RAL Tier1 weekly operations Grid 20100712

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy
Software server overloaded Atlas High Software server problems

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX 30 June 2010 Medium [30-06-2010]Request made

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Mayo's last week
  • Deployed lcgce03 (CREAM CE) for Alice
  • Applied job limits to Atlas after problems with Atlas software server

Highlights for Tier-1 VO Liaison Meeting

  • LHCb have requested that we raise walltime on grid2000M to 140 hours (from 96)
  • Testing update to glite WN
  • New CREAM CE (lcgce03) deployed for Alice
  • Applied job limits to Atlas after problems with Atlas software server

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs [ongoing]
  • Written script to identify unavaliable files when a disk server is taken out of production. [testing]
  • Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Working on ATLAS Frontier service, monitoring and backup.

Andrew

  • June accounting [Done]
  • Wrote BDII-DAC disk capacity consistency checking script [Done]
  • Checking new stage-out config for RAL so that unmerged files can be deleted (ProdAgent, SAM tests, site local config required updates) [Done]
  • Checking checksums of files from T10KB tapes
  • CMS data ops
    • Completing MC reprocessing [Done]
    • Started MC production backfill at RAL & IN2P3 [Ongoing]
    • Real MC production at CNAF
  • glite-APEL
    • Reading documentation
    • To do: setup glite-APEL instance in testbed
  • Listened to CMS talks at WLCG meeting (evo)

Catalin

  • test LFC deployment using quattor [ongoing]
  • LFC talk for NGS [done]
  • Frontier monitoring [ongoing]
  • Alice castor+xrootd issues [ongoing]

Derek

  • Testing glexec update [ongoing]
  • Setting up NGS UEE on worker nodes
  • Implementing new updated change control process on dev helpdesk
  • Quattorising CREAM CE
  • Mayo leaving stuff

Matt

Richard

  • Catch-up after last week's leave
  • Planning updates to RAL top-level BDIIs [ongoing]
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • Developed a tool to help with automating the wiki page on grid middleware versions [done]
  • CASTOR items:
    • Ran stress tests on pre-prod
  • Next Week
    • Assemble results from last week's stress test runs
    • Try to get 2.1.9 functional tests running on pre-prod
    • Finishing off 2.1.7 metrics documentation [ongoing]

Mayo

  • Implement David Meredith's feedback into Certificate viewer [Done]
  • Integrate certificate viewer module with existing NGS certificate wizard code[Done]
  • Create Handover Documentation for finished projects [ongoing]
  • Enter job plan into ssc [Done]
  • Create Certificate Query class for David Meredith [Done]

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: