RAL Tier1 weekly operations Grid 20100719

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress
WMS03 16-Jul-2010 16-Jul-2010 Non-LHC low Was unresponsive and rebooted

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX 30 June 2010 Medium [30-06-2010]Request made
#62179: Request for new CMS pool accounts 16 July 2010 High [16-07-2010]Request made

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • Mayo has now left, please remove any access he may have had
  • Only 2 Grid Team members in on Wed-Thu
  • New CMS t1production role
  • Batch farm full :-), causing issues for CMS :-(

Highlights for Tier-1 VO Liaison Meeting

  • Investigating options for limiting Alice jobs after CMS ran work elsewhere over the weekend
  • Progressing with enabling new CMS role on batch farm
  • Roll out an upgrade of the top level BDIIs next week (At-risk)
  • 2 crashes of WMS03 with no obvious cause

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs [ongoing]
  • Written script to identify unavaliable files when a disk server is taken out of production. [testing]
  • Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Working on ATLAS Frontier service, monitoring and backup.

Andrew

  • Investigated slow transfers of an important MC dataset to many T2s [Done]
  • Added Ganglia monitoring of CMS data transfers (volume per day & rates) to/from CERN, T1s, T2s [Done]
  • Preparations for new CMS t1production role
    • Working on change-control form & implementation plan; submitted request for Fabric for new pool accounts
  • Updated FTS monitor to v1.4 [Done]
  • Understanding disk & tape capacity calculations
  • CMS data ops
    • MC production at CNAF
    • backfill (MC production) at RAL; testing CREAM CEs
    • Data reprocessing at FNAL
  • Try glite-APEL installation in testbed [To do]
  • Write script for checksum checking of last file on T10KB tapes [To do]

Catalin

  • Python course (Mon - Thu) RAL R1

Derek

  • Sync'd testbed against QWG profiles [Done]
  • Rebooted lcgwms03 [Done]
  • Debugging t2k job submission issues
  • CIC broadcast for lcgce02 decommission [Done]
  • Writing Strawman Cloud strategy [ongoing]
  • Sync production templates against QWG

Matt

Richard

  • Submitted change control request for updating RAL top-level BDIIs [done]
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • Developed a tool to help with automating the wiki page on grid middleware versions [done]
  • CASTOR items:
    • Continue trying to get 2.1.9 functional tests running on pre-prod

VO Reports

ALICE

  • waiting for CREAM-CE 1.6 deployment at RAL
  • cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

  • Due to CMS unable to get any job slots at RAL, v2 of an urgent workflow was run at FNAL. The v1 finally generated at RAL has been deleted.
  • Started to use CREAM CEs again due to upgrade of CNAF WMSs; no problems so far.

LHCb

OnCall/AoD Cover

  • Primary OnCall: Catalin
  • Grid OnCall:
  • AoD: