RAL Tier1 weekly operations Grid 20100726

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Job status monitoring from CREAMCE 2-Feb-2010 CMS medium [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
WMS03 16-Jul-2010 16-Jul-2010 Non-LHC low Was unresponsive and rebooted
FTS02 21-Jul-2010 All High SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
HW needed to test Dataguard technology for LFC/FTS 19 May 2010 15 June 2010 Medium [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX 30 June 2010 Medium [30-06-2010]Request made
#62179: Request for new CMS pool accounts 16 July 2010 High [16-07-2010]Request made [21-07-2010]Ticket closed by Fabric team [26-07-2010]Pool accounts were created yesterday

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Working on ATLAS software server on /afs [ongoing]
  • Written script to identify unavaliable files when a disk server is taken out of production. [testing]
  • Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
  • Working to improve pbsjobs database to allow easier monitoring of production work.
  • Working on ATLAS Frontier service, monitoring and backup.

Andrew

  • Completed & submitted change control documents about the new CMS production role [Done]
  • Prepared changes required for monitoring & accounting for new CMS production role [Done]
  • PhEDEx backup
    • Grid services on-call spreadsheet now contains details about temporarily moving PhEDEx to lcgvo0598
    • Ensured lcgvo0598 is ready to run PhEDEx in an emergency. [Done]
  • CMS Data Ops
    • Backfill at IN2P3 & RAL
  • Add VOBOX proxy renewal restarter to lcgvo-02-21 [To do]
  • CMS storage consistency check [To do]
  • A/L Wed - Fri

Catalin

  • Python course [done]
  • ATLAS frontier monitoring
  • LFC quattorising (SL4 and SL5) [ongoing]

Derek

  • Moved LHCb to grid3000M queue [done]
  • Writing Strawman Cloud strategy [ongoing]
  • Sync production templates against QWG [ongoing]
  • CREAM CE quattor profile

Matt

  • Using FTS dev endpoint to test new timeout parameters.
  • Test deployment of gLite 3.2 FTS.

Richard

  • Submitted downtime for applying the BDII update approved in change control request # 62184
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • CASTOR items:
    • Further progress on getting the 2.1.9 functional tests running on pre-prod

VO Reports

ALICE

ATLAS

CMS

  • Discussions about having all Tier-1s publish CPU farm information in a common XML format:
    • Summary information - number of jobs running, pending, CPU time, wall time, number of jobs with efficiency < 10% (overall & for different groups)
    • (Optional) Details about individual jobs

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek
  • AoD: