RAL Tier1 weekly operations castor 20/12/2010

From GridPP Wiki
Revision as of 11:59, 17 December 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Remaining SRMs (LHCb, Gen) upgraded to 2.8-6

Operations Issues

  • Wrong checksums were found to be given to ~30 LHCb files, leading to errors in the rtcopyd log. This was due to a bug (fixed in 2.1.9-9) affecting incompletely transferred files. Since the transfer error was originally sent back to the user, this is not considered as data corruption due to us.
  • Jobmanager stopped working for LHCb for ~45 minutes. Secondary job managers will be enabled for remaining instances (LHCb, Gen)
  • aliceTape GC limits were found to be wrong, leading to a lack of spare capacity. These were corrected.
  • On Friday, the ATLAS instance became very busy. The 6 SRMs coped, but a backlog was created in LSF. The FTS was throttled to prevent further congestion.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Update ATLAS disk servers to SL5 64bit (TBC) 17/01/2011 08:00 18/12/2011 16:00 Downtime ATLAS MV

Advanced Planning

  • CASTOR for Facilities instance in production by end of 2010
  • Upgrade ATLAS, CMS, Gen disk servers to SL5 64bit and Quattorize the non-Quattorized disk servers
  • CASTOR certification and upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Shaun out all week
    • Jens out from Wednesday