RAL Tier1 weekly operations castor 04/10/2010

From GridPP Wiki
Revision as of 15:27, 4 October 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Work previous week

  • Matthew:
    • LHCb Upgrade and Testing
  • Shaun:
    • ..
  • Chris:
    • LHCb Upgrade and Testing
    • Castor Facilities work
  • Richard:
    • Ran the 2.1.9 functional test suite on the upgraded LHCB instance of CASTOR
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • Very heavy load on CMS on 27-28/9/10. Requests were throttled back at Fermilab.
  • FTS channels were not requested to be turned on after LHCb upgrade and stayed closed until 30/9/10
  • gridftp-internal RPMs missing from upgraded 15 upgraded 2.1.9 LHCb disk servers, causing transfers to fail. Fixed on morning of 30/9/10.
  • Wrong checksum were written to NS + filesystem attributes after LHCb upgrade. Checksums were turned off on 30/9/10 am. Approx. 1200 file migration backlog due to a number of files having wrong checksums. Checksums were manually deleted afterwards and migration backlog cleared.
  • 3 ATLAS SRM server daemons crashed due to unknown reasons at same time on 29/9/10

Blocking issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update Gen to 2.1.9 (STC) 25/10/2010 08:00 27/10/2010 18:00 Downtime Gen
Update CMS to 2.1.9 (STC) 08/11/2010 08:00 10/11/2010 18:00 Downtime CMS
Update ATLAS to 2.1.9 (STC) 22/11/2010 08:00 24/11/2010 18:00 Downtime ATLAS

Advanced Planning

  • Upgrade to 2.1.9-8 after all instances are upgraded to 2.1.9-6
  • CASTOR for Facilities instance in production by end of 2010

Staffing

  • Castor on Call person: Matthew
  • Staff absences:
    • ..