RAL Tier1 weekly operations castor 21/03/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Tested upgraded CASTOR client on 3 worker nodes from 2.1.7-27 to 2.1.9-6
  • Power on remaining CASTOR EMC unit was configured to be fed from UPS through isolating transformer during downtime on 15/3/11.

Operations Issues

  • On 11/3/11, one of the three LHCb SRM died and was taken out of the DNS round robin. Its replacement has yet to be tested and put back in.
  • Sluggish SRM-DB performace on ATLAS and CMS on Monday, indicating network issues, but none could be found. Similar problems

affected LHCB SRMs on Wednesday - this was traced to two "decommissioned" LHCb SRMs (srm204,205) that were still connecting to the DB

  • On 17/3/11, LHCb accidentally deleted 4100 reconstruction files from their 2010 data. We will try to recover it from tape.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Have arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade CASTOR clients on all WNs from 2.1.7-27 to 2.1.9-6 21 March 10:00 21 March 12:00 At-risk All
Upgrade CMS to 2.1.10-0 (STC) 28 March 08:00 28 March 16:00 Downtime CMS
Upgrade ATLAS, LHCb and Gen to 2.1.10-0 (STC) 30 March 08:00 30 March 16:00 Downtime ATLAS, LHCb, Gen

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Start migrating from T10KA to T10KC media later this year


  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • ..