RAL Tier1 weekly operations castor 14/03/2011

From GridPP Wiki
Revision as of 11:26, 16 March 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Upgraded NS schema to 2.1.10-0 and switched one EMC unit to the power supply with the isolating transformer.
  • ATLAS did a mass rename of 3.2m files following the merging of two disk pools

Operations Issues

  • There is an incompatibility between the upgraded NS schema and 2.1.8-17 NS client, installed on the SRM machines. This was only noticed after the upgrade, and had the effect of stopping nsmkdir working. The SRMs were reconfigured to point back to the central NS daemon a few hours after the upgrade, which fixed the problem. There was also some minor complications on the repack server (which also has a 2.1.8 NS)
  • Due to an oversight, the mass rename of ATLAS files involved new directories being created owned by root instead of atlas001 - which gave users ermission problems over the weekend. CoC was called out and corrected ownership
  • On 14/3/11 sluggish performance was noticed between the ATLAS and CMS SRMs and databases. On 16/3 this reappeared, this time for LHCb. Cause unknown.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Have arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Network outage and switch to isolating transformers for remaining EMCs 15 March 08:00 15 March 13:00 Downtime All
Upgrade ATLAS to 2.1.10-0 (STC) 28 March 08:00 28 March 16:00 Downtime ATLAS
Upgrade CMS to 2.1.10-0 (STC) 29 March 08:00 29 March 16:00 Downtime CMS
Upgrade LHCb, Gen to 2.1.10-0 (STC) 30 March 08:00 30 March 16:00 Downtime LHCb, Gen

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Start migrating from T10KA to T10KC media later this year

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Chris (Mon)
    • Shaun at dCache workshop (Tue-Thu)