RAL Tier1 weekly operations castor 18/04/2011

From GridPP Wiki
Revision as of 14:59, 18 April 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • LHCb, CMS and ATLAS SRMs upgraded to 2.10-2

Operations Problems

  • Transfer failures reported by CMS, clustered across different d/s on Mon/Tue and Tue/Wed nights at very similar times. This correlated to packet loss Nagios test failures, indicating networking problems
  • On Friday ATLAS reported problems connecting to SRM. The SRM database was badly under-performing under the high load from ATLAS, which were put into ~4 hours of downtime. Locking the statistics improved performance.
  • New certificate host and key installed on gdss66 (cmsFarmRead) didn't match, resulting in transfer failures. There should be a check that the host and key match when renewing certificates, or deploying new disk servers.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • None

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade Facilities instance to 2.1.10-0
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade SRMs to 2.10-3 which incorporates
    • VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade and Quattorization of CASTOR headnodes

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Shaun A/L (all week)