RAL Tier1 weekly operations castor 17/10/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • A new diskpool aliceDisk (100TB) went into production (by redeploying existing disk servers from aliceTape) and started being used by Alice for processing jobs
  • First testing of SRM 2.11 went successfully

Operations Problems

  • A reoccurance of the database corruption problem happened on Tues morning at 08:43. As soon as it was noticed, the fix was applied and we were back in production at 10:00. A new callout Nagios test has been deployed which alerts upon the problem appearing, so we can act even faster if it happens again. A new hypothesis is that a possible cause was draining disk servers. As a precautionary measure, we will no longer be draining out of working hours.
  • It was discovered that the CIP has been publishing 1024x the real tape capacity since the 2.1.10-1 upgrade - due to changes in CASTOR code.

Blocking Issues

  • We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes


  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Shaun at EUDAT (Mon-Wed)