RAL Tier1 weekly operations castor 13/06/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • The CIP was successfully run against the Facilities database, producing accounting info. This will be deployed into production and the information monitored.
  • On Thursday DB team changed an OS parameter on 2 nodes to fix internal logging at ORACLE's request. It will only be known whether this change is successful at the next intervention.
  • 4 WNs just been upgraded to CASTOR client 2.1.10-0

Operations Problems

  • LHCb were experiencing a lot of failures after pre-staged files from tape were being deleted by the gc on lhcbRawRdst because it is very full. The gc policy was changed on Wednesday from default to LRU (Last Recently Used) on this service class, which appears to have improved matters.
  • On Thursday a stuck tape stopped recalls for LHCb had to be manually unmounted.
  • CMS continue to be very heavily loaded, and have periodically encountered timeouts on the CASTOR instance.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities can guarantee the same level of service as the Tier1 instances. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • none

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade Tier1 tape subsystem to 2.1.10-1 which allows us to support files>2TB and T10KC
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities DB instance to new Database hardware running 10g
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes


  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • ..