RAL Tier1 weekly operations castor 25/07/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • CASTOR stress testing started of V/SL10 disk servers and is progressing well. We've decided to introduce these disk servers with 4x the 'old' slot count (apart from GridFTP) The SL08s are being phased back in with 2x the 'old' slot count.
  • First version of new Ganglia based castormon now available

Operations Problems

  • More problems on Gen SRM which were load related, by a T2K user. The Maui fairshare will be reduced.
  • castor151 (LHCb Stager) spontaneously rebooted. Castor headnodes were unaffected, but a number of LHCb tapes got a disabled state. The VMGR scheama will be moved onto another node.
  • Approx. 25% of transfers using the RAL FTS are failing for unknown reasons. Doesn't appear to be CASTOR related.

Blocking Issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities DB instance to new Database hardware running 10g
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Certify 2.1.11 and evaluate the new LSF replacement
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • none