RAL Tier1 weekly operations castor 02/05/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • none

Operations Problems

  • On 22-23/4/11 the Gen stager database ran into problems with internal memory. The exact cause is unknown and is being followed up with ORACLE.
  • On 25/4/11 LHCb tape servers went into UNKNOWN status for 24 hours due to unknown reasons, creating a very large backlog of unmigrated files which in turn caused the lhcbDst service class to run out of space. Other available tape servers were re-assigned to LHCb and LHCb activity on the farm was reduced to improve matters.
  • On 28/4/11 LSF jobs on gdss457 (atlasScratchDisk) were timing out, resulting in failed reads/writes to this disk server. This appeared to be caused by an LSF problem - after killing the jobs it continued as usual.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • None

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade Facilities instance to 2.1.10-0
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade SRMs to 2.10-3 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade and Quattorization of CASTOR headnodes

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • Chris A/L