RAL Tier1 weekly operations castor 16/05/2011

From GridPP Wiki
Revision as of 09:04, 17 May 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • SRM upgrade to 2.1.10 on Gen
  • Created new space tokens for LHCb:
    • LHCb-Disk (in front of lhcbDst, merged with the old lhcbMdst)
    • LHCb-Tape (in front of lhcbRawRdst)
  • Quarterly patch applied to Pluto

Operations Problems

  • On 9/5/11 afternoon, Gen LSF started failing jobs. This was not picked up by the CASTOR team until 10/5/11 at 0800 when LSF was restarted and jobs started working. The callout status of Gen is being looked at by the Production team.
  • On 16/5/11 between 0730-0830 in the morning, Neptune node castor151 crashed during a routine Oracle backup, affecting LHCb (and strangely, ATLAS). Reasons unknown - an SR has been sent to Oracle
  • On 16/5/11 between 0721-0730 Oracle errors appeared in the CMS SRM (ORA-00060) indicating deadlocks. This resulted in failed transfers. Reasons unknown - an SR has been sent to Oracle

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Install quaterly ORACLE patches on Pluto 16 May 11:00 16 May 15:00 At-risk CMS,Gen
Install xroot on farm 17 May 10:00 17 May 12:00 At-risk All

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade Facilities instance to 2.1.10-0
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade SRMs to 2.10-3 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of CASTOR headnodes


  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Jens out all week