RAL Tier1 weekly operations castor 26/12/2011

From GridPP Wiki
Revision as of 10:05, 3 January 2012 by Chris kruk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • All 2.1.11 headnode components are now setup (inc. Transfer Manager) and are being tested
  • All new SRM machines are installed and are awaiting testing.

Operations Problems

  • On the night/morning of 22 Dec, problems with the SAR caused all Ops SAM tests to fail from 01:30-09:00
  • During the early morning of 23 Dec, performance of LHCb SRM DB degraded. This was picked up and DB On-Call regenerated stats, which improved matters.
  • atlasStager var partition close to the limit on 24th Dec
  • 25% failures on 25th Dec in lhcbDst, investigation showed 6 hot files which were tried to be accessed

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Stage 1 of move to new CASTOR DB hardware 05/01/2012 08:30 05/01/2012 16:00 Downtime All Rich
SRM 2.11 upgrade, inc. move to new hardware+SL5+Quattor (STC) 16/01/2012 08:00 18/01/2012 16:00 Downtime All Shaun
CIP 2.2.0 upgrade (STC) 26/01/2012 10:00 26/01/2012 12:00 At-risk All Matthew
Stage 2 of CASTOR DB move (STC) 07/02/2012 08:00 07/02/2012 16:00 Downtime All Rich
CASTOR 2.11-8 upgrade, inc. move to new hardware+SL5+Quattor (STC) 13/02/2012 08:00 24/02/2012 16:00 Downtime All Matthew

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • All (Xmas)