Difference between revisions of "RAL Tier1 weekly operations castor 09/05/2011"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:42, 16 May 2011

Operations News

  • Quarterly patch applied to Neptune

Operations Problems

  • On 3/5/11 one tape server started disabling LHCb tapes due to a hardware error - leading to inaccessible files. The tapes were marked as enabled and the tape server removed from production.
  • On 3/5/11 ATLAS migrations stopped due to recycled 'B' media having the wrong labels. The T0Raw mighunter was restarted with an increased time between runs of 30mins which worked around the problem.
  • On 4/5/11 lhcbDst jobs failing due to an inbalance (every disk server was full apart from gdss457 which was overloaded with write jobs). 2 disk servers were added from nonProd and other disk servers were put into draining.
  • On 8/5/11 due to a wrong workflow CMS flooded their stager with PrepareToGet requests which caused the partition containing the logs to fill up. CMS throttled back until Monday when the workflow was stopped.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
CASTOR Oracle DB Patches 09 May 11:00 09 May 15:00 At-risk All
Upgrade SRM to 2.10-2 10 May 10:00 10 May 12:00 Downtime Gen
Install xroot on farm (STC) 17 May 10:00 17 May 12:00 At-risk All

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade Facilities instance to 2.1.10-0
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade SRMs to 2.10-3 which incorporates
    • VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of CASTOR headnodes

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • Shaun (Mon: working from home)
    • Chris (Mon-Wed: A/L)
    • Matthew (Wed afternoon-Fri: EGI virtualization workshop)