Difference between revisions of "RAL Tier1 weekly operations castor 05/10/2009"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 11:29, 5 October 2009

Summary of Previous Week

  • Working on a problem with kernel clashing with FC card which prevents us to upgrade tape servers to the latest kernel (Chris)
  • Working with vendor isolating cause of database raid controller (Cheney)
  • Kernel patching of robot controllers (Cheney)
  • Setting up CIP hosting machine (Cheney)
  • Setting up repack server (Richard, Fabric team, DB Team, Chris)
  • SRM 2.8.1 upgrade on ATLAS (Shaun, DB Team)
  • Debugging ATLAS transfer problems (Shaun, Matt)
  • Fixed LHCb bottleneck on lhcbMdst by increasing job slots (Shaun)
  • Chasing up strategic objectives (Matt)
  • Establishing CASTOR change control policy (Matt)

Developments for this week

  • Carry on working on kernel problem for tape servers (Chris)
  • Setup 2.1.8 on repack server with Puppet (Chris)
  • Working on puppet manifest for polymorphic central servers (Chris)
  • 2.8-1 deployment on Gen,LHCb,CMS (Shaun)
  • Preparing for CASTOR F2F meeting (Matt, All)
  • Add extra raid controller to LHCb D1T0 disk servers (Matt, Fabric team, Production team)

Ongoing

  • CastorMon monitoring graphs for Gen instance (Brian)
  • Black and White list tests (Chris)
  • Disaster recovery document (Matt)

Operations Issues

  • ATLAS SRM get failures affecting some jobs - being investigated on certification
  • LHCb ran out of slots on lhcbMdst. Increased job slots.

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
SRM 2.8-1 upgrade 5/10/09 1000 5/10/09 1030 At risk Gen,LHCb,CMS
Replace faulty ORACLE voting disk 6/10/09 1000 6/10/09 1200 Downtime ATLAS, LHCb

Changes to Production Milestones

Advanced Planning

  • Add extra raid controller to LHCb D1T0 servers
  • Black and White lists? (delayed until it is required on a 'per-instance' basis)
  • Improve resiliency to central services (This year)

Staffing

  • Brian A/L
  • Richard away
  • Castor on Call person: Shaun