RAL Tier1 weekly operations castor 28/09/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • SRM 2.8 upgrade on ATLAS (Shaun, DB Team)
  • Finalizing testing for CIP 2.0 (Jens)
  • Investigating cause of D2D Transfer incident (Chris)
  • Finalized disk server deploymentation documentation (Chris)
  • Deployed 5 DS for atlasHotDisk and 14 for AtlasSimStrip (Chris)
  • Working on a problem with kernel clashing with FC card which prevents us to upgrade tape servers to the latest kernel (Chris)
  • Distributing Raid5/6 servers across service classes using draining (Brian)
  • Diagnose and fix network cable problem on Vulcan test database (Cheney)
  • Fix sendmail problem DLF database single (Cheney)
  • Started build of a new failover tape robot controller (Cheney)
  • Fixed SLS (out of inodes due to logrotate failure) (Cheney)
  • Fixed controller crash on database hardware (twice) (Cheney)
  • Applied changes to nagios config for new diskservers (Cheney)
  • Applied Oracle ASM Patch on Production RACs (DB Team)
  • Installing and acceptance testing new CASTOR servers (Richard, Cheney)
  • Coordinating bringing CASTOR down for UPS test (Matt)
  • Writing post mortem of NS upgrade D2D transfer incident (Matt)
  • Working with GOCDB developers to suggest including 'DEGRADED' status (Matt)

Developments for this week

  • Carry on working on kernel problem for tape servers (Chris)
  • Black and White list tests (Chris)
  • Carry on LSF investigation (Chris)
  • Working on puppet manifest for polymorphic central servers (Chris)
  • 2.8-1 deployment and testing (Shaun)
  • Install and Configure Database Agent for Oracle Enterprise Manager at CERN (DB Team)
  • Installing SLC 64 bit on new preprod machines (Richard)
  • Finish off patching including non-castor (Cheney)
  • Write next Techwatch newsletter (Cheney)
  • Distributing Raid5/6 servers across service classes using draining (Brian)
  • Chasing up strategic objectives (Matt)
  • Disaster recovery documentation (Matt)

Ongoing

  • CastorMon monitoring graphs for Gen instance (Brian)

Operations Issues

  • The ORACLE ASM failed again on night of 24/9/09. However, the ORACLE patch worked and ORACLE was able to recover without any adverse service impact.

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
CIP 2.0 upgrade 29/9/09 1200 29/9/09 1400 At risk All instances

Changes to Production Milestones

Advanced Planning

  • SRM 2.8-1 to be deployed this week
  • Black and White lists? (delayed until it is required on a 'per-instance' basis)
  • Improve resiliency to central services (This year)

Staffing

  • Richard away Thurs,Fri
  • Brian A/L Thurs,Fri
  • Castor on Call person: Chris