Operations Report 05/10/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • 3D LHCb: migrated to new hardware
  • 3D ATLAS: ASM patch applied
  • LFCs and FTS: ASM patch applied
  • Castor: Refined the metrics gathering for the CASTOR databases in Grid Control (part of our improved monitoring plan)
  • Castor: Performance analysis on DLF database
  • Castor: Installed new Grid Control monitoring agents on Neptune/Pluto (for our new Grid Control server)
  • Castor: Installed new Grid Control monitoring agents on Neptune/Pluto for CERNs use (waiting for ports to be opened)
  • Castor: Analysed Tuesdays problem with cdbc08 node crash/voting disk problem
  • Castor: Upgraded ATLAS SRM (well changed the version number!)

Operational Issues and Incidents

  • Castor: Databases down because of disk arrays problems since Sunday (04/10/09) afternoon
  • Castor: cdbc08 (Neptune1) rebooted because of problems with the voting disk partition

Plans for Week(s) Ahead

  • Castor: Recover from disk failures
  • Castor: Rollout the new monitoring script
  • Castor: Intervention to change Clusterware timeout parameter and move NFS mounted voting disk to Neptune2
  • Castor: Install new REPACK schemas
  • LFC: Test resilience


Downtimes and At Risk

Description Start End Affected VO(s)

Development Priorities

  • CASTOR Database Monitoring
  • Migrate ATLAS TAGs to 64bit systems
  • Investigate ORACLE replication technique for LFC/FTS resilience

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for Tag databases Medium Waiting
Hardware to test LFC database replication Medium/high Waiting

OnCall

  • Eter Pani