Operations Report 12/02/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Setting up test environment
  • Comparing test environment with production one
  • 18 crush tests to reproduce production problems
  • Adding lust archived logs to Automated Recovery Script (reduce the lag between actual data and backup copy)
  • Tests to add / remove cluster nodes

Operational Issues and Incidents

  • Tier1: still some EMC/Oracle problems on PLUTO/NEPTUNE. Tring to reproduce on test system.

Plans for Week(s) Ahead

  • CASTOR nodes Upgrade
  • Resiliance tests
  • Node manipulation (add/remove) tests

Downtimes and At Risk

Description Start End Affected VO(s) Type
Memory upgrade on Castor nodes start of the week start of the week ATLAS, LHCb At risk
Moving backup area from cdbc08 to spare node start of the week start of the week ATLAS, LHCb At risk
Disable clusterware on cdbc08 node start of the week start of the week ATLAS, LHCb At risk

Development Priorities

  • Deploy CASTOR Database Monitoring
  • Migrate ATLAS TAGs to 64bit systems
  • Investigate ORACLE replication technique for LFC/FTS resilience
  • Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.


Requirements and Blocking Issues

Description Required By Priority Status
Hardware for new Neptune node Medium/high Waiting
Hardware to test LFC database replication Medium/high Waiting

OnCall

  • Keir

Absences

  • Rich Out Until 22nd February
  • Carmine Out Until 17nd February
  • Keir Away on 16th February