Operations Report 19/02/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Memory upgrade
  • Recovery tests using non backed-up archived logs
  • Recovery tests from 64bit machines to 32bit machines
  • Houseskeeping script improvements
  • Remove orisa nodes from oncall
  • OCR checksum calculations
  • Tests to add / remove cluster nodes
  • Developing resilience plans with Matt Viljoen and Martin Bly
  • Review quarterly Oracle patch update

Operational Issues and Incidents

  • Tier1: still some EMC/Oracle problems on PLUTO/NEPTUNE. Looking for resiliant hardware for nodes.
  • Pluto1 reboot. Error in swaping demon on Linux machine
  • Pluto2 reboot. Crush because of motherbord failure.
  • Recovery Catalog database crashed due to suspected hardware failure.

Plans for Week(s) Ahead

  • Documentation for add / remove cluster nodes
  • Restoration of VULCAN pre-prod system
  • Test quartely Oracle patch update
  • Implement protection from ORACLE JAVA security vulnerable
  • Move backup process from Neptune1 to Neptune4

Downtimes and At Risk

Description Start End Affected VO(s) Type
Migrate backup area and scripts from neptune1 (cdbc08) 23th of February 23th of February ATLAS, LHCB, Nameserver At risk
Disable clusterware on cdbc08 node Next Week Next Week ATLAS, LHCb At risk

Development Priorities

  • Deploy CASTOR Database Monitoring
  • Migrate ATLAS TAGs to 64bit systems
  • Investigate ORACLE replication technique for LFC/FTS resilience
  • Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.


Requirements and Blocking Issues

Description Required By Priority Status
Hardware for new Neptune node Medium/high Waiting
Hardware to test LFC database replication Medium/high Waiting

OnCall

  • Rich

Absences

  • Carmine Out 22nd-23rd February