Difference between revisions of "Operations Report 19/02/2010"
From GridPP Wiki
(No difference)
|
Latest revision as of 13:22, 22 February 2010
Contents
Summary of Previous Week
- Memory upgrade
- Recovery tests using non backed-up archived logs
- Recovery tests from 64bit machines to 32bit machines
- Houseskeeping script improvements
- Remove orisa nodes from oncall
- OCR checksum calculations
- Tests to add / remove cluster nodes
- Developing resilience plans with Matt Viljoen and Martin Bly
- Review quarterly Oracle patch update
Operational Issues and Incidents
- Tier1: still some EMC/Oracle problems on PLUTO/NEPTUNE. Looking for resiliant hardware for nodes.
- Pluto1 reboot. Error in swaping demon on Linux machine
- Pluto2 reboot. Crush because of motherbord failure.
- Recovery Catalog database crashed due to suspected hardware failure.
Plans for Week(s) Ahead
- Documentation for add / remove cluster nodes
- Restoration of VULCAN pre-prod system
- Test quartely Oracle patch update
- Implement protection from ORACLE JAVA security vulnerable
- Move backup process from Neptune1 to Neptune4
Downtimes and At Risk
Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|
Migrate backup area and scripts from neptune1 (cdbc08) | 23th of February | 23th of February | ATLAS, LHCB, Nameserver | At risk |
Disable clusterware on cdbc08 node | Next Week | Next Week | ATLAS, LHCb | At risk |
Development Priorities
- Deploy CASTOR Database Monitoring
- Migrate ATLAS TAGs to 64bit systems
- Investigate ORACLE replication technique for LFC/FTS resilience
- Investigate hardware architecture, backup and recovery strategy, resilience and validation of restored backup.
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Hardware for new Neptune node | Medium/high | Waiting | |
Hardware to test LFC database replication | Medium/high | Waiting |
OnCall
- Rich
Absences
- Carmine Out 22nd-23rd February