Resiliency and Disaster Planning
- 1 Summary
- 1.1 ScotGrid
- 1.2 LondonGrid
- 1.3 NorthGrid
- 1.4 SouthGrid
- 1.5 Tier1
A major theme of GridPP22 was resiliency and disaster planning with topics ranging from the loss of a site through to the tasks faced everyday by system administrators. This page has been created to collate information about resiliency and disaster planning on a site by site basis. This should generate discussion of on what preparations and precautions are being taken at each site.
- Conducted Review of backup strategy. All new machines now included in backups.
- Dirvish used for backups [10 days of daily backups, 3 months of weekly, 1 year of monthly].
- Daily off-site backup of cluster administration server [svr031] allowing full tier2 rebuild if necessary.
- OSSEC installed on all machines at ScotGrid. Web interface, generation of alerts, rules engine, rootkit checker and scriptable actions. Glasgow installation very noisy at first. Therefore, time required to tailor for site.
- Splunk installed on all machines at ScotGrid. Log aggregator and indexer with web interface for searching. 500mb a day limit for free version. Glasgow use 100mb a day. Very expensive for full license. Use cases - searching for suspicious IP, hardware faults
- OSSEC has splunk integration and work nicely together.
- Cold start procedures updated after power outages. This helped to highlight missing steps.
- Appropriate machine room signage created after issues identifying server rooms, circuit breakers, switches etc.
- Emergency contacts list created. Phone numbers distributed amongst team.