RAL Tier1 weekly operations castor 19/10/2009

From GridPP Wiki

Revision as of 08:37, 20 October 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Contents

1 Summary of Previous Week
2 Developments for this week
3 Operations Issues
4 Blocking issues
5 Planned, Scheduled and Cancelled Down Times
6 Changes to Production Milestones
7 Advanced Planning
8 Staffing

Summary of Previous Week

CASTOR F2F at CERN (Chris, Matt)
Continuing to deal with fallout from ORACLE disk contoller crash: specificially the rollback of the databases (All)
- Investigation into exactly what happened (All, DB Team)
- Investigating into the consequences of re-using NS uniqueid (Chris, Matt, CERN team)
- Producing lists of lost and at-risk files (Chris, Matt)
- Gathering information for post mortem (All)
- Increased NS uniqueid counter in NS database (All, DB Team)
Deployed one new disk server for LHCb (Chris)
Tweaked database backups to try out a grandfather/father/son cycle (Cheney)
Continued with build of new db server cdbe07 (Cheney)
Tweaked backups of redo logs to dmf for Pluto (Cheney)
Added bulk log disk array for Pluto redo log archive (Cheney)
Fixed cdbe02 and configured to pick up Overland array (Cheney)
Shifted emc array to run on different pdu but same power supply (Cheney)
Building tape robot controller to swapout buxton (Cheney)

Developments for this week

Setup 2.1.8 on repack server with Puppet (Chris)
Working on puppet manifest for polymorphic central servers (Chris)
Testing various combinations of emc kit versus power supply (Cheney)
Regen nagios config for diskservers (Cheney)
Build spare tape robot controller (Cheney)
Build replacement db server (Cheney)
Techwatch newsletter (Cheney)
Making ATLAS file lists for comparison to LFC (Matt)
Contributing to incident PMs (Matt)

Ongoing

SRM 2.8-1 deployment on Gen,LHCb,CMS (Shaun)
CastorMon monitoring graphs for Gen instance (Brian)
Black and White list tests (Chris)
Disaster recovery document (Matt)

Operations Issues

Possible lost data resulting from reusing NS uniqueid's (TBC).
Problems with DNS server (chiton) caused all CASTOR instances to be affected for 4-5 hours

Blocking issues

Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

none

Changes to Production Milestones

none

Advanced Planning

Black and White lists? (delayed until it is required on a 'per-instance' basis)
Improve resiliency to central services (This year)

Staffing

Brian A/L
Tim at LTUG (Mon-Wed)
Shaun away (?)
Castor on Call person: Chris

Retrieved from "https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_weekly_operations_castor_19/10/2009&oldid=2226"

RAL Tier1