RAL Tier1 weekly operations castor 21/09/2009
From GridPP Wiki
Revision as of 14:43, 21 September 2009 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- NS 2.1.8-3 upgrade & testing (Chris, All)
- SRM 2.8 upgrade on CMS (Shaun, DB Team)
- Implementing database performance tuning (DB Team)
- Updating database kernels (Cheney)
- Dealing with D2D Transfer incident following NS upgrade (All)
- Investigating distributing Raid5/6 servers across service classes (Brian)
- Installation of new CASTOR servers (Tim)
- T10KB tape deployment and hardware plans (Tim/Matt)
- Strategic plans updates (Matt)
- Preprod plans (Matt, Chris, Richard)
Developments for this week
- SRM 2.8 upgrade on ATLAS (Shaun, DB Team)
- Finalizing testing for CIP 2.0 (Jens)
- Investigating cause of D2D Transfer incident (Chris)
- Preparing disk server deploymentation documentation (Chris)
- Investigating distributing Raid5/6 servers across service classes (Brian)
- Investigating cause of DB hardware problems (Cheney)
- Acceptance testing new CASTOR servers (Richard)
- Chasing up strategic objectives (Matt)
- Disaster recovery documentation (Matt)
Ongoing
- CastorMon monitoring graphs for Gen instance (Brian)
- Setting up Preproduction (Richard, Chris)
Operations Issues
- D2D Transfer incident following NS upgrade affecting all instances
Blocking issues
- Problems with ganglia check on GEN instance delaying work on monitoring (in hand)
Planned, Scheduled and Cancelled Down Times
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Upgrade ATLAS SRM to 2.8 | 21/9/09 1000 | 21/9/09 1200 | Downtime | ATLAS |
Oracle patch to prevent reoccurrence of recent hardware problem. | 21/9/09 1200 | 21/9/09 1400 | At risk | All instances |
Suspend CASTOR during R89 UPS test | 22/9/09 0800 | 22/9/09 1000 | Downtime | All |
CIP 2.0 upgrade | 29/9/09 1200 | 29/9/09 1400 | At risk | All |
Changes to Production Milestones
Description | Changed Status |
---|---|
SRM upgrade to 2.8 (H) Shaun | DONE |
Nameserver upgrade to 2.8 (L) Chris | DONE |
Move CMS to T10KB (M) Tim | Ongoing. Meeting with AS and Chris Brew about how to implement this on 18/9/09. |
Advanced Planning
- CIP upgrade to include nearline publishing (Sept)
- Black and White lists? (delayed until it is required on a 'per-instance' basis)
- Improve resiliency to central services (This year)
Staffing
- Brian away Monday and Tuesday
- Richard away Monday
- Castor on Call person: Matthew