Difference between revisions of "RAL Tier1 weekly operations castor 28/09/2009"
From GridPP Wiki
Matt viljoen (Talk | contribs) |
(No difference)
|
Latest revision as of 09:58, 30 September 2009
Contents
Summary of Previous Week
- SRM 2.8 upgrade on ATLAS (Shaun, DB Team)
- Finalizing testing for CIP 2.0 (Jens)
- Investigating cause of D2D Transfer incident (Chris)
- Finalized disk server deploymentation documentation (Chris)
- Deployed 5 DS for atlasHotDisk and 14 for AtlasSimStrip (Chris)
- Working on a problem with kernel clashing with FC card which prevents us to upgrade tape servers to the latest kernel (Chris)
- Distributing Raid5/6 servers across service classes using draining (Brian)
- Diagnose and fix network cable problem on Vulcan test database (Cheney)
- Fix sendmail problem DLF database single (Cheney)
- Started build of a new failover tape robot controller (Cheney)
- Fixed SLS (out of inodes due to logrotate failure) (Cheney)
- Fixed controller crash on database hardware (twice) (Cheney)
- Applied changes to nagios config for new diskservers (Cheney)
- Applied Oracle ASM Patch on Production RACs (DB Team)
- Installing and acceptance testing new CASTOR servers (Richard, Cheney)
- Coordinating bringing CASTOR down for UPS test (Matt)
- Writing post mortem of NS upgrade D2D transfer incident (Matt)
- Working with GOCDB developers to suggest including 'DEGRADED' status (Matt)
Developments for this week
- Carry on working on kernel problem for tape servers (Chris)
- Black and White list tests (Chris)
- Carry on LSF investigation (Chris)
- Working on puppet manifest for polymorphic central servers (Chris)
- 2.8-1 deployment and testing (Shaun)
- Install and Configure Database Agent for Oracle Enterprise Manager at CERN (DB Team)
- Installing SLC 64 bit on new preprod machines (Richard)
- Finish off patching including non-castor (Cheney)
- Write next Techwatch newsletter (Cheney)
- Distributing Raid5/6 servers across service classes using draining (Brian)
- Chasing up strategic objectives (Matt)
- Disaster recovery documentation (Matt)
Ongoing
- CastorMon monitoring graphs for Gen instance (Brian)
Operations Issues
- The ORACLE ASM failed again on night of 24/9/09. However, the ORACLE patch worked and ORACLE was able to recover without any adverse service impact.
Blocking issues
- Problems with ganglia check on GEN instance delaying work on monitoring (in hand)
Planned, Scheduled and Cancelled Down Times
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
CIP 2.0 upgrade | 29/9/09 1200 | 29/9/09 1400 | At risk | All instances |
Changes to Production Milestones
Advanced Planning
- SRM 2.8-1 to be deployed this week
- Black and White lists? (delayed until it is required on a 'per-instance' basis)
- Improve resiliency to central services (This year)
Staffing
- Richard away Thurs,Fri
- Brian A/L Thurs,Fri
- Castor on Call person: Chris