Difference between revisions of "RAL Tier1 weekly operations castor 02/11/2009"
From GridPP Wiki
Matt viljoen (Talk | contribs) |
(No difference)
|
Latest revision as of 15:53, 2 November 2009
Contents
Summary of Previous Week
- SRM 2.8-2 deployment on all instances (Shaun)
- Developed DB fix to allow checksumming to work on 2.1.7 (Shaun)
- Deploying new disk servers (Chris)
- Deployed new disk servers for LHCb,ATLAS,CMS (Chris)
- Setting up repack (Chris)
- Split grid-mapfile across instances (Chris)
- Fixed FC problem (Cheney)
- Bringing up Vulcan array (Cheney)
- Enabling monitoring of Overland array (Cheney)
- Enabling SNMP monitoring of arrays (Cheney)
- Working on tape server problems (Cheney, Tim)
- Nagios training for John and Tiju (Cheney)
- Continuing to investigate EMC problems (Tim)
- CastorMon monitoring graphs for Gen instance (Brian)
- Draining disk servers for LHCb to move from lhcbDst->lhcbRawRdst (Brian, Matthew)
- Disaster Management of recent data-loss (Matthew)
- Deploying 4 new disk servers for repack server (Matthew)
- Incident review of CIP upgrade (Matthew, Jens, All)
- CASTOR-Fabric work proposal (All)
Developments for this week
- Improving resilience on central servers (Chris, Shaun)
- Working on puppet manifest for polymorphic central servers (Chris)
- Setup 2.1.8 on repack server (Chris)
- Installing T10KB drives (Tim)
- CastorMon monitoring graphs for Gen instance (Brian)
- Building Quattor templates for preprod (Richard)
- Deploying 3 new tape servers for repack server (Matthew)
- Move 2 disk servers from lhcbDst->lhcbRawRdst (Matthew)
- Deploying new disk servers (Matthew, Shaun)
- Reviewing preprod plans (Matthew)
- Disaster recovery document (Matthew)
Operations Issues
- Broken FC cable leads to a failed DB node and instability of redistributed databases. In particular, the Nameserver was rendered unavailable for 1h (Mon-Tues).
- Unknown problems affected Gen SRM database (Wed)
- Disk server status become unsynchronized for unknown reasons between RMMaster+Stager on CMS(Thurs)
Blocking issues
none
Planned, Scheduled and Cancelled Down Times
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Application of Quarterly ORACLE patches | 10/11/09 0900 | 10/11/09 1300 | At Risk | All instances |
Changes to Production Milestones
none
Advanced Planning
- Black and White lists will be tested and introduced on ATLAS
- Install/enable gridftp-internal on Gen (This year)
Staffing
- Tim away (Tues)
- Castor on Call person: Chris