RAL Tier1 weekly operations castor 01/02/2010
From GridPP Wiki
Contents
Summary of Previous Week
- Matthew:
- Fixed bug in persistent test suite
- Configured IMPI on central nodes
- Planning CERN trip
- CoD duties
- Testing CASTOR
- Shaun:
- SRM Development
- Repack instance tape problem
- Corrupted files
- Strategy Meeting
- Testing SRM monitoring.
- Chris:
- Coordinating Castor shutdown
- Configured IPMI for SRM machines
- Applied the latest kernel and other outstanding upgrades to all castor servers
- Working on PreProd instance
- Reinstalling repack instance
- Pre-work for Security Challenge
- Cheney:
- ..
- Tim:
- ..
- Richard:
- Setup CCSE02..CCSE07 as CASTOR disk servers
- Brian:
- Draining ATLAS RAID5 servers.
- Audit of RAID 5 in D1TX disk servers.
- Jens:
- CIP deployment discussion with CERN
Developments for this week
- Matthew:
- Testing CASTOR
- Setting up preprod stress test
- Assessing relevance of 2009 strategic actions and dropping where necessary
- Shaun:
- Analysis of ATLAS timeouts
- Disk-2-Disk copy problems on LHCb
- Pre-production tape server-vdqm problem
- Chris:
- Restart production Castor services
- Finish work for repack instance
- Finish work for PreProd instance
- Test number of job slots for new disk servers per protocol basis
- Test redeployment procedure
- Test xrootd disk server for aliceTape
- Cheney:
- ..
- Tim:
- ..
- Richard:
- Finalise configuration of CASTOR disk servers: CCSE02..07 + CASTOR301 + CASTOR303 + GDSS198 + GDSS368
- Brian:
- Draining ATLAS RAID5 servers
- Check Removal of ATLAS servers ready fo re-install
- Jens:
- CIP upgrade deployed
Operations Issues
- Major problems with multipath setup encountered when moving back to EMC, resulting in 5 days of unscheduled downtime
Blocking issues
- Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Upgrade of memory on database nodes | (Ongoing) | (Ongoing) | At Risk | All instances |
Upgrade LHC CIP and introduce reduncancy of CIPs | 1/2/10 11:00 STC | 1/2/10 12:00 STC | At Risk | All instances |
Big Intervention - Day 1 Fsck & kernel upgrades of all disk servers & head nodes (apart from tape servers). Add IPMI to head nodes. Restrict user login on disk servers. Update fetch-crl rpm on disk servers | 27/1/10 08:00 | 27/1/10 24:00 | Downtime | All instances |
Big Intervention - Day 2 Move DB to EMC kit. Replace cdbc08 and add new DB archive log destination. Install NameServer CheckSum Trigger | 28/1/10 00:00 | 28/1/10 17:00 | Downtime | All instances |
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Shaun
- Staff absences: Matthew (Thursday)