RAL Tier1 weekly operations castor 18/01/2010
From GridPP Wiki
Revision as of 15:51, 18 January 2010 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- Matthew:
- Intervention planning
- Tape-backed persistent tests
- Access to CASTOR for facilities testing
- Preprod stress test planning
- Shaun:
- Chris:
- Testing new kernel on certification before intervention
- Working with Tim on getting repack instance working again
- Working on certification and preproduction instance
- Fixed ralreplicas for LHCB and GEN
- Redeployed gdss110 for repack instance
- Cheney:
- re-validation of EMC kit
- relocate neptune voting disk ahead of cdbc08 retirement
- Prep for DB take-on of EMC kit
- Fixed ipmi SOL on ibm x3550
- A little patching
- Tim:
- Working on getting repack working again
- setting up and deleting atlas tape families
- Richard:
- Brian:
- Draining of RAID5 diskservers within atlas
- Removal of service classes which are no longer needed
- Planning for new ATLAS/LHCb server deployment and consequent Draining
- Jens:
- CIP upgrade planning
- Debugging CASTOR SRM authentication problem with Shaun
Developments for this week
- Matthew:
- Intervention planning
- CIP and new castoradm1 testing
- Disk server redeployment
- Setting up preprod stress test
- Shaun:
- Testing SRM with castor2.1.8-17 client libraries
- Testing database load with nameserver checksum trigger
- SRM testing
- Understanding on castormon source code
- Recovering certification system
- Chris:
- Update repack instance plus preprod disk servers (castor30x)
- Test disk server deployment procedure using Quattor
- Test maximum number of lsf job slots for 18TB disk server
- Test 64bit disk server with XFS
- Test access restriction to disk servers with Jonathan
- Work on preproduction instance
- Cheney:
- Revalidation of EMC kit
- Restore castoradm1 (again)
- Fitting memory sticks
- Training on how to use ipmi
- Tim:
- Look at what needs to be purchased for lhc and non-lhc castor instances
- Richard:
- Brian:
- Draining of RAID5 diskservers within atlas
- Jens:
- Prepare for and upgrade all RAL T1 production CIPs to 2.1.0
Operations Issues
- atlas lsf became momentarily unstable due to big logfiles because of missing servers
Blocking issues
- Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
- Preprod DB can only be delivered after EMC testing is done (3nd week after Jan'10)
Planned, Scheduled and Cancelled Interventions
- 18-22 January - at-risk while memory on database nodes is upgraded
- 19 January - move castoradm1 to newer host. No downtime/at-risk
- 19,20 January - upgrade SRM castor client to 2.1.8-17
- 21 January - upgrade LHC CIP and introduce reduncancy of CIPs. 1 hour at-risk
- 27-28 January
- FSCK Disk servers and pick up new kernels. - Add IPMI to Castor Head Nodes. - Replace cdbc08 and add new DB archive log destination. - Install NameServer CheckSum Trigger - Restrict user login on disk servers
- The following have not been folded into the above schedule. These can be fitted around as they are, at worst, an ‘At Risk’.
- Update fetch-crl rpm on disk servers
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Chris