RAL Tier1 weekly operations castor 25/01/2010
From GridPP Wiki
Revision as of 15:47, 25 January 2010 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- Matthew:
- Intervention planning
- CIP and new castoradm1 testing
- Setting up preprod stress test
- Planning RAFL spend for new non-LHC instance
- Assisting DCI team members for evaluating CASTOR for Facilities
- Defining CASTOR roles for GridPP4
- Shaun:
- Load testing of checksum trigger
- Analysis of diskserver problem on certification
- Assisting ASGC
- Documentation updates
- Installation of new castor client library and kernel on srm boxes
- Chris:
- Castor on Duty
- Testing access restriction to disk servers with Jonathan
- Working and testing on preproduction instance
- Working on certification instance
- Doing verification work for disk server deployed by Quattor
- Cheney:
- Assist DB team with database failover testing
- dished out programming work for npre plugins
- Started on restore of broken ads0pt02
- Rebuilt using DR procedure and cut over new castoradm1
- Listened to a chap from Google tell us how it should be done
- Assist Kash with fitting memory upgrades to database servers
- Tim:
- Configuring preprod tape system
- Richard:
- Began re-building castor301 and castor303 as pre-prod disk servers. Castor301 is down with memory probs (JA handling it); apparent SCSI errors on 303 now surmounted. Now need to up-rev some castor packages to required versions
- Deployed gdss272 and gdss273 disk servers into atlasScratchDisk
- Brian:
- Atlas Draining of simStrip RAID5 disk servers
- Clean up of atlas nameserver of empty obsolete directories.
- Continued '10 tape families for ATLAS DATATAPE
- Analysis of SCRATCHDISK Filling and decom of 7 DATADISK servers for migration to SCRATCHDISK
- Jens:
- CIP upgrade testing
Developments for this week
- Matthew:
- Intervention planning
- Setting up preprod stress test
- Assisting DCI team members for evaluating CASTOR for Facilities
- Planning CERN trip
- CoD duties
- Shaun:
- Installation and testing of nameserver trigger
- Investigating high failure rates of ATLAS SAM tests
- castormon developments
- Documentation updates
- New disk pool for t2k
- Chris:
- Preparation for 'Big intervention' this week
- Coordinate Castor section for the intervention on Wednesday and Thursday
- Test and verify disk server deployed by Quattor
- Brian and I need to prepare 'redeployment' procedure for disk servers which were in production
- Test maximum number of job slots for new disk servers
- Verify lcgflex01/02/03 and enable Quattor if all 3 are identical
- Test 64 bit disk server with XFS (if time permits)
- Cheney:
- Database changes
- Bring into use EMC kit
- Tim:
- Spend stuff
- more pre-prod tape stuff
- Richard:
- Complete the update of castor packages on pre-prod disk servers castor301 and castor303.
- Brian:
- Atlas Draining of simStrip RAID5 disk servers
- Jens:
- CIP upgrade (postponed from prev week), new release for other CASTOR sites
Operations Issues
- none
Blocking issues
- Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
- Preprod DB can only be delivered after EMC testing is done (3nd week after Jan'10)
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Upgrade of memory on database nodes | (Ongoing) | (Ongoing) | At Risk | All instances |
Upgrade LHC CIP and introduce reduncancy of CIPs | 1/2/10 11:00 STC | 1/2/10 12:00 STC | At Risk | All instances |
Big Intervention - Day 1 Fsck & kernel upgrades of all disk servers & head nodes (apart from tape servers). Add IPMI to head nodes. Restrict user login on disk servers. Update fetch-crl rpm on disk servers | 27/1/10 08:00 | 27/1/10 24:00 | Downtime | All instances |
Big Intervention - Day 2 Move DB to EMC kit. Replace cdbc08 and add new DB archive log destination. Install NameServer CheckSum Trigger | 28/1/10 00:00 | 28/1/10 17:00 | Downtime | All instances |
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Matthew