RAL Tier1 weekly operations castor 24/08/2009
From GridPP Wiki
Contents
Summary of Previous Week
- CIP development on certification (Jens)
- Written lsf archiving mechanism which archives logs for 3 months on DMF and keeps 24 hours log on local partition (Chris)
- Restart and recovery of all instacnes following AC failure (All)
- Development of load test generator for database tuning testing (Chris)
- Redeployed hotspare from preprod to production instances (Chris)
- Identification of historical data for deletion from ATLAS MCDISK (Brian)
- Improved database metric monitoring (Eter)
- Fixed (?) partitioning problem on ATLAS DLF (Rich)
- Investigation of problems with repacking some tapes (Tim/Shaun)
- Media cleaning and robot checkout following water ingress (Tim)
- Deployment of new puppet restarter (Shaun)
- Tested database cleanup procedure in prep for 2.1.8 CASTOR u/g (Shaun)
Developments for this week
- CIP development on certification (Jens)
- Implement newly written lsf archiving mechanism in production (Chris)
- Review and update disk server deployment procedure (Chris)
- Co-ordinate ATLAS data reconciliation (Brian)
- CASTOR Database tuning testing (Database team)
- Prepare certification for SRM 2.8 testing (Shaun)
Ongoing
- CastorMon monitoring graphs for Gen instance (Brian)
- Cleaning up database for a future 2.1.8 upgrade (Shaun)
- Setting up Preproduction (Matt, Chris)
Operations Issues
- Lost gdss169. ATLAS informed and data should be being replicated.
- Expired CRL's caused problems bring CASTOR instances back up.
- Tape migration problems on CMS on 21 Aug. 2009
Blocking issues
- Problems with ganglia check on GEN instance delaying work on monitoring
- Testing gridFTP internal delaying work on SRM upgrade
Scheduled and Cancelled Down Times
none
Changes to Production Milestones
none
Advanced Planning
- CIP upgrade to include nearline publishing (August)
- SRM 2.8 upgrade (August)
- Work with Fabric to add extra RAID card in remaining Viglen'06 disk servers (Second half of August)
- Database optimization tasks (September)
- Upgrade nameserver to 2.1.8 (Possibly during September)
- Black and White lists? (Possibly during September)
- Improve resiliency to central services (This year)
Staffing
- Castor on Call person: Chris
- Cheney on A/L
- Matt on A/L