RAL Tier1 weekly operations castor 20090706
From GridPP Wiki
Contents
Summary of Previous Week
- Moving CASTOR central services to R89 and then bringing up/testing
- SRM development (Shaun)
- Certification of 2.1.7-27 with new LSF configuration (Chris)
Developments for this week
- Monitoring CASTOR as it is brought back into production (All)
- 2.1.7-27 upgrade preparation - testing synchronisation and kernel upgrades (Chris)
- SRM development (Shaun)
- CIP development (Jens)
Ongoing
- Cleaning up database for a future 2.1.8 upgrade
- Setting up Preproduction (Matt)
- Test 2.1.8-8 on tape drives (Tim)
- Prepare preproduction platform for stress testing (crosstalk investigations suspended) (Chris/Matt)
- adding virtual disk servers to preproduction (Matt)
Operations Issues
- Tape servers were stuck in BUSY state after CASTOR startup and needed to be reset
- 3 dead PSUs on head nodes
- ypbind didn't startup on a headnode, even though it was chkconfig-ed to ON
Blocking issues
none
Scheduled and Cancelled Down Times
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
R89 move | 25/6/09 0600 | 6/6/09 1200 | Downtime | All |
R89 move | 6/6/09 1200 | 10/6/09 1700 | At Risk | All |
Apply Oracle BigID patch | 13/7/09 0800 | 13/7/09 1700 | At Risk | All |
2.1.7-27 upgrade and LSF reconfiguration | 14/7/09 0800 | 14/7/09 1700 | Downtime | All |
2.1.7-27 upgrade and LSF reconfiguration | 14/7/09 0700 | 15/7/09 1700 | At Risk | All |
Advanced Planning
- Preferably do kernel upgrades of all systems during 2.1.7-27 upgrade
- SRM 2.8 upgrade (sometime during July)
- Start using Black and White lists (sometime during July)
- CIP upgrade to include nearline publishing (sometime during July)
Staffing
- Castor on Call person (is also Castor on Day Duty): Shaun
- Chris at CRISTAL1 course Mon-Wed
- Matt in CERN at STEP09 post mortem Thurs,Fri