RAL Tier1 weekly operations castor 03/10/2011
From GridPP Wiki
Revision as of 14:47, 3 October 2011 by Matt viljoen (Talk | contribs)
Contents
Operations News
- 5TB T10KC drives moved into production (and repack) for ATLAS on Wednesday
- The Facilities database moved to new hardware running 10g on Wednesday
Operations Problems
- Another spate of inconsistencies in the database brought down ATLAS for 11 hours on Tuesday. Similar to an incident in July when 1 subrequest without an entry in id2type brought the instance down - this time there were many orphaned subrequests. All subrequests had to be invalidated. This incident will be reviewed on Wednesday.
Blocking Issues
- We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB none
Advanced Planning
- Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
- Upgrade SRMs to 2.11 which incorporates VOMS support
- Certify 2.1.11 and evaluate the new LSF replacement
- Quattorization of remaining SRM servers
- Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes
Staffing
- Castor on Call person: Shaun
- Staff absence/out of the office:
- none