RAL Tier1 weekly operations castor 03/10/2011

From GridPP Wiki

Jump to: navigation, search

Contents

1 Operations News
2 Operations Problems
3 Blocking Issues
4 Planned, Scheduled and Cancelled Interventions
5 Advanced Planning
6 Staffing

Operations News

5TB T10KC drives moved into production (and repack) for ATLAS on Wednesday
The Facilities database moved to new hardware running 10g on Wednesday

Operations Problems

Another spate of inconsistencies in the database brought down ATLAS for 11 hours on Tuesday. Similar to an incident in July when 1 subrequest without an entry in id2type brought the instance down - this time there were many orphaned subrequests. All subrequests had to be invalidated. This incident will be reviewed on Wednesday.

Blocking Issues

We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the new LSF replacement
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

Castor on Call person: Shaun
Staff absence/out of the office:
- none

Retrieved from "https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_weekly_operations_castor_03/10/2011&oldid=3006"