RAL Tier1 weekly operations castor 24/10/2011

Operations News

WAN tuning changes were rolled out to approximately half production disk servers on 21st. It remains to be seen whether it has improved trasfer rates.

Operations Problems

3 CMS disk servers (gdss303,304,305) were found to have a large amount of dark data, as they had been redeployed from another instance with cleanLostFiles being run on them, but not having waited for garbabe collection to run. In future, data partitions of redeployed disk servers will be wiped with "rm -rf" by the CASTOR team to avoid future problems.
Database hardware problems on Saturday brought down all instances of CASTOR. Service was restored on Sunday after hardware reconfiguration.

Blocking Issues

We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
Upgrade SRMs to 2.11 which incorporates VOMS support
Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
Quattorization of remaining SRM servers
Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

Castor on Call person: Matthew
Staff absence/out of the office:
- Matthew at LTUG (Wed) and in DL (Fri)

RAL Tier1 weekly operations castor 24/10/2011

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools