RAL Tier1 weekly operations castor 15/11/2010
From GridPP Wiki
Work previous week
- LHCb testing during and after interventions
- CMS 2.1.9 upgrade coordination
- Fac reconfiguration after db hosting hardware moves to Maia
- Castor Facilities work
- Upgrading LHCb disk servers to 64bit
- Preparation for CMS upgrade next week
- Castor on Duty work
- Used the grid to stress test Facilities instance
- After the LHCb disk server upgrade to SL5 64bit, we got reports of some ROOTD jobs failing due to authentication errors. This was due to missing entries in a configuration file, which was previously controlled by puppet. A workaround was implemented on Thursday morning.
- Puppetmaster got overloaded (again) after the LHCb disk server upgrade. We are moving forward its upgrade to after the CMS 2.1.9 upgrade.
- gdss289 was presented by Fabric with a number of 2.1.9 RPMs installed on it, when it was deployed into ATLAS production (2.1.7). In future, new re-deployments will be installed by Quattor from scratch.
- On 10/11/10 a large backlog of stager requests appeared on the ATLAS SRMs. These were cleaned on the database.
- On 13-14/11/10 ATLAS experienced high load and due to a bug in the FTS, the SRMs were flooded with srmStatusOfPutRequests. We want to deploy two more SRMs dedicated to running the daemon only which should make them more efficient until the FTS bug is fixed.
- Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
|Update CMS to 2.1.9-6||16/11/2010 08:00||18/11/2010 18:00||Downtime||CMS|
|Update ATLAS to 2.1.9-6 (STC)||06/12/2010 08:00||08/12/2010 18:00||Downtime||ATLAS|
- Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
- CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
- CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
- CASTOR for Facilities instance in production by end of 2010
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Chris (Monday)