RAL Tier1 weekly operations castor 15/11/2010

From GridPP Wiki
Jump to: navigation, search

Work previous week

  • Matthew:
    • LHCb testing during and after interventions
    • CMS 2.1.9 upgrade coordination
    • Fac reconfiguration after db hosting hardware moves to Maia
  • Shaun:
    • ..
  • Chris:
    • Castor Facilities work
    • Upgrading LHCb disk servers to 64bit
    • Preparation for CMS upgrade next week
    • Castor on Duty work
  • Richard:
    • Used the grid to stress test Facilities instance
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • After the LHCb disk server upgrade to SL5 64bit, we got reports of some ROOTD jobs failing due to authentication errors. This was due to missing entries in a configuration file, which was previously controlled by puppet. A workaround was implemented on Thursday morning.
  • Puppetmaster got overloaded (again) after the LHCb disk server upgrade. We are moving forward its upgrade to after the CMS 2.1.9 upgrade.
  • gdss289 was presented by Fabric with a number of 2.1.9 RPMs installed on it, when it was deployed into ATLAS production (2.1.7). In future, new re-deployments will be installed by Quattor from scratch.
  • On 10/11/10 a large backlog of stager requests appeared on the ATLAS SRMs. These were cleaned on the database.
  • On 13-14/11/10 ATLAS experienced high load and due to a bug in the FTS, the SRMs were flooded with srmStatusOfPutRequests. We want to deploy two more SRMs dedicated to running the daemon only which should make them more efficient until the FTS bug is fixed.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update CMS to 2.1.9-6 16/11/2010 08:00 18/11/2010 18:00 Downtime CMS
Update ATLAS to 2.1.9-6 (STC) 06/12/2010 08:00 08/12/2010 18:00 Downtime ATLAS

Advanced Planning

  • Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • Chris (Monday)