RAL Tier1 weekly operations castor 22/11/2010

From GridPP Wiki
Jump to: navigation, search

Work previous week

  • Matthew:
    • CMS 2.1.9 upgrade planning
    • Testing during and after CMS upgrade
  • Shaun:
    • ..
  • Chris:
    • Castor Facilities work
    • CMS 2.1.9 upgrade
  • Richard:
    • Working on the 4 CIP servers to apply RPM errata and kernel versions. Discovered in the process that one of them would not reboot unattended (which could, of course, have caused problems for an on-call person)
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • During testing of CMS after the 2.1.9 upgrade, migration policies were initially being ignored and <350 files were migrated to the wrong tape pools. This was fixed before the end of the upgrade, but the files remain in the wrong pools.
  • During the night 18-19/11/10, a number of CMS disk2disk copying failed, due to a known LSF problem. The problem was fixed on Friday morning. We have modified our instance restart procedures to get around this problem.
  • On 19/11/10, transfers from cmsWanOut were very slow. This was due to a nigh number of unscheduled disk2disk copying from cmsFarm (51 disk servers) that were swamping network activity on the fewer disk servers in WanOut (5 disk servers). The number of diskcopies were temporarily reduced from 5 to 1.
  • On 22/11/10, a large number of accesses to hot files on 3 cmsWanOut disk servers created very low data transfer rates. These files were distributed by putting into Draining mode which helped.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update ATLAS to 2.1.9-6 06/12/2010 08:00 08/12/2010 18:00 Downtime ATLAS

Advanced Planning

  • Deploy new puppetmaster, ideally before ATLAS upgrade
  • Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010


  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Matthew on A/L Thurs PM