RAL Tier1 weekly operations castor 29/11/2010

From GridPP Wiki
Jump to: navigation, search

Operations News

  • On 25/11/10 all ATLAS+CMS SL08 disk servers were put into Read Only mode via LSF, to prevent new files being lost if there is a further catastrophic crash.

Operations Issues

  • On 22/11/10 CMS experienced slowness transferring files from cmsWanOut. 3 disk servers were running very hot. Putting them into draining to distribute the hot files helped.
  • On 24/11/10 at 00:34 and again on 27/11/10 at 22:59 the CMS jobmanager stopped processing requests for approx. 30 minutes (on both occassions) due to unknown reasons. Afterwards it resumed operations normally. Over these period, transfers to/from RAL failed. We have enabled a second jobmanager instance on CMS to protect us from a future reoccurance.
  • Very slow connectivity seems to be affecting a number of disk servers in CMS, ATLAS, LHCb. Indications are that there may be a common problem with their networking.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update ATLAS to 2.1.9-6 06/12/2010 08:00 08/12/2010 18:00 Downtime ATLAS

Advanced Planning

  • Deploy new puppetmaster
  • Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Matthew on A/L Friday PM