RAL Tier1 weekly operations castor 01/11/2010

From GridPP Wiki
Revision as of 14:43, 1 November 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Work previous week

  • Matthew:
    • GEN Upgrade and Testing
    • LHCb problems
    • Castor for Facilities planning
  • Shaun:
    • ..
  • Chris:
    • GEN Upgrade and Testing
    • Castor Facilities work
    • Working on LSF problem in ASGS
  • Richard:
    • Finishing the tape section of the 2.1.9 functional tests on Facilties instance
    • Developing a script to run stress tests by running grid jobs
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • On 26/10/10 LHCb reported that their SRMs were returning malformed TURLs - affecting approx. 4% of transfers. We're not yet clear what is causing this bug, but the last occurrence was on the morning of 30/10/10.
  • On 27/10/10 LHCb reported they were having recall problems for a file. CERN informed us that this is due to a known bug.
  • On 30/10/10 while rebooting LHCb to attempt to fix the above malformed TURL problem, one SRM daemon did not reconnect to the database due to a known bug. This created a backlog of requests and bad SRM performance. LHCb was put into downtime until 1300 when the backlog was naturally cleared. On 31/10/10, LHCb was under very high load which again created a backlog. LHCb was again put into downtime until Monday when investigations pointed the problem to be load related, and SRMs were reconfigured to improve performance under high loads. The SRMs are now being replaced by new hardware (and one extra SRM) to improve performance further.
  • On 1/11/10 the ATLAS SRMs were repeatedly crashing, caused by a new unsupported command being passed to them (statusOfBringOnlineRequest).

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update CMS to 2.1.9-6 (STC) 08/11/2010 08:00 10/11/2010 18:00 Downtime CMS
Update ATLAS to 2.1.9-6 (STC) 22/11/2010 08:00 24/11/2010 18:00 Downtime ATLAS

Advanced Planning

  • New SRM machines for LHCb
  • Upgrade disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to the latest 2.1.9 which incorporates the fix for grid-ftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010


  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • ..