RAL Tier1 weekly operations castor 08/11/2010

From GridPP Wiki
Revision as of 16:24, 8 November 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Work previous week

  • Matthew:
    • ATLAS permissions testing
    • Change control for updating ATLAS permissions
    • Change control for upgrading ATLAS SRMs
    • Change control for upgrading disk servers
    • CoD work
  • Shaun:
    • ..
  • Chris:
    • Castor Facilities work
    • Testing 64-bit disk servers in preProd
    • Doing work for Gen Repack
  • Richard:
    • Running stress tests on pre-prod and facilities instances of CASTOR
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • On 1/11/10 the ATLAS SRMs were repeatedly crashing, caused by a new unsupported command being passed to them (statusOfBringOnlineRequest). The SRMs were upgraded from 2.8-2 to 2.8-6 on 2/11/10 and the problem hasn't reoccurred.
  • There was a problem reported during the night of 2-3 Nov with CE SAM tests timing out when trying to use the CMS Castor instance. This appears to be a recurrence of a problem whereby CASTOR is very busy doing Disk-to-Disk copies. CMS have further limited PhEDEx from staging too many files too quickly.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade disk servers to Quattorized SL5 64bit and replace SRM hardware 10/11/2010 08:00 10/11/2010 17:00 Downtime LHCb
Upgrade SRM hardware and add a new SRM (STC) 11/11/2010 11:00 11/11/2010 13:00 At Risk LHCb
Update CMS to 2.1.9-6 16/11/2010 08:00 18/11/2010 18:00 Downtime CMS
Update ATLAS to 2.1.9-6 (STC) 06/12/2010 08:00 08/12/2010 18:00 Downtime ATLAS

Advanced Planning

  • Upgrade all disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to the latest 2.1.9 which incorporates the fix for grid-ftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • ..