RAL Tier1 weekly operations castor 24/01/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Upgraded (and Quattorized) all remaining ATLAS disk servers to SL5 64 bit
  • Enabled checksums for ATLAS
  • Discussing merging of ATLAS disk pools at CERN
  • Tested TCP tuning on PreProd disk servers (Richard&Brian)
  • Migration CastorMon plot has been fixed for Gen
  • Logging to DLF from CMS now using UDP to see if this will fix the hanging disk server problem

Operations Issues

  • PreProd SRM is failing, the cause is still under investigation
  • A few hundred files on Gen instance in CanBeMigr status but the copy on tape already. Needs to have the status changed to Staged in GenStagerDB
  • PreProd DLF DB have had problems with partitions, fixed by Rich/Shaun
  • gdss498 into readonly mode in castor due to checksum errors, the cause was misconfiguration in puppet. Now back into read/write mode.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Now being ordered.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Upgrade CASTOR ORACLE to 10.2.0.5 31/01/2011 08:00 31/01/2011 18:00 Downtime All MV
Update CMS disk servers to SL5 64bit 31/01/2011 08:00 01/02/2011 17:00 Downtime CMS MV
Point Quattorized disk servers to new puppetmaster 03/02/2011 09:00 03/02/2011 16:00 At-Risk All (except for Gen) MV

Advanced Planning

  • Upgrade Gen disk servers to SL5 64bit and Quattorize the remaining non-Quattorized disk servers
  • CASTOR certification and upgrade to 2.1.9-10 which incorporates:
    • fix for bad checksums being generated for incompletely transferred files
    • fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR certification and upgrade to 2.1.10 and upgrade of SRM to 2.10 which incorporates:
    • fix to report files on draining disk servers accessed by FTS to be NEARLINE not UNAVAILABLE
  • Upgrade the NS to 2.1.9

Staffing

  • Castor on Call person: Chris, Shaun (working hours: Thu,Fri)
  • Staff absence/out of the office:
    • Thu-Fri: Matt, Chris and Richard at Project Management Course, TCH
    • Thu-Fri: Shaun at DL