RAL Tier1 weekly operations castor 07/01/2011

From GridPP Wiki
Revision as of 15:39, 10 February 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • ORACLE succesfully upgraded to 10.2.0.5
  • CMS disk servers succesfully upgraded to SL5 64bit
  • Checksumming turned on for cmsWanIn on 2/2/11 and for everything else on 7/2/11
  • Fix for bad checksums (upgraded gridftp rpm) rolled out to lhcbMdst on 2/2/11 and for everything else (apart from Gen) on 7/2/11
  • New puppetmaster02 rolled out for all Quattorized disk servers on 3/2/11
  • Inactive job manager monitoring script rolled out to all primary job managers on 3/2/11
  • 2.1.9-10 installed on Preprod - testing can now start

Operations Issues

  • Lost tape CS7541. 78 files declared lost to LHCb. Remaining files were restaged as they were on disk.
  • Number of incompletely transferred LHCb files getting the wrong checksums increased until fix was rolled out, and checksums were corrected and the migration queue reduced.
  • A small number of files (<10) have been given wrong checksums, when they should contain '0000'. The same fix rolled out for LHCb helps with this bug as well.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Now being ordered.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade gridftp RPM on remaining LHCb, ATLAS and CMS disk servers 07/02/2011 10:00 07/02/2011 12:00 At-Risk ATLAS,CMS,LHCb
Roll out WAN tuning changes to cmsWanIn and cmsWanOut 08/02/2011 09:00 08/02/2011 16:00 At-Risk CMS
Upgrade and quattorize Gen disk servers to SL5 64 bit 15/02/2011 08:00 15/02/2011 16:00 Downtime Gen
Roll out WAN tuning changes to remaining CMS disk pools 15/02/2011 10:00 15/02/2011 12:00 At-Risk CMS
Roll out WAN tuning changes to all remaining disk servers (STC) 01/03/2011 09:00 01/03/2011 16:00 At-Risk ATLAS,LHCb,Gen

Advanced Planning

  • Upgrade Gen disk servers to SL5 64bit and Quattorize the remaining non-Quattorized disk servers
  • CASTOR certification and upgrade to 2.1.10 and upgrade of SRM to 2.10 which incorporates:
    • fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
    • fix to report files on draining disk servers accessed by FTS to be NEARLINE not UNAVAILABLE
  • Upgrade the NS to 2.1.10

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Chris (all week)
    • Richard (Mon,Tue,Thu)