Difference between revisions of "RAL Tier1 weekly operations castor 21/02/2011"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:10, 28 February 2011

Operations News

  • Last disk servers (Gen) quattorized and upgraded to SL5 64bit
  • WAN tuning rolled out to all remaining CMS disk servers
  • srm0662 (ATLAS) repartitioned to give more space to logs. Two more to go.
  • atlasSimStrip was successfully merged into atlasStripInput

Operations Issues

  • Around 10k FTS transfers failed for ATLAS on Monday after switching to a new robot certificate, which wasn't correctly pushed out in grid-mapfiles due to a misconfiguration when upgrading to the new puppetmaster02.
  • After the disk pool merging, ATLAS continued using SIMSTRIP and failed to modify their pilot jobs to use DATADISK.
  • On 17/2 the xrootd redirector crashed resulting in failing functional tests. It was quickly noticed and restarted after 2 hours. An automatic restarter was written and installed that will kick in if it happens again.

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Been ordered. Servers arriving this week, RAID device mid-March.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Roll out WAN tuning changes to all remaining disk servers 22/02/2011 09:00 22/02/2011 16:00 At-Risk ATLAS,LHCb,Gen
Upgrade NS to 2.1.10 (STC) mid March mid March Downtime ALL

Advanced Planning

  • CASTOR certification and upgrade to 2.1.10 and upgrade of SRM to 2.10 which incorporates:
    • fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
    • fix to report files on draining disk servers accessed by FTS to be NEARLINE not UNAVAILABLE
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities instance to new Database hardware running 10g
  • Start migrating from T10KA to T10KC media later this year

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Shaun, Richard (all week)