RAL Tier1 weekly operations castor 21/05/2012

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Successful switch from rtcpclientd to tapegateway for remaining Tier1 instances: LHCb, CMS and ATLAS
  • Switched SLS monitoring from castoradm1 to new virtualized admin machine: lcgccvm02

Operations Problems

  • (Mon) 12/13 new disk servers deployed into atlasScratchDisk had problems reading/writing due to an error during deployment and were removed from production. The were fixed and put back into production on Thursday.
  • (Tue) Another failure of CRLs to be updated causes service interruption between 1600-2100 - in a repetition to the incident last Friday. In this case it was because we failed to pick up a new CERN CA CRL as it was only issued 10 hours before the expiry of the previous one. We suspect the previous incident was similarly caused. See RT#98979, incident: FP#246
  • Two incidents of dataloss were recorded for gdss374 - first 34 files on Monday then a further 90 files on Friday. This was due to incorrectly putting the disk server into readonly mode on Monday, followed by further files being written to it. Readonly mode will no longer be supported after the switch to the transfer manager at the end of this month.

Blocking Issues

  • Need to relocate Repack stager to a database running 11g prior to being about to upgrade it to 2.1.12.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
Switch from LSF to Transfer Manager 28/05/12 10:00 28/05/12 11:00 Downtime LHCb Matthew
Switch from LSF to Transfer Manager 29/05/12 10:00 29/05/12 11:00 Downtime Gen Matthew
Switch from LSF to Transfer Manager 30/05/12 10:00 30/05/12 11:00 Downtime CMS Matthew
Switch from LSF to Transfer Manager 07/06/12 10:00 07/06/12 11:00 Downtime ATLAS Matthew
CIP 2.2.0 upgrade (STC) TBD TBD At-risk All Matthew

Advanced Planning

Tasks

  • Test and re-apply CIP upgrade (Jens, Matthew)
  • Test and certify 2.1.12-4 and 2.1.11-9 (Matthew, Chris)
  • Stress testing of Transfer Manager (TM) (Shaun, All) DONE
  • Ganglia monitoring for TM (Rob, Chris) DONE
  • Re-instantiate certification on HyperV VMs using Quattor+Puppet (Rob)
  • Stress testing of CV11 generation disk servers on preprod (Rob, Matthew) DONE
  • Selection of disk-only prototype solution (Shaun, Rob, Brian, James)

Interventions

  • Upgrade repack to 2.1.12-4 (May)
  • Upgrade Castor Facilities and Tier1 instances to 2.1.11-9 (Jun)
  • Upgrade Oracle to 11g (Jun)
  • Upgrade to 2.1.12 on Tier1 instances once we are happy with TM and TG in performance (Jul)

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • (Mon) Matthew A/L