RAL Tier1 weekly operations castor 04/06/2012

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Switched from LSF to TM for Gen and CMS
  • Repack stager database moved to a host running 11g, so we can now upgrade it to 2.1.12
  • DLF database was re-initialized this week to try to improve its performance

Operations Problems

  • (Wed) Inaccessible files reported by ALICE were found out to be timeouts within XRD manager. The timeout threshold was raised from 30s to 60s which improved things.
  • (Thu) CMS migrations stopped due to interference between mighunters. fixed the problem on Friday and the queue went down successfully.
  • (Fri) LHCb SRMs became unresponsive overnight, possibly due to a memory leak with logprocessors repeatedly trying to contact the unavailable database during its upgrade

Blocking Issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
CIP 2.2.0 upgrade (STC) 06/06/12 09:00 06/06/12 10:00 (Internal) All Matthew
Switch from LSF to Transfer Manager 07/06/12 09:00 07/06/12 11:00 Downtime ATLAS Matthew
2.1.11-9 upgrade (STC) 13/06/12 09:00 13/06/12 14:00 Downtime All Matthew
ORACLE 11g upgrade (STC) 27/06/12 09:00 27/06/12 17:00 Downtime All Rich

Advanced Planning

Tasks

  • Test and re-apply CIP upgrade (Jens, Matthew)
  • Test and certify 2.1.12-4 and 2.1.11-9 (Matthew, Chris)
  • Stress testing of Transfer Manager (TM) (Shaun, All) DONE
  • Ganglia monitoring for TM (Rob, Chris) DONE
  • Re-instantiate certification on HyperV VMs using Quattor+Puppet (Rob)
  • Stress testing of CV11 generation disk servers on preprod (Rob, Matthew) DONE
  • Selection of disk-only prototype solution (Shaun, Rob, Brian, James)

Interventions

  • Upgrade repack to 2.1.12-4 (Jun)
  • Upgrade Castor Facilities and Tier1 instances to 2.1.11-9 (Jun)
  • Upgrade to 2.1.12 on Tier1 instances once we are happy with TM and TG in performance (Jul)

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • (Mon/Tue) Public holiday