RAL Tier1 weekly operations castor 30/04/2012
From GridPP Wiki
Revision as of 10:41, 3 May 2012 by Matt viljoen (Talk | contribs)
Contents
Operations News
- Repack configuration files now completely puppetized.
Operations Problems
- A gSOAP error has been found that sometimes prevents a user's VO being validated during a running job. This results in an invalid error message being passed to the FTS.
- An xrootd misconfiguration affecting ATLAS jobs was fixed on Thursday
- Draining too many disk servers concurrently caused ATLAS and LHCb jobs to fail. From now on, we will only drain 1 disk server at a time for busy instances.
Blocking Issues
- none
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) | Lead by |
---|---|---|---|---|---|
CIP 2.2.0 upgrade (STC) | TBD | TBD | At-risk | All | Matthew |
Advanced Planning
Tasks
- Test and re-apply CIP upgrade (Jens, Matthew)
- Test and certify 2.1.12-4 and 2.1.11-9 (Matthew, Chris)
- Stress testing of Transfer Manager (TM) (Shaun, All) DONE
- Ganglia monitoring for TM (Rob, Chris) IN PROGRESS
- Re-instantiate certification on HyperV VMs using Quattor+Puppet (Rob)
- Stress testing of CV11 generation disk servers on preprod (Rob, Matthew) DONE
- Selection of disk-only prototype solution (Shaun, Rob, Brian, James)
- Switch to Tape Gateway on repack and test (Tim, Matthew) DONE
Interventions
- Upgrade repack to 2.1.12-4 (Apr)
- Switch from LSF to TM after 2.1.11-8 upgrade. Will need to better stress-test TM on preprod with more disk servers. (Apr)
- Switch to Tape Gateway (TG) once it has been tested on repack (May)
- Upgrade Castor Facilities and Tier1 instances to 2.1.11-9 (Jun)
- Upgrade Oracle to 11g (Jun)
- Upgrade to 2.1.12 on Tier1 instances once we are happy with TM and TG in performance (Jul)
Staffing
- Castor on Call person: Shaun
- Staff absence/out of the office:
- Matthew TOIL (Mon)