RAL Tier1 weekly operations castor 01/03/2010
From GridPP Wiki
Contents
Summary of Previous Week
- Matthew:
- CASTOR database - Disaster recovery coordination
- Forming our 2.1.8/2.1.9 strategy
- Coordinating interventions
- Depmon duties - deploying 100Tb into atlasSimStrip
- First look at CIP code
- Shaun:
- Identifying and correcting problems with new disk server deployment
- Completed investigation of ATLAS SAM timeouts
- Prototyping of monitoring updates.
- Chris:
- Continuing testing number of job slots per protocol basis
- Doing some work on Quattor Tape Server
- Start preparing test infrastructure for castor upgrades
- Finish testing fixes for Atlas
- Cheney:
- Build of vulcan database cluster for preprod
- Fixed backups (couldn't write to its index file for some reason).
- Tim:
- ..
- Richard:
- Converted CERN castor stress tests into Perl to get around limitations on # of concurrent threads and also to make it easier to bolt on instrumentation for benchmarking purposes
- Brian:
- Disk Deployment assignment
- Comparing CASTOR stager_qry/bdii/dq2 accounting values
- Disabled Tape investigation
- Jens:
- Support for experiments interpreting CIP information, SRM related support
Developments for this week
- Matthew:
- 2.1.8/2.1.9 strategy
- Database - DR and new hardware plans
- Hardware spend plans
- Install lcg_utils on castoradm3 for stress testing
- Depmon (and backup CASTOR on Day) duties
- Write presentation for T1 Away Day
- Shaun:
- More monitoring prototyping
- SRM work
- Chris:
- Castor on Duty
- Implement Atlas fix: "Reduce Atlas LSF clean period to 14400 (sec)"
- Continue testing number of job slots per protocol basis. Waiting for LHCB to test rootd
- Do some work with polymorphic machines
- Concentrate on Quattor Tape Server
- Cheney:
- Handover new Vulcan database cluster
- Tim:
- sort out new hardware
- start installing new tape servers?
- more work on RAC resiliancy planning
- Jens
- Work on CIP 2.2.0 release
Operations Issues
- Some disk servers lost routing table: switch needs to have its cache refreshed
- Missing RPMs on new disk servers - kickstart repository was incomplete
- xinetd not working on newly deployed disk servers. Needed restarting.
Blocking issues
- Still don't have an ip address allocated for one node of new Vulcan database cluster.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
ORACLE security patch | 02/03/2010 10:00 | 02/03/2010 11:00 | At-risk | All |
Change to LSF configuration | 02/03/2010 10:00 | 02/03/2010 11:00 | At-risk | ATLAS |
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Chris
- Matt on paternity leave for 2 weeks from approx 8 March
- Staff absences: Chris (Friday)