RAL Tier1 weekly operations castor 22/02/2010
From GridPP Wiki
Revision as of 16:50, 22 February 2010 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- Matthew:
- Production resiliency investigations
- CoD work
- Facilities evaluation support
- CASTOR database - Disaster recovery coordination
- Shaun:
- ..
- Chris:
- Testing Maximum number of job slots per protocol basis - still on going
- Working on LHCB d2d problem
- Investigating Atlas problem with dispatching jobs to full disk servers and LSF mbatchd memory usage. Testing fixes which might improve this situation
- Investigating CMS slowness in scheduler
- Cheney:
- ..
- Tim:
- Investigating CMS migration backlog
- Working on new hardware
- Richard:
- ..
- Brian:
- Investigating LSF Pending to Full disk servers in ATLAS.
- Disk Deployment strategy and prioritisation of outstanding DD tickets.
- Jens:
- ..
Developments for this week
- Matthew:
- CASTOR database - Disaster recovery coordination
- Forming our 2.1.8/2.1.9 strategy
- Coordinating interventions
- Shaun:
- SRM development
- Investigating LHCb disk to disk copy problems
- COD work
- Chris:
- Continue testing number of job slots per protocol basis
- Do some work with polymorphic machines
- Concentrate on Quattor Tape Server
- Start preparing test infrastructure for castor upgrades
- Finish testing fixes for Atlas
- Cheney:
- ..
- Tim:
- None, I'm on leave for most of it :-)
- Richard:
- ..
- Brian:
- Testing solution to LSF Pending to Full disk servers in ATLAS.
- Jens:
- ..
Operations Issues
- LSF slowdowns affecting ATLAS (Sunday, Thursday)
- Pluto node unscheduled reboot caused 10 minute hangup (Tuesday)
- Jobs writing into CMSWanIn are taking a disproportionate amount of time. This seems to coincide with high activity writing into CMSWanOut
Blocking issues
- none
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
NFS reconfiguration on database | 23/02/2010 10:00 | 23/02/2010 16:00 | At-risk | All |
Clusterware reconfiguration on database | 24/02/2010 10:00 | 24/02/2010 11:00 | Downtime | All |
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Shaun
- Staff absences: Cheney (Mon), Tim (Tue-Fri)