RAL Tier1 weekly operations castor 08/02/2010
From GridPP Wiki
Revision as of 14:55, 15 February 2010 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- Matthew:
- High level CASTOR strategy for 2010 (inc. 2.1.9)
- Coordinating team debugging multipath EMC problems
- Shaun:
- Configuration Analysis of Production Systems
- Castor On Duty things
- Fixing CMS recall problem
- Fixing missing passwd entries for stage:st and lsfadmin accounts
- Bringing up and testing of CASTOR instances
- Chris:
- Configuring repack instance
- Working on PreProd instance
- Preparing test disk server for new Alice peer/manager
- Preparing preprod instance to test max number of job slots
- Writing puppet manifests for preproduction disk servers
- Cheney:
- ..
- Tim:
- RAC upgrade and getting working again...
- Hardware purchasing
- Getting repack working after install
- Richard:
- Completed setting up current set of pre-prod disk servers
- Brian:
- ATLAS D1T0 draining and disk removal.
- Jens:
- CIP upgrade finally graded up. And related upgrade pre and post coordination and testing.
Developments for this week
- Matthew:
- meetings at CERN and ATLAS Jamboree
- Coordinating team debugging multipath EMC problems
- 2.1.9 fact finding at CERN
- Shaun:
- More configuration analysis
- Looking at ways of improving resilience of current system
- LHCb disk-2-disk copy problems
- SRM development if time permits
- Chris:
- Castor On Duty (M-F)
- Test max number of job slots per protocol basis
- Looking at why vdqm/vmgr not working on preprod
- Get back to polymorphic configuration
- Install SRM machine(s) for preprod
- Cheney:
- ..
- Tim:
- More hardware purchasing
- Looking at why RAC not working
- Geeting RAC stability back to what it should be
- Richard:
- Run stress tests on pre-prod instance
- Brian:
- CASTOR Draining
- Educating AD on Draining.
- Jens:
- Ideally, some CIP development.
Operations Issues
- Problem with CMS recalls - now fixed
- c08 continuing being instable. Plan for removal from production
- approx. 8 corrupt files discovered on gdss66 (cmsFarmRead) sent to CMS. None were critical.
- entries on /etc/passwd disappeared on gdss67,110. Accedental redeployment?
Blocking issues
- Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Upgrade of memory on database nodes | (Ongoing) | (Ongoing) | At Risk | All instances |
Fix EMC multipath issues | (Ongoing) | (Ongoing) | At Risk | All instances |
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)
Staffing
- Castor on Call person: Chris
- Staff absences: Matthew in CERN (Tueday, Wednesday)