RAL Tier1 weekly operations castor 08/02/2010

Summary of Previous Week

Matthew:
- High level CASTOR strategy for 2010 (inc. 2.1.9)
- Coordinating team debugging multipath EMC problems
Shaun:
- Configuration Analysis of Production Systems
- Castor On Duty things
  - Fixing CMS recall problem
  - Fixing missing passwd entries for stage:st and lsfadmin accounts
  - Bringing up and testing of CASTOR instances
Chris:
- Configuring repack instance
- Working on PreProd instance
- Preparing test disk server for new Alice peer/manager
- Preparing preprod instance to test max number of job slots
- Writing puppet manifests for preproduction disk servers
Cheney:
- ..
Tim:
- RAC upgrade and getting working again...
- Hardware purchasing
- Getting repack working after install
Richard:
- Completed setting up current set of pre-prod disk servers
Brian:
- ATLAS D1T0 draining and disk removal.
Jens:
- CIP upgrade finally graded up. And related upgrade pre and post coordination and testing.

Matthew:
- meetings at CERN and ATLAS Jamboree
- Coordinating team debugging multipath EMC problems
- 2.1.9 fact finding at CERN
Shaun:
- More configuration analysis
- Looking at ways of improving resilience of current system
- LHCb disk-2-disk copy problems
- SRM development if time permits
Chris:
- Castor On Duty (M-F)
- Test max number of job slots per protocol basis
- Looking at why vdqm/vmgr not working on preprod
- Get back to polymorphic configuration
- Install SRM machine(s) for preprod
Cheney:
- ..
Tim:
- More hardware purchasing
- Looking at why RAC not working
- Geeting RAC stability back to what it should be
Richard:
- Run stress tests on pre-prod instance
Brian:
- CASTOR Draining
- Educating AD on Draining.
Jens:
- Ideally, some CIP development.

Problem with CMS recalls - now fixed
c08 continuing being instable. Plan for removal from production
approx. 8 corrupt files discovered on gdss66 (cmsFarmRead) sent to CMS. None were critical.
entries on /etc/passwd disappeared on gdss67,110. Accedental redeployment?

Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.

Entries in/planned to go to GOCDB

Description	Start	End	Type	Affected VO(s)
Upgrade of memory on database nodes	(Ongoing)	(Ongoing)	At Risk	All instances
Fix EMC multipath issues	(Ongoing)	(Ongoing)	At Risk	All instances