RAL Tier1 weekly operations castor 22/03/2010

Summary of Previous Week

Matthew:
- Mostly away on paternity leave
Shaun:
- Detailed Planning of 2.1.8/9 upgrades
- Started testing SRM 2.8-6 on certification
- Worked on preparation for PreProd stress testing
- Planning for LHCb Jamboree
- Slide prep for open day
Chris:
- Testing maximum number of job slots for root protocol with Raja
- Built 4 cold stand-by central castor servers (still need the final configuration and basic tests)
- Deploying disk servers
- DepMon duties
- Castor on Call duties
- Doing work related to Tier1 Security Group project
Cheney:
- ..
Tim:
- Drained CS5116, may have issues, no data loss
- T10K testing on preprod
Richard:
- Further tweaks to stress-testing suite in preparation for benchmarking exercise
Brian:
- Disk server removal
Jens:
- ..

Matthew:
- 2.1.8/2.1.9 strategy
- Polymorphic servers' 'morphing' scripts
- Tier1 talk
Shaun:
- LHCb Jamboree
- Progress Strategic Objectives
- Testing SRM 2.8-6
- Progressing Stress Testing on pre-prod
Chris:
- Finish configuring and testing cold stand-by central castor servers
- Continue doing work related to Tier1 Security Group project
- DepMon duties
- Castor on Call duties (Mon-Tue)
- Progress with Quattor tape server
- Stress tests on PreProd (part of castor upgrade plan)
- Prepare certification instance for 2.1.8 upgrade
- Preparation for 2.1.8/2.1.9 castor upgrade
Cheney:
- ..
Tim:
- More T10K testing prior to CMS migration
- Installation of new tape servers
Richard:
- Adding "random file size" feature to stress testing suite
- Running stress tests on pre-prod instance
Jens

ATLAS User causing SRM failures. Under investigation.
A checksum problem has been reported by ATLAS causing some transfer failures from the CE to the SE.
CMS migrations continue to be troublesome. Production team will start restarting migHunters if observed, but the nagios alert needs to be revisited since it does not seem to be working
Castor151 (part of Neptune RAC) rebooted Friday morning. No obvious reason, but it was pointed out that backups now run from this node having been moved from the previously flaky node - which has since been rock solid.
During disk server deployments we have again seen instances where not all the RPMs required have been installed (roll on quattor)

Entries in/planned to go to GOCDB None