Difference between revisions of "RAL Tier1 weekly operations castor 4/04/2010"
From GridPP Wiki
Matt viljoen (Talk | contribs) |
(No difference)
|
Latest revision as of 12:16, 7 April 2010
Contents
Summary of Previous Week
- Matthew:
- Tier1 Open Day
- Determine 2.1.8/2.1.9 upgrade plan for test upgrades of production db snapshots
- CASTOR DB Disaster Recovery and Way Forward plans
- CoD/Depmon work
- Published list of 'approved exceptions' - changes that don't require formal change control
- Running functional tests against 2.1.7 certification
- Shaun:
- ..
- Chris:
- Partially tested cold stand-by central castor servers
- Disk server deployment duties
- Castor 2.1.8/2/1.9 upgrade work
- Doing work related to Tier1 Security Group project
- Richard:
- Created the necessary setups for the mix of commands agreed for the pre-prod stress testing
- Run the above commands (using limited # of threads and repeats) to "test the tests"
- Brian:
- ..
- Jens:
- Continued CIP 2.2 dev. Discussions with CERN about roadmap for CIP/CASTOR interface.
- Added to test plan for CASTOR upgrade, testing with the grid.
Developments for this week
- Matthew:
- Preparing for GridPP24 CASTOR session
- Install puppetmaster on new hardware
- Depmon work
- Shaun:
- Meet ORACLE consultant
- Chris:
- Test SL5 (64bit) disk server with xfs
- Test cold stand-by central castor servers and then write documentation
- Disk server deployment duties
- Test Quattor disk server procedure and build castor disk server
- Castor 2.1.8/2/1.9 upgrade work
- Doing work related to Tier1 Security Group project
- Richard:
- Will kick off tests to run during the weekend
- Brian:
- ..
- Jens:
- See if I can finally get enough femtoseconds to install CIP on preprod or cert.
Operations Issues
- Puppet master became unstable after latest disk server deployment. We need to move it to new hardware.
- Reboot of Neptune node: castor151. Suspicion is that is is related to the NFS mounts for backup
- A few job failures on CMS due to incorrect free space being reported on SRM, appearing in FTS log
- Networking issues over weekend (4/4/10) lead to site outage including CASTOR
Blocking issues
None
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
None
Advanced Planning
- Upgrade to 2.1.8/2.1.9 2010
- Upgrade to SRM 2.8-6 after testing is complete
- ATLAS want to know how much capacity is available in disabled servers (published as Capability). Low priority CIP change to do this.
- CASTOR Instance for Non LHC 2010Q2
- Install/enable gridftp-internal on Gen (Before 2.1.8 upgrade)
Staffing
- Castor on Call person: Chris
- Staff absences:
- Matthew (Wed,Thu morning)
- Richard (Tue/Wed)