RAL Tier1 weekly operations castor 08/03/2010
From GridPP Wiki
Contents
Summary of Previous Week
- Matthew:
- 2.1.8/2.1.9 strategy and presented new features to Liaison meeting
- Planning for CASTOR session at GridPP24
- Database DR
- Kicked off plans for moving forward to new production database hardware
- Installed lcg_utils on castoradm3 for stress testing
- Depmon (and backup CASTOR on Day) duties
- Wrote presentation for T1 Away Day
- Shaun:
- Assisted with disk server deployment problems
- Fixed t2k tape recall problem
- Implemented tweak to address CMS job problems
- CODD (Friday)
- Chris:
- Continuing testing number of job slots per protocol basis
- Doing some work on Quattor Tape and Disk Server
- Start preparing test infrastructure for castor upgrades
- Implemented fix for Atlas for LFS events
- Castor on Duty person
- Friday off
- Cheney:
- ..
- Tim:
- Hardware installs
- CS1818 problem investigation
- Pre-prod VDQM (big-id) problems.
- T10KB drives on Pre-prod
- Richard:
- Worked on new version of pre-prod benchmarking tool
- Brian:
- ..
- Jens:
- Expounding on the Correct Interpretation(tm) of information
Developments for this week
- Matthew:
- ..
- Shaun:
- COD
- Castor Monitoring prototyping
- Testing distribution of new tnsnames file
- Chris:
- Continue testing number of job slots per protocol basis. Waiting for LHCB to test rootd
- Do some work with polymorphic machines
- Prepare cold stand-by central server
- Do some work on Quattor Tape Server
- Preparing test infrastructure for castor upgrades
- Cheney:
- ..
- Tim:
- T10KB drive testing on Pre-prod
- Getting new tape servers into operation
- Richard:
- Complete new version of pre-prod benchmarking tool and create a Wiki page to document it
- Brian:
- ..
- Jens:
- Getting preprod and/or cert cipped. Pick up CIP 2.2.0 again.
Operations Issues
- Large number of jobs failing due to saturation of access to small number of hot files. New service class with replica=30 added using same disk pool as cmsFarmRead to deal with this.
- 1 faulty Atlas tape identified (cs1818)
- problems of missing RPMs on redeployed disk servers after going into production. Final disk server signoff introduced by CASTOR team members when deploying new disk servers to production.
- Another BigID occurence, this time on Preprod VDQM (first time on this schema)
Blocking issues
- Still awaiting preprod database
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
none
Advanced Planning
- Gen upgrade to 2.1.8 2010Q1
- Install/enable gridftp-internal on Gen (before 2.1.8 upgrade)
Staffing
- Castor on Call person: Shaun
- Matt on paternity leave for 2 weeks from approx 8 March
- Staff absences: