RAL Tier1 weekly operations castor 26/04/2010
From GridPP Wiki
Contents
Summary of Previous Week
- Matthew:
- Building and testing new puppetmaster server
- Testing quattorized disk server deployment
- Writing plans for Investigating Alternatives to CASTOR project
- Shaun:
- Solved problem of ATLAS recalls (missing account on disk server)
- Worked with James Jackson on implementing solution to mighunter contention for CMS
- Solving and testing of ATLAS software problems (32 bit issue)
- APR
- Stratgic Objectives
- Chris:
- ..
- Richard:
- Continuing with p/p stress testing
- Brian:
- ..
- Jens:
- ..
Developments for this week
- Matthew:
- CASTOR Database - The Way Forward meeting
- Testing quattorized disk server deployment
- Writing plans for Investigating Alternatives to CASTOR project
- Testing new puppetmaster server
- Shaun:
- Make LSF pending jobs information available to CMS
- Catch up with SRM developments
- Keep trying the get SRM rate monitoring into castormon
- APR (cont...)
- Chris:
- ..
- Richard:
- Continuing with p/p stress testing
- Brian:
- ..
- Jens:
- ..
Operations Issues
- Number of pending jobs in CMS increased substantially, resulting in a callout. Will make castormon data files available to provide feedback to PheDeX to avoid overloading CASTOR in the future.
- gdss420 was found to be missing its /exportstage/castor* partitions during deployment into atlasNonProd
Blocking issues
- Preprod stress testing taking longer than anticipated. We are cutting back the number of tests from 10k->5k and file size tests to only 100Mb and 2Gb
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
Description | Start | End | Type | Affected VO(s) |
---|---|---|---|---|
Add new node to database SAN for running backups | 28/4/10 1015 | 28/4/10 1430 | At risk | Downtime |
Advanced Planning
- Upgrade to 2.1.8/2.1.9 2010
- Upgrade to SRM 2.8-6 after testing is complete
- ATLAS want to know how much capacity is available in disabled servers (published as Capability). Low priority CIP change to do this.
- CASTOR Instance for Non LHC 2010Q2
- Install/enable gridftp-internal on Gen (Before 2.1.8 upgrade)
Staffing
- Castor on Call person: Shaun
- Staff absences:
- Matthew (Tues morning)
- Jens (Mon-Wed)