RAL Tier1 weekly operations castor 18/01/2010

Summary of Previous Week

Matthew:
- Intervention planning
- Tape-backed persistent tests
- Access to CASTOR for facilities testing
- Preprod stress test planning
Shaun:
Chris:
- Testing new kernel on certification before intervention
- Working with Tim on getting repack instance working again
- Working on certification and preproduction instance
- Fixed ralreplicas for LHCB and GEN
- Redeployed gdss110 for repack instance
Cheney:
- re-validation of EMC kit
- relocate neptune voting disk ahead of cdbc08 retirement
- Prep for DB take-on of EMC kit
- Fixed ipmi SOL on ibm x3550
- A little patching
Tim:
- Working on getting repack working again
- setting up and deleting atlas tape families
Richard:
Brian:
- Draining of RAID5 diskservers within atlas
- Removal of service classes which are no longer needed
- Planning for new ATLAS/LHCb server deployment and consequent Draining
Jens:
- CIP upgrade planning
- Debugging CASTOR SRM authentication problem with Shaun

Developments for this week

Matthew:
- Intervention planning
- CIP and new castoradm1 testing
- Disk server redeployment
- Setting up preprod stress test
Shaun:
- Testing SRM with castor2.1.8-17 client libraries
- Testing database load with nameserver checksum trigger
- SRM testing
- Understanding on castormon source code
- Recovering certification system
Chris:
- Update repack instance plus preprod disk servers (castor30x)
- Test disk server deployment procedure using Quattor
- Test maximum number of lsf job slots for 18TB disk server
- Test 64bit disk server with XFS
- Test access restriction to disk servers with Jonathan
- Work on preproduction instance
Cheney:
- Revalidation of EMC kit
- Restore castoradm1 (again)
- Fitting memory sticks
- Training on how to use ipmi
Tim:
- Look at what needs to be purchased for lhc and non-lhc castor instances
Richard:
Brian:
- Draining of RAID5 diskservers within atlas
Jens:
- Prepare for and upgrade all RAL T1 production CIPs to 2.1.0

Operations Issues

atlas lsf became momentarily unstable due to big logfiles because of missing servers

Blocking issues

Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
Preprod DB can only be delivered after EMC testing is done (3nd week after Jan'10)

Planned, Scheduled and Cancelled Interventions

18-22 January - at-risk while memory on database nodes is upgraded
19 January - move castoradm1 to newer host. No downtime/at-risk
19,20 January - upgrade SRM castor client to 2.1.8-17
21 January - upgrade LHC CIP and introduce reduncancy of CIPs. 1 hour at-risk

27-28 January

- FSCK Disk servers and pick up new kernels.
- Add IPMI to Castor Head Nodes.
- Replace cdbc08 and add new DB archive log destination.
- Install NameServer CheckSum Trigger
- Restrict user login on disk servers

The following have not been folded into the above schedule. These can be fitted around as they are, at worst, an ‘At Risk’.

- Update fetch-crl rpm on disk servers

Advanced Planning

Gen upgrade to 2.1.8 2010Q1
Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

Castor on Call person: Chris

RAL Tier1 weekly operations castor 18/01/2010

Contents

Summary of Previous Week

Developments for this week

Operations Issues

Blocking issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools