RAL Tier1 weekly operations castor 18/01/2010

From GridPP Wiki
Revision as of 15:51, 18 January 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • Intervention planning
    • Tape-backed persistent tests
    • Access to CASTOR for facilities testing
    • Preprod stress test planning
  • Shaun:
  • Chris:
    • Testing new kernel on certification before intervention
    • Working with Tim on getting repack instance working again
    • Working on certification and preproduction instance
    • Fixed ralreplicas for LHCB and GEN
    • Redeployed gdss110 for repack instance
  • Cheney:
    • re-validation of EMC kit
    • relocate neptune voting disk ahead of cdbc08 retirement
    • Prep for DB take-on of EMC kit
    • Fixed ipmi SOL on ibm x3550
    • A little patching
  • Tim:
    • Working on getting repack working again
    • setting up and deleting atlas tape families
  • Richard:
  • Brian:
    • Draining of RAID5 diskservers within atlas
    • Removal of service classes which are no longer needed
    • Planning for new ATLAS/LHCb server deployment and consequent Draining
  • Jens:
    • CIP upgrade planning
    • Debugging CASTOR SRM authentication problem with Shaun

Developments for this week

  • Matthew:
    • Intervention planning
    • CIP and new castoradm1 testing
    • Disk server redeployment
    • Setting up preprod stress test
  • Shaun:
    • Testing SRM with castor2.1.8-17 client libraries
    • Testing database load with nameserver checksum trigger
    • SRM testing
    • Understanding on castormon source code
    • Recovering certification system
  • Chris:
    • Update repack instance plus preprod disk servers (castor30x)
    • Test disk server deployment procedure using Quattor
    • Test maximum number of lsf job slots for 18TB disk server
    • Test 64bit disk server with XFS
    • Test access restriction to disk servers with Jonathan
    • Work on preproduction instance
  • Cheney:
    • Revalidation of EMC kit
    • Restore castoradm1 (again)
    • Fitting memory sticks
    • Training on how to use ipmi
  • Tim:
    • Look at what needs to be purchased for lhc and non-lhc castor instances
  • Richard:
  • Brian:
    • Draining of RAID5 diskservers within atlas
  • Jens:
    • Prepare for and upgrade all RAL T1 production CIPs to 2.1.0

Operations Issues

  • atlas lsf became momentarily unstable due to big logfiles because of missing servers

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
  • Preprod DB can only be delivered after EMC testing is done (3nd week after Jan'10)

Planned, Scheduled and Cancelled Interventions

  • 18-22 January - at-risk while memory on database nodes is upgraded
  • 19 January - move castoradm1 to newer host. No downtime/at-risk
  • 19,20 January - upgrade SRM castor client to 2.1.8-17
  • 21 January - upgrade LHC CIP and introduce reduncancy of CIPs. 1 hour at-risk
  • 27-28 January
- FSCK Disk servers and pick up new kernels.
- Add IPMI to Castor Head Nodes.
- Replace cdbc08 and add new DB archive log destination.
- Install NameServer CheckSum Trigger
- Restrict user login on disk servers
  • The following have not been folded into the above schedule. These can be fitted around as they are, at worst, an ‘At Risk’.
- Update fetch-crl rpm on disk servers

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Chris