RAL Tier1 weekly operations castor 20/09/2010

From GridPP Wiki
Jump to: navigation, search

Work previous week

  • Matthew:
    • Upgrade planning and coordination with Cern
    • Coordination of test (certification) upgrade
    • Debugging 2.1.9 rmmaster problem resulting in apparent database corruption - was actually corrupted shared memory
  • Shaun:
    • ..
    • Monitoring LHCb production instance
  • Chris:
    • Preparation for the test upgrade (installing SRM, moving disk servers, preparing Puppet templates)
    • Tested upgrade procedure and writing post upgrade report
    • Castor on Duty
  • Richard:
    • Running 2.1.9 functional tests on CERT instance
  • Brian:
    • ..
  • Jens:
    • Upgrade pre-planning for CIP - in principle, nothing needs doing, but just to be certain.

Operations Issues

  • LHCb performance degredation: Still running acceptably, so farm slots raised from 800 to 1000. However, high swapping on 1 SRM caused by half the memory of the other SRM - this will be increased.
  • gdss280 went back into production after intervention and soon after started displaying fsprobe errors. Checking checksums on its files showed that there had been data corruption.
  • The gdss280 problem shows we need extra testing prior to putting intervention disk servers back into production.


  • ..

Blocking issues

  • Any ongoing production problems at present will jepordize the timeline for starting 2.1.9 upgrades at the end of this month.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update LHCb to 2.1.9 27/09/2010 08:00 29/03/2010 18:00 Downtime LHCb

Advanced Planning

  • Upgrade to 2.1.9 2010


  • Castor on Call person: Matt
  • Staff absences:
    • ..