RAL Tier1 weekly operations castor 07/11/2011

From GridPP Wiki
Revision as of 13:28, 9 November 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Preprod now working with stg,lsf & dlf headnodes quattorized
  • The new CIP which fixes the tape capacity over-reporting was successfully tested and was rolled out to production today

Operations Problems

  • ATLAS outage from Oct 29-31 caused by a bad execution plan. Later fixed by a hint provided by Nilo. PM: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage
  • On Thursday a problem with the new setup of xroot for aliceDisk was identified and fixed. ALICE were unaffected.
  • On Friday morning a second Pluto node was lost after a motherboard burnout.
  • On Friday morning the CMS jobmanager activity ceased for approx. 1 hour. It was picked up by the strace script, but it did not provide any useful information. The script now actively restarts the daemon if it is triggered again.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person: Shaun
  • Staff absence/out of the office:
    • Chris (all week)
    • (Thu) Shaun in DL
    • (Thu PM) Matthew working from home