RAL Tier1 weekly operations castor 12/11/2012

From GridPP Wiki
Revision as of 13:41, 12 November 2012 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Aug errata was applied to central NS hosts during unscheduled downtime due to power cut.

Operations Problems

  • A power cut cut off CASTOR and all Tier1 services at 11am on Wed. Full service was not restored until 13:00 Thu.
  • lcgsrm03 (ATLAS SRM) sustained hardware problems during the power outage, and was replaced by lcgsrm13.
  • On Sun ORACLE switched to a bad execution plan which caused the ATLAS SRMs to perform sub-optimally. A planned change this week to ORACLE will prevent this from happening, by freezing a good execution plan.
  • NTP not running on disk servers caused SRM VO SAM tests to fail for ATLAS and CMS due to clock skew.

Blocking Issues

Enabling central syslog collection of central service logs is needed before we turn off Amanda backups on all CASTOR headnodes

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

Tasks

  • Simplify and document Quattor templates to make them easier to maintain
  • Test and certify 2.1.13-5 with simplified Quattor templates

Interventions

  • Upgrade stagers from 2.1.12 to 2.1.13 and central services (NS,CUPV,VDQM) from 2.1.11 to 2.1.13

Staffing

  • Castor on Call person
    • Matthew
  • Staff absence/out of the office:
    • ..