Difference between revisions of "RAL Tier1 weekly operations castor 25/04/2011"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 08:22, 27 April 2011

Operations News

  • none

Operations Problems

  • On 15/4/11 the ATLAS SRM started underperforming due to incorrect execution plans leading to bad performance. ATLAS were put into 6 hours of downtime.
  • On 20/4/11 the co-location of the NS and ATLAS Schemas on the same node caused problems which made the node unresponsive and affected all users for 1 hour. The NS was moved to another node and the node rebooted.
  • On 21/4/11 the ATLAS SRM once again became too slow, again because the stats were old and the default execution plan needed changing. ATLAS were put into 2 hours of downtime.
  • (Gen stager db problems over weekend - details to be completed)

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • None

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Upgrade tape subsystem to 2.1.10-1 which allows us to support files >2TB
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade Facilities instance to 2.1.10-0
  • Move Facilities instance to new Database hardware running 10g
  • Upgrade SRMs to 2.10-3 which incorporates
    • VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade and Quattorization of CASTOR headnodes

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Shaun A/L (all week)