RAL Tier1 weekly operations castor 19/12/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • As a result of the workaround applied on Thu, the new DB hardware is now in full production for the ATLAS SRM.

Operations Problems

  • On Wed night there were more problems with the DNS server Chilton. This time the DB machines were affected, as their DNS lookup order had not been changed after the previous week's DNS problems.
  • On Thu morning, ATLAS SRM schema started taking up much more resources than it should. This appears to be a bug within the ORACLE Resource Manager. The SRM schema was moved to the new hardware where it is now running in production.
  • On Fri morning, CMS and Gen started malfunctioning due to one node in the Pluto database becoming unresponsive and affecting the whole rack.
  • The CIP stopped publishing new data as it was not reconfigured to point to the new DB after the ATLAS SRM was moved.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person:Chris
  • Staff absence/out of the office:
    • Shaun (A/L)