RAL Tier1 weekly operations castor 12/12/2011

From GridPP Wiki
Jump to: navigation, search

Operations News

  • NS/VDQM/VMGR successfully upgraded to 2.1.11-8 on certification, and functional tests against the 2.1.10-1 stager. Next step is to upgrade the stager
  • Hardware for new NS now setup and working

Operations Problems

  • On Tue morning during high load on ATLAS, DB team were alerted to session deadlocks on the SRM schema. Following the established workaround, SRM daemons were restarted on all ATLAS SRMs which fixed the situation. Although there were some FTS transfer failures, we are not aware of users being disrupted.
  • On Tue late afternoon there were more session deadlock problems, and a GGUS ticket was raised against RAL. On this occasion, the problem disappeared without us doing anything.
  • On Wed the primary DNS failed and this especially affected ATLAS. After changing the DNS lookup order, the situation improved.
  • On Thu afternoon the CMS mighunter stopped working for unknown reasons. Investigations continuing.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes


  • Castor on Call person:Shaun
  • Staff absence/out of the office:
    • ..