RAL Tier1 weekly operations castor 12/08/2013

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Test for distributed oracle transaction bug now implemented and running across all production stagers
  • Repack successfully upgraded to 2.1.13-9
  • Draining of ATLAS servers is ongoing and so far seems problem free
  • Work still ongoing getting HBASE logging working
  • Currently moving data at 2GB/s between cmsTape and cmsDisk

Operations Problems

  • Increasing number of pending jobs within CMS
  • ATLAS deletion problems seem to be outside of RAL. Firewall sent FIN packet but clearly not received at destination
  • Two disk servers out of production:
    • gdss664 - Down one drive, waiting for replacement
    • gdss720 - Rob to chase status when Kashif is back

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • none
  • ATLAS renaming to RUCI namespace to start next week (Wednesday)
  • Continue draining ATLAS disk servers - plan for 5 next week
  • Start draining and decommissioning of cmsTape disk servers
  • Deploy 5 new disk servers into lhcbUser
  • Storage array behind CASTOR standby database needs firmware upgrade.

Advanced Planning

Tasks

  • CASTOR 2.1.14 + SL6 testing

Interventions

  • none

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Matthew A/L
    • Shaun (Monday/Tuesday)