RAL Tier1 weekly operations castor 14/04/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • The NN_FILE_STAGERTIME constraint has been removed for the Facilities CASTOR database, completing the 2.1.14 upgrade. This upgrade was thought to be transparent, but some daemons didn't reconnect, TM and VMGR is particular. This was fixed by restarting services.
  • 2.1.14 upgrade has been repeated on Preprod - this time with the NS Compatibility flag enabled - as it will be in Tier 1 when we do staggered upgrades across the instances after the initial NS upgrade
  • The xrootd timeout in castor.conf is now set to 30s for all nodes.

Operations Problems

  • 2.1.14 bug was uncovered by Facilities where DiskManager timout (set to 2min) prevented recalled files being returned to users. We've disabled this timeout.
  • gdss673 failed after draining and has been removed from CASTOR for Fabric intervention.
  • An ATLAS user caused a callout by specifying an incorrect space token on write.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

Tasks

  • Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.
  • CASTOR 2.1.14 for Tier 1

Interventions

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • (Mon) Chris A/L
    • (Mon-Tues) Matt A/L
    • (Mon-Thu) Shaun A/L