RAL Tier1 weekly operations castor 04/08/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.14-13 Facilities upgrade complete.
  • We have received word that a 2.1.14-15 version of CASTOR may be forthcoming.
  • Kashyap's Elasticsearch query script has been rolled out to CASTOR headnodes. Users are encouraged to test it and report any bugs.
  • Plan to ensure PreProd represents production in terms of hardware generation are underway. A student will be starting soon with the task of investigating visualisation and querying solutions for CASTOR use.
  • LHCb's remaining batch of 2014 disk servers have been deployed into production.


Operations Problems

  • Major problems have been found with the draining script when we tried to drain an ATLAS disk server. The accounting was reporting obviously wrong numbers (negative number of files left on node), and the drain 'finished' without moving all files from the node. We have contacted CERN and are awaiting a response.
  • A few Atlas SUM test failures throughout the week, Monday 28th being the worst.
  • Disk server GDSS680 crashed (atlasStripInput), no obvious reason and has now been returned to service. One file lost.
  • A new service class called 'cedaRetrieve' has been created to allow CEDA users (aka Kevin) to manually stage files for retrieval.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Tasks

  • Possible future upgrade to CASTOR 2.1.14-15.
  • Resume draining on the ATLAS instance once draining issues resolved.
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Replace DLF with Elastic Search
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers


Interventions


Staffing

  • Castor on Call person
    • Shaun
  • Staff absence/out of the office:
    • Chris and Matt out all week