RAL Tier1 weekly operations castor 20/09/2013

From GridPP Wiki
Revision as of 13:02, 20 September 2013 by Shaun de witt (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • All cmsTape disk servers scheduled for redeployment have been removed from CASTOR
  • ATLAS have modified timing of their deletion scripts for ATLASSCRATCHDISK to circumvent timeout problems observed.
    • The underlying cause of these is still not understood
  • Kernel and errata updates have been performed on preprod, with Bruno and Chris working on vcert
  • Rob believes he has a solution to stop SRMs logging into older files but needs a kernel update and reboot.
    • porstoned until next week


Operations Problems

  • There was an dblink errror observed on production (as seen before on facilities)
    • Problem was resolved in the same manor with a (almost) minimum of downtime
    • This should also be applied to the standby (John to investigate)
  • ATLAS hammercloud tasts have shown large but intermittent failure rates; the cause is under investigation (Alastair)
  • Brian belives there are about 1.5M files in scratch disk which are dark
    • Shaun to start off a namespace dump of scratch disk
  • No progress on HBASE logging

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • none

Advanced Planning

Tasks

  • CASTOR 2.1.14 + SL6 testing, once 2.1.14 is released.

Interventions

  • none

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • (Mon-Thu) Matt at RDA
    • (Wed-Thu) Shaun at EUDAT