RAL Tier1 weekly operations castor 03/02/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Testing of 2.1.14 ongoing.
  • The new headnodes for preprod are ready to be deployed.
  • The Virtual Certification instance running 2.1.14 has been repaired and the DB upgraded.
  • CERN are seeing problems with SRM crashes when running against the current version of 2.1.14. Aggressive restarters would be required to run this version in production.

Operations Problems

  • Caltech are still having problems with writing a particular file into CASTOR.
  • On Monday one node of the Pluto DB crashed and was returned to the correct node at 15:15 which was the likely cause of a SUM test failure for Atlas.
  • On Tuesday a problem was encountered with SRM-MV operations on GEN for SNO+. This is now understood and fixed in the latest version of the SRM tools (currently only available on one RAL UI).
  • On Wednesday & Thursday we were hit by a load-related issue on the CMS instance. This was caused by an excess number of xroot daemons running simultaneously. The situation was improved by reducing the number of slots of disk servers in the cmsDisk pool (and purging jobs from CASTOR) hence reducing the the number of jobs that can run concurrently on a disk server.
  • On Thursday night and Friday we had a number of SUM test failures (2 for Atlas and 2 for LHCb), these are currently under investigation.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • The tape system will be down from 0800 to 1000 on Tuesday to allow work to be carried out on the tape system. This is not expected to have any impact on operations.

Advanced Planning

Tasks

  • CASTOR 2.1.14 + SL5/6 testing
  • iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
  • Quattor cleanup process. First step is to deal with 200-odd servers in 'misc'.
  • Installation of new Preprod headnodes
  • A complete dump of the ATLAS namespace is in progress.

Interventions

  • none

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Chris out Monday
    • Rob out Monday morning until 11.
    • Matt out Monday