Tier1 Operations Report 2013-12-11

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 11th December 2013

Review of Issues during the week 4th to 11th December 2013.
  • A number of files (some tens) have been found to be missing from Castor as part of the Atlas renaming exercise. Currently around 280 files missing out of around 7 million renamed. So far these have been older files as the renaming has started with those. These are being catalogued and being dealt with in blocks. So far we are not aware of any systematic pattern to the missing files. These numbers are broadly in line with those seen by other Tier1s.
  • Independently of the above one specific file was reported missing by Atlas in a GGUS ticket (and has been declared lost to them)
  • Batch work for the non-LHC VOs was stopped and drained Tuesday/Wednesday (10/11 Dec) for the software server to be moved.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • Atlas have reported batch job failures. The most probably cause is thought to be due to a high inodes counter in CVMFS. We are rolling out some updates to the worker nodes that requires a reboot of each WN and will monitor the effect of this on the job failures.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • A UPS/Generator load test was carried out successfully this morning.
  • A Condor update and a reduction in the memory over-commit (as well as kernel/errata updates) are being rolled out across the batch farm.
  • FTS3 has been upgraded (to version 3.1.46-1.el6).
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • There will be an interruption to the small VO's software server as it to be physically moved.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Possible move of Tier1 core network switch in January (TBC).
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 4th and 11th December 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 11/12/2013 10:00 11/12/2013 12:00 2 hours RAL Tier1 site in warning state due to UPS/generator test.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
99556 yellow Very Urgent In Progress 2013-12-06 2013-12-10 NGI Argus requests for NGI_UK
98249 Red Urgent In Progress 2013-10-21 2013-12-10 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2013-12-09 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent In Progress 2013-10-08 2013-12-04 T2K CVMFS for t2k.org
97385 Red Less Urgent In Progress 2013-09-17 2013-12-09 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-11-05 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
04/12/13 100 100 100 100 100
05/12/13 100 100 100 100 100
06/12/13 100 100 100 100 100
07/12/13 100 100 100 100 100
08/12/13 100 100 100 100 100
09/12/13 100 100 100 100 100
10/12/13 100 100 100 100 100