Tier1 Operations Report 2013-12-18

From GridPP Wiki
Revision as of 13:18, 18 December 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 18th December 2013

  • The Tier1's plans for the Christmas and New Year holiday can be seen on our blog.*


Review of Issues during the week 11th to 18th December 2013.
  • On Thursday (12th Dec) our internal network link to the Atlas building failed. A number of production virtual machines were running on hypervisors in the Atlas building. The break lasted around 30 minutes. Castor services were unaffected. It mainly affected batch services (some jobs were lost) and FTS - which was not available for this time. HTCondor should have been resilient to this break - a bug report has been submitted to the Condor developers. We now have a work around in place should this recur.
  • There was a problem with Castor that was noticed by Atlas on Sunday (15th Dec). This was traced to a database issue that bumped the ALTAS SRM and Stager databases onto another node that had occurred the previous evening.
  • We continue to track Atlas files found to be lost during the renaming operations. We now have around a thousand lost files out of the around 9 to 10 million renamed. So far we are not aware of any systematic pattern to the missing files. These numbers are broadly in line with those seen by other Tier1s.
  • Problems were found following updates to some Grid service nodes (e.g. CEs) on Thursday (12th Dec) - the newer version of openssl deployed required longer keys. These problems were resolved the following day.
  • The current rolling update of the Worker Nodes has resolved the recent problem of Atlas job failures caused by the high inodes counter in CVMFS.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • None
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • The batch farm roll-out of a condor update and a reduction in the memory over-commit (as well as kernel/errata updates) is continuing and now in its final stages.
  • EMI-3 update 11 applied on WMS nodes
  • Kernel/errata updates on various service nodes (CEs etc).
  • Two new Logging/Bookkeeping servers have been deployed - effectively updating the older pair that are being retired.
  • An Argus server for NGI_UK has been set-up and configuration/testing will continue in the New Year.
  • Deliveries currently ongoing of one of this year's tranches of disk system purchases.
Declared in the GOC DB
  • None.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • There will be an interruption to the small VO's software server as it to be physically moved.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Possible move of Tier1 core network switch in January (TBC).
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 11th and 18th December 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours Old EMI-2 hosts to be retired
Whole Site SCHEDULED WARNING 11/12/2013 10:00 11/12/2013 11:19 1 hour and 19 minutes RAL Tier1 site in warning state due to UPS/generator test.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
99768 Green Less Urgent Waiting Reply 2013-12-13 2013-12-17 Atlas RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist"
99647 Yellow Less Urgent Waiting Reply 2013-12-12 2013-12-17 SNO+ lcg-cp connection timeouts
99556 Red Very Urgent In Progress 2013-12-06 2013-12-17 NGI Argus requests for NGI_UK
98249 Red Urgent In Progress 2013-10-21 2013-12-10 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2013-12-09 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent In Progress 2013-10-08 2013-12-04 T2K CVMFS for t2k.org
97385 Red Less Urgent In Progress 2013-09-17 2013-12-09 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-11-05 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
11/12/13 100 100 100 100 100
12/12/13 100 100 100 100 91.8 Network link to Atlas building broke - where we were running some of the VMs for Condor & CEs.
13/12/13 100 100 100 100 100
14/12/13 100 100 100 100 100
15/12/13 100 100 100 100 100
16/12/13 100 100 99.2 100 100 Single SRM GET failure: "could not open connection to srm-atlas.gridpp.rl.ac.uk"
17/12/13 100 100 100 100 100