Tier1 Operations Report 2013-11-06

From GridPP Wiki
Revision as of 13:30, 6 November 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 6th November 2013

Review of Issues during the week 30th October to 6th November 2013.
  • The Torque/Maui batch continued to run until its final drain with one of the batches of worker nodes disabled.
  • The significant outages of services for the UPS work are referred to below. In order to keep the Top-BDII services up two replacement nodes were installed and placed on non-UPS power. However these have not run smoothly. LFC, MyProxy (lcgrbp01) and FTS3 services stayed up. Some services (FTS2, LFC, Atlas 3D/Frontier) were up most of the time - they suffered two short (1 - 2 hour) outages during yesterday (5th). Castor was down from 07:00 to 19:00 yesterday. Batch (CEs in front of Condor farm) from 07:00 yesterday until 13:00 today.
  • The "uklight" router stopped during the afternoon of Tuesday 5th November when the power to its rack failed. This is used by the links to CERN and the 'bypass' route to Tier2s. This did not cause any operational problems as other services were down for the UPS work in the main computer building. This failure was nothing to do with the planned work - it was just coincidental. The uklighht router is in a different building.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The uplink from the Tier1 core switch to the UK Light router that was doubled last week has been working OK since that change.
Ongoing Disk Server Issues
  • GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week.
  • There was a significant outage yesterday and this morning for work on the UPS - changes to the "Essential Power Board" and an electrical safety check during which all UPS circuits were tested.
  • The Torque/Maui batch farm has been stopped. Its worker Nodes will be moved into the Condor farm. The CREAM CEs that served this farm (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) are in a long downtime in the GOC DB ahead of decommissioning.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 30th October and 6th November 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
All Castor (all SRMs), Atlas Frontier (lcgft-atlas.gridpp.rl.ac.uk) SCHEDULED OUTAGE 05/11/2013 07:00 05/11/2013 19:13 12 hours and 13 minutes Stop of systems (Castor, Frontier/3D database) during work on Uninterruptible Power Supply (UPS).
All Batch (arc-ce01, arc-ce02, arc-ce03, cream-ce01, cream-ce02, atlas-squid, cms-squid, VO boxes, WMSs (lcgwms04, lcgwms05, lcgwms06), perfsonar (perfsonar-ps01, perfsonar-ps02) SCHEDULED OUTAGE 05/11/2013 07:00 06/11/2013 15:00 1 day, 8 hours Stop of systems (Batch, WMS) during work on Uninterruptible Power Supply (UPS).
lcgbdii, site-bdii, lcgfts, lcgrbp01, myproxy, lfc. SCHEDULED WARNING 05/11/2013 07:00 06/11/2013 12:00 1 day, 5 hours Warning (At Risk) on services during intervention on Uninterruptible Power Supply (UPS). Some services (LFC, FTS) will experience two breaks of around one to two hours during this period.
All WMSs (lcgwms04, lcgwms05, lcgwms06) SCHEDULED OUTAGE 01/11/2013 12:00 05/11/2013 07:00 3 days, 19 hours Drain of WMSs ahead of their shutdown during work on UPS.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98625 Green Urgent In Progress 2013-11-04 2013-11-04 LHCb Data unavailable for Brazilian proxies at RAL-LCG2
98249 Red Urgent In Progress 2013-10-21 2013-10-30 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent In Progress 2013-10-17 2013-10-30 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent Waiting Reply 2013-10-08 2013-10-30 T2K CVMFS for t2k.org
97759 Red Urgent On Hold 2013-10-04 2013-10-04 OPS SHA-2 test failing on lcgce01
97385 Red Less Urgent In Progress 2013-09-17 2013-10-14 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-05-11 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
30/10/13 100 100 100 100 100
31/10/13 100 100 100 97.7 100 Single SRM test failure (on Put) "Error reading token data header:"
01/11/13 100 100 99.0 100 100 Provoked by local Atlas file deletions taking place at the same time.
02/11/13 100 100 100 100 100
03/11/13 100 100 100 100 100
04/11/13 100 85.0 71.3 100 74.6 Mainly drain of batch ahead of tomorrow's UPS work. Atlas also had Single SRM SUM test failure.
05/11/13 29.2 0 0 48.9 0 Effect of outage for UPS Work in R89.