Tier1 Operations Report 2013-10-30

From GridPP Wiki
Revision as of 10:54, 30 October 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 30th October 2013

Review of Issues during the week 23rd to 30th October 2013.
  • The Torque/Maui batch still has one of the batches of worker nodes disabled. Apart from that it has run reasonably well. The Condor farm has run OK.
  • Two files were declared lost to Atlas following the failure of GDSS720. These were in transit as the server went down.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • We are running with the two farms, Condor and Torque/Maui, in production. The Torque/Maui farm will be decommissioned after the intervention next week and its nodes moved into the Condor farm.
  • The uplink from the Tier1 core switch to the UK Light router that was doubled last week has been working OK since that change.
Ongoing Disk Server Issues
  • GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week.
  • CVMFS client version 2.1.15-1 has been rolled out to all worker nodes in the Condor farm.
  • A further update was applied to FTS3 last Wednesday, 23rd Oct. (Upgraded to 3.1.33-1).
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
BDIIs (lcgbdii, site-bdii), lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, Myproxy (lcgrbp01, myproxy) SCHEDULED WARNING 05/11/2013 07:00 06/11/2013 12:00 1 day, 5 hours Warning (At Risk) on services during intervention on Uninterruptible Power Supply (UPS). Some services (LFC, FTS) will experience two breaks of around one to two hours during this period.
All Castor (all SRMs), Atlas Frontier SCHEDULED OUTAGE 05/11/2013 07:00 05/11/2013 21:00 14 hours Stop of systems (Castor, Frontier/3D database) during work on Uninterruptible Power Supply (UPS).
Condor batch farm (arc-ce01, arc-ce02, arc-ce03, cream-ce01, cream-ce02, lcgargus01, VO boxes, lcgapel01, atlas-squid, cms-squid, UIs (lcgui01, lcgui02), WMSs (lcgwms04, lcgwms05, lcgwms06), Perfsonar (perfsonar-ps01, perfsonar-ps02). SCHEDULED OUTAGE 05/11/2013 07:00 06/11/2013 15:00 1 day, 8 hours Stop of systems (Batch, WMS) during work on Uninterruptible Power Supply (UPS).
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 SCHEDULED OUTAGE 05/11/2013 07:00 30/11/2013 23:59 25 days, 16 hours and 59 minutes Service being decommissioned.
lcgwms04, lcgwms05, lcgwms06 SCHEDULED OUTAGE 01/11/2013 12:00 05/11/2013 07:00 3 days, 19 hours Drain of WMSs ahead of their shutdown during work on UPS.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Outages and Warnings declared in GOC DB.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 23rd and 30th October 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (all SRMs), batch (All CEs),lcgfts, lfc SCHEDULED OUTAGE 23/10/2013 09:45 23/10/2013 12:15 2 hours and 30 minutes Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network so some services stopped during the work. Other services at risk,
All systems not in the above outage. SCHEDULED WARNING 23/10/2013 09:45 23/10/2013 12:15 2 hours and 30 minutes Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network - some services At Risk. (Other services declared down in separate GOC DB entry).
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98337 Amber Urgent In Progress 2013-10-23 2013-10-23 Mice Slow file uploads to castor (MICE)
98249 Red Urgent In Progress 2013-10-21 2013-10-30 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98214 Red Less Urgent In Progress 2013-10-19 2013-10-21 CMS HC Job failure reading dataset from T1_UK_RAL storage
98122 Red Less Urgent In Progress 2013-10-17 2013-10-30 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent Waiting Reply 2013-10-08 2013-10-30 T2K CVMFS for t2k.org
97759 Red Urgent On Hold 2013-10-04 2013-10-04 OPS SHA-2 test failing on lcgce01
97385 Red Less Urgent In Progress 2013-09-17 2013-10-14 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
23/10/13 89.6 89.6 87.4 89.6 89.6 Systems stopped for doubling of data uplink.
24/10/13 100 100 85.9 100 100 Atlas Castor problem caused by a draining disk server.
25/10/13 100 100 100 100 100
26/10/13 100 100 99.5 100 100 Single SRM test failure "Error reading token data header:"
27/10/13 100 100 100 100 100
28/10/13 100 100 100 100 100
29/10/13 100 100 100 95.9 100 Single SRM test failure "Error reading token data header:"