Tier1 Operations Report 2013-10-23

From GridPP Wiki
Revision as of 10:48, 23 October 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd October 2013

Review of Issues during the week 16th to 23rd October 2013.
  • The Torque/Maui batch farm continued to give problems during the second half of last week. During last weekend (19/20 Oct.) one of the batches of worker nodes was identified as being problematic and was disabled. Since then the Torque/Maui batch system has run fine. Investigations are ongoing into the problem with the batch of worker nodes. Indications are the cause is most likely in the network switch used to connect them.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs have been upgraded to Sl6 as well. We plan to keep this configuration (with both farms running SL6 WNs with 50% of the total capacity) until early November.
Ongoing Disk Server Issues
  • GDSS720 (AtlasDataDisk - D1T0) crashed yesterday evening. It has been taken out of production. It will be returned to production (planned for today) and drained ahead of further investigations.
Notable Changes made this last week.
  • CVMFS client version 2.1.15-1 has been installed on two batches of worker nodes in the Condor farm.
  • This morning (23rd October) the data uplink from the Tier1 was doubled from 10 to 20 Gbit/sec.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Initial plans propose Castor down for the day on Tuesday 5th.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
  • Infrastructure:
    • A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
      • Intervention required on the "Essential Power Board".
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 16th and 23rd October 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site except CEs; SRMs, LFC & FTS SCHEDULED WARNING 23/10/2013 09:45 23/10/2013 12:15 2 hours and 30 minutes Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network - some services At Risk. (Other services declared down in separate GOC DB entry).
All CEs; All SRMs, LFC & FTS SCHEDULED OUTAGE 23/10/2013 09:45 23/10/2013 12:15 2 hours and 30 minutes Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network so some services stopped during the work. Other services at risk,
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98294 Green Less Urgent In Progress 2013-10-23 2013-10-23 Atlas Failed transfers from srm-atlas.gridpp.rl.ac.uk to RRC-KI-T1
98249 Green Urgent In Progress 2013-10-21 2013-10-23 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98214 Green Less Urgent In Progress 2013-10-19 2013-10-21 CMS HC Job failure reading dataset from T1_UK_RAL storage
98122 Green Less Urgent In Progress 2013-10-17 2013-10-22 cernatschool CVMFS access for the cernatschool.org VO
97908 Red Less Urgent In Progress 2013-10-09 2013-10-22 Backup UK VOMS servers
97868 Red Less Urgent Waiting Reply 2013-10-08 2013-10-21 T2K CVMFS for t2k.org
97759 Red Urgent On Hold 2013-10-04 2013-10-04 OPS SHA-2 test failing on lcgce01
97385 Red Less Urgent In Progress 2013-09-17 2013-10-14 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
16/10/13 100 100 100 100 100
17/10/13 100 100 99.3 100 100 Single SRM Get test failure (Error reading token data header).
18/10/13 100 100 100 100 100
19/10/13 100 100 99.2 100 100 Single SRM Get test failure (SRM_FILE_BUSY).
20/10/13 100 100 100 100 100
21/10/13 100 100 99.0 100 100 Single SRM Put failure (Internal error).
22/10/13 100 100 100 100 100