Tier1 Operations Report 2013-10-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th October 2013

Review of Issues during the week 2nd to 9th October 2013.
  • At the time of the meeting last week (2nd Oct) we had a problem with a Network switch. This caused a loss of access to some older batch nodes and the disk servers in AtlasHotDisk. The problem was resolved after about 90 minutes.
  • There was a network break of about 45 minutes overnight Wed/Thu 2/3 October. This broke our connectivity from RAL to the rest of the world. Staff attended on site to fix it. Following this there were some issues with one of the Site routers which was restarted at 10am that day.
  • During the upgrade of the Torque/Maui farm to SL6 we suffered a significant loss of availability for Atlas. During this time Atlas were successfully running jobs through the ARC CEs and the Condor batch farm. However, Atlas do not have critical tests on the ARC CEs and these were not included in their availability calculations.
Resolved Disk Server Issues
  • GDSS673 (CMSDisk - D1T0) failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was then returned to service on Monday 30th Sep. However, it failed again later that evening. A further two disks failed while it was still rebuilding the RAID array. The system was returned to service on Thursday 3rd Oct. Four CMS data files were declared lost following this incident. These were discovered while performing a routine checksum validation before returning the machine to production. Investigations suggest all four files were in-flight when the machine went down.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs have been upgraded to Sl6 as well. We plan to keep this configuration (with both farms running SL6 WNs with 50% of the total capacity) until early November.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • FTS3 was upgraded to version 3.1.22-1.el6 during the afternoon of Wed 2nd October; and then again to 3.1.26-1 on Tuesday 8th Oct.
  • On Thursday 3rd Oct all SL5 nodes in the Torque/Maui farm were stopped and the farm re-started with worker nodes running SL6. Two batches of SL6 WNs were put into the farm that day and another batch the next day. The two final batches of WNs were re-installed with SL6 at the start of this week and added back into the Torque/maui farm. Some configuration issues with these last two batches of WNs have since been discovered and are being investigated.
  • On Monday morning 7th October a set of fans in the UPS were replaced. During this time there was a marginal additional risk to our services as the UPS was bypassed and would not have been available had there been a general power cut.
  • Tuesday 8th October: Update to Janet6 infrastructure for the primary OPN link to CERN. This was transparent as we switched to the backup link while the work was carried out.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.
  • Interruption to some services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits.
  • Alastair reported verbally at the meeting: We plan to start testing CVMFS 2.1.15 now that the SL6 migration has been completed successfully. We are not aware of any specific VO concerns (eg GGUS tickets) with the current release and therefore will be testing gradually. We will keep the VOs informed, please let us know of problems. If things go well we should be doing large scale testing the week after next (week beginning 21st October).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
  • Infrastructure:
    • A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
      • Intervention required on the "Essential Power Board".
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 2nd and 9th October 2013.

There was one unscheduled outage in the GOC DB for this period. This is the Warning for the JANET 6 upgrade of the Primary OPN link to CERN which we advertised late.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED WARNING 08/10/2013 10:30 08/10/2013 11:30 1 hour Primary OPN link to CERN being migrated to new Janet 6 infrastructure.
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 SCHEDULED OUTAGE 02/10/2013 11:15 03/10/2013 14:00 1 day, 2 hours and 45 minutes Upgrading WNs to SL6. Will drain queues out ahead of WN re-installs with the new OS.
Whole Site SCHEDULED WARNING 02/10/2013 10:00 02/10/2013 11:30 1 hour and 30 minutes At Risk during test of generator backup to the main UPS.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
97868 Green Less Urgent In Progress 2013-10-08 2013-10-08 T2K CVMFS for t2k.org
97759 Yellow Urgent On Hold 2013-10-04 2013-10-04 OPS SHA-2 test failing on lcgce01
97516 Red Urgent In Progress 2013-09-23 2013-09-30 T2K [SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
97479 Red Very Urgent On Hold 2013-09-20 2013-09-30 Atlas RAL-LCG2, high job failure rate
97385 Red Less Urgent On Hold 2013-09-17 2013-09-26 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-06-17 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
02/10/13 100 100 42.7 95.9 100 Atlas: Drain of Torque/Maui farm left Atlas without working CEs in profile; CMS Single SRM test failure on GET
03/10/13 98.4 100 25.6 85.3 95.8 Atlas: As for 02/10; Others: Site Network Break
04/10/13 100 100 100 100 100
05/10/13 100 100 100 100 100
06/10/13 100 100 100 100 100
07/10/13 100 100 100 100 100
08/10/13 100 100 100 100 100