Tier1 Operations Report 2013-09-18

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 18th September 2013

Review of Issues during the week 11th to 18th September 2013.
  • There was a problem in the Oracle database behind Castor that called out last night (17/18 Sep). This is a problem of which we were aware and the fix is to change an Oracle parameter. However, this requires a restart of the database to take effect. A Castor outage, with a batch stop/pause is being done today in order for this to happen.
  • There have been some Top-BDII issues during last Wednesday (11th Sep) and again on Friday (13th Sep). The (then) current version of the Top_BDII software has performance issues and can become unresponsive. These can sometimes (but not always) be alleviated by restarting the service. On Thursday (12th Sep) one of the Top-BDII nodes (lcgbdii04) was upgraded to a newer version of the software. Although this new version is responsive, it introduced a new bug causing it to not contain information about various Site-BDIIs. It was this bug that affected our Top-BDII service on Friday - which was fixed by removing the particular node from the alias.
  • On 11/12 Sep there were problems with one of our (two) Site-BDIIs, causing SUM test failures. This particular problem on the Site-BDII had not been seen before and was fixed by a restart.
  • There was a very brief failure of both links to CERN this morning. We did fail over to the backup link at one point, but tests show there was a short period (maybe a few minutes) with both links affected. We have received a ticket stating that this was due to emergency maintenance.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The FTS3 testing has continued very actively. Atlas have moved the UK, German and French clouds to use it. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues. (The FTS3 servers were upgraded to version 3.1.12-1.el6 during this last week.)
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The Change Control process has agree that we will move to a Condor batch farm with (at least initially) both ARC and CREAM CEs. Final testing is ongoing with this farm (ARC-CEs, Condor, SL6). The '08 and '09 batches of worker nodes have been moved from the old (Torque/Maui) farm to the Condor farm which has run OK since the addition of these nodes. However, we have not yet finalised whether the migration of all nodes to SL6 will be done by moving the remaining WNs to the Condor farm or if a portion of the farm will be upgraded 'in-situ' in the Torque/Maui farm.
Ongoing Disk Server Issues
  • None
Notable Changes made this last fortnight.
  • On Thursday (12th Sep) Top-BDII node lcgbdii04was been upgraded to EMI 3.8.0 top-BDII.
  • On Thursday (12th Sep) FTS3 was upgraded to 3.1.12-1.el6
  • The 'whole node' queue on the Torque/maui farm is no longer declared as available to VOs in the information system.
  • The '08 and '09 batches of worker nodes have been moved to the Condor batch farm giving it around 40% of the total (HEPSpec) capacity.
Declared in the GOC DB
  • LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The "Whole Node" queue on the Torque/Maui batch service is being terminated. Multi-core jobs and those requiring SL6 can be run on the Condor batch system.
  • Thursday 26th September is allocated for upgrading the Torque/Maui batch farm WNs to SL6.
  • On Tuesday 1st October RAL network connections will move to SuperJanet 6.
  • Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
  • Infrastructure:
    • A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
      • Intervention required on the "Essential Power Board".
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 11th and 18th September 2013.

There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor & Batch. UNSCHEDULED OUTAGE 18/09/2013 11:45 18/09/2013 14:45 3 hours We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time.
lcgwms05.gridpp.rl.ac.uk, SCHEDULED OUTAGE 06/09/2013 13:00 11/09/2013 10:55 4 days, 21 hours and 55 minutes Upgrade to EMI-3
lcgce12.gridpp.rl.ac.uk, SCHEDULED OUTAGE 05/09/2013 13:00 04/10/2013 13:00 29 days, CE (and the SL6 batch queue behind it) being decommissioned.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
97385 Green Less Urgent In Progress 2013-09-17 2013-09-18 HyperK CVMFS for hyperk.org
97360 Green Less Urgent In Progress 2013-09-17 2013-09-17 EPIC Problem experienced by epic.vo.gridpp.ac.uk VO with RAL and Glasgow WMS
97320 Green Urgent Waiting Reply 2013-09-15 2013-09-16 Atlas No such file or directory and checksum mismatch
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
95996 Red Urgent On Hold 2013-07-22 2013-09-17 OPS SHA-2 test failing on lcgce01
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-16-17 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
11/09/13 91.9 100 100 100 100 Problem with one of the Site BDIIs overnight.
12/09/13 89.9 100 100 100 100 Problem with one of the Site BDIIs overnight.
13/09/13 100 100 100 100 100
14/09/13 100 100 100 95.9 100 Single SRM SUM test failure. (Error reading token data header)
15/09/13 100 100 100 96.0 100 Single SRM SUM test failure. (Error reading token data header)
16/09/13 100 100 100 96.0 100 Single SRM SUM test failure. (Error reading token data header)
17/09/13 100 100 100 95.9 100 Single SRM SUM test failure. (Error reading token data header)