Difference between revisions of "Tier1 Operations Report 2013-10-02"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:09, 2 October 2013

RAL Tier1 Operations Report for 2nd October 2013

Review of Issues during the fortnight 18th September to 2nd October 2013.
  • As reported at the last meeting: There was a problem in the Oracle database behind Castor overnight 17/18 Sep. This was a problem of which we were aware and the fix was to change an Oracle parameter. A Castor outage, with a batch stop/pause was carried out on Wednesday 18th in order to pick up the changed parameter.
  • On Thursday (19th) there was a problem on some batch worker nodes with /tmp filling up and the nodes being set offline. This was traced to some ALICE jobs. Good communications with ALICE enabled a prompt resolution.
  • On Saturday (28th Sep) a problem with a CVMFS repository at CERN caused problems, in particular with /cvmfs/lhcb-conddb. This caused the Condor farm's healthcheck to fail. For a couple of hours (until this test was removed) Condor batch jobs didn't start.
  • There have been some problems with the Torque/maui batch server. This affected Alice test jobs on Saturday (28th). The problem was resolved on Tuesday (1st Oct) when some stuck batch jobs were cleaned up.
Resolved Disk Server Issues
  • GDSS670 (AliceDisk - D1T0) failed on Sunday (22nd Sep). A RAID verify failed owing to a faulty disk. It was returned to service the next day.
  • GDSS595 (GenTape - D0T1) was unavailable for a couple of hours on Monday (23rd Sep). The system needed to be restarted as it wouldn't see a replacement disk drive.
  • GDS611 (LHCbDst - D1T0) failed on Tuesday (24th Sep). The disk controller was reporting many failed drives. The system was returned to service on Thursday (26th Sep) and has been drained for further investigation.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
  • We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
  • The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs will be upgraded to SL6 tomorrow. This farm is expected to restart with initially with around 20% of the total batch capacity tomorrow. Remaining nodes being added over the following days.
  • Just before this meeting we have a problem that is believed to be in a network switch stack. We lost access to some older servers (including those in Atlas HotDisk and some older worker nodes. Investigations are ongoing.
Ongoing Disk Server Issues
  • GDSS673 (CMSDisk - D1T0) is out of production. It failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was returned to service on Monday (30th Sep). However, it failed again when another disk failed while it was still rebuilding the RAID array.
Notable Changes made this last fortnight.
  • The SRMs have been upgraded to SL5.9 with updated errata and kernel (lcgsrm13 on Monday 23rd, the remaining Atlas SRMs on Wed. 25th, all others on Thursday 26th).
  • FTS3 was upgraded to version 3.1.14-1 on Wed 25th and to version 3.1.16-1 the next day.
  • On Thursday 26th - Completion of upgrade of all three Top-BDII nodes (lcgbdii01, lcgbdii03, lcgbdii04) to the latest EMI v3.8.0 release.
  • The BDII component on a number of systems (including ARC and CREAM CEs for the Condor farm) has been upgraded.
  • Condor farm CEs set to Production in the GOC DB on Monday (30th). At this point the Condor farm had around 50% of total batch capacity.
  • Tuesday 1st October: Update to Janet6 infrastructure for the RAL site connections and the backup OPN link to CERN.
Declared in the GOC DB
  • LCGCE01,02,04,10,11 (The Torque/Maui farm) in an Outage as the WNs are upgraded to SL6.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tomorrow (Thursday 3rd October) - upgrade of the Torque/Maui batch farm WNs to SL6. Drain already underway.
  • Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
  • On Tuesday 8th October the primary RAL OPN link to CERN will migrate to the SuperJanet 6 infrastructure.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
  • Infrastructure:
    • A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
      • Intervention required on the "Essential Power Board".
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 18th September and 2nd October 2013.

There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (all SRM end points) and all batch (All CEs) UNSCHEDULED OUTAGE 18/09/2013 11:45 18/09/2013 14:45 3 hours We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time.
lcgwms05.gridpp.rl.ac.uk SCHEDULED OUTAGE 06/09/2013 13:00 11/09/2013 10:55 4 days, 21 hours and 55 minutes Upgrade to EMI-3
lcgce12.gridpp.rl.ac.uk SCHEDULED OUTAGE 05/09/2013 13:00 04/10/2013 13:00 29 days, CE (and the SL6 batch queue behind it) being decommissioned.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
97516 Red Urgent Waiting Reply 2013-09-23 2013-09-30 T2K [SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
97479 Red Very Urgent On Hold 2013-09-20 2013-09-30 Atlas RAL-LCG2, high job failure rate
97385 Red Less Urgent On Hold 2013-09-17 2013-09-26 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-09-12 Myproxy server certificate does not contain hostname
95996 Red Urgent On Hold 2013-07-22 2013-09-17 OPS SHA-2 test failing on lcgce01
91658 Red Less Urgent On Hold 2013-02-20 2013-09-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-06-17 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
18/09/13 97.2 100 90.6 93.2 93.0 Castor Stop for DB restart caused test failures across all Vos (except Alice)' Atlas also had a couple of other SRM SUM test failures;
19/09/13 100 100 100 100 100
20/09/13 100 100 100 100 100
21/09/13 100 100 100 100 100
22/09/13 100 100 100 100 100
23/09/13 100 100 100 100 100
24/09/13 100 100 100 100 100
25/09/13 100 100 98.1 100 100 Failed one SUM test during a SRM upgrade then another (Delete) failure overnight.
26/09/13 100 100 100 95.8 100 Single SRM test failure as SRMs were updated.
27/09/13 100 100 100 100 100
28/09/13 100 95.8 100 100 100 Batch problem (Cannot connect to batch server).
29/09/13 100 100 100 100 100
30/09/13 100 100 100 95.7 100 Single SRM test failure on PUT: Error reading token data header:
01/10/13 100 100 74.8 95.8 95.8 Atlas problem affected all sites; Single test failure for LHCb during Janet6 transition; Single test failure for CMS (timeout).