Tier1 Operations Report 2012-11-28

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 28th November 2012

Review of Issues during the fortnight 14th to 28th November 2012
  • There has been a recurring problem with castor Atlas and GEN stager daemons using memory. This has caused a number of problems, the first on Thursday afternoon (15th Nov) for GEN. the problem also affected Atlas on 16th & 17th. A regular re-starter is now in place for this daemon.
  • A known problem of the batch server process consuming memory has been seen again on a number of occasions since the power cut of the 7th.
  • Overnight 19/20 Nov there was a failure of one of the network stacks (stack 15) which was resolved the following morning. This affected a small number of services including the Atlas Frontier squids.
  • On Tuesday 20the November there was a major power incident during a planned intervention on the electrical system for the UPS. This resulted in an over-voltage applied to systems on UPS power. All Tier1 systems were unavailable for around 24 hours. Castor services were unavailable for around 50 hours, and batch systems were brought back after that. There were numerous broken power supplies, PDUs and network switches. Some services (notably batch capacity and tape throughput) have been reduces since then. The Tier1 remains with very limited resilience until more replacements can be obtained. On Friday 23rd, the first full day of services following this incident, there were still some residual issues including a rack of eight disk servers (mixture of Atlas & CMS) being unavailable for two to three hours and a short network break as one of the site routers (Router A) was restarted. A Post Mortem report for this incident is being prepared.
Resolved Disk Server Issues
  • GDSS439 (AtlasDataDisk) failed with a read only filesystem in the early morning of 17th Nov. returned to service on the morning of 18th Nov.
  • GDSS629 to GDSS632 (AtlassDataDisk) & GDSS633 to GDSS636 (CMSTape) were unavailable for a few hours on Friday 23rd Nov when the power to the rack was tripped.
  • GDSS611 (LHCbDst - D1T0) was unavailable for a few hours on Friday 23rd Nov. The Castor partitions were not mounted owing to a RAID error.
  • GDSS523 (CMSTape) shut itself down following a (believed erroneous) over-temperature report in the early hours of Sunday 25th November. It was booted up and drained before being checked out, IPMI firmware updated, and returned to service lunchtime on 26th Nov.
  • GDSS523 (CMSTape) was showing a similar temperature problem to GDSS523 (above). It was also drained out on Sunday 25th Nov, and returned to service the next day.
Current operational status and issues
  • Although the planned work on Tuesday 20th November resulted in a major problem, some investigation into the causes of the diesel generator not cutting in was made. A minor fault was found, along with a sensitive trip setting. These were corrected and it is believed the diesel generator would now work - although this has not yet been tested.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
  • Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
Ongoing Disk Server Issues
  • GDSS673 (CMSTape - D0T1) crahsed on Tuesday morning, 27th Nov. It is currently being checked out.
Notable Changes made this last week
  • On Thursday (15th Nov) the job numbers on two batches of worker nodes was increased on two further batches of worker nodes.
  • On Monday (19th Nov) the MyProxy service was upgraded from UMD-1 to UMD-2.
  • This morning (28th Nov) lcgui02 was upgraded to EMI-2. Both UIs have now been upgraded.
  • The worker nodes are being progressively re-installed with EMI-2 WN software. Two batches were done on the morning of the 20th November, before the power incident. Since then most of the remaining batches have been done. The final two batches will be drained out this weekend ahead of re-installation
  • CVMFS continues to be available for testing by non-LHC VOs (including "stratum 0" facilities).
  • Test instance of FTS version 3 continues to be available.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
    • Improve the stack 13 uplink
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Continuing overcommit on WNs to make use of hyperthreading.
  • Infrastructure:
    • Test of move to diesel power in event of power loss.
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 14th and 28th November 2012

There were five unscheduled outages in the GOC DB for this period. One was for a restart of the Castor GEN stager to investigate the memory leak problem. The other four were all relating to the power incident on 20th November.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgui02 SCHEDULED OUTAGE 28/11/2012 10:00 28/11/2012 12:00 2 hours Re-install with EMI software version (Upgrade postponed from last week).
All CEs (all batch) UNSCHEDULED OUTAGE 22/11/2012 14:40 22/11/2012 17:00 2 hours and 20 minutes Batch services still down following outage for power incident.
Whole site UNSCHEDULED WARNING 22/11/2012 14:40 23/11/2012 17:00 1 day, 2 hours and 20 minutes Systems at Risk after recovery from power incident.
lcgui02 SCHEDULED OUTAGE 21/11/2012 10:00 21/11/2012 12:00 2 hours Re-install with EMI software version.
All Castor SRM endpoints and all CEs. UNSCHEDULED OUTAGE 20/11/2012 12:18 22/11/2012 14:40 2 days, 2 hours and 22 minutes All storage and batch services down due to power incident
All services except Castor SRM endpoints and CEs. UNSCHEDULED OUTAGE 20/11/2012 12:18 21/11/2012 16:00 1 day, 3 hours and 42 minutes All services down due to power incident
Castor GEN (srm-alice, srm-biomed, srm-cert, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) UNSCHEDULED OUTAGE 20/11/2012 11:30 20/11/2012 11:40 10 minutes Outage while we reboot the castor headnodes. This is part of an ongoing investigation into a memory leak.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
88596 Red Very Urgent In Progress 2012-10-19 2012-11-28 T2K Jobs don't get delgated to RAL
86690 Red Urgent In Progress 2012-10-03 2012-11-06 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent On Hold 2012-09-17 2012-10-31 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
14/11/12 100 100 100 95.9 100 Single SRM test failure "user timeout".
15/11/12 97.2 72.7 92.0 92.2 100 Problem with CE configuration.
16/11/12 100 96.0 98.4 100 100 Castor stager memory problem
17/11/12 100 100 85.3 100 100 Castor stager memory problem
18/11/12 100 81.7 100 100 100 Jobs timed out
19/11/12 100 89.9 100 96.0 100 Alice: Jobs timed out; CMS: SRM problem.
20/11/12 30.7 54.7 51.6 44.4 53.9 Power incident took Tier1 down.; before that monitoring problem affected all UK OPS tests.
21/11/12 0 0 0 0 0 Power incident
22/11/12 33.6 28.7 32.0 29.1 34.2 Power incident
23/11/12 100 100 80.3 100 100 Problem with Atlas' monitoring
24/11/12 100 98.6 99.0 95.9 100 Alice: Problem with CEs; Atlas & CMS - single SRM test failure.
25/11/12 100 100 100 100 100
26/11/12 100 88.5 100 100 100 Batch problem.
27/11/12 100 100 100 100 100