Tier1 Operations Report 2012-10-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 3rd October 2012

Review of Issues during the fortnight 19th September to 3rd October 2012
  • During the evening of Thursday 20th Sep there was a problem with the Atlas Castor instance. This was traced to a problem in the Castor database (an "orphaned sub-request"). The problem caused the Atlas Castor instance to be unavailable for some hours and lasted just into the following day.
  • On Sunday evening (30th September) there was a problem with the Atlas SRM database. This is believed to be an Oracle bug, and unrelated to the Castor 2.1.12 update of some days earlier. SRM_Atlas was unavailable for several hours.
  • Yesterday (Tuesday 2nd October) four files were reported lost to LHCb. These were discovered when attempting to recall them from tape. All four files were on the same tape. No other files on that tape are affected.
Resolved Disk Server Issues
  • GDSS399 (LhcbRawRdst - D0T1) was taken out of production on Monday (1st Oct). A failed disk was replaced but the rebuild did not go normally - it was very slow. The machine was rebooted and the RAID rebuild tracked to ensure it was OK. System returned to production this morning (3rd Oct).
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • The migration of LHCb data from the T10KA to the T10KC tapes is progressing. 193 LHCb tapes are left to migrate and the process should be finished by next week.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). This appears to be having a negative impact on the job efficiency of CMS jobs. Fabric team are looking at improving uplink.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Tuesday 25th Sep. Upgrade of Atlas Castor instance to Version 2.1.12-10.
  • On Tuesday 25th Sep. Oracle updates were applied to the Atlas TAGS database.
  • The Rolling (transparent) migration of LHCb LFC front ends to EMI-2 on Virtual Machines has been completed.
  • On Monday 1st October the FTS front ends were moved to virtual machines and a patch applied that addresses the problem of the 'wrong' proxy being picked up.
  • Updated errata being rolled out across batch farm.
  • Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). No problems observed with 50% overcommit. Pending Change Control approval expect to increase over-commit on all hyper-threaded nodes at the start of October. Comments/concerns from VOs welcome.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • WMS03 - update to EMI v3.3.8 (4 - 10 Oct)
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tuesday 9th October: Upgrade of LHCb Castor instance to Version 2.1.12-10.
  • Tuesday 16th October: Upgrade of CMS Castor instance to Version 2.1.12-10.
  • Tuesday 23rd October: Upgrade of GEN Castor instance to Version 2.1.12-10.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12. (As detailed above).
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
  • Infrastructure:
    • Intervention required on the "Essential Power Board". (An "At Risk"). Proposed Date 20th November.
    • Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.


Entries in GOC DB starting between 19th September and 3rd October 2012

There were two unscheduled outages (one followed by a 'warning') in the GOC DB for this period. Both refer to srm-atlas and are detailed above.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas UNSCHEDULED WARNING 01/10/2012 00:11 01/10/2012 12:00 11 hours and 49 minutes Following the ATLAS SRM database problems, we are returning to service with as an AT-RISK until tomorrow.
srm-atlas UNSCHEDULED OUTAGE 30/09/2012 21:00 01/10/2012 00:08 3 hours and 8 minutes Atlas SRM failing due to problems in the underlying databases.
srm-atlas SCHEDULED OUTAGE 25/09/2012 09:00 25/09/2012 12:10 3 hours and 10 minutes Upgrade of Atlas Castor instance to Version 2.1.12-10.
srm-atlas UNSCHEDULED OUTAGE 20/09/2012 21:30 21/09/2012 01:15 3 hours and 45 minutes Atlas SRMs failed due to an orphaned subrequest causing database queries to block for the Atlas Castor instance.
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
86570 Yellow Very Urgent In Progress 2012-10-01 2012-10-01 Moving to SHA2 GGUS certificate
86152 Red Less Urgent In Progress 2012-09-17 2012-09-19 correlated packet-loss on perfsonar host
85077 Red Less Urgent In Progress 2012-08-13 2012-09-17 Biomed CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
68853 Red Less Urgent On hold 2011-03-22 2012-09-04 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
19/09/12 100 100 100 100 100
20/09/12 100 100 86.7 100 100 Castor Database blocking issue late evening
21/09/12 100 100 97.3 100 100 Continuation of above after midnight
22/09/12 100 100 100 100 100
23/09/12 100 100 100 100 100
24/09/12 100 100 99.0 100 100 Single failure to connect to srm-atlas.
25/09/12 100 100 85.3 100 100 Scheduled Atlas Castor Stager 2.1.12-10 update.
26/09/12 100 100 100 100 100
27/09/12 100 100 100 100 100
28/09/12 100 100 100 100 100
29/09/12 100 100 100 100 100
30/09/12 100 100 78.7 100 100 Problem with Atlas SRM database.
01/10/12 100 100 100 100 100
02/10/12 100 93.4 100 100 100 CE test failed when job aborted by VO.