Tier1 Operations Report 2012-09-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th September 2012

Review of Issues during the week 29th August and 5th September 2012
  • Problem overnight Wed/Thu (5/6 Sep). One of the pair of uplinks to switch stack was failing intermittently. This caused problems on at least one of the switches in the stack. Rather than access to the connected systems just being degraded, access to some of the systems failed for periods. This led to failures accessing one batch of disk servers and some worker nodes.
  • An update to the LHCb Castor stager to version 2.1.12 was announced for Tuesday morning (11th September) but was cancelled when a problem was found in testing the day before.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. In particular one half of the new switchboard has been refurbished and is on track to be brought into service by 17 September. Once this is operational then RAL will be switched over to using it and will no longer be dependent on the old switchgear.
  • The migration of LHCb date from the T10KA to the T10KC tapes is progressing.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • The rolling migration of (non-LHC) LFC front ends to EMI-2 on Virtual Machines is underway.
  • Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). Problems have been seen when there are many cpu-bound jobs (Atlas monte-carlo) on the same node. These have taken longer to run on these nodes and exceeded maximum wall time. In response the overcommit of jobs was reduced on Tuesday (11th Sep). The total job slots were reduced from 24 to 18 on these 12-core nodes.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • A test instance of FTS version 3 is now available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12. Expected to be ready imminently.
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
  • Infrastructure:
    • Intervention required on the "Essential Power Board". (Should be an "At Risk"). Likely to be in November.
    • Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.


Entries in GOC DB starting between 29th August and 5th September 2012

There were no Scheduled or Unscheduled entries in the GOC DB for this period. (Note: We did declare a downtime to upgrade the LHCb Castor Stager on Tuesday morning (11th) but that was cancelled.)

Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
85889 yellow Less Urgent In Progress 2012-09-06 2012-09-11 OPS ops pilot role not enabled on lcgwms03.gridpp.rl.ac.uk
85077 Red Less Urgent In progress 2012-08-13 2012-09-03 biomed CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
85023 Red In Progress Waiting Reply 2012-08-09 2012-09-11 SNO+ WMS
84492 Red Urgent In Progress 2012-07-24 2012-08-31 SNO+ Job time/memory requirements not provided
68853 Red Less Urgent On hold 2011-03-22 2012-09-04 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers


Availability Report
Day OPS Alice Atlas CMS LHCb Comment
01/09/12 100 100 100 100 100
02/09/12 100 100 100 100 100
03/09/12 100 100 99.2 100 100 Failure to connect to srm-atlas.gridpp.rl.ac.uk
04/09/12 100 57.3 100 100 100 Test fails version check on EMI2.1 nodes.
05/09/12 100 89.3 94.5 91.7 96.5 Mainly problem on Tier1 Network Link causing problems for switch stack.
06/09/12 100 93.1 99.2 100 91.7 Continued effect of Tier1 Network Link causing problems for switch stack.
07/09/12 100 100 100 100 100
08/09/12 100 100 99.2 100 100 Failure to connect to srm-atlas.gridpp.rl.ac.uk correlates with network router reload.
09/09/12 100 100 100 100 100
10/09/12 100 100 100 100 100
11/09/12 100 100 100 100 100