Tier1 Operations Report 2012-09-19

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 19th September 2012

Review of Issues during the week 12 to 19th September 2012
  • LHCb reported batch job failures over the weekend which does follow a peak in number of LHCb jobs (>5000 running on Saturday 15 Sep.). At present cause unknown but the CEs lost contact with the batch jobs.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. In particular one half of the new switchboard has been refurbished and is on track to be brought into service by 17 September. Once this is operational then RAL will be switched over to using it and will no longer be dependent on the old switchgear.
  • The migration of LHCb data from the T10KA to the T10KC tapes is progressing.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). This appears to be having a negative impact on the job efficiency of CMS jobs. Fabric team are looking at improving uplink.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • The rolling migration of (non-LHC) LFC front ends to EMI-2 on Virtual Machines has been completed.
  • The CIP (Castor Information provider) was updated today (19th Sep) to a version compatible with Castor 2.1.12.
  • Updated errata being rolled out across batch farm.
  • Continuing test of hyperthreading on one batch of worker nodes (the Dell 2011 batch). No problems observed with 50% overcommit. Pending Change Control approval expect to increase over-commit on all hyper-threaded nodes at the start of October. Comments/concerns from VOs welcome.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • Tuesday 25th Sep. Upgrade of Atlas Castor instance to Version 2.1.12-10.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Rolling (transparent) migration of LHCb LFC front ends to EMI-2 on Virtual Machines imminent.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12. (Atlas stager upgrade announced).
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
  • Infrastructure:
    • Intervention required on the "Essential Power Board". (An "At Risk"). Likely to be in November.
    • Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.


Entries in GOC DB starting between 12th and 19th September 2012

There were no Scheduled or Unscheduled entries in the GOC DB for this period.

Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
86156 green Very Urgent In Progress 2012-09-17 2012-09-18 LHCb Aborted pilots at RAL
86152 green Less Urgent In Progress 2012-09-17 2012-09-17 correlated packet-loss on perfsonar host
85077 Red Less Urgent Waiting Reply 2012-09-17 2012-09-14 biomed CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
68853 Red Less Urgent On hold 2011-03-22 2012-09-04 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers


Availability Report
Day OPS Alice Atlas CMS LHCb Comment
12/09/12 100 100 100 100 100
13/09/12 100 100 100 100 100
14/09/12 100 100 100 100 100
15/09/12 100 100 100 100 100
16/09/12 100 100 100 100 100
17/09/12 100 100 100 100 100
18/09/12 100 100 100 100 95.8 Failure to connect to srm-lhcb.gridpp.rl.ac.uk