Tier1 Operations Report 2012-09-05

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 5th September 2012

Review of Issues during the week 29th August and 5th September 2012
  • We have been failing some ALICE CE SUM tests intermittently - Their test fails if it runs on one of the EMI-2 worker nodes.
  • We suffered some SRM SUM test failures for Atlas & CMS between around 05:00 and 06:00 this morning (Wed. 5th Sep). Cause not yet understood but likely not to be within Castor or the SRMs.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. In particular one half of the new switchboard has been refurbished and is on track to be brought into service by 17 September. Once this is operational then RAL will be switched over to using it and will no longer be dependent on the old switchgear.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • On Tuesday (4th Sep) LHCb were switched to use the T10KC media and drives. Any data on T10KA media will be read on the A drives. The migration of LHCb data from A to C tapes has also started.
  • Continuing test of hyperthreading, one batch of worker nodes (the Dell 2011 batch) has 24 jobs slots per node. (Twice the number of cores). However, the memory limitations do not allow these nodes to fill - plan to continue testing with an increased memory overcommit (upped from 50% to 75%).
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) have been re-installed with EMI-2/SL5 and are running as part of the normal batch system.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Migration of (non-LHC) LFC front ends to EMI-2. Rolling upgrade over next weeks.
  • The Castor 2.1.12 update is expected to be ready within a few weeks.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)
  • Infrastructure:
    • Intervention required on the "Essential Power Board". (Should be an "At Risk"). Likely to be in November.
    • Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.


Entries in GOC DB starting between 29th August and 5th September 2012

There were no Scheduled or Unscheduled entries in the GOC DB for this period.


Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
85438 yellow Less Urgent Waiting Reply 2012-08-23 2012-08-29 Atlas FTS errors from SRM srm-atlas.gridpp.rl.ac.uk
85077 yellow Less Urgent In progress 2012-08-13 2012-09-03 biomed CE lcgce05.gridpp.rl.ac.uk job cannot register file on SE srm-biomed.gridpp.rl.ac.uk
85023 Red Less Urgent Waiting Reply 2012-08-09 2012-08-10 SNO+ WMS
84492 Red Urgent Waiting Reply 2012-07-24 2012-08-31 SNO+ Job time/memory requirements not provided
84408 Red Urgent Waiting Reply 2012-07-20 2012-08-29 neurogrid Enable neurogrid.incf.org on WMS and LFC
68853 Red Less Urgent On hold 2011-03-22 2012-09-04 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers