Tier1 Operations Report 2012-10-31

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 31st October 2012

Review of Issues during the week 24th to 31st October 2012
  • During the afternoon of Tuesday 23rd Oct. one of the LHCb Castor headnodes showed signs of an impending significant hardware fault and was replaced with a hot spare before it failed. Following the vendor fixing the hardware the original system was swapped back in this morning (Wed 31st Oct.)
  • Around 04:00 there on the morning of Thursday 25th Oct. a fault was reported on the Alice VO box (lcgvo-alice). This was fixed when staff arrived at work the next morning.
  • Some problems with the site firewall caused short breaks in connectivity through this route on both Monday and Tuesday mornings for around 10 to 15 minutes each time. The cause of this has been understood.
  • The primary OPN link to CERN failed and we automatically switched to the backup around 09:15 Tuesday morning (30th). The cause was a fibre break during road works in France. The problem was fixed and we reverted to the primary link around 19:00 the same day.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Aim to resolve this at same time as network outage on 13th November.
  • Investigations are ongoing (e.g. using perfsonar) into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates.
  • A fault has been found in a the card that connects the Tier1 to one of the main RAL routers ("Router A") and requires replacement. (Scheduled for 13th November).
  • The new EMI CREAM CEs are bedding in. Some intermittent SUM test failures are being followed up. Checks are being made for any remaining jobs that still arrive via the old glite CEs.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • WMS02 updated to EMI v3.3.8. This completes the updates of the three WMSs software.
  • The routing of network packets back from North American Tier1s (BNL, FerminLab, Triumph) has been corrected to use the OPN rather than other production networks.
  • 30th Oct - Castor GEN instance upgraded to version 2.1.12-10. This completes the Castor 2.1.12 upgrade.
  • Hyperthreading continues to run on one batch of worker nodes and will be rolled out on all suitable worker nodes once the CE changes have bedded in.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • 13th November: Intervention on network router card. Aim to use this time to also improve the stack 13 uplink and possibly carry out further tests to find the cause of the poor outbound data rates.
  • 20th November: Intervention required on the "Essential Power Board" and transformers. (An "At Risk").

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • None
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).
    • migration to EMI software for worker nodes.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers. (Scheduled for 20th November).
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 24th and 31st October 2012

There was one unscheduled outage in the GOC DB for this period when one of the LHCb Castor headnodes showed hardware errors shortly after the LHCb Castor upgrade.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb.gridpp.rl.ac.uk, SCHEDULED WARNING 31/10/2012 11:00 31/10/2012 12:00 1 hour Swapping one the Castor headnodes back following repair after hardware failure.
castor GEN Instance (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) SCHEDULED OUTAGE 30/10/2012 08:00 30/10/2012 11:10 3 hours and 10 minutes Upgrade of Castor GEN instance to Version 2.1.12-10.
srm-lhcb.gridpp.rl.ac.uk, UNSCHEDULED WARNING 23/10/2012 16:30 24/10/2012 12:30 20 hours At risk due to hardware fault on castor headnode. Services are being moved to alternative hardware.
New EMI CREAM CEs: (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) SCHEDULED WARNING 23/10/2012 10:00 24/10/2012 12:00 1 day, 2 hours post EMI-2 CREAM migration
Old Glite CREAM CEs (lcgce03, lcgce05, lcgce07, lcgce08, lcgce09) SCHEDULED OUTAGE 23/10/2012 09:00 30/11/2012 12:00 38 days, 4 hours Replacement with EMI-2 CREAM nodes
lcgwms02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 21/10/2012 10:00 25/10/2012 10:20 4 days, 20 minutes EMI WMS upgrade to v3.3.8
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
86690 Red Urgent In Progress 2012-10-03 2012-10-31 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent In Progress 2012-09-17 2012-10-30 correlated packet-loss on perfsonar host
68853 Red Less Urgent In Progress 2011-03-22 2012-10-30 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
24/10/12 100 0 0 100 0 New EMI CEs not in tests for all VOs.
25/10/12 100 52.4 29.1 100 46.6 New EMI CEs appeared in tests during this day.
26/10/12 100 100 96.8 95.8 100 CMS - Single failure of SRM test. Atlas - appears spurious.
27/10/12 100 100 100 100 100
28/10/12 100 100 100 99.0 100 Single failure of SRM Put: User timeout over
29/10/12 100 100 100 100 100
30/10/12 100 100 98.6 100 100 Failures on both monitored CEs. (No compatible resources returned by BDII.)