Tier1 Operations Report 2012-10-24

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 24th October 2012

Review of Issues during the week 17th to 24th October 2012
  • During planned maintenance the OPN link to CERN failed over to the backup route from around 07:30 until 17:30 on Saturday 20th October.
  • During the rebooting of the LHCb disk servers while the Castor instance was being upgraded one of the disk servers re-installed itself as another disk server. No data was lost, but the server was out of production until later that afternoon and then a further fault was found and fixed the following morning.
  • During the afternoon of Tuesday 23rd Oct. one of the LHCb Castor headnodes showed a significant hardware fault and was replaced.
  • The FTS service failed (with a known bug) early yesterday evening (23rd Oct). The test for this failed to detect the problem and the service was down for most VOs until around 9am this morning (24th).
Resolved Disk Server Issues
  • GDSS454 (AtlasDataDisk - D1T0) failed on 16th Oct. It was returned to production during the afternoon of 17th October. As reported at the last meeting one file was declared lost from this server.
  • GDSS639 (GENScratchDisk - D0T0) failed on Saturday morning (20th Oct). It was returned to production on Monday afternoon (22nd Oct) after faulty memory had been replaced.
  • GDSS213 (AtlasScratchDisk - D1T0) failed on Sunday afternoon (21st Oct). It was returned to production on Monday afternoon (22nd Oct).
  • GDSS535 (LHCbDst - D1T0) The system was re-installed as another node when rebooted during the LHCb Castor upgrade on Tuesday 23rd Oct. It was returned to production later that afternoon. However, a further problem was found on this server which was fixed during the following morning (24th).
Current operational status and issues
  • At the moment we are failing the VO SUM tests for the CEs for a number of VOs. This reflects tests that have not yet moved to the new EMI CEs.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Ongoing work by Fabric team looking to improve the uplink.
  • Investigations are ongoing (using perfsonar) into asymmetric routing of data over (and not back over) the OPN. A problem has been resolved with routing from CNAF. The problem also appears with the North American Tier1 sites and is being followed up.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • WMS01 updated to EMI v3.3.8
  • On 19th Oct an update to the castor information provider removed some unnecessary references to glite and fixed a problem of tape usage reporting.
  • 23rd Oct - LHCb Castor instance upgraded to version 2.1.12-10.
  • 23rd October glite CREAM CEs replaced with EMI CREAM CEs.
  • Hyperthreading continues to run on one batch of worker nodes ahead of it being rolled out on all suitable worker nodes.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • Ongoing WMS02 update to EMI v3.3.8
  • Tuesday 30th October: Upgrade of GEN Castor instance to Version 2.1.12-10.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • 20th November: Intervention required on the "Essential Power Board" and transformers. (An "At Risk").

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12. (As detailed above).
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • CEs being upgraded to EMI version now.
    • Rolling upgrade of WMSs to EMI version underway.
    • Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).

Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)

  • Infrastructure:
    • Intervention required on the "Essential Power Board".
    • Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.


Entries in GOC DB starting between 17th and 24th October 2012

There are two unscheduled outages in the GOC DB for this period. One is for the failure of one of the LHCb Castor headnodes, the other is for the new EMI CREAM CEs (not in production at that time).

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb UNSCHEDULED WARNING 23/10/2012 16:30 24/10/2012 12:30 20 hours At risk due to hardware fault on castor headnode. Services are being moved to alternative hardware.
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 SCHEDULED WARNING 23/10/2012 10:00 24/10/2012 12:00 1 day, 2 hours post EMI-2 CREAM migration
lcgce03, lcgce05, lcgce07, lcgce08, lcgce09 SCHEDULED OUTAGE 23/10/2012 09:00 30/11/2012 12:00 38 days, 4 hours replacement with EMI-2 CREAM nodes
srm-lhcb SCHEDULED OUTAGE 23/10/2012 08:00 23/10/2012 10:50 2 hours and 50 minutes Upgrade of LHCb Castor instance to Version 2.1.12-10
lcgwms02 SCHEDULED OUTAGE 21/10/2012 10:00 26/10/2012 13:00 5 days, 3 hours EMI WMS upgrade to v3.3.8
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 UNSCHEDULED OUTAGE 19/10/2012 15:00 23/10/2012 10:00 3 days, 19 hours migration to EMI-2 CREAM
lcgwms01 SCHEDULED OUTAGE 19/10/2012 13:00 22/10/2012 15:00 3 days, 2 hours EMI WMS upgrade to v3.3.8
lcgwms01 SCHEDULED OUTAGE 17/10/2012 15:00 19/10/2012 13:00 1 day, 22 hours EMI WMS update to v3.3.8
lcgwms01 SCHEDULED OUTAGE 12/10/2012 10:00 17/10/2012 15:00 5 days, 5 hours EMI WMS update to v3.3.8
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
86705 Red Less Urgent In Progress 2012-10-03 2012-10-23 SNO+ RAL jobs returning errors
86690 Red Urgent In Progress 2012-10-03 2012-10-22 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent In Progress 2012-09-17 2012-10-22 correlated packet-loss on perfsonar host
68853 Red Less Urgent In Progress 2011-03-22 2012-10-23 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
17/10/12 96.0 100 100 100 100 CE07 had a problem (according to tests). This coincided with a block of missing data.
18/10/12 100 100 100 100 100
19/10/12 100 100 100 100 100
20/10/12 100 100 99.1 100 100 Single failure of SRM Put at 07:46 ("zero number of replicas");
21/10/12 100 100 98.2 100 100 Failures of SRM Get at 02:05 & 02:19 ("could not open connection to srm-atlas.gridpp.rl.ac.uk")
22/10/12 100 100 100 100 100
23/10/12 92.6 33.3 33.3 82.0 29.2 Mainly effect of replacing glite CREAM CEs with EMI CREAM CEs. Some effect on LHCb from castor upgrade.