Tier1 Operations Report 2012-10-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 17th October 2012

Review of Issues during the fortnight 3rd to 17th October 2012
  • There was a problem with the FTS during the evening of 4th October. The FTS has recently been patched - but the patch does not fix this problem (although it does slightly change its behaviour). There were problems again during the evening of Monday 8th Oct, and Monday 15th Oct.
  • On 5th October we declared one lost file to CMS. This was picked up by the checksum checker. The file was on a tape backed service class but had not yet been migrated to tape.
  • On Saturday morning 6th Oct one of the nodes in the TopBDII crashed. The BDII set ran with four out of five nodes available until the node was restarted on Monday morning.
  • The planned upgrade of the LHCb Castor instance to version 2.1.12 on 9th Oct. was cancelled at he end of the preceding afternoon. An increase in free disk space in the period shortly after the upgrade of the Atlas Castor instance was noticed and further upgrades were put on hold until the issue was understood (which it now is). See Tier1 BLOG entry.
Resolved Disk Server Issues
  • GDSS673 (CMSTape - D0T1) crashed on Friday evening 5th Oct. It was returned to production on Sunday evening (7th).
  • GDSS555 (AtlasDataDisk - D1T0) crashed on Wednesday afternoon (10th Oct). After restarting a memory test was run and the disk array was verified. The system was returned to production on Friday morning (12th July) when the disks verification was around one third completed without problems.
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
  • High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage). Ongoing work by Fabric team looking to improve the uplink.
Ongoing Disk Server Issues
  • GDSS454 (AtlasDataDisk - D1T0) failed yesterday afternoon (16th Oct). The RAID array is currently being verified. We have lost one file from this server.
Notable Changes made this last week
  • 5th Oct - Migration of LHCb data from T10KA to T10KC tapes completed successfully. See Tier1 BLOG entry.
  • 9th Oct - CMS castor instance made accessible from UK/European/global xrootd redirectors.
  • 10th Oct - WMS03 upgraded to EMI version.
  • 15th Oct - Oracle patches applied to the databases behind the Atlas conditions, FTS and non-LHC LFC services (databases SOMNUS & OGMA)
  • 16th Oct - CMS Castor instance upgraded to version 2.1.12-10.
  • The LHCb 3D & LFC database systems have been withdrawn from service.
  • Updated errata rolled out across batch farm.
  • Continuing test of hyperthreading on one batch of worker nodes.
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • A test queue ("gridTest") is available with (currently) four worker nodes running EMI2/SL5. In addition a further ten nodes (one from each hardware generation/batch) installed with EMI-2/SL5 are running as part of the normal batch system.
  • Test instance of FTS version 3 available. The non-LHC VOs that use the existing service have been enabled on it and we are looking for one of the VOs to test it.
Declared in the GOC DB
  • WMS02 & WMS01 - update to EMI v3.3.8
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tuesday 23rd October: Upgrade of LHCb Castor instance to Version 2.1.12-10. (Re-scheduled after not being done on 9th Oct.)
  • Tuesday 30th October: Upgrade of GEN Castor instance to Version 2.1.12-10. (Re-scheduled from 23rd Oct.)
  • Upgrade of CEs to EMI version. If final tests go well propose doing this next Tuesday (23rd Oct.)

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12. (As detailed above).
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • CEs being upgraded to EMI version now.
    • Rolling upgrade of WMSs to EMI version underway.
    • Enabling overcommit on WNs to make use of hyperthreading (will be implemented after the CE upgrades are complete).

Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)

  • Infrastructure:
    • Intervention required on the "Essential Power Board". (An "At Risk"). Proposed Date 20th November.
    • Remedial work on three (out of four) transformers. Will require two "At Risk" periods. Likely to be in November.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty. Will require a further “At Risk”.


Entries in GOC DB starting between 3rd and 17th October 2012

The only unscheduled outage in the GOC DB for this period is for the retirement of a CMS VO box.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms01 SCHEDULED OUTAGE 17/10/2012 15:00 19/10/2012 13:00 1 day, 22 hours EMI WMS update to v3.3.8
srm-cms SCHEDULED OUTAGE 16/10/2012 08:00 16/10/2012 14:00 6 hours Upgrade of CMS Castor instance to Version 2.1.12-10.
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, SCHEDULED WARNING 15/10/2012 09:00 15/10/2012 13:00 4 hours Rolling application of Oracle Patches to database systems behind these services.
lcgwms01 SCHEDULED OUTAGE 12/10/2012 10:00 17/10/2012 15:00 5 days, 5 hours EMI WMS update to v3.3.8
lcgwms03 SCHEDULED OUTAGE 04/10/2012 11:00 10/10/2012 12:00 6 days, 1 hour EMI WMS update to v3.3.8
lcgvo-02-21.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 03/10/2012 15:30 31/10/2012 23:00 28 days, 8 hours and 30 minutes System being decommissioned (This is a CMS VOBOX)
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
87455 Green Urgent In Progress 2012-10-17 2012-10-17 Atlas RAL-LCG2_HIMEM: jobs failing with stage-in errors
86705 Red Less Urgent In Progress 2012-10-03 2012-10-09 SNO+ RAL jobs returning errors
86690 Red Urgent In Progress 2012-10-03 2012-10-11 T2K JPKEKCRC02 missing from FTS ganglia metrics
86152 Red Less Urgent In Progress 2012-09-17 2012-09-19 correlated packet-loss on perfsonar host
68853 Red Less Urgent In Progress 2011-03-22 2012-08-10 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
03/10/12 100 100 99.1 100 100 Single failure to connect to srm-atlas.
04/10/12 100 100 100 100 100
05/10/12 100 100 100 100 100
06/10/12 100 100 100 100 100
07/10/12 100 100 100 100 100
08/10/12 100 100 100 100 100
09/10/12 100 100 100 100 100
10/10/12 100 100 100 100 100
11/10/12 100 95.9 100 100 100 Timeout for the CE test job exceeded.
12/10/12 100 84.7 100 100 100 Timeout for the CE test job exceeded.
13/10/12 100 100 100 100 100
14/10/12 100 100 100 100 100
15/10/12 95.8 100 100 100 100 CE tests failed. testing of new EMI CEs affected BDII data.
16/10/12 100 100 100 85.3 100 CMS Castor instance upgraded to version 2.1.12.