Tier1 Operations Report 2013-05-22

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 22nd May 2013

Review of Issues during the week 15th to 22nd May 2013.
  • There was a short (few minutes) break in the primary OPN link on Thursday (16th May).
  • There was an unplanned restart of the Atlas SRM daemons shortly before midnight last night triggered by a database issue. This caused a momentary spike of transfer failures but the systems recovered and carried on working OK.
  • A single file has been reported lost to Atlas. This problem was found when draining an older disk server.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • GDSS422 (LHCbUser - D1T0)was removed from production this morning (22nd May) for a memory test following reported memory errors.
Notable Changes made this last week
  • The central networking team made a successful intervention on Tuesday morning (22nd) during which we stopped Castor & batch services as a precaution.
  • A new WLCG VOBOX for ALICE has been deployed and handed over to ALICE for testing. Once ALICE confirm it is OK, the old box will be retired and new one moved into full production.
  • This morning (22nd May) some ATLAS jobs were switched to directly accessing the data for their jobs rather than copying the data to scratch disk on the WN.
  • As part of the work to follow up problems reported in GGUS ticket #92266 an additional MyProxy server has been declared in the GOC DB.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
  • A problem has come to light following the upgrade of the non-Tier1 'facilities' Castor instance to version 2.1.13. The upgrade of Tier1 Castor instances to 2.1.13-9 will await our understanding and resolution of this problem.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor).
    • Upgrade of one remaining EMI-1 component (UI) being planned.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
    • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
    • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 15th and 22nd May 2013.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor & Batch SCHEDULED OUTAGE 21/05/2013 07:00 21/05/2013 08:42 1 hour and 42 minutes Stop of Castor storage and batch job starts during network intervention.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
94083 Amber Urgent Waiting Reply 2013-05-15 2013-05-15 MICE Jobs sent to RAL fail with an error -10
93149 Red Less Urgent On Hold 2013-04-05 2013-05-13 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
92266 Red Less Urgent In Progress 2013-03-06 2013-05-21 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-04-03 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
15/05/13 100 100 100 100 100
16/05/13 100 100 98.3 100 100 CE test failed due to a CVMFS problem. (New CVMFS version being tested on a few nodes).
17/05/13 100 100 100 100 100
18/05/13 100 100 -100 100 100 Atlas' monitoring not working
19/05/13 100 100 -100 100 100 Atlas' monitoring not working
20/05/13 100 100 -100 100 100 Atlas' monitoring not working
21/05/13 92.9 92.9 -100 92.4 92.0 Castor & Batch services stopped around planned networking intervention.