Difference between revisions of "Tier1 Operations Report 2013-07-24"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:19, 24 July 2013

RAL Tier1 Operations Report for 24th July 2013

Review of Issues during the week 17th to 24th July 2013.
  • On Thursday (18th) the main RAL link to Janet was failed over to the alternative route (via London) when one of the multiple connections to Reading failed. This was transparent to us.
  • On Thursday (18th) a configuration error affecting all the CEs caused batch problems for a couple of hours until noticed and corrected.
  • Yesterday (Tuesday 23rd) the primary OPN link to CERN failed and we switched over to the backup route. We ran on the backup link for around 7 hours until the problem (JANET ticket reports a broken fibre) was fixed.

A post mortem report of the Atlas Castor outage on 28-30 June has been prepared and can be seen at:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20130628_Atlas_Castor_Outage
Resolved Disk Server Issues
  • GDSS664 (AtlasDataDisk, D1T0) failed on 11th July. Following a period when the server was down, which lasted until Tuesday 16th July, it has now been completely drained. All files that were on the server are available to Atlas. The server itself is undergoing hardware checks before being returned to service.
Current operational status and issues
  • There has been a problem for a few weeks with the batch server and investigations are continuing. The problem started at the same time as the batch server was updated. Although this update was rolled back the problem has remained.
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb jobs failing due to long job set-up times is still under investigation. The recent updates to the CVMFS clients to v2.1.12 is promising.
  • The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • CMS & LHCb Castor instance (stager) were upgraded to version 2.1.13-9 yesterday (Tuesday 23rd).
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Update the remaining Castor stager GEN on Tuesday 30th July.
  • The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
  • Wednesday 24th July: Transition of Thames Valley Network to Janet 6.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13 (ongoing)
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between the 17th and 24th July 2013.

There were no unscheduled entries in the GOC DB during this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-cms, srm-lhcb SCHEDULED OUTAGE 23/07/2013 10:00 23/07/2013 13:54 3 hours and 54 minutes Upgrade of CMS and LHCb Castor instances to version 2.1.13-9
Whole site SCHEDULED WARNING 23/07/2013 07:45 23/07/2013 08:45 1 hour Site warning for one hour around two reboots of the site firewall which will take place within this time window.
lcgwms06 SCHEDULED OUTAGE 19/07/2013 10:00 25/07/2013 12:00 6 days, 2 hours Upgrade to EMI-3
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
96102 Green Less Urgent In Progress 2013-07-24 2013-07-24 CMS File Read Error: T1_UK_RAL
96079 Green Urgent In Progress 2013-07-23 2013-07-23 Atlas Slow deletion rate at RAL
95996 Green Urgent In Progress 2013-07-22 2013-07-22 OPS SHA-2 test failing on lcgce01
95904 Yellow Very Urgent In Progress 2013-07-20 2013-07-22 LHCb Pilots aborted at RAL-LCG2
95435 Red Urgent In Progress 2013-07-04 2013-07-19 LHCb CVMFS problem at RAL-LCG2
91658 Red Less Urgent Waiting Reply 2013-02-20 2013-07-16 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-17-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
17/07/13 100 96.9 100 100 96.9 Batch server problems - CEs then unable to contact it.
18/07/13 96.9 91.6 92.3 95.4 88.9 Configuration error on CEs.
19/07/13 100 96.4 100 97.5 100 ALICE: Batch server problems - CEs then unable to contact it; CMS: Single failure of SRM Put Test.
20/07/13 98.3 84.9 100 100 85.7 Batch server problems - CEs then unable to contact it.
21/07/13 100 100 100 100 96.9 Batch server problems - CEs then unable to contact it.
22/07/13 100 100 100 100 100
23/07/13 100 100 100 83.8 83.8 Castor 2.1.13 upgrade for CMS & LHCb.