Tier1 Operations Report 2013-07-17

From GridPP Wiki
Revision as of 14:16, 17 July 2013 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 17th July 2013

Review of Issues during the week 10th to 17th July 2013.
  • The CVMFS problems - notably affecting CMS - have been ongoing as we verified that CVMFS client version 2.1.12 works OK. This has been the case and this version has been rolled out across the batch farm.
  • The Atlas Castor 2.1.13-9 upgrade overran significantly (4 hours) last Wednesday (10th July). The problems were in the updating of the configurations and OS of the head nodes and disk servers. The upgrade was completed OK.
  • The problem reported last week with connections to the batch server failing has continued. The problem started at the same time as the batch server was updated. This update was rolled back last Thursday (11th) but the problem remains.
  • On Wednesday late afternoon monitoring showed unusual activity (or lack of it) on the Castor GEN instance which was put into a 'warning' state in the GOC DB overnight . No problems were subsequently identified.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb jobs failing due to long job set-up times is still under investigation. The recent updates to the CVMFS clients to v2.1.12 is promising.
  • The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
Ongoing Disk Server Issues
  • On Thursday evening, 11th July, GDSS664 (AtlasDataDisk, D1T0) failed. There have been significant problems rebuilding the RAID array containing the data and at one point Atlas were warned we may have data loss. However, the server was brought backup on Tuesday (16th) and following checksumming of a sample of files to validate the data the server is being drained ahead of further investigations.
Notable Changes made this last week
  • Castor Atlas instance (stager) was upgraded to version 2.1.13-9 last Wednesday (10th).
  • CVMFS client version 2.1.12 has been rolled out to most of the batch farm.
  • Software updates applied to the batch server the week before were rolled back on Thursday 11th July.
  • The two ARC-CEs were added to the GOC DB a week ago and were set to 'monitored' this Monday (15th).
Declared in the GOC DB
  • Tuesday 23rd July: Upgrade of CMS and LHCb Castor instances to version 2.1.13-9
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Two reboots of site firewall between 07:45 and 08:45: Tuesday 23rd July.
  • Update the remaining Castor stagers on the following dates: CMS & LHCb: Tuesday 23rd July; GEN Tuesday 30th July.
  • Wednesday 24th July: Transition of Thames valley Network to Janet 6.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13 (ongoing)
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 3rd and 10th July 2013.

There were three unscheduled entries in the GOC DB. One was an unscheduled OUTAGE - when the upgrade to the Atlas Castor upgrade overran. The other two were unscheduled WARNINGs. One for the batch system (as a change made earlier was reverted). The other for the Castor 'GEN' instance which was experiencing problems.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs: lcgce01, lcgce02, lcgce04, lcgce10, lcgce11, lcgce12. UNSCHEDULED WARNING 11/07/2013 09:30 11/07/2013 10:35 1 hour and 5 minutes Batch service At Risk during work on batch server.
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k, UNSCHEDULED WARNING 10/07/2013 18:00 11/07/2013 09:25 15 hours and 25 minutes Some problems seen with Castor GEN instance which are not fully understood. Instance working but being put in Warning overnight.
srm-atlas UNSCHEDULED OUTAGE 10/07/2013 14:00 10/07/2013 18:00 4 hours Extending outage of Atlas Castor instance as the upgrade is overrunning.
srm-atlas SCHEDULED OUTAGE 10/07/2013 09:00 10/07/2013 14:00 5 hours Upgrade of Atlas Castor Stager to version 2.1.13-9.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
95820 Green Less Urgent In Progress 2013-07-17 2013-07-17 CMS Many errors with file access at RAL today, maybe related to high load (~5000 jobs running) on the file server.
95757 Green Less Urgent In Progress 2013-07-15 2013-07-17 CMS Jobs are failing at a particular node.
95671 Yellow Less Urgent In Progress 2013-07-11 2013-07-17 LHCb Many jobs are falling at T1_UK_RAL related availability CMSSW release
95435 Red Urgent In Progress 2013-07-04 2013-07-04 LHCb CVMFS problem at RAL-LCG2
91658 Red Less Urgent In Progress 2013-02-20 2013-07-16 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-17-06 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
10/07/13 100 97.1 67.6 100 100 Atlas: Castor Upgrade; ALICE: CE test failures (CEs could not contact batch server.)
11/07/13 100 97.5 100 100 96.2 CE test failures (CEs could not contact batch server.)
12/07/13 100 100 100 100 96.9 CE test failures (CEs could not contact batch server.)
13/07/13 100 100 98.7 100 100 SRM test failures (Castor)
14/07/13 100 100 90.4 100 100 SRM test failures (Castor)
15/07/13 100 100 94.8 91.9 100 SRM test failures (Castor)
16/07/13 100 96.9 100 95.9 100 ALICE: CE test failure; CMS: SRM test failures (Castor)