RAL Tier1 Operations Report for 7th August 2013
Review of Issues during the fortnight 24th July to 7th August 2013.
|
- There have been problems for a number of weeks with the batch server. On Thursday 25th July a mis-configured node was found that was repeatedly contacting the batch server. Stopping this node resolved the problems.
- On Thursday morning, 1st August there was a problem with one of the DNS servers at RAL which was responding very slowly. This caused of problems for Castor & Batch services. The problem occurred around 6am and was fixed around 9am.
- The problem LHCb jobs failing due to long job set-up times is now closed. The updates of the CVMFS clients to v2.1.12 is working well.
- Over last weekend there was a problem with CVMFS that affected Atlas batch work. On Monday (5th August) this was traced to a corrupt cvmfs catalog entry on webfs (ie. on the RAL stratum1)
- On Tuesday morning there was a planned network intervention. However, problems were subsequently encountered leading to two breaks in connectivity for the RAL site. The first (07:54 – 08:20) was within the pre-declared (in the GOC DB) time window . The second around an hour later (from 09:20 – 09:50). An Outage was declared for this second break.
Resolved Disk Server Issues
|
- GDSS492 (AtlasDataDisk, D1T0) was taken out of production for around an hour yesterday (6th Aug). Following replacement of a disk the new drive wasn't seen by the disk controller. Needed system restart to see disk.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively. Atlas have moved both the UK and German clouds to use it. There have been some problems (over the weekend of 4/5 August and overnight on the 6/7 August. These were both issues with FTS3 itself and were rapidly resolved.
- We are participating in xrootd federated access tests for Atlas.
- Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
- Atlas have reported a problem with file deletions going slow. This is being investigated. The problem seems to also affect the RAL Tier2.
Ongoing Disk Server Issues
|
Notable Changes made this last fortnight.
|
- Castor GEN instance (stager) upgraded to version 2.1.13-9 on Tuesday 30th July.
- Lcgwms06 has been re-deployed and configured as EMI-3 SL6 WMS server. The installation includes also the condor release needed for interfacing with ARC-CE nodes.
- LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- A 2-day maintenance is being planned for the first week in November (TBC) for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between the 24th July and 7th August 2013.
|
There were three unscheduled Outages in the GOC DB during this last fortnight. Two were for Network problems and another for batch server problems. There was also an unscheduled Warning for a forthcoming network intervention (notification initially overlooked by Tier1 team).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgce12
|
SCHEDULED
|
OUTAGE
|
06/08/2013 13:00
|
05/09/2013 13:00
|
30 days,
|
CE (and the SL6 batch queue behind it) being decommissioned.
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
06/08/2013 09:20
|
06/08/2013 09:50
|
30 minutes
|
All services unavailable due to break in external network connectivity
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
06/08/2013 07:45
|
06/08/2013 08:45
|
1 hour
|
Site At Risk around two short firewall reboots.
|
All Castor & Batch (CEs)
|
UNSCHEDULED
|
OUTAGE
|
01/08/2013 06:00
|
01/08/2013 09:00
|
3 hours
|
Many services (all Castor and Batch) disrupted by DNS problem. (Retrospective Declaration)
|
Castor GEN instance (srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k
|
SCHEDULED
|
OUTAGE
|
30/07/2013 10:00
|
30/07/2013 11:35
|
1 hour and 35 minutes
|
Upgrade of the Castor GEN instance for upgrade to version 2.1.13-9
|
All batch (All CEs)
|
UNSCHEDULED
|
OUTAGE
|
24/07/2013 14:20
|
24/07/2013 16:20
|
2 hours
|
Communication problem between the pbs/maui batch server and the CEs
|
lcgwms06
|
SCHEDULED
|
OUTAGE
|
19/07/2013 10:00
|
25/07/2013 12:00
|
6 days, 2 hours
|
Upgrade to EMI-3
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
96321
|
Yellow
|
Less Urgent
|
Waiting Reply
|
2013-08-02
|
2013-08-06
|
SNO+
|
SNO+ srm tests failing
|
96235
|
Red
|
Less Urgent
|
In Progress
|
2013-07-29
|
2013-07-30
|
hyperk.org
|
LFC for hyperk.org
|
96233
|
Red
|
Less Urgent
|
In Progress
|
2013-07-29
|
2013-07-30
|
hyperk.org
|
WMS for hyperk.org - RAL
|
96079
|
Amber
|
Urgent
|
Reopened
|
2013-07-23
|
2013-08-06
|
Atlas
|
Slow deletion rate at RAL
|
95996
|
Red
|
Urgent
|
In Progress
|
2013-07-22
|
2013-07-22
|
OPS
|
SHA-2 test failing on lcgce01
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-08-02
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-17-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
24/07/13 |
100 |
100 |
98.6 |
100 |
100 |
Batch problems. Partly ongoing problem (unable to connect). Also problems following the replacement of the pbs_server daemon as part of the investigations into this.
|
25/07/13 |
100 |
100 |
100 |
100 |
93.3 |
Batch server problems - CEs then unable to contact it.
|
26/07/13 |
100 |
100 |
100 |
100 |
95.8 |
Single SRM test failure. Failure to list file.
|
27/07/13 |
100 |
100 |
100 |
100 |
100 |
|
28/07/13 |
100 |
100 |
100 |
100 |
100 |
|
29/07/13 |
100 |
100 |
100 |
100 |
100 |
|
30/07/13 |
94.9 |
100 |
97.2 |
100 |
100 |
OPS: Error in information published to BDIIs by test CEs; Atlas: Two SRM test timeouts.
|
31/07/13 |
76.0 |
100 |
100 |
100 |
100 |
Error in information published to BDIIs by test CEs.
|
01/08/13 |
100 |
100 |
84.5 |
88.2 |
92.5 |
DNS server response very slow. Caused multiple problems, especially for Castor & batch.
|
02/08/13 |
100 |
100 |
100 |
100 |
100 |
|
03/08/13 |
100 |
100 |
100 |
100 |
100 |
|
04/08/13 |
100 |
100 |
100 |
100 |
100 |
|
05/08/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure (timeout).
|
06/08/13 |
99.4 |
94.4 |
95.1 |
95.9 |
98.9 |
Site Network problem. (Plus single SRM test failure later in day for CMS)
|