RAL Tier1 Operations Report for 28th August 2013
Review of Issues during the fortnight 15th to 28th August 2013.
|
- There has been a high rate of timeouts on the CMS Castor instance. These are limited to CMS-tape and may be a consequence of draining a large number of disk servers in that service class. On Thursday 15th the number of draining disk servers was reduced, the number of FTS channels was reduced and the transfer manager timeout was increased to 10 seconds. Subsequently the transfer manager timeout was increased to 10 seconds for all Castor instances on 20th August.
- Atlas had problems transfering files via FTS3 where the file has no checksum. FTS3 was patched for this on Tuesday 20th.
- Over the weekend of the 17th, 18th, Atlas had many file transfer failures. A disk server had bad networking. The disk server was fixed on Monday.
- Alice submitted some pathological jobs on Friday 23rd August. Approx 200 worker nodes were put offline due to memory depletion.
- There were problems for Atlas FTS3 on Friday 23rd August. They reverted to using FTS2 until Saturday. A new FTS3 release is expected today to fix this.
- There are ongoing issues today (28/08/2013) with the top level bdiis. Two of three top BDIIs are not working correctly. We are looking into this.
Resolved Disk Server Issues
|
- gdss598 (atlas ATLASDATADISK) developed a problem on Monday 19th and was removed from production for a while. The fault was found to be a faulty switch port.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively. Atlas have moved the UK, German and French clouds to use it. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
- Atlas have reported a problem with file deletions going slow. This is being investigated. The problem seems to also affect the RAL Tier2.
Ongoing Disk Server Issues
|
Notable Changes made this last fortnight.
|
- CVMFS has been upgraded to 2.1.14-1 in response to [EGI-SVG-2013-5890]
- Transfer manager timeout for all VOs has been changed to 10 seconds.
- FTS3 was patched to fix various issues.
- Today we are updating the firmware on the EMC array under the Castor standby databases.
- LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- A 2-day maintenance is being planned for the first week in November (TBC) for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between the 7th and 27th August 2013.
|
There is one unscheduled Warning in the GOC DB for our ongoing top BDII issues.
There were two unscheduled Warning in the GOC DB during the week before last. This is for an intervention on a disk array behind the Atlas 3D service. It was posponed on Wednesday because the engineer was not available, but went ahead on Thursday.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgbdii.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
28/08/2013 11:30
|
28/08/2013 16:00
|
4 hours and 30 minutes
|
We are currently investigating problems on 2 (out of 3) top BDII servers.
|
lcgft-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
15/08/2013 10:00
|
15/08/2013 14:54
|
4 hours and 54 minutes
|
Atlas 3D service at RAL at Risk while a standby power supply is swapped in a disk array. (Delayed intervention from yesterday).
|
lcgft-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
14/08/2013 09:00
|
14/08/2013 17:00
|
8 hours
|
Atlas 3D service at RAL at Risk while a standby power supply is swapped in a disk array.
|
lcgce12.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
06/08/2013 13:00
|
05/09/2013 13:00
|
30 days,
|
CE (and the SL6 batch queue behind it) being decommissioned.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
96321
|
Red
|
In Progress
|
Waiting Reply
|
2013-08-02
|
2013-08-06
|
SNO+
|
SNO+ srm tests failing
|
96235
|
Red
|
waiting for reply
|
In Progress
|
2013-07-29
|
2013-08-09
|
hyperk.org
|
LFC for hyperk.org
|
96233
|
Red
|
Less Urgent
|
In Progress
|
2013-07-29
|
2013-08-09
|
hyperk.org
|
WMS for hyperk.org - RAL
|
95996
|
Red
|
Urgent
|
In Progress
|
2013-07-22
|
2013-07-22
|
OPS
|
SHA-2 test failing on lcgce01
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-08-09
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-17-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
14/08/13 |
100 |
100 |
98.7 |
83.1 |
92 |
Atlas: CMS: SRM timeout, LHCB: SRM error "ERROR: [SE][Ls][SRM_FILE_BUSY]"
|
15/08/13 |
100 |
100 |
100 |
100 |
95 |
LHCB: SRM error "ERROR: [SE][Ls][SRM_FILE_BUSY]"
|
16/08/13 |
100 |
100 |
91 |
100 |
100 |
ATLAS: SRM timeouts.
|
17/08/13 |
100 |
100 |
100 |
100 |
100 |
|
18/08/13 |
100 |
-100 |
89 |
100 |
100 |
ATLAS: SRM User timeout over. Alice: monitoring glitch.
|
19/08/13 |
100 |
100 |
96 |
100 |
100 |
ATLAS: SRM User timeout over.
|
20/08/13 |
100 |
100 |
98.4 |
100 |
100 |
ATLAS: SRM User timeout over.
|
21/08/13 |
100 |
100 |
100 |
100 |
100 |
|
22/08/13 |
100 |
100 |
100 |
100 |
100 |
|
23/08/13 |
100 |
100 |
100 |
-100 |
100 |
CMS: monitoring Glitch due to new CMS accounts not being mapped correctly.
|
24/08/13 |
100 |
100 |
96 |
92.99 |
100 |
CMS: failed test on lcgce11, "software directory non existent or non readable". Atlas: SRM: "critical= 1 lcg_cp timed 120 seconds out"
|
25/08/13 |
100 |
100 |
98.15 |
100 |
100 |
Atlas: SRM timeout at 2013-08-25T12:30:52Z, "lcg_cp timed 120 seconds out"
|
26/08/13 |
100 |
100 |
100 |
100 |
100 |
|
27/08/13 |
100 |
100 |
96.36 |
100 |
100 |
Atlas: one SRM timeout at 2013-08-27T15:47:04Z, "lcg_cp timed 120 seconds out"
|