Latest revision as of 11:20, 4 September 2013
RAL Tier1 Operations Report for 4th September 2013
Review of Issues during the week 28th August to 4th September 2013.
|
- There were problems last Wednesday (28/08/2013) with the top level BDIIs. A workaround was put in place and the situation improved at the end of the afternoon.
- At around 08:00 this morning the number of Atlas jobs running on the farm dropped suddenly by around 3000. This is not yet understood.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively. Atlas have moved the UK, German and French clouds to use it. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues. (The FTS3 servers were upgraded to version 3.1.5 1.el6 during this last week.)
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
- Atlas have reported a problem with file deletions going slow. This was investigaated but no cause found. Atlas have introduced a workround (shortening the time to re-try) but the underlying problem is not solved.
Ongoing Disk Server Issues
|
Notable Changes made this last fortnight.
|
- On Friday (30th Aug) FTS3 was updated to version 3.1.5 1.el6.
- Last Wednesday (28th Aug) the firmware on the EMC array under the Castor standby databases was successfully updated.
- LCGCE12 (CE for SL6 test Queue on the production batch farm) is in a long Outage ready for decommissioning.
- LCGWMS04 is in Outage for an upgrade to EMI-3. (Anticipate return to production tomorrow).
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The SL6 and "Whole Node" queues on the production batch service will be terminated. Multi-core jobs and those requiring SL6 can be run on the test Condor batch system.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
- Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 4th/5th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
- Intervention required on the "Essential Power Board".
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 28th August and 4th September 2013.
|
There was one unscheduled Warning in the GOC DB for our ongoing top BDII issues (this was ongoing at the time of last week's meeting).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgwms04.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
29/08/2013 17:00
|
06/09/2013 16:00
|
7 days, 23 hours
|
Upgrade to EMI-3
|
lcgbdii.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
28/08/2013 11:30
|
28/08/2013 16:00
|
4 hours and 30 minutes
|
We are currently investigating problems on 2 (out of 3) top BDII servers.
|
lcgce12.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
06/08/2013 13:00
|
05/09/2013 13:00
|
30 days,
|
CE (and the SL6 batch queue behind it) being decommissioned.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
97025
|
Green
|
Less urgent
|
In Progress
|
2013-09-03
|
2013-09-04
|
|
Myproxy server certificate does not contain hostname
|
96968
|
Green
|
Less urgent
|
In Progress
|
2013-08-31
|
2013-08-31
|
CMS
|
Black Node at T1_UK_RAL
|
96235
|
Red
|
Less urgent
|
waiting for reply
|
2013-07-29
|
2013-08-30
|
hyperk.org
|
LFC for hyperk.org
|
96233
|
Red
|
Less Urgent
|
waiting for reply
|
2013-07-29
|
2013-08-28
|
hyperk.org
|
WMS for hyperk.org - RAL
|
95996
|
Red
|
Urgent
|
In Progress
|
2013-07-22
|
2013-08-23
|
OPS
|
SHA-2 test failing on lcgce01
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-08-09
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-17-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
28/08/13 |
100 |
100 |
100 |
100 |
100 |
|
29/08/13 |
100 |
100 |
99.1 |
100 |
-100 |
Atlas: Single SRM Test failure: (Error reading token data header); LHCb: Test problem.
|
30/08/13 |
100 |
100 |
98.3 |
100 |
100 |
SRM Test failure ("Raising Timeout")
|
31/08/13 |
100 |
100 |
97.4 |
100 |
100 |
SRM Test failure ("Raising Timeout")
|
01/09/13 |
100 |
100 |
99.1 |
100 |
100 |
SRM Test failure ("Raising Timeout")
|
02/09/13 |
100 |
100 |
100 |
100 |
100 |
|
03/09/13 |
100 |
100 |
91.7 |
96.4 |
95.9 |
Atlas: SRM Test failures ("Raising Timeout"); CMS & LHCb: Each show "Single test failure (Error reading token data header: Connection closed)" - but at different times.
|