RAL Tier1 Operations Report for 5th May 2019
Review of Issues during the week 29th April 2019 to the 5th May 2019.
|
- We are seeing high outbound packet loss over IPv6. Investigation has restarted now that the appropriate resources are back in office.
- High CMS job failure rates. Due to the workloads submitted by CMS, there is currently no issue. We are however still looking at making changes to the XrootD caches on the WN to improve performance.
- On Wednesday 24th April gdss738 (LHCb) crashed and was removed from production. It was returned to production on Friday 26th April at lunch.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140870
|
t2k.org
|
in progress
|
less urgent
|
25/04/2019
|
08/05/2019
|
Data Management - generic
|
Files vanished from RAL tape?
|
EGI
|
|
140773
|
lhcb
|
in progress
|
top priority
|
18/04/2019
|
08/05/2019
|
Storage Systems
|
Removal of Echo unbearably slow
|
WLCG
|
|
140447
|
dteam
|
on hold
|
less urgent
|
27/03/2019
|
08/05/2019
|
Network problem
|
packet loss outbound from RAL-LCG2 over IPv6
|
EGI
|
|
140220
|
mice
|
in progress
|
less urgent
|
15/03/2019
|
08/04/2019
|
Other
|
mice LFC to DFC transition
|
EGI
|
|
139672
|
other
|
in progress
|
urgent
|
13/02/2019
|
30/04/2019
|
Middleware
|
No LIGO pilots running at RAL
|
EGI
|
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140965
|
cms
|
solved
|
urgent
|
02/05/2019
|
03/05/2019
|
CMS_Data Transfers
|
Datatransfers T2_AT_VIenna -> T1_UK_RAL_Buffer failing
|
WLCG
|
Possible corruption on the file coming from Vienna may have made the file un-deleteable at RAL. The file was replaced with a new copy at Vienna and transfer in debug stream is now green.
|
140725
|
cms
|
closed
|
urgent
|
15/04/2019
|
30/04/2019
|
CMS_Facilities
|
T1_UK_RAL intermittent xrootd relative failures
|
WLCG
|
reason is clear, more additional hardware is on the way.
|
140683
|
lhcb
|
closed
|
top priority
|
10/04/2019
|
29/04/2019
|
Local Batch System
|
Pilots failing at RAL across all CEs
|
WLCG
|
Problem was resolved and checks put in place to prevent recurrence.
|
140660
|
cms
|
solved
|
urgent
|
09/04/2019
|
02/05/2019
|
CMS_Central Workflows
|
FIle read issues for Workflows where data is located at T1_UK_RAL
|
WLCG
|
The gridmap has been replaced and we appear to be working again. As such I'm going to close this ticket as implementation of the the full solution is beyond the scope of this ticket.
|
138033
|
atlas
|
closed
|
urgent
|
01/11/2018
|
02/05/2019
|
Other
|
singularity jobs failing at RAL
|
EGI
|
Hi James,
looks good. The test I sent succeeded. Closing ticket.
cheers
alessandra
|
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2019-04-22
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-23
|
100
|
100
|
98
|
97
|
83
|
100
|
|
2019-04-24
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-25
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-26
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-27
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-28
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-29
|
100
|
100
|
100
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2019-04-22 |
- |
97 |
|
2019-04-23 |
100 |
100 |
|
2019-04-24 |
100 |
n/a |
|
2019-04-25 |
100 |
n/a |
|
2019-04-26 |
100 |
n/a |
|
2019-04-27 |
100 |
99 |
|
2019-04-28 |
100 |
96 |
|
2019-04-29 |
100 |
96 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud