RAL Tier1 Operations Report for 29th April 2019
Review of Issues during the week 22nd April 2019 to the 29th April 2019.
|
- We are seeing high outbound packet loss over IPv6. Investigation has restarted now that the appropriate resources are back in office.
- High CMS job failure rates. Due to the workloads submitted by CMS, there is currently no issue. We are however still looking at making changes to the XrootD caches on the WN to improve performance.
- On Wednesday 24th April gdss738 (LHCb) crashed and was removed from production. It was returned to production on Friday 26th April at lunch.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss738
|
LHCb
|
lhcb
|
d1to
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140870
|
t2k.org
|
in progress
|
less urgent
|
25/04/2019
|
29/04/2019
|
Data Management - generic
|
Files vanished from RAL tape?
|
EGI
|
|
140773
|
lhcb
|
in progress
|
top priority
|
18/04/2019
|
29/04/2019
|
Storage Systems
|
Removal of Echo unbearably slow
|
WLCG
|
|
140660
|
cms
|
in progress
|
urgent
|
09/04/2019
|
18/04/2019
|
CMS_Central Workflows
|
FIle read issues for Workflows where data is located at T1_UK_RAL
|
WLCG
|
|
140447
|
dteam
|
on hold
|
less urgent
|
27/03/2019
|
26/04/2019
|
Network problem
|
packet loss outbound from RAL-LCG2 over IPv6
|
EGI
|
|
140220
|
mice
|
in progress
|
less urgent
|
15/03/2019
|
08/04/2019
|
Other
|
mice LFC to DFC transition
|
EGI
|
|
139672
|
other
|
in progress
|
urgent
|
13/02/2019
|
25/04/2019
|
Middleware
|
No LIGO pilots running at RAL
|
EGI
|
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140887
|
atlas
|
solved
|
urgent
|
27/04/2019
|
27/04/2019
|
File Transfer
|
UK RAL-LCG2 ransfer error with: srm-ifce err: Communication error on send
|
WLCG
|
This is not a RAL issue, but a problem with Wuppertalprod already ticketed at https://ggus.eu/index.php?mode=ticket_info&ticket_id=140883 .
Closing this ticket.
|
140758
|
lhcb
|
solved
|
urgent
|
17/04/2019
|
24/04/2019
|
File Access
|
lhcbUser svcClass not working as it should ?
|
WLCG
|
Hi guys,
I'm assuming I can now resolve this one again?
Cheers
D.
|
140577
|
lhcb
|
closed
|
less urgent
|
04/04/2019
|
25/04/2019
|
File Access
|
LHCb disk only files requested with the wrong service class
|
EGI
|
No solution found so far. LHCb is close to migrate from the old CASTIR instance soon
|
138665
|
mice
|
closed
|
urgent
|
04/12/2018
|
23/04/2019
|
Middleware
|
Problem accessing LFC at RAL
|
EGI
|
As I understand it this ticket is has been superseded by https://ggus.eu/?mode=ticket_info&ticket_id=140220. As such I'm closing this ticket. Please feel free to reopen this ticket if you disagree.
|
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2019-04-22
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-23
|
100
|
100
|
98
|
97
|
83
|
100
|
|
2019-04-24
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-25
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-26
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-27
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-28
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-29
|
100
|
100
|
100
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2019-04-22 |
- |
97 |
|
2019-04-23 |
100 |
100 |
|
2019-04-24 |
100 |
n/a |
|
2019-04-25 |
100 |
n/a |
|
2019-04-26 |
100 |
n/a |
|
2019-04-27 |
100 |
99 |
|
2019-04-28 |
100 |
96 |
|
2019-04-29 |
100 |
96 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud