RAL Tier1 Operations Report for 29th April 2019
Review of Issues during the week 22nd April 2019 to the 29th April 2019.
|
- We are seeing high outbound packet loss over IPv6. Investigation has restarted now that the appropriate resources are back in office.
- High CMS job failure rates. Due to the workloads submitted by CMS, there is currently no issue. We are however still looking at making changes to the XrootD caches on the WN to improve performance.
- On Wednesday 24th April gdss738 (LHCb) crashed and was removed from production. It was returned to production on Friday 26th April at lunch.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss738
|
LHCb
|
lhcb
|
d1to
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140870
|
t2k.org
|
in progress
|
less urgent
|
25/04/2019
|
29/04/2019
|
Data Management - generic
|
Files vanished from RAL tape?
|
EGI
|
|
140773
|
lhcb
|
in progress
|
top priority
|
18/04/2019
|
29/04/2019
|
Storage Systems
|
Removal of Echo unbearably slow
|
WLCG
|
|
140660
|
cms
|
in progress
|
urgent
|
09/04/2019
|
18/04/2019
|
CMS_Central Workflows
|
FIle read issues for Workflows where data is located at T1_UK_RAL
|
WLCG
|
|
140447
|
dteam
|
on hold
|
less urgent
|
27/03/2019
|
26/04/2019
|
Network problem
|
packet loss outbound from RAL-LCG2 over IPv6
|
EGI
|
|
140220
|
mice
|
in progress
|
less urgent
|
15/03/2019
|
08/04/2019
|
Other
|
mice LFC to DFC transition
|
EGI
|
|
139672
|
other
|
in progress
|
urgent
|
13/02/2019
|
25/04/2019
|
Middleware
|
No LIGO pilots running at RAL
|
EGI
|
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
Solution
|
140758
|
lhcb
|
solved
|
urgent
|
17/04/2019
|
17/04/2019
|
File Access
|
lhcbUser svcClass not working as it should ?
|
WLCG
|
Should be fixed now.
|
140725
|
cms
|
solved
|
urgent
|
15/04/2019
|
16/04/2019
|
CMS_Facilities
|
T1_UK_RAL intermittent xrootd relative failures
|
WLCG
|
reason is clear, more additional hardware is on the way.
|
140599
|
lhcb
|
verified
|
very urgent
|
05/04/2019
|
18/04/2019
|
File Access
|
Data access problem at RAL-LCG2
|
WLCG
|
Files have been transferred out of this diskserver into ECHO
|
140589
|
lhcb
|
verified
|
very urgent
|
04/04/2019
|
15/04/2019
|
Local Batch System
|
Pilots killed at RAL-LCG2
|
WLCG
|
As per Raja's comments, this original issue has now been resolved so ticket is being closed.
|
140511
|
cms
|
closed
|
urgent
|
01/04/2019
|
16/04/2019
|
CMS_Facilities
|
T1_UK_RAL SAM job run out of date
|
WLCG
|
Issue is related to SAM dashboard.
|
140493
|
atlas
|
closed
|
less urgent
|
29/03/2019
|
15/04/2019
|
File Transfer
|
UK RAL-LCG2 MCTAPE: transfer error with" Connection timed out,"
|
WLCG
|
Hi xin wang,
This looks more like a problem with the source site, LRZ-LMU_DATADISK. The error message refers to the source URL at httpg://lcg-lrz-srm.grid.lrz.de:8443/srm/managerv2. Also, if I try to download from the source path, it gets stuck:
gfal-copy srm://lcg-lrz-srm.grid.lrz.de:8443/srm/managerv2?SFN=/pnfs/lrz-muenchen.de/data/atlas/dq2/atlasdatadisk/rucio/mc16_13TeV/35/cb/HITS.17137527._002530.pool.root.1 .
Can you assign a new ticket for LRZ-LMU?
I hope it's OK for me to mark this ticket "solved". Please reopen if I was mistaken.
Thanks,
Tim.
|
140467
|
cms
|
closed
|
urgent
|
28/03/2019
|
15/04/2019
|
CMS_Data Transfers
|
Stuck file at RAL
|
WLCG
|
Stuck file had missing stripes and a zeroth stripe with zero size. This was deleted by hand and the errors stopped appearing.
|
138665
|
mice
|
closed
|
urgent
|
04/12/2018
|
23/04/2019
|
Middleware
|
Problem accessing LFC at RAL
|
EGI
|
As I understand it this ticket is has been superseded by https://ggus.eu/?mode=ticket_info&ticket_id=140220. As such I'm closing this ticket. Please feel free to reopen this ticket if you disagree.
|
138033
|
atlas
|
solved
|
urgent
|
01/11/2018
|
18/04/2019
|
Other
|
singularity jobs failing at RAL
|
EGI
|
Hi James,
looks good. The test I sent succeeded. Closing ticket.
cheers
alessandra
|
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2019-04-15
|
100
|
100
|
99
|
100
|
100
|
100
|
|
2019-04-16
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-17
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-18
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-19
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-20
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-21
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-22
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-04-23
|
100
|
100
|
98
|
97
|
83
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2019-04-15 |
- |
80 |
|
2019-04-16 |
- |
96 |
|
2019-04-17 |
- |
100 |
|
2019-04-18 |
- |
98 |
|
2019-04-19 |
- |
98 |
|
2019-04-20 |
- |
99 |
|
2019-04-21 |
- |
97 |
|
2019-04-22 |
- |
98 |
|
2019-04-23 |
- |
90 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud