RAL Tier1 Operations Report for 11th February 2019
Review of Issues during the week 28th January 2019 to the 5th February 2019.
|
- Our tape robot experienced a 2 hour unscheduled outage on Wednesday. The issue was as the result of a failed network switch (internal to the robot), which was resolved after diligent use of the power switch.
- CASTOR team are moving Alice to the new xrootd redirector setup.
- Garbage collection issues with the new wlcgTape that were, in part, responsible for issues experienced by NA62 last week have now been understood and subsequently resolved.
- The ARC-CE’s are intermittently reporting ‘unknown’ for SAM test results. We believe we understand the cause of this and upgrading to the current release version will resolve this issue.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
139575
|
cms
|
in progress
|
urgent
|
07/02/2019
|
08/02/2019
|
CMS_AAA WAN Access
|
T1_UK_RAL SAM xrootd reads failing
|
WLCG
|
139477
|
ops
|
in progress
|
less urgent
|
01/02/2019
|
07/02/2019
|
Operations
|
[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-submit-ops@arc-ce04.gridpp.rl.ac.uk
|
EGI
|
139476
|
mice
|
in progress
|
less urgent
|
01/02/2019
|
06/02/2019
|
Other
|
LFC dump
|
EGI
|
139306
|
dteam
|
in progress
|
less urgent
|
24/01/2019
|
29/01/2019
|
Monitoring
|
perfsonar hosts need updating
|
EGI
|
138665
|
mice
|
on hold
|
urgent
|
04/12/2018
|
30/01/2019
|
Middleware
|
Problem accessing LFC at RAL
|
EGI
|
138500
|
cms
|
on hold
|
urgent
|
26/11/2018
|
30/01/2019
|
CMS_Data Transfers
|
Transfers failing from T2_PL_Swierk to RAL
|
WLCG
|
138361
|
t2k.org
|
in progress
|
less urgent
|
19/11/2018
|
31/01/2019
|
Other
|
RAL-LCG2: t2k.org LFC to DFC transition
|
EGI
|
138033
|
atlas
|
in progress
|
urgent
|
01/11/2018
|
31/01/2019
|
Other
|
singularity jobs failing at RAL
|
EGI
|
137897
|
enmr.eu
|
on hold
|
urgent
|
23/10/2018
|
31/01/2019
|
Workload Management
|
enmr.eu accounting at RAL
|
EGI
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
139538
|
cms
|
solved
|
urgent
|
05/02/2019
|
05/02/2019
|
CMS_Data Transfers
|
Some transfers failing to RAL - SRM_AUTHORIZATION_FAILURE
|
WLCG
|
139414
|
lhcb
|
verified
|
very urgent
|
30/01/2019
|
05/02/2019
|
Other
|
Jobs Failed with Segmentation fault at RAL-LCG2
|
WLCG
|
139405
|
ops
|
verified
|
less urgent
|
30/01/2019
|
05/02/2019
|
Operations
|
[Rod Dashboard] Issue detected : org.bdii.GLUE2-Validate@site-bdii.gridpp.rl.ac.uk
|
EGI
|
139375
|
atlas
|
solved
|
urgent
|
29/01/2019
|
04/02/2019
|
Other
|
RAL-LCG2 transfers fail with "the server responded with an error 500"
|
WLCG
|
139302
|
atlas
|
closed
|
urgent
|
24/01/2019
|
08/02/2019
|
File Transfer
|
RAL: transfer issues between BNL and UK due to a wrong DNS alias?
|
WLCG
|
139245
|
cms
|
solved
|
urgent
|
21/01/2019
|
04/02/2019
|
CMS_Data Transfers
|
Transfers failing from CNAF_Disk to RAL_Buffer
|
WLCG
|
139210
|
cms
|
closed
|
urgent
|
17/01/2019
|
08/02/2019
|
CMS_Data Transfers
|
Transfers failing from CSCS to UCL - issue with RAL FTS
|
WLCG
|
139209
|
cms
|
closed
|
urgent
|
17/01/2019
|
08/02/2019
|
CMS_AAA WAN Access
|
file open error at RAL
|
WLCG
|
138891
|
ops
|
solved
|
less urgent
|
17/12/2018
|
07/02/2019
|
Operations
|
[Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability
|
EGI
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2019-02-05
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-02-06
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-02-07
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-02-08
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-02-09
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2019-02-10
|
85
|
85
|
44
|
100
|
100
|
100
|
|
2019-02-11
|
29
|
29
|
8
|
100
|
98
|
100
|
|
2019-02-12
|
77
|
69
|
65
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2019-01-23 |
100 |
98 |
|
2019-01-24 |
100 |
98 |
|
2019-01-25 |
100 |
98 |
|
2019-01-26 |
100 |
91 |
|
2019-01-27 |
100 |
97 |
|
2019-01-28 |
100 |
93 |
|
2019-01-29 |
100 |
98 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
|