RAL Tier1 Operations Report for 14th January 2019
Review of Issues during the week 7th January 2019 to the 14th January 2019.
|
- With the minor exceptions below, it's still (suspiciously!), calm and peaceful at the Tier-1. Very much business as usual.
- Ongoing issues with ARC CEs. Not causing significant operational issues yet. We have identified a large number of LHCb MCFastSimulation jobs that are probably creating the load on the system.
- On Wednesday 9th January an Echo storage node started rebooting itself and was removed from the cluster. No loss of availability to the cluster. Engineer to attend site to fix.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss804
|
lhcb
|
lhcbDst
|
d1t0
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
139081
|
lhcb
|
in progress
|
top priority
|
07/01/2019
|
14/01/2019
|
Local Batch System
|
Aborted pilots for LHCb
|
WLCG
|
138891
|
ops
|
waiting for reply
|
less urgent
|
17/12/2018
|
09/01/2019
|
Operations
|
[Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability
|
EGI
|
138665
|
mice
|
in progress
|
urgent
|
04/12/2018
|
11/01/2019
|
Middleware
|
Problem accessing LFC at RAL
|
EGI
|
138500
|
cms
|
in progress
|
urgent
|
26/11/2018
|
20/12/2018
|
CMS_Data Transfers
|
Transfers failing from T2_PL_Swierk to RAL
|
WLCG
|
138361
|
t2k.org
|
on hold
|
less urgent
|
19/11/2018
|
15/01/2019
|
Other
|
RAL-LCG2: t2k.org LFC to DFC transition
|
EGI
|
138033
|
atlas
|
in progress
|
urgent
|
01/11/2018
|
08/01/2019
|
Other
|
singularity jobs failing at RAL
|
EGI
|
137897
|
enmr.eu
|
waiting for reply
|
urgent
|
23/10/2018
|
15/01/2019
|
Workload Management
|
enmr.eu accounting at RAL
|
EGI
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
139108
|
ops
|
solved
|
less urgent
|
09/01/2019
|
15/01/2019
|
Operations
|
[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-ARIS@arc-ce04.gridpp.rl.ac.uk
|
EGI
|
139106
|
lhcb
|
verified
|
very urgent
|
09/01/2019
|
10/01/2019
|
File Transfer
|
FTS3 transfer problem at RAL-LCG2
|
WLCG
|
138833
|
none
|
verified
|
urgent
|
13/12/2018
|
14/01/2019
|
Storage Systems
|
Tape reporting is not being updated
|
EGI
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2019-01-08
|
100
|
100
|
100
|
98
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-09
|
100
|
100
|
100
|
100
|
98
|
-1
|
Ref GGUS#138891
|
2019-01-10
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-11
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-12
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-13
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-14
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-15
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
2019-01-16
|
100
|
100
|
100
|
100
|
100
|
-1
|
Ref GGUS#138891
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2019-01-07 |
100 |
99 |
|
2019-01-08 |
100 |
100 |
|
2019-01-09 |
100 |
99 |
|
2019-01-10 |
100 |
100 |
|
2019-01-11 |
100 |
100 |
|
2019-01-12 |
100 |
95 |
|
2019-01-13 |
100 |
98 |
|
2019-01-14 |
100 |
100 |
|
2019-01-14 |
100 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
|