RAL Tier1 Operations Report for 10th December 2018
Review of Issues during the week 4th December to the 10th December 2018.
|
- Argo tests for CMS Castor were failing on Monday and Tuesday last week (3rd and 4th December). This was as a result of a BDII problem (it stopped publishing the information).
- There was a successful load test of the generator on Wednesday,
- ~5% of SAM tests via GridFTP against Echo have been failing due to the “Address already in use” problem. We are investigating the problem and disabled NIS on the gateways as it is not needed and was using up ~13000 ports. We are monitoring to see if the situation improves.
- CMS successfully migrated to the new consolidated Castor tape instance on Thursday (6th December).
- The physical machine hosting MySQL databases for the Tier-1 (RT ticket system + LFC) died on Thursday. The service was restored from backup on Friday on a VM.
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
138762
|
cms
|
on hold
|
urgent
|
10/12/2018
|
10/12/2018
|
CMS_Data Transfers
|
Transfers failing from FNAL to RAL_Disk
|
WLCG
|
138760
|
cms
|
in progress
|
urgent
|
10/12/2018
|
10/12/2018
|
CMS_Data Transfers
|
Transfers failing from RAL to CCIN2P3_Disk
|
WLCG
|
138758
|
none
|
in progress
|
less urgent
|
10/12/2018
|
10/12/2018
|
File Transfer
|
RAL-LCG2-ECHO: No such file or directory
|
EGI
|
138736
|
cms
|
in progress
|
urgent
|
07/12/2018
|
10/12/2018
|
CMS_Facilities
|
T1_UK_RAL intermittent SRM VOGet/VOPut failures
|
WLCG
|
138665
|
mice
|
waiting for reply
|
urgent
|
04/12/2018
|
05/12/2018
|
Middleware
|
Problem accessing LFC at RAL
|
EGI
|
138500
|
cms
|
in progress
|
urgent
|
26/11/2018
|
07/12/2018
|
CMS_Data Transfers
|
Transfers failing from T2_PL_Swierk to RAL
|
WLCG
|
138361
|
t2k.org
|
in progress
|
less urgent
|
19/11/2018
|
07/12/2018
|
Other
|
RAL-LCG2: t2k.org LFC to DFC transition
|
EGI
|
138033
|
atlas
|
in progress
|
urgent
|
01/11/2018
|
30/11/2018
|
Other
|
singularity jobs failing at RAL
|
EGI
|
137897
|
enmr.eu
|
in progress
|
less urgent
|
23/10/2018
|
28/11/2018
|
Accounting
|
enmr.eu accounting at RAL
|
EGI
|
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
138549
|
atlas
|
solved
|
top priority
|
28/11/2018
|
28/11/2018
|
Other
|
This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release.
|
WLCG
|
138331
|
cms
|
closed
|
urgent
|
16/11/2018
|
03/12/2018
|
CMS_Data Transfers
|
Posible expired proxy at RAL
|
WLCG
|
138327
|
cms
|
solved
|
urgent
|
16/11/2018
|
27/11/2018
|
CMS_Data Transfers
|
RAL FTS reporting connection issue with many hosts
|
WLCG
|
138315
|
cms
|
closed
|
urgent
|
15/11/2018
|
03/12/2018
|
CMS_Data Transfers
|
Transfers failing from T2_US_Wisconsin to T1_UK_RAL_Disk
|
WLCG
|
138218
|
cms
|
closed
|
urgent
|
09/11/2018
|
28/11/2018
|
CMS_Data Transfers
|
Transfers failing from RAL_Buffer to TIFR
|
WLCG
|
137822
|
lhcb
|
solved
|
top priority
|
18/10/2018
|
04/12/2018
|
File Transfer
|
FTS server seems in bad state.
|
WLCG
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2018-11-28
|
100
|
100
|
99
|
100
|
100
|
-1
|
|
2018-11-29
|
100
|
100
|
98
|
99
|
100
|
-1
|
|
2018-11-30
|
100
|
100
|
98
|
100
|
100
|
-1
|
|
2018-12-01
|
100
|
100
|
95
|
100
|
100
|
-1
|
|
2018-12-02
|
100
|
100
|
95
|
100
|
100
|
-1
|
|
2018-12-03
|
100
|
100
|
95
|
100
|
100
|
66.4
|
|
2018-12-04
|
100
|
100
|
96
|
100
|
100
|
90.625
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2018-11-28 |
100 |
100 |
|
2018-11-29 |
100 |
99 |
|
2018-11-30 |
100 |
99 |
|
2018-12-01 |
100 |
100 |
|
2018-12-02 |
100 |
99 |
|
2018-12-03 |
100 |
99 |
|
2018-12-04 |
100 |
99 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud