RAL Tier1 Operations Report for 2 October 2018
Review of Issues during the week 24th September 2018 to the 2nd October 2018.
|
- The on-going LFC issue appeared to have had been resolved early in the week (w/e 30th), as such the service was returned to Production.
- First breakage of database = 19th August Fixed 21st September Second breakage of database = Tuesday 25th September 1pm.
- Currently we have reasonable confidence that LFC issues are now understood and there is a fix:
- In conjunction with Fabrizio Furano (former LFC code developer at CERN), we have identified a flaw in the Oracle to MySQL migration process.
- An Oracle SEQUENCE had not been migrated into a MySQL table therefore it was only possible to register 6820 new records in the new MYSQL backend each time before a crash.
- This is explanation also accounts for the fact that rolling back the database to an earlier backup also worked
- Fabrizio provided a simple solution to fix the problem.
- We are now planning to implement the fix and restore the LFC to a full production service by close of business 02/10/2018.
Current operational status and issues
|
- Tier-1 encountered another IPv6 issue 17/9/18 - 23:00), however Digital Solutions were able to resolve this issue within 2 hours
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss747
|
Atlas
|
atlasStripInput
|
d1t0
|
Currently in intervention.
|
gdss753
|
Atlas
|
atlasStripInput
|
d1t0
|
Currently in intervention.
|
Limits on concurrent batch system jobs.
|
- GROUP_CMS_LIMIT = 4000
- GROUP_ATLAS_LIMIT = 8000
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
investigating problems with the LFC - possibly in the back end database.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
- No ongoing downtime
- No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Castor:
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Move to generic Castor headnodes.
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
137498
|
cms
|
in progress
|
urgent
|
01/10/2018
|
02/10/2018
|
CMS_AAA WAN Access
|
Xrootd FileOpenErrors in production jobs
|
WLCG
|
137398
|
cms
|
in progress
|
urgent
|
26/09/2018
|
01/10/2018
|
CMS_Data Transfers
|
Transfers failing from SPRACE to RAL - No data available
|
WLCG
|
137391
|
atlas
|
waiting for reply
|
urgent
|
25/09/2018
|
28/09/2018
|
Network problem
|
UK RAL-LCG2 transfer errors with Communication error on send
|
WLCG
|
137254
|
ops
|
in progress
|
less urgent
|
18/09/2018
|
01/10/2018
|
Operations
|
[Rod Dashboard] Issue detected : ch.cern.LFC-Write-ops@lfc.gridpp.rl.ac.uk
|
EGI
|
137195
|
ops
|
in progress
|
less urgent
|
14/09/2018
|
26/09/2018
|
Operations
|
[Rod Dashboard] Issues detected at RAL-LCG2
|
EGI
|
137153
|
t2k.org
|
in progress
|
urgent
|
12/09/2018
|
25/09/2018
|
Data Management - generic
|
LFC entry has file size 0, preventsw registering of additional replicas
|
EGI
|
136840
|
snoplus.snolab.ca
|
waiting for reply
|
very urgent
|
23/08/2018
|
28/09/2018
|
Other
|
Cannot upload files to LFN from Storage node
|
EGI
|
136701
|
lhcb
|
waiting for reply
|
very urgent
|
14/08/2018
|
26/09/2018
|
File Transfer
|
background of transfer errors
|
WLCG
|
136199
|
lhcb
|
on hold
|
very urgent
|
18/07/2018
|
01/10/2018
|
File Transfer
|
Lots of submitted transfers on RAL FTS
|
WLCG
|
124876
|
ops
|
in progress
|
less urgent
|
07/11/2016
|
23/07/2018
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
EGI
|
echo.stfc.ac.uk
EGI
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
137434
|
cms
|
solved
|
urgent
|
26/09/2018
|
27/09/2018
|
CMS_AAA WAN Access
|
xrootd host not reachable at T1_UK_RAL
|
WLCG
|
137416
|
none
|
verified
|
top priority
|
26/09/2018
|
01/10/2018
|
Other
|
This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release.
|
WLCG
|
137169
|
none
|
closed
|
less urgent
|
13/09/2018
|
28/09/2018
|
Monitoring
|
enable the nagios notifications - CVMFS
|
EGI
|
137128
|
snoplus.snolab.ca
|
closed
|
urgent
|
11/09/2018
|
28/09/2018
|
File Access
|
RAL remote host name change?
|
EGI
|
136884
|
t2k.org
|
verified
|
top priority
|
27/08/2018
|
27/09/2018
|
Data Management - generic
|
lcg-cr not working for t2k vo
|
EGI
|
136757
|
mice
|
solved
|
less urgent
|
17/08/2018
|
25/09/2018
|
Other
|
Missing lsc files for mice VO on lfc.gridpp.rl.ac.uk ?
|
EGI
|
136366
|
mice
|
verified
|
less urgent
|
25/07/2018
|
28/09/2018
|
Local Batch System
|
Remove MICE Queue from RAL T1 Batch
|
EGI
|
136028
|
cms
|
solved
|
top priority
|
10/07/2018
|
28/09/2018
|
CMS_AAA WAN Access
|
Issues reading files at T1_UK_RAL_Disk
|
WLCG
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2018-09-17
|
100
|
100
|
96
|
100
|
100
|
100
|
|
2018-09-18
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-09-19
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-09-20
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-09-21
|
100
|
100
|
97
|
100
|
100
|
100
|
|
2018-09-22
|
100
|
100
|
96
|
100
|
100
|
100
|
|
2018-09-23
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-09-24
|
100
|
100
|
100
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2018-08-10 |
100 |
98 |
|
2018-08-11 |
100 |
98 |
|
2018-08-12 |
98 |
99 |
|
2018-08-13 |
95 |
98 |
|
2018-08-14 |
100 |
99 |
|
2018-08-15 |
79 |
99 |
|
2018-08-16 |
100 |
99 |
|
2018-08-17 |
- |
- |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
|