|
|
Line 11: |
Line 11: |
| | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd October 2018 to the 8th October 2018. | | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 2nd October 2018 to the 8th October 2018. |
| |} | | |} |
− | * The LFC is STILL running - hurrah! We now believe that this problem has been properly fixed. | + | * LFC |
| + | ** The LFC is STILL running - hurrah! We now believe that this problem has been properly fixed. |
| * FTS / gfal version issues | | * FTS / gfal version issues |
| **The newest version of gfal made changes to the way certificate are processed. This broke transfers to a significant number of sites. Tier-1 was ticketed to role back the FTS service. | | **The newest version of gfal made changes to the way certificate are processed. This broke transfers to a significant number of sites. Tier-1 was ticketed to role back the FTS service. |
Revision as of 07:32, 9 October 2018
RAL Tier1 Operations Report for 8th October 2018
Review of Issues during the week 2nd October 2018 to the 8th October 2018.
|
- LFC
- The LFC is STILL running - hurrah! We now believe that this problem has been properly fixed.
- FTS / gfal version issues
- The newest version of gfal made changes to the way certificate are processed. This broke transfers to a significant number of sites. Tier-1 was ticketed to role back the FTS service.
- New Castor tape service:
- Teir-1 have successfully run ATLAS performance tape test. Excellent results at 2GB/s. This was reported at the WLCG Archival Storage WG[1].
- We were second only to IN2P3 who got 50% better throughput but had dedicated 36 T10KD drives compared to our 8 T10KD drives.
- As we migrate more VOs to the new Castor instance we will have more drives available (up to 22) and this should mean we are likely to double our overall throughput.
[1] https://indico.cern.ch/event/756338/attachments/1723845/2784624/update-atlas-data-carousel-wlcg-wg.pdf
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss747
|
Atlas
|
atlasStripInput
|
d1t0
|
Currently in intervention.
|
gdss753
|
Atlas
|
atlasStripInput
|
d1t0
|
Currently in intervention.
|
gdss738
|
LHCb
|
LHCb_FAILOVER,LHCb-Disk
|
d1t0
|
Currently in intervention.
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
- No ongoing downtime
- No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Castor:
- Disk only storage will end in Castor. A new single tape-only Castor instance (called WLCGTape) is being tested. This is using generic Catsor headnodes on SL7 configured by Quattor/Aquilon with a slightly newer Castor version.
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
137619
|
cms
|
in progress
|
urgent
|
07/10/2018
|
08/10/2018
|
CMS_AAA WAN Access
|
T1_UK_RAL xrootd read failures
|
WLCG
|
137391
|
atlas
|
in progress
|
urgent
|
25/09/2018
|
05/10/2018
|
Network problem
|
UK RAL-LCG2 transfer errors with Communication error on send
|
WLCG
|
137195
|
ops
|
in progress
|
less urgent
|
14/09/2018
|
05/10/2018
|
Operations
|
[Rod Dashboard] Issues detected at RAL-LCG2
|
EGI
|
137153
|
t2k.org
|
in progress
|
urgent
|
12/09/2018
|
25/09/2018
|
Data Management - generic
|
LFC entry has file size 0, preventsw registering of additional replicas
|
EGI
|
136701
|
lhcb
|
waiting for reply
|
very urgent
|
14/08/2018
|
03/10/2018
|
File Transfer
|
background of transfer errors
|
WLCG
|
136199
|
lhcb
|
on hold
|
very urgent
|
18/07/2018
|
01/10/2018
|
File Transfer
|
Lots of submitted transfers on RAL FTS
|
WLCG
|
124876
|
ops
|
in progress
|
less urgent
|
07/11/2016
|
23/07/2018
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
EGI
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
137565
|
atlas
|
solved
|
less urgent
|
03/10/2018
|
05/10/2018
|
Other
|
failing handshake for transfers from CA-VICTORIA-WESTGRID-T2_DATADISK to UK RAL-LCG2-ECHO
|
WLCG
|
137498
|
cms
|
solved
|
urgent
|
01/10/2018
|
05/10/2018
|
CMS_AAA WAN Access
|
Xrootd FileOpenErrors in production jobs
|
WLCG
|
137416
|
none
|
verified
|
top priority
|
26/09/2018
|
01/10/2018
|
Other
|
This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release.
|
WLCG
|
137398
|
cms
|
solved
|
urgent
|
26/09/2018
|
03/10/2018
|
CMS_Data Transfers
|
Transfers failing from SPRACE to RAL - No data available
|
WLCG
|
137294
|
snoplus.snolab.ca
|
closed
|
very urgent
|
20/09/2018
|
05/10/2018
|
File Access
|
No copy on tape error on Castor at RAL
|
EGI
|
137267
|
atlas
|
closed
|
urgent
|
19/09/2018
|
03/10/2018
|
Other
|
RAL-LCG2 transfer fail with "Communication error on send"
|
WLCG
|
137254
|
ops
|
verified
|
less urgent
|
18/09/2018
|
04/10/2018
|
Operations
|
[Rod Dashboard] Issue detected : ch.cern.LFC-Write-ops@lfc.gridpp.rl.ac.uk
|
EGI
|
137047
|
t2k.org
|
closed
|
less urgent
|
06/09/2018
|
04/10/2018
|
File Access
|
Connection to RAL tape storage fails
|
EGI
|
136840
|
snoplus.snolab.ca
|
solved
|
very urgent
|
23/08/2018
|
03/10/2018
|
Other
|
Cannot upload files to LFN from Storage node
|
EGI
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2018-10-02
|
91
|
99
|
100
|
100
|
100
|
100
|
|
2018-10-03
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-10-04
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-10-05
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-10-06
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-10-07
|
100
|
100
|
99
|
100
|
100
|
100
|
|
2018-10-08
|
100
|
100
|
100
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2018-08-24 |
100 |
99 |
|
2018-08-25 |
100 |
99 |
|
2018-08-26 |
100 |
100 |
|
2018-08-27 |
100 |
100 |
|
2018-08-28 |
100 |
100 |
|
2018-08-29 |
100 |
99 |
|
2018-08-30 |
100 |
99 |
|
2018-08-01 |
100 |
99 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
|