|
|
Line 81: |
Line 81: |
| | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs. | | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Limits on concurrent batch system jobs. |
| |} | | |} |
− | * CMS Multicore 550
| + | GROUP_CMS_LIMIT = 4000 |
| + | GROUP_ATLAS_LIMIT = 8000 |
| <!-- ******************End Limits On Batch System Jobs***************** -----> | | <!-- ******************End Limits On Batch System Jobs***************** -----> |
| <!-- ****************************************************************** -----> | | <!-- ****************************************************************** -----> |
Revision as of 14:46, 21 August 2018
RAL Tier1 Operations Report for 20th August 2018
Review of Issues during the week 13th August to the 20th August 2018.
|
- The upgrade of Echo was completed successfully on Thursday (16/8/18), with a greatly reduced memory usage. The cluster was allowed to recover overnight. Everything appeared to be working well on the Friday (17/8/18), and there is currently no evidence of data loss. We therefore ended the downtime at Friday 12:00 UTC (17/8/18). As a precaution for the weekend, we limited the ATLAS (and CMS), quota on our batch farm to 50% of its nominal amount. Assuming we encounter no problems we intend to lift this on Monday (20/8/18).
Current operational status and issues
|
- The new siny O2 SIM for the SMS service has been delivered and installed
Resolved Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
-
|
-
|
-
|
-
|
-
|
Ongoing Castor Disk Server Issues
|
Machine
|
VO
|
DiskPool
|
dxtx
|
Comments
|
gdss747
|
Atlas
|
atlasStripInput
|
d1t0
|
Currently in intervention.
|
Limits on concurrent batch system jobs.
|
GROUP_CMS_LIMIT = 4000
GROUP_ATLAS_LIMIT = 8000
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
Service
|
ID
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
-
|
- No ongoing downtime
- No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Castor:
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Move to generic Castor headnodes.
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting).
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
136757
|
mice
|
in progress
|
less urgent
|
17/08/2018
|
21/08/2018
|
Other
|
Missing lsc files for mice VO on lfc.gridpp.rl.ac.uk ?
|
EGI
|
136701
|
lhcb
|
in progress
|
very urgent
|
14/08/2018
|
21/08/2018
|
File Transfer
|
background of transfer errors
|
WLCG
|
136366
|
mice
|
in progress
|
less urgent
|
25/07/2018
|
20/08/2018
|
Local Batch System
|
Remove MICE Queue from RAL T1 Batch
|
EGI
|
136199
|
lhcb
|
in progress
|
very urgent
|
18/07/2018
|
07/08/2018
|
File Transfer
|
Lots of submitted transfers on RAL FTS
|
WLCG
|
136028
|
cms
|
in progress
|
top priority
|
10/07/2018
|
21/08/2018
|
CMS_AAA WAN Access
|
Issues reading files at T1_UK_RAL_Disk
|
WLCG
|
124876
|
ops
|
in progress
|
less urgent
|
07/11/2016
|
23/07/2018
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
EGI
|
GGUS Tickets Closed Last week
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
Scope
|
136655
|
lhcb
|
verified
|
less urgent
|
10/08/2018
|
15/08/2018
|
File Access
|
Missing File At RAL
|
WLCG
|
136460
|
cms
|
closed
|
urgent
|
30/07/2018
|
15/08/2018
|
CMS_Data Transfers
|
Transfers failing to RAL_Buffer
|
WLCG
|
136427
|
atlas
|
closed
|
urgent
|
28/07/2018
|
13/08/2018
|
File Transfer
|
UK RAL-LCG2: Transfer errors as destination
|
WLCG
|
136408
|
cms
|
closed
|
urgent
|
27/07/2018
|
15/08/2018
|
CMS_Data Transfers
|
missing files at RAL
|
WLCG
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day
|
Atlas
|
Atlas-Echo
|
CMS
|
LHCB
|
Alice
|
OPS
|
Comments
|
2018-08-13
|
100
|
100
|
0
|
100
|
100
|
100
|
|
2018-08-14
|
100
|
100
|
0
|
100
|
100
|
100
|
|
2018-08-15
|
100
|
100
|
0
|
100
|
100
|
100
|
|
2018-08-16
|
100
|
100
|
0
|
100
|
100
|
100
|
|
2018-08-17
|
100
|
100
|
60
|
100
|
100
|
100
|
|
2018-08-18
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-08-19
|
100
|
100
|
100
|
100
|
100
|
100
|
|
2018-08-20
|
100
|
100
|
100
|
100
|
100
|
100
|
|
Target Availability for each site is 97.0%
|
Red <90%
|
Orange <97%
|
Day |
Atlas HC |
CMS HC |
Comment
|
2018-08-13 |
0 |
0 |
|
2018-08-14 |
0 |
0 |
|
2018-08-15 |
0 |
0 |
|
2018-08-16 |
0 |
0 |
|
2018-08-17 |
76 |
60 |
|
2018-08-18 |
100 |
100 |
|
2018-08-19 |
100 |
100 |
|
2018-08-20 |
100 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
|