RAL Tier1 Operations Report for 24th January 2018
Review of Issues during the week 18th January 2018 to 24th January 2018
|
Current operational status and issues
|
- Ongoing security patching.
Resolved Castor Disk Server Issues
|
- gdss736 (lhcbDst - D1T0) – rebuilt and back in production (RO)
- gdss776 (lhcbDst - D1T0) - Failed Wednesday afternoon (17th). Returned to service on Friday.)
Ongoing Castor Disk Server Issues
|
- gdss717 (CMSTape - D0T1) – multiple drive failure
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
- Security patching done or underway.
- The WMS service has been declared as not in production in the GOC DB.
- Updating Echo CEPH to the "Luminous" version underway. The service will continue to operate during this intervention.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
22/01/2018 22:00
|
23/01/2018 10:00
|
12 hours
|
urgent fixes needed on Oracle DB backend - extension
|
srm-atlas.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
22/01/2018 17:45
|
22/01/2018 22:00
|
4 hours and 15 minutes
|
urgent fixes needed on Oracle DB backend
|
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-biomed.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-pheno.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-solid.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
17/01/2018 10:00
|
17/01/2018 13:00
|
3 hours
|
Outage of Castor Storage to apply Security patches.
|
lcglb01.gridpp.rl.ac.uk, lcglb02.gridpp.rl.ac.uk, lcgwms04.gridpp.rl.ac.uk, lcgwms05.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
12/01/2018 10:00
|
19/01/2018 12:00
|
7 days, 2 hours
|
WMS Decommissioning RAL Tier1
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
23/01/2018 15:45
|
24/01/2018 14:00
|
22 hours and 15 minutes
|
emergency downtime of Castor Atlas while rebuilding some database tables
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Ongoing or Pending - but not yet formally announced:
- Update to next CEPH version ("Luminous"). Ongoing.
Listing by category:
- Castor:
- Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous"). Ongoing.
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Internal
- DNS servers will be rolled out within the Tier1 network.
- Infrastructure
- Testing of power distribution boards in the R89 machine room is being scheduled for some time late July / Early August. The effect of this on our services is being discussed.
Open GGUS Tickets (Snapshot during morning of meeting)
|
Request id
|
Affected vo
|
Status
|
Priority
|
Date of creation
|
Last update
|
Type of problem
|
Subject
|
117683
|
none
|
on hold
|
less urgent
|
18/11/2015
|
03/01/2018
|
Information System
|
CASTOR at RAL not publishing GLUE 2
|
124876
|
ops
|
on hold
|
less urgent
|
07/11/2016
|
13/11/2017
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
127597
|
cms
|
on hold
|
urgent
|
07/04/2017
|
05/10/2017
|
File Transfer
|
Check networking and xrootd RAL-CERN performance
|
132589
|
lhcb
|
in progress
|
very urgent
|
21/12/2017
|
24/01/2018
|
Local Batch System
|
Killed pilots at RAL
|
132712
|
other
|
in progress
|
less urgent
|
04/01/2018
|
23/01/2018
|
Other
|
support for the hyperk VO (RAL-LCG2)
|
132802
|
cms
|
in progress
|
urgent
|
11/01/2018
|
24/01/2018
|
CMS_AAA WAN Access
|
Low HC xrootd success rates at T1_UK_RAL
|
132830
|
cms
|
reopened
|
very urgent
|
12/01/2018
|
24/01/2018
|
CMS_AAA WAN Access
|
Reading issues T1_UK_RAL
|
132844
|
atlas
|
in progress
|
urgent
|
14/01/2018
|
19/01/2018
|
Storage Systems
|
UK RAL-LCG2 DATADISK transfer errors "DESTINATION OVERWRITE srm-ifce err:"
|
132935
|
atlas
|
in progress
|
less urgent
|
18/01/2018
|
22/01/2018
|
Storage Systems
|
UK RAL-LCG2: deletion errors
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas Echo |
Comment
|
17/01/18 |
87.15 |
100 |
100 |
100 |
100 |
100 |
|
18/01/18 |
100 |
100 |
100 |
100 |
98 |
100 |
|
19/01/18 |
100 |
100 |
100 |
100 |
100 |
100 |
|
20/01/18 |
100 |
92 |
100 |
100 |
100 |
100 |
|
21/01/18 |
100 |
0 |
100 |
100 |
100 |
100 |
|
22/01/18 |
100 |
49 |
65 |
100 |
89 |
100 |
|
23/01/18 |
100 |
94 |
0 |
100 |
94 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_UCORE, Template 841); CMS HC = CMS HammerCloud
Day |
Atlas HC |
CMS HC |
Comment
|
17/01/18 |
92 |
100 |
|
18/01/18 |
100 |
100 |
|
19/01/18 |
100 |
99 |
|
20/01/18 |
100 |
99 |
|
21/01/18 |
100 |
100 |
|
22/01/18 |
59 |
100 |
|
23/01/18 |
100 |
99 |
|
- There was a discussion around performance of the Tier1 when accessing off-site data from the worker nodes. This data path goes through the site firewall which is causing some restrictions.