RAL Tier1 Operations Report for 6th December 2017
Review of Issues during the week 30th November to 6th December 2017.
|
- IPv6 issues have now been resolved – [Tier1] Unit 2 is master for IPv6 but there is no physical connections to that router from the switch core. Consequently the fail-over did not complete successfully. Once understood this was resolved 23/11/17.
Current operational status and issues
|
- Certificate deployment issues with UKeScience 2B ICA 1.88-1 and SL6. Possible SHA-1/SHA-2 incompatibility.
Resolved Disk Server Issues
|
- GDSS896 (CMS_DEFAULT) has been returned to full production
- GDSS771 has been returned to full production
Ongoing Disk Server Issues
|
- GDSS753 - Faulty drive - Port 5:6. Being investigated
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-alice.gridpp.rl.ac.uk, srm-atlas.gridpp.rl.ac.uk, srm-biomed.gridpp.rl.ac.uk, srm-cert.gridpp.rl.ac.uk, srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-pheno.gridpp.rl.ac.uk, srm-preprod.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-solid.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
06/12/2017 13:00
|
06/12/2017 15:00
|
2 hours
|
Upgrade of non-LHCb SRM to version 2.1.16-18
|
lcgfts3.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
05/12/2017 11:00
|
05/12/2017 13:00
|
2 hours
|
FTS update to v3.7.7
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Ongoing or Pending - but not yet formally announced:
Listing by category:
- Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous").
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Services
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
132222
|
Green
|
Urgent
|
In Progress
|
2017-11-30
|
2017-12-04
|
CMS
|
Transfers failing to T1_UK_RAL_Disk
|
131840
|
Green
|
Urgent
|
Waiting for reply
|
2017-11-14
|
2017-11-15
|
Other
|
solidexperiment.org CASTOR tape copy fails
|
131815
|
Green
|
Less Urgent
|
In Progress
|
2017-11-13
|
2017-11-20
|
T2K.Org
|
Extremely long download times for T2K files on tape at RAL
|
130207
|
Red
|
Urgent
|
On Hold
|
2017-08-24
|
2017-11-13
|
MICE
|
Timeouts when copyiing MICE reco data to CASTOR
|
127597
|
Red
|
Urgent
|
On Hold
|
2017-04-07
|
2017-10-05
|
CMS
|
Check networking and xrootd RAL-CERN performance
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-11-13
|
Ops
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-11-06
|
None
|
CASTOR at RAL not publishing GLUE 2
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas Echo |
Comment
|
29/11/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
30/11/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
1/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
2/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
3/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
4/12/17 |
100 |
100 |
96 |
88 |
100 |
100 |
|
5/12/17 |
100 |
100 |
42 |
100 |
100 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud
Day |
Atlas HC |
Atlas HC Echo |
CMS HC |
Comment
|
29/11/17 |
85 |
100 |
100 |
|
30/11/17 |
98 |
100 |
99 |
|
1/12/17 |
100 |
100 |
100 |
|
2/12/17 |
100 |
100 |
100 |
|
3/12/17 |
99 |
100 |
100 |
|
4/12/17 |
99 |
97 |
100 |
|
5/12/17 |
100 |
100 |
100 |
|