RAL Tier1 Operations Report for 13th December 2017
Review of Issues during the week 21st December 2017 to 3rd January 2018
|
Network:
• Network problem on Stack 9 in the UPS room. Faulty transceiver replaced,
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
- GDSS688 (cmsDisk - D1T0) is back in production.
- GDSS743 (atlasStripInput - D1T0) is back in production.
Ongoing Castor Disk Server Issues
|
- GDSS757 (cmsDisk - D1T0) is back in production.
- GDSS756 (cmsDisk - D1T0) is back in production.
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
• None.
Entries in GOC DB starting since the last report.
|
No downtime scheduled in the GOCDB between 2017-12-12 and 2017-12-20
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Ongoing or Pending - but not yet formally announced:
Listing by category:
- Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous").
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Services
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting)
|
Ticket-ID
|
Type
|
VO
|
Notified Site
|
Resp. Unit
|
Status
|
Priority
|
Creation
|
Last Update
|
ToI
|
Subject
|
132589
|
TEAM
|
lhcb
|
RAL-LCG2
|
NGI_UK
|
in progress
|
very urgent
|
2017-12-21 06:45:00
|
2017-12-21 16:22:00
|
Local Batch System
|
Killed pilots at RAL
|
132540
|
TEAM
|
lhcb
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
in progress
|
top priority
|
2017-12-18 09:32:00
|
2017-12-23 10:13:00
|
Other
|
Upload problems at RAL
|
131815
|
USER
|
t2k.org
|
RAL-LCG2
|
NGI_UK
|
in progress
|
less urgent
|
2017-11-13 14:42:00
|
2017-12-01 19:30:00
|
Storage Systems
|
Extremely long download times for T2K files on tape at RAL
|
130207
|
USER
|
mice
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
urgent
|
2017-08-24 09:46:00
|
2017-12-18 17:22:00
|
Network problem
|
Timeouts when copyiing MICE reco data to CASTOR
|
127597
|
USER
|
cms
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk share with:sexton@fnal.gov
|
on hold
|
urgent
|
2017-04-07 10:34:00
|
2017-10-05 09:14:00
|
File Transfer
|
Check networking and xrootd RAL-CERN performance
|
124876
|
USER
|
ops
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
less urgent
|
2016-11-07 12:06:00
|
2017-11-13 16:55:00
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
USER
|
none
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
less urgent
|
2015-11-18 11:36:00
|
2017-11-06 16:59:00
|
Information System
|
CASTOR at RAL not publishing GLUE 2
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas Echo |
Comment
|
20/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
21/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
22/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
23/12/17 |
100 |
100 |
100 |
100 |
53 |
100 |
|
24/12/17 |
100 |
100 |
100 |
98 |
100 |
100 |
|
25/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
26/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
27/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
28/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
29/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
30/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
31/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
01/01/18 |
100 |
100 |
100 |
100 |
100 |
100 |
|
02/01/18 |
100 |
100 |
100 |
100 |
100 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud
Day |
Atlas HC |
Atlas HC Echo |
CMS HC |
Comment
|
20/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
21/12/17 |
98 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
22/12/17 |
100 |
0 |
98 |
Atlas HC Echo - No test run in time bin
|
23/12/17 |
98 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
24/12/17 |
0 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
25/12/17 |
86 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
26/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
27/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
28/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
29/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
30/12/17 |
93 |
0 |
100 |
-
|
31/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
01/01/18 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
02/01/18 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
- Ceph scrubbing is now running daytime only to help reduce call-outs at nights.