RAL Tier1 Operations Report for 13th December 2017
Review of Issues during the week 13th to 20th December 2017.
|
Echo:
• Background scrubbing has been going on. This has flushed out more bad disks – causing some callouts through the week.
Network:
• Emergency card replacement at Harwell PoP on Thursday morning. This was announced to us and caused a short break in two out of the three OPN links (as expected)
Current operational status and issues
|
Resolved Castor Disk Server Issues
|
- GDSS743 (AtlasDataDisk - D1T0) is back in production.
- GDSS705 (AtlasTape - D0T1) is back in production.
Ongoing Castor Disk Server Issues
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
Infrastructure:
• There was a successful generator load test last Wednesday (13th Dec).
Certificates:
• The re-updating to pick up the updated UK CA certificate in the IGTF 1.88 rollout took place successfully last Tuesday (12th) as planned.
Castor:
• Three Castor disk servers were moved from LHCb tape buffer to their disk-only (D1T0) storage.
Entries in GOC DB starting since the last report.
|
No downtime scheduled in the GOCDB between 2017-12-12 and 2017-12-20
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Ongoing or Pending - but not yet formally announced:
Listing by category:
- Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
- Echo:
- Update to next CEPH version ("Luminous").
- Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
- Services
- Internal
- DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting)
|
Ticket-ID
|
Type
|
VO
|
Notified Site
|
Resp. Unit
|
Status
|
Priority
|
Creation
|
Last Update
|
ToI
|
Subject
|
132540
|
TEAM
|
lhcb
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
in progress
|
top priority
|
2017-12-18 09:32:00
|
2017-12-18 11:36:00
|
Other
|
Upload problems at RAL
|
132336
|
USER
|
ops
|
RAL-LCG2
|
NGI_UK
|
in progress
|
less urgent
|
2017-12-06 14:34:00
|
2017-12-18 11:40:00
|
Operations
|
[Rod Dashboard] Issue detected : org.nagios.GLUE2-Check@site-bdii.gridpp.rl.ac.uk
|
132314
|
USER
|
ops
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
in progress
|
less urgent
|
2017-12-05 10:48:00
|
2017-12-18 14:10:00
|
Operations
|
[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-SRM-result-ops@arc-ce02.gridpp.rl.ac.uk
|
131815
|
USER
|
t2k.org
|
RAL-LCG2
|
NGI_UK
|
in progress
|
less urgent
|
2017-11-13 14:42:00
|
2017-12-01 19:30:00
|
Storage Systems
|
Extremely long download times for T2K files on tape at RAL
|
130207
|
USER
|
mice
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
urgent
|
2017-08-24 09:46:00
|
2017-12-18 17:22:00
|
Network problem
|
Timeouts when copyiing MICE reco data to CASTOR
|
127597
|
USER
|
cms
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk share with:sexton@fnal.gov
|
on hold
|
urgent
|
2017-04-07 10:34:00
|
2017-10-05 09:14:00
|
File Transfer
|
Check networking and xrootd RAL-CERN performance
|
124876
|
USER
|
ops
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
less urgent
|
2016-11-07 12:06:00
|
2017-11-13 16:55:00
|
Operations
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
USER
|
none
|
RAL-LCG2
|
NGI_UK assign to:lcg-support@gridpp.rl.ac.uk
|
on hold
|
less urgent
|
2015-11-18 11:36:00
|
2017-11-06 16:59:00
|
Information System
|
CASTOR at RAL not publishing GLUE 2
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas Echo |
Comment
|
13/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
14/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
15/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
16/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
17/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
18/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
19/12/17 |
100 |
100 |
100 |
100 |
100 |
100 |
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud
Day |
Atlas HC |
Atlas HC Echo |
CMS HC |
Comment
|
13/12/17 |
99 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
14/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
15/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
16/12/17 |
98 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
17/12/17 |
85 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
18/12/17 |
86 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
19/12/17 |
100 |
0 |
100 |
Atlas HC Echo - No test run in time bin
|
- Ceph scrubbing is now running daytime only to help reduce call-outs at nights.