Tier1 Operations Report 2018-01-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 13th December 2017

Review of Issues during the week 21st December 2017 to 3rd January 2018

Network: • Network problem on Stack 9 in the UPS room. Faulty transceiver replaced,


Current operational status and issues
  • None
Resolved Castor Disk Server Issues
  • GDSS688 (cmsDisk - D1T0) is back in production.
  • GDSS743 (atlasStripInput - D1T0) is back in production.


Ongoing Castor Disk Server Issues
  • GDSS757 (cmsDisk - D1T0) is back in production.
  • GDSS756 (cmsDisk - D1T0) is back in production.
Limits on concurrent batch system jobs.
  • CMS Multicore 550
Notable Changes made since the last meeting.

• None.

Entries in GOC DB starting since the last report.

No downtime scheduled in the GOCDB between 2017-12-12 and 2017-12-20

Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Ongoing or Pending - but not yet formally announced:

Listing by category:

  • Castor:
    • Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
    • Move to generic Castor headnodes.
  • Echo:
    • Update to next CEPH version ("Luminous").
  • Networking
    • Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
  • Services
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting)
Ticket-ID Type VO Notified Site Resp. Unit Status Priority Creation Last Update ToI Subject
132589 TEAM lhcb RAL-LCG2 NGI_UK in progress very urgent 2017-12-21 06:45:00 2017-12-21 16:22:00 Local Batch System Killed pilots at RAL
132540 TEAM lhcb RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk in progress top priority 2017-12-18 09:32:00 2017-12-23 10:13:00 Other Upload problems at RAL
131815 USER t2k.org RAL-LCG2 NGI_UK in progress less urgent 2017-11-13 14:42:00 2017-12-01 19:30:00 Storage Systems Extremely long download times for T2K files on tape at RAL
130207 USER mice RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold urgent 2017-08-24 09:46:00 2017-12-18 17:22:00 Network problem Timeouts when copyiing MICE reco data to CASTOR
127597 USER cms RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk share with:sexton@fnal.gov on hold urgent 2017-04-07 10:34:00 2017-10-05 09:14:00 File Transfer Check networking and xrootd RAL-CERN performance
124876 USER ops RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold less urgent 2016-11-07 12:06:00 2017-11-13 16:55:00 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683 USER none RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold less urgent 2015-11-18 11:36:00 2017-11-06 16:59:00 Information System CASTOR at RAL not publishing GLUE 2
Availability Report
Day OPS Alice Atlas CMS LHCb Atlas Echo Comment
20/12/17 100 100 100 100 100 100
21/12/17 100 100 100 100 100 100
22/12/17 100 100 100 100 100 100
23/12/17 100 100 100 100 53 100
24/12/17 100 100 100 98 100 100
25/12/17 100 100 100 100 100 100
26/12/17 100 100 100 100 100 100
27/12/17 100 100 100 100 100 100
28/12/17 100 100 100 100 100 100
29/12/17 100 100 100 100 100 100
30/12/17 100 100 100 100 100 100
31/12/17 100 100 100 100 100 100
01/01/18 100 100 100 100 100 100
02/01/18 100 100 100 100 100 100
Hammercloud Test Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud

Day Atlas HC Atlas HC Echo CMS HC Comment
20/12/17 100 0 100 Atlas HC Echo - No test run in time bin
21/12/17 98 0 100 Atlas HC Echo - No test run in time bin
22/12/17 100 0 98 Atlas HC Echo - No test run in time bin
23/12/17 98 0 100 Atlas HC Echo - No test run in time bin
24/12/17 0 0 100 Atlas HC Echo - No test run in time bin
25/12/17 86 0 100 Atlas HC Echo - No test run in time bin
26/12/17 100 0 100 Atlas HC Echo - No test run in time bin
27/12/17 100 0 100 Atlas HC Echo - No test run in time bin
28/12/17 100 0 100 Atlas HC Echo - No test run in time bin
29/12/17 100 0 100 Atlas HC Echo - No test run in time bin
30/12/17 93 0 100 -
31/12/17 100 0 100 Atlas HC Echo - No test run in time bin
01/01/18 100 0 100 Atlas HC Echo - No test run in time bin
02/01/18 100 0 100 Atlas HC Echo - No test run in time bin
Notes from Meeting.
  • Ceph scrubbing is now running daytime only to help reduce call-outs at nights.