Difference between revisions of "Tier1 Operations Report 2017-12-20"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 77: Line 77:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
|}
 
|}
* Allocation in Echo for ATLAS increased to 4.1PB.  They now have 4PB in datadisk and 100TB in scratchdisk.  This is part of the gradual increase of their usage to 5.1PB.
+
* None
* The maximum number of gridftp connections to each Echo gateways has been increased to 200 (from 100).
+
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->

Revision as of 09:02, 19 December 2017

RAL Tier1 Operations Report for 13th December 2017

Review of Issues during the week 7th to 13th December 2017.

Echo: • Background scrubbing has been going on. This has flushed out more bad disks – causing some callouts through the week.

Network: • Emergency card replacement at Harwell PoP on Thursday morning. This was announced to us and caused a short break in two out of the three OPN links (as expected)

Infrastructure: • There was a successful generator load test last Wednesday (13th Dec).

Certificates: • The re-updating to pick up the updated UK CA certificate in the IGTF 1.88 rollout took place successfully last Tuesday (12th) as planned.

Christmas Plans (repeat of last week’s entry) • We will follow the same pattern as in previous years. The on-call team will be in place as usual. Some additional checks will be made by those on-call. RAL is closed after Friday afternoon 22nd December and will re-open on Tuesday 2nd January.

Current operational status and issues
  • None
Resolved Disk Server Issues
  • None
Ongoing Disk Server Issues
  • None
Limits on concurrent batch system jobs.
  • CMS Multicore 550
Notable Changes made since the last meeting.
  • None
Entries in GOC DB starting since the last report.

No downtime scheduled in the GOCDB between 2017-12-12 and 2017-12-20

Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Ongoing or Pending - but not yet formally announced:

Listing by category:

  • Castor:
    • Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
    • Move to generic Castor headnodes.
  • Echo:
    • Update to next CEPH version ("Luminous").
  • Networking
    • Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
  • Services
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
Open GGUS Tickets (Snapshot during morning of meeting)
Ticket-ID Type VO Notified Site Resp. Unit Status Priority Creation Last Update ToI Subject
132540 TEAM lhcb RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk in progress top priority 2017-12-18 09:32:00 2017-12-18 11:36:00 Other Upload problems at RAL
132336 USER ops RAL-LCG2 NGI_UK in progress less urgent 2017-12-06 14:34:00 2017-12-18 11:40:00 Operations [Rod Dashboard] Issue detected : org.nagios.GLUE2-Check@site-bdii.gridpp.rl.ac.uk
132314 USER ops RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk in progress less urgent 2017-12-05 10:48:00 2017-12-18 14:10:00 Operations [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-SRM-result-ops@arc-ce02.gridpp.rl.ac.uk
131815 USER t2k.org RAL-LCG2 NGI_UK in progress less urgent 2017-11-13 14:42:00 2017-12-01 19:30:00 Storage Systems Extremely long download times for T2K files on tape at RAL
130207 USER mice RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold urgent 2017-08-24 09:46:00 2017-12-18 17:22:00 Network problem Timeouts when copyiing MICE reco data to CASTOR
127597 USER cms RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk share with:sexton@fnal.gov on hold urgent 2017-04-07 10:34:00 2017-10-05 09:14:00 File Transfer Check networking and xrootd RAL-CERN performance
124876 USER ops RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold less urgent 2016-11-07 12:06:00 2017-11-13 16:55:00 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683 USER none RAL-LCG2 NGI_UK assign to:lcg-support@gridpp.rl.ac.uk on hold less urgent 2015-11-18 11:36:00 2017-11-06 16:59:00 Information System CASTOR at RAL not publishing GLUE 2
Hammercloud Test Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud

Day Atlas HC Atlas HC Echo CMS HC Comment
6/12/17 99 99 81
7/12/17 89 100 100
8/12/17 100 100 100
9/12/17 100 0 100 Atlas HC Echo - No test run in time bin
10/12/17 100 0 100 Atlas HC Echo - No test run in time bin
11/12/17 100 0 100 Atlas HC Echo - No test run in time bin
12/12/17 99 0 100 Atlas HC Echo - No test run in time bin
Notes from Meeting.
  • EGI will withdraw support for the WMS from the end of 2017. Our WMS service will be stopped on this timescale.
  • There is a problem with Perfsonar measurements using IPv6 to nodes accessed via JANET.
  • There was a discussion about how best to bring files back online from tape. The MICE VO needs a better (bulk) solution than they are using at the moment.