Tier1 Operations Report 2019-06-24

From GridPP Wiki
Revision as of 09:18, 26 June 2019 by Brian Davies 77058998a5 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 24th June 2019

Review of Issues during the week 10th June 2019 to the 17th June 2019.
  • Scheduled optical replacement work on the Janet Core in London suggested that there could bea prolonged outage at RAL.
 ** Additionally there was concern that IPv6 may break and notfailover correctly (based on previous experience).  
 ** In the event, the outage was momentary andno services were impacted.   Both IPv6 and IPv4 failovers worked correctly.
  • CMS CPU efficiencies are currently describing a veritable sine curve over a weekly period.
 ** Investigations seems to suggest a 100% failure of “log collection” jobs at RAL. 

 ** However, despite extensive investigation on the part of the Tier-1 Liaison no one seems to know what this job typedoes (other than the obvious), and who is actually responsible for the monitoring/processing ofthis job type as CMS
Current operational status and issues
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - -
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -
Limits on concurrent batch system jobs.
Notable Changes made since the last meeting.
  • NTR
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
  • No ongoing downtime
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).


Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141872 TEAM lhcb RAL-LCG2 top priority NGI_UK in progress 2019-06-26 08:29:00 srm-lhcb.gridpp.rl.ac.uk seems in a bad state (time out) WLCG
141838 USER cms RAL-LCG2 urgent NGI_UK in progress 2019-06-24 11:13:00 Transfers failing from CERN Tape to RAL Disk WLCG
141608 USER snoplus.snolab.ca RAL-LCG2 less urgent NGI_UK in progress 2019-06-06 08:55:00 Permissions on RAL SE EGI
140870 USER t2k.org RAL-LCG2 less urgent NGI_UK in progress 2019-06-20 14:35:00 Files vanished from RAL tape? EGI
140447 USER dteam RAL-LCG2 less urgent NGI_UK on hold 2019-05-22 14:20:00 packet loss outbound from RAL-LCG2 over IPv6 EGI
140220 USER mice RAL-LCG2 less urgent NGI_UK in progress 2019-06-25 13:03:00 mice LFC to DFC transition EGI
139672 USER other RAL-LCG2 urgent NGI_UK waiting for reply 2019-06-17 08:24:00 No LIGO pilots running at RAL EGI
GGUS Tickets Closed Last week
Ticket-ID Type VO Site Priority Responsible Unit Status Last Update Subject Scope
141901 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-25 18:49:00 T1_UK_RAL SRM is timing out WLCG
141771 USER cms RAL-LCG2 urgent NGI_UK solved 2019-06-24 14:00:00 file read error at T1_UK_RAL WLCG
141638 USER cms RAL-LCG2 urgent NGI_UK closed 2019-06-25 23:59:00 SAM XROOTD read failure at T1_UK_RAL WLCG
141549 TEAM atlas RAL-LCG2 less urgent NGI_UK closed 2019-06-25 23:59:00 ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded WLCG
141537 TEAM lhcb RAL-LCG2 very urgent NGI_UK verified 2019-06-25 12:52:00 Pilots Failed at RAL-LCG2 WLCG
141462 TEAM lhcb RAL-LCG2 top priority NGI_UK solved 2019-06-25 15:52:00 Error: Connection limit exceeded WLCG


Availability Report

Day Atlas CMS LHCB Alice Comments
2019-06-19 100 100 100 100
2019-06-20 100 86 100 100
2019-06-21 100 96 100 100
2019-06-22 100 22 100 100
2019-06-23 100 80 100 100
2019-06-24 100 95 91 93
2019-06-25 100 62 100 100
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2019-06-19 100 98
2019-06-20 100 85
2019-06-21 0 93
2019-06-22 100 98
2019-06-23 100 98
2019-06-24 100 97
2019-06-25 100 97



Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.