Tier1 Operations Report 2017-04-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th April2017

Review of Issues during the week 5th to 12th April 2017.
  • LHCb Castor instance has been running with problems all this last week. Initially it appeared the new SRM version was causing a bottleneck. This was fixed but it then appears the stager was also struggling. Work has been ongoing to resolve this.
  • Some batch job submission errors have been seen by CMS and LHCb. These are not yet understood. ?? Ongoing
  • Over the weekend there were problems with the Atlas Frontier systems. Lyon were also affected.
  • On Monday problems were reported on one of the ARC CEs (AC-CE4) and its services were restarted.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
  • gdss673 (LHCb-Tape) was removed from production this morning (05/04/2017) due to it having a double disk failure.
Limits on concurrent batch system jobs.
  • Atlas Pilot (Analysis) 1500
  • CMS Multicore 550
  • LHCb 1000
Notable Changes made since the last meeting.
  • Increased limit on number of CMS multicore jobs from 460 to 550 due to increased pledge for 2017.
  • Out of Hours cover for the CEPH ECHO service is being piloted.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms04.gridpp.rl.ac.uk SCHEDULED OUTAGE 12/04/2017 09:05 18/04/2017 12:00 6 days, 2 hours and 55 minutes server migration
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Pending - but not yet formally announced:

  • Update Castor SRMs - CMS & GEN still to do. This is awaiting a full understanding of the problem seen with LHCb.
  • Chiller replacement - work ongoing.
  • Merge AtlasScratchDisk into larger Atlas disk pool.

Listing by category:

  • Castor:
    • Update SRMs to new version, including updating to SL6.
    • Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
  • Networking
    • Enable first services on production network with IPv6 once addressing scheme agreed.
  • Infrastructure:
    • Two of the chillers supplying the air-conditioning for the R89 machine room are being replaced.
Entries in GOC DB starting since the last report.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms04.gridpp.rl.ac.uk SCHEDULED OUTAGE 12/04/2017 09:05 18/04/2017 12:00 6 days, 2 hours and 55 minutes server migration
srm-lhcb.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 09/04/2017 12:00 10/04/2017 12:00 24 hours Problems with LHCb transfers
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
127388 Green Less urgent In Progress 2017-03-29 2017-04-03 LHCb [FATAL] Connection error for some file
127240 Green Urgent In Progress 2017-03-21 2017-03-27 CMS Staging Test at UK_RAL for Run2
126905 Green Less Urgent Waiting Reply 2017-03-02 2017-04-03 solid finish commissioning cvmfs server for solidexperiment.org
126184 Amber Less Urgent In Progress 2017-01-26 2017-02-07 Atlas Request of inputs for new sites monitoring
124876 Red Less Urgent On Hold 2016-11-07 2017-01-01 OPS [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683 Red Less Urgent On Hold 2015-11-18 2017-03-02 CASTOR at RAL not publishing GLUE 2.
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC Atlas HC ECHO CMS HC Comment
29/03/17 100 100 100 97 83 100 99 100 SRM test failures
30/03/17 100 100 100 100 88 100 100 100 SRM test failures
31/03/17 100 100 100 98 50 96 100 100 SRM test failures
01/04/17 100 100 67 100 79 100 100 100 Atlas: Missing data; LHCb: SRM test failures
02/034/17 100 100 100 98 100 100 89 100 SRM test failures
03/04/17 100 100 100 97 82 98 94 99 SRM test failures
05/04/17 100 100 100 100 88 100 100 100 SRM Test failures.
06/04/17 100 100 100 99 84 100 100 100 SRM test failures for both CMS and LHCb.
07/04/17 100 100 100 100 100 100 100 100
08/04/17 100 96 88 75 94 100 100 100 A hypervisor failure led to problems for one of the CEs and argus.
09/04/17 100 100 100 45 88 100 100 100 CMS: Problem with CMS Castor (transfer manager problems); LHCb - SRM test failures.
10/04/17 100 100 100 60 100 100 100 100 CMS: Ongoing problem with CMS Castor (above) fixed during morning.
11/04/17 100 100 100 100 100 100 100 100
Notes from Meeting.
  • None yet