Tier1 Operations Report 2017-03-22

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 22nd March 2017

Review of Issues during the week 15th to 22nd March 2017.
  • There was a problem with the Atlas Castor instance on the evening of Wednesday 15th Mar. The oncall was contacted and Castor services restarted to fix it. The cause was a known bug that causes exhaustion of a particular database resource.
  • A crash of one of the five hypervisors in the Microsoft Hyper-V high availability cluster caused a number of VMs to reboot overnight Thursday-Friday (16-17 Mar).
  • We have an ongoing problem with the SRM SAM tests for Atlas which are failing a lot of the time. We have confirmed this is not affecting Atlas operationally it is just the tests that fails. We still have a GGUS ticket open with Atlas as the test appears to be problematic.
  • On Friday (17th March) there was a problem with the Argus srever when it had a full disk partition. This affected CMS glexec tests.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
  • None
Limits on concurrent batch system jobs.
  • Atlas Pilot (Analysis) 1500
  • CMS Multicore 460
Notable Changes made since the last meeting.
  • Last week nine of the ’14 generation disk servers – each 100TB - were deployed into AtlasDataDisk. (These are from the batch that was used as CEPH test servers).
  • Work ongoing on replacing two of the chillers.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb.gridpp.rl.ac.uk SCHEDULED OUTAGE 23/03/2017 11:00 23/03/2017 17:00 6 hours Upgrade of Castor SRMs for LHCb to version 2.1.16-10
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Pending - but not yet formally announced:

  • Update Castor SRMs starting with LHCb. (Announced for 23rd March.
  • Chiller replacement - work ongoing.
  • Merge AtlasScratchDisk into larger Atlas disk pool.

Listing by category:

  • Castor:
    • Update SRMs to new version, including updating to SL6.
    • Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
  • Databases
    • Removal of "asmlib" layer on Oracle database nodes. (Ongoing)
  • Networking
    • Enable first services on production network with IPv6 once addressing scheme agreed.
  • Infrastructure:
    • Two of the chillers supplying the air-conditioning for the R89 machine room will be replaced.
Entries in GOC DB starting since the last report.

None

Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
126905 Green Less Urgent In Progress 2017-03-02 2017-03-02 solid finish commissioning cvmfs server for solidexperiment.org
126184 Yellow Less Urgent In Progress 2017-01-26 2017-02-07 Atlas Request of inputs for new sites monitoring
124876 Red Less Urgent On Hold 2016-11-07 2017-01-01 OPS [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683 Red Less Urgent On Hold 2015-11-18 2017-03-02 CASTOR at RAL not publishing GLUE 2.
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC Atlas HC ECHO CMS HC Comment
08/03/17 100 100 60 96 100 75 196 100 Atlas: Ongoing problems with SRM test (Atlas Castor restarted to try and fix this - but no effect); CMS - timeouts in SRM tests.
15/03/17 100 100 100 99 100 99 98 99 Single SRM test failure (timeout)
16/03/17 100 99 100 99 100 79 72 99 ALICE: Test failed with 'no compatible resources found in BDII'. CMS: Single SRM test failure
17/03/17 100 100 100 85 100 100 100 97 Problem with glexec tests from CMS.
18/03/17 100 100 100 100 100 100 100 100
19/03/17 100 100 100 100 100 100 100 100
20/03/17 100 100 100 98 100 100 100 100 Two SRM test failures on GET (User timeout).
21/03/17 100 100 100 100 100 100 100 100

=