Difference between revisions of "Tier1 Operations Report 2017-03-22"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 66: Line 66:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting.
 
|}
 
|}
* Increased max_filedesc parameter on squids to enable them to better cope with high load.
+
* Last week nine of the ’14 generation disk servers – each 100TB - were deployed into AtlasDataDisk. (These are from the batch that was used as  CEPH test servers).  
* ECHO: Two additional 'MON' boxes are being set-up bringing the total to five. The existing three can cope with normal activity but the additional ones would speed up recoveries and starts. Two additional gateway nodes are also being set-up (also bringing the total to five) which will improve access bandwidth.
+
* Work ongoing on replacing two of the chillers.
* The first of two chillers for the R89 machine room air-conditioning has been replaced.
+
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- *************End Notable Changes made this last week************** ----->
 
<!-- ****************************************************************** ----->
 
<!-- ****************************************************************** ----->

Revision as of 20:00, 21 March 2017

RAL Tier1 Operations Report for 22nd March 2017

Review of Issues during the week 15th to 22nd March 2017.
  • A problem with one of the five hypervisors in the Microsoft Hyper-V high availability cluster caused a number of VMs to reboot overnight Thursday-Friday.
Resolved Disk Server Issues
  • GDSS689 (AtlasDataDisk - D1T0) reported 'fsprobe' errors and was taken out of production last Wednesday (8th March). It was returned to service on Friday (10th) having had two disks replaced.
  • GDSS623 (GenTape - D0T1) had one partition go read-only on Friday evening, 10th Mar. It was put back in service read-only on the SUnday (12th) so that the files awaiting migration to tape could be drained off.
Current operational status and issues
  • We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
  • None
Limits on concurrent batch system jobs.
  • Atlas Pilot (Analysis) 1500
  • CMS Multicore 460
Notable Changes made since the last meeting.
  • Last week nine of the ’14 generation disk servers – each 100TB - were deployed into AtlasDataDisk. (These are from the batch that was used as CEPH test servers).
  • Work ongoing on replacing two of the chillers.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
srm-lhcb.gridpp.rl.ac.uk SCHEDULED OUTAGE 23/03/2017 11:00 23/03/2017 17:00 6 hours Upgrade of Castor SRMs for LHCb to version 2.1.16-10
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Pending - but not yet formally announced:

  • Update Castor SRMs. Propose LHCb SRMs first - target date 22nd March.
  • Chiller replacement - work ongoing.
  • Merge AtlasScratchDisk into larger Atlas disk pool.

Listing by category:

  • Castor:
    • Update SRMs to new version, including updating to SL6.
    • Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
  • Databases
    • Removal of "asmlib" layer on Oracle database nodes. (Ongoing)
  • Networking
    • Enable first services on production network with IPv6 once addressing scheme agreed.
  • Infrastructure:
    • Two of the chillers supplying the air-conditioning for the R89 machine room will be replaced.
Entries in GOC DB starting since the last report.

None

Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
126905 Green Less Urgent In Progress 2017-03-02 2017-03-02 solid finish commissioning cvmfs server for solidexperiment.org
126184 Yellow Less Urgent In Progress 2017-01-26 2017-02-07 Atlas Request of inputs for new sites monitoring
124876 Red Less Urgent On Hold 2016-11-07 2017-01-01 OPS [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683 Red Less Urgent On Hold 2015-11-18 2017-03-02 CASTOR at RAL not publishing GLUE 2.
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC Atlas HC ECHO CMS HC Comment
08/03/17 100 100 60 96 100 75 196 100 Atlas: Ongoing problems with SRM test (Atlas Castor restarted to try and fix this - but no effect); CMS - timeouts in SRM tests.
15/03/17 100 100 100 99 100 99 98 99 Single SRM test failure (timeout)
16/03/17 100 99 100 99 100 79 72 99 ALICE: Test failed with 'no compatible resources found in BDII'. CMS: Single SRM test failure
17/03/17 100 100 100 85 100 100 100 97 Problem with glexec tests from CMS. This followed a problem with one of the Hyper-V 2012 hypervisors.
18/03/17 100 100 100 100 100 100 100 100
19/03/17 100 100 100 100 100 100 100 100
20/03/17 100 100 100 98 100 100 100 100 Two SRM test failures on GET (User timeout).
21/03/17 100 100 100 100 100 100 100 100

=