Tier1 Operations Report 2014-02-12

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 12th February 2014

Review of Issues during the week 29th January to 5th February 2014.
  • There was a successful UPS/Generator load test this morning.
  • There was a problem with updating grid-mapfiles in Castor caused by a certificate problem that was resolved on Tuesday (11th). The problem was first seen on Friday 7th).
Resolved Disk Server Issues
  • GDSS653 (LHCbDst - D1T0) had a problem aound 06:00 on Monday morning (10th Feb). The on-call person worked on the system and it was unavailable for less than an hour. One file was lost and this has been declared to LHCb.
Current operational status and issues
  • The intermittent failures of Castor access via the SRM (as seen in the availability tests) reported last week is still present. This has been seen across multiple Castor instances. The Castor team are actively working on this and have been in contact with the Castor developers at CERN to try and find a solution.
  • We are participating in an extensive FTS3 test with Atlas and CMS.
  • There has been a problem over the last couple of days with LHCb jobs aborting.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • CVMFS client version 2.1.17 continues to be tested on one batch of worker nodes (approx 10% of the batch farm).
  • On Thursday (6th Feb) all remaining worker nodes were configured to access the new CernVM-FS Stratum-1 service at RAL (cvmfs-wlcg.gridpp.rl.ac.uk).
  • There have been updates to the WMSs to resolve the proxy renewal problems.
  • There was a successful intervention on the Tier1 network yesterday morning (Tuesday 12th February) to add equipment that will form the new 'mesh' network.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

(Proposed) Tuesday 25th February: Change Tier1 connection to site network (expect around 1 day outage).

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall. Date for Tier1 proposed to be 11th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Proposed for Tuesday 25th February).
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
    • The floor in the machine room in the Atlas building is being replaced. We currently run some production services on hypervisors located there. These will be moved ahead of the first part of this work (re-routing some networking) on the morning of Wednesday 19th February. We are experiencing some problems with the hypervisors which means this move may not be transparent.
Entries in GOC DB starting between the 5th and 12th February 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site. SCHEDULED WARNING 12/02/2014 10:00 12/02/2014 12:00 2 hours RAL Tier1 site in warning state due to UPS/generator test.
Whole Site. SCHEDULED WARNING 11/02/2014 09:30 11/02/2014 11:30 2 hours Site services at risk as additional equipment added to the internal network.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
101164 Green Less Urgent In Progress 2014-02-12 2014-02-12 Atlas Fair amount of "file not found" srm-atlas.gridpp.rl.ac.uk
101079 Green Urgent In Progress 2014-02-09 2014-02-10 ARC CEs have VOViews with a default SE of "0"
101068 Green Less Urgent In Progress 2014-02-07 2014-02-10 CMS [sr #141938] fts problem
101052 Green Urgent In Progress 2014-02-06 2014-02-11 Biomed Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
101015 Green Less Urgent In Progress 2014-02-05 2014-02-06 CMS [sr #141890] Failed PhEDEx transfers between T3_US_Minnesota and T1_UK_RAL_Buffer
100887 Green Less Urgent In Progress 2014-01-31 2014-02-07 Please update gridsite on WebDAV LFC
100343 Red Less Urgent In Progress 2014-01-16 2014-02-12 RAL WMS still generating 512 proxies
100114 Red Waiting Reply On Hold 2014-01-08 2014-02-11 Jobs failing to get from RAL WMS to Imperial
99556 Red Very Urgent In Progress 2013-12-06 2014-01-30 NGI Argus requests for NGI_UK
98249 Red Urgent On Hold 2013-10-21 2014-01-29 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-02-05 Myproxy server certificate does not contain hostname
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
05/02/14 100 100 96.8 96.1 91.6 Various SRM test failures.
06/02/14 100 100 100 96.1 100 Single SRM test failure (Error reading token data header)
07/02/14 100 100 100 100 95.9 Single SRM test failure (User timeout)
08/02/14 100 100 100 100 95.8 Single SRM test failure (SRM_FILE_BUSY)
09/02/14 100 100 100 100 95.8 Single SRM test failure (SRM_FILE_BUSY)
10/02/14 100 100 99.5 88.4 95.7 Various SRM test failures.
11/02/14 100 100 100 100 91.7 2 SRM test failures (both with SRM_FILE_BUSY)