Difference between revisions of "Tier1 Operations Report 2014-02-05"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:13, 5 February 2014

RAL Tier1 Operations Report for 5th February 2014

Review of Issues during the week 29th January to 5th February 2014.
  • During the second part of last week there were problems with the CMS Castor instance. Many timeouts were being seen within Castor and batch jobs efficiencies were very poor. Changes were made that improved the behaviour including reducing the number of concurrent xroot transfers on each disk server and CMS re-enabling 'lazy download'.
  • There was a successful test of a new interface system for the tape libraries on Tuesday morning (4th Feb).
Resolved Disk Server Issues
  • None
Current operational status and issues
  • We are investigating intermittent failures of Castor access via the SRM (as seen in the availability tests) for multiple Castor instances. The SRM Front-end daemons were erstarted (for Atlas, CMS & LHCb instances) late morning today and we will continue to track this problem.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • t2k.org have been enabled on the ARC CEs
  • CVMFS client version 2.1.17 is being tested on one batch of worker nodes (approx 10% of the batch farm).
  • The same batch of worker nodes has also been configured to access the new CernVM-FS Stratum-1 service at RAL (cvmfs-wlcg.gridpp.rl.ac.uk).
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 12/02/2014 10:00 12/02/2014 12:00 2 hours RAL Tier1 site in warning state due to UPS/generator test.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall. Date for Tier1 proposed to be 11th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Required before firewall changes on 11th March).
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 29th January and 5th February 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor endpoints (srm-alice, srm-atlas, srm-biomed, srm-cert, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-na62, srm-preprod, srm-snoplus, srm-superb, srm-t2k. SCHEDULED WARNING 04/02/2014 08:00 04/02/2014 10:00 2 hours Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed.
lcglb03, lcglb04. SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours old EMI-2 hosts to be retired
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
100887 Green Less Urgent In Progress 2014-01-31 2014-01-31 Please update gridsite on WebDAV LFC
100343 Red Less Urgent In Progress 2014-01-16 2014-02-03 RAL WMS still generating 512 proxies
100114 Red Less Urgent On Hold 2014-01-08 2014-01-30 Jobs failing to get from RAL WMS to Imperial
99556 Red Very Urgent In Progress 2013-12-06 2014-01-30 NGI Argus requests for NGI_UK
98249 Red Urgent On Hold 2013-10-21 2014-01-29 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-01-06 Myproxy server certificate does not contain hostname
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
29/01/14 100 100 98.2 54.1 100 CMS: Main availability loss in morning: Condor scheduling (as yesterday); Plus a single SRM test failure. Atlas: Two separate SRM Put test failures.
30/01/14 100 100 99.7 95.9 96.0 One SRM test failure in each case: (Atlas, CMS & LHCb)
31/01/14 100 100 98.5 100 100 Single SRM test failure
01/02/14 100 100 100 100 100
02/02/14 100 100 100 100 100
03/02/14 100 100 99.5 98.8 95.7 One SRM test failure in each case: (Atlas, CMS & LHCb)
04/02/14 100 100 97.4 96.0 91.9 A number of SRM test failures across the VOs.