Difference between revisions of "Tier1 Operations Report 2014-03-05"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:20, 5 March 2014

RAL Tier1 Operations Report for 5th March 2014

Review of Issues during the week 26th February to 5th March 2014.
  • The Atlas disk space in Castor has become full. We are aware of an ongoing problem where file deletions triggered by Atlas' central service are slow. Some 'manual' deletions of files are taking place to speed up the process.
  • There have been significant problems with part of our Hyper-V infrastructure that runs many production virtual machines. This started on Friday (28th). The more important VMs have been moved elsewhere while the underlying problem is investigated. The problems were only worked round to sufficiently for us to report services were OK on Tuesday (4th). Services impacted included FTS3 which was running a large scale test for Atlas & CMS. At our request, on Tuesday (4th) Atlas moved the bulk of their file transfers (all except for the UK) to other FTS3 servers.
  • Three CMS files have been lost from a tape. The tape monitoring showed a problem when the tape was being read. On investigation a number of bad files were found on the tape. After further work some of the files were recovered. Three files were finally declared lost to CMS.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The intermittent failures of Castor access via the SRM reported in recent weeks is still present. This has been seen across multiple Castor instances and the Castor team are actively working to understand this. Some changes have been made with the aim of alleviating the problem, but it recurred this morning (Wednesday 5th March).
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • Two updates have been applied to FTS3 (now at 3.1.80-1)
  • Increased daemon thread counts for transfermanagerd and stagerd rolled out to all CASTOR instances. This is part of investigations into the Castor problems reported elewhere in this report.
  • Reduced number of replicas for atlasHotDisk from 10 to 1
  • The new MyProxy server (myproxy.gridpp.rl.ac.uk) added to the BDII. UIs changed to use this as their default MyProxy server.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The Tier1 will move to use the new site firewall on Monday 17th March. There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
  • Networking:
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 26th February and 5th March 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
arc-ce01.gridpp.rl.ac.uk SCHEDULED WARNING 26/02/2014 10:00 26/02/2014 12:00 2 hours At Risk during software upgrade to version 13.11 / 4.0.0.
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
101729 Green Top Priority Waiting Reply 2014-03-01 2014-03-05 LHCb Pilots failed at cream-ce02.gridpp.rl.ac.uk RAL-LCG2
101701 Green Less Urgent In Progress 2014-02-28 2014-02-28 ILC Pilots aborted on ARC CEs
101557 Green Less Urgent In Progress 2014-02-25 2014-03-04 SNO+ Unable to delegate proxy to fts
101532 Green Less Urgent In Progress 2014-02-25 2014-02-25 Publishing default value for Max CPU Time
101079 Red Urgent In Progress 2014-02-09 2014-02-25 ARC CEs have VOViews with a default SE of "0"
101052 Red Urgent In Progress 2014-02-06 2014-02-26 Biomed Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
100114 Red Less Urgent In Progress 2014-01-08 2014-03-04 Jobs failing to get from RAL WMS to Imperial
99556 Red Very Urgent In Progress 2013-12-06 2014-02-13 NGI Argus requests for NGI_UK
98249 Red Urgent On Hold 2013-10-21 2014-01-29 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-03-04 Myproxy server certificate does not contain hostname
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
26/02/14 100 100 100 100 100
27/02/14 100 100 100 100 100
28/02/14 100 100 100 100 100
01/03/14 100 100 100 100 100
02/03/14 100 100 100 100 100
03/03/14 100 100 94.0 100 100 Multiple SRM test failures. (4 * "User timeout"; 1 * "SRM_FILE_BUSY")
04/03/14 100 100 99.3 100 100 Single SRM test failure ("Invalid argument")