Difference between revisions of "Tier1 Operations Report 2013-05-29"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 10:27, 29 May 2013

RAL Tier1 Operations Report for 29th May 2013

Review of Issues during the week 22nd to 29th May 2013.
  • There were a couple of restarts of the Atlas srmServer daemons during the afternoon of Wednesday 22nd May as some of the database services were relocated to other nodes within the RACs.
  • There was a problem with access to some Atlas files overnight Wed/Thu (22/23 May). We are in the process of draining some older disk servers. This was left running overnight and caused a blockage in access to remaining files on these servers.
  • The ganglia server failed on Saturday (25th May) and was brought back into service on Tuesday (28th) - after the bank holiday. This did not affect services but is visible as a lack of data on the Tier1 dashboard.
Resolved Disk Server Issues
  • GDSS422 (LHCbUser - D1T0) was removed from production during the morning of Wed 22nd May and returned to service later that afternoon. The system had reported some memory errors beforehand although the memory tests failed to find a problem.
Current operational status and issues
  • The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
  • The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
  • The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
  • We are participating in xrootd federated access tests for Atlas.
  • Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Many services have had kernel/errata updates. The CEs (amongst other service nodes) have been upgraded to SL6.4.
Declared in the GOC DB
  • Tuesday 4th June. Warning (At Risk) on whole site for the UPS / generator load test.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • On Tuesday 4th June there will be a UPS / Generator load test between 10:00 - 11:00.
  • Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
  • The problem reported last week following the upgrade of the non-Tier1 'facilities' Castor instance to version 2.1.13 is now understood and fixed. We will continue to monitor this closely ahead of re-scheduling the upgrade of Tier1 Castor instances.
  • The database team will re-organise the OGMA (Atlas 3D) database RAC to simplify the recovery procedures should this database have an operational problem.

Listing by category:

  • Databases:
    • Reconfigure OGMA (Atlas 3D) voting disk to simplify recovery procedures.
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.13
  • Networking:
    • Single link to UKLight Router to be restored as paired (2*10Gbit) link.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Grid Services
    • Testing of alternative batch systems (SLURM, Condor).
    • Upgrade of one remaining EMI-1 component (UI) being planned.
  • Fabric
    • One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
  • Infrastructure:
    • A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room.
      • Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
      • Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
      • Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 22nd and 29th May 2013.

There were no entries in the GOC DB starting during the last week.

Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
93149 Red Less Urgent On Hold 2013-04-05 2013-05-13 Atlas RAL-LCG2: jobs failing with " cmtside command was timed out"
92266 Red Less Urgent Waiting Reply 2013-03-06 2013-05-21 Certificate for RAL myproxy server
91658 Red Less Urgent On Hold 2013-02-20 2013-05-29 LFC webdav support
86152 Red Less Urgent On Hold 2012-09-17 2013-03-19 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
22/05/13 100 100 94.2 100 100 Atlas srmServer not responding during database reconfiguration.
23/05/13 100 100 99.1 100 100 Single test failure. Probably a knock-on effect of the stuck disk server draining which had not been stopped overnight.
24/05/13 100 100 100 100 100
25/05/13 100 100 100 100 100
26/05/13 100 100 100 100 100
27/05/13 100 100 100 100 100
28/05/13 100 100 100 100 100