Difference between revisions of "Tier1 Operations Report 2013-12-04"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:23, 4 December 2013

RAL Tier1 Operations Report for 4th December 2013

Review of Issues during the week 27th November to 4th December 2013.
  • There was a problem reported last week with one of the WMS systems, WMS05, caused by a user job filling up the available space. Our initial clean-up was insufficient and WMS05 again had a rather full disk and stopped accepting jobs overnight Thursday/Friday.
  • One file has been reported lost to Atlas. It was found to be missing during the (ongoing) Atlas file renaming.
Resolved Disk Server Issues
  • Two disk servers (gdss238, gdss239) in AtlasHotDisk were out of production from Thursday to Friday (28-29 Nov) as they were physically moved. (The rack space being required for this year's purchases).
Current operational status and issues
  • Nothing To Report.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • On Friday 29th Nov. the site-BDIIs were updated to EMI-3 update 9.
  • Some batch system parameters have been adjusted as experience is gained with the new system, notably when Atlas were running a large number of whole node jobs.
Declared in the GOC DB
  • Wednesday 11th December: UPS/Generator Load Test at 10:00. Site in 'warning' state.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • There will be an interruption to the small VO's software server as it to be physically moved.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
  • Networking:
    • Possible move of Tier1 core network switch in January (TBC).
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 27th November and 4th December 2013.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 26/11/2013 15:00 26/11/2013 15:15 15 minutes Investigating problems with restarting FTS2 service after intervention earlier today
lcgft-atlas.gridpp.rl.ac.uk, lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, SCHEDULED OUTAGE 26/11/2013 09:30 26/11/2013 15:00 5 hours and 30 minutes Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database.
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
98249 Red Urgent Waiting Reply 2013-10-21 2013-11-18 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2013-11-18 cernatschool CVMFS access for the cernatschool.org VO
97868 Red Less Urgent Waiting Reply 2013-10-08 2013-12-03 T2K CVMFS for t2k.org
97385 Red Less Urgent In Progress 2013-09-17 2013-11-18 HyperK CVMFS for hyperk.org
97025 Red Less urgent On Hold 2013-09-03 2013-11-05 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
27/11/13 100 91.1 100 100 58.4 Ongoing problem that affected all sites. (For Alice additional scheduling issue - see 28/11)
28/11/13 100 51.5 100 100 100 Problem scheduling Alice test jobs coming into the 'whole node' queue.
29/11/13 100 100 100 100 100
30/11/13 100 100 100 100 100
01/12/13 100 100 100 100 100
02/12/13 100 100 100 100 100
03/12/13 100 100 100 100 100