Tier1 Operations Report 2011-11-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd November 2011

Review of Issues during the week 16th to 23rd November 2011.

  • Thursday 17th an internal network problem at RAL made RAL-LCG2 unavailable from 07:50 to 09:00. This was caused by the failure of a power supply for one of the main RAL routers. Once this recovered there was a problem with the LHCb Castor instance where the LSF scheduler within Castor had lost contact to some of its disk servers. This was resolved at around 14:45 (and the other Castor instances checked for the same failure as well). There were also around 300 batch job failures at the time, although it later became clear that more batch jobs had probably run into trouble and this was only resolved some days later (see below).
  • Late evening on Monday (21st Nov) there was a problem on the Oracle database that caused a problem for Atlas & LHCb Castor instances for an hour or two shortly before midnight. This was traced to an Oracle bug and was resolved by the on-call team.
  • Tuesday morning (22nd) there was scheduled maintenance work on both the main RAL link (to Reading) and the OPN link to CERN. Both failed over to the backup routes for a while between 07:00 and 08:00.
  • During the second half of Tuesday 22nd we were not starting enough batch job and the farm was partly empty. This was traced to some stuck jobs. Attempts to clear these out late Tuesday afternoon helped but did not resolve the problem. Further work on Wednesday morning (today) has resolved the issue. The main cause seems to be linked to jobs that started on Thursday (17th) and was possibly triggered by the networking problem of that day.

Resolved Disk Server Issues

  • None.

Current operational status and issues.

  • The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
  • We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.

Ongoing Disk Server Issues

  • Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
  • On Tuesday morning gdss375 (AtlasTape D0T1) had two failed drives and was taken out of production.

Notable Changes made this last week

  • Started roll-out of UMD version of Top BDII (Site BIDII roll-out also under-way).
  • Firmware update of all remaining disk servers in the particular batch to resolve spurious 'SMART' errors

Forthcoming Work & Interventions

  • Tuesday 29th November. Failover of OPN link to backup during maintenance. Should be transparent.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Update, in rolling manner, the Site and Top-BDII nodes to the UMD release.
  • Regular Oracle "PSU" patches are pending.
  • There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).
  • Updates to the RAL DNS infrastructure (replacing DNS servers)

Entries in GOC DB starting between 16th to 23rd November 2011.

There was 1 unscheduled entry in the GOC DB for this last week, which was for the problem on the site network.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED OUTAGE 17/11/2011 07:45 17/11/2011 09:00 1 hour and 15 minutes Site Outage following network failure. (GOCDB item added retrospectively).

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
76750 Green Very Urgent In progress 2011-11-23 2011-11-23 T2K Jobs get aborted due to proxy(?) issues
76735 Green Urgent In progress 2011-11-22 2011-11-23 vo.londongrid.ac.uk lcglb02 GSS error
76564 Amber Very urgent In progress 2011-11-17 2011-11-18 geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
76521 Yellow less urgent waiting for reply 2011-11-16 2011-11-22 snowplus Support for snoplus.snolab.ca
75395 Red urgent waiting for reply 2011-10-17 2011-11-22 T2K WMS 'jumping' (Ticket now with L&B support)
74353 Red very urgent waiting for reply 2011-09-16 2011-11-22 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas