Tier1 Operations Report 2011-11-16

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 16th November 2011

Review of Issues during the week 9th to 16th November 2011.

  • In the middle of last week there were problems with transfers for two LHCb users. This was traced to a root CA certificate that had been introduced in the latest distribution, but which was not rolled out to the 3 LHCb SRMs due to an oversight. (These three SRMs have a different set-up to the other SRMs.)
  • Since the weekend there have been problems for Alice accessing Castor. The problem is still unresolved despite intensive investigations Monday & Tuesday. At the moment the problem is awaiting input from Alice, and our current thinking is that this is a problem at the Alice end.
  • Quite large number of pilot jobs reported as failing (GGUS ticket from LHCb). On Monday (14th Nov) two nodes were found to be failing quite a number of jobs, but below limit for 'black hole' detector. Removed these two batch worker nodes from production.
  • Around 05:00 this morning (Wednesday 16th) there was a DNS problem that affected CERN. This also affected SAM tests of the Tier1 and had some impact on VO usage of our site.

Resolved Disk Server Issues

  • None.

Current operational status and issues.

  • The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
  • We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
  • We are currently patching all Grid Services nodes that run a BDII.

Ongoing Disk Server Issues

  • Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
  • As reported in recent weeks, we are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been successfully applied to D0T1 disk servers and will be rolled out to the affected D1T0 disk servers over the next week or two.

Notable Changes made this last week

  • A firmware update was applied to the RAL Firewall yesterday morning (Tuesday 15th Nov) which addresses a problem of some packet loss.

Forthcoming Work & Interventions

  • Tuesday 22nd November. Failover of main RAL link (from Reading link to London link) during maintenance. Should be transparent.
  • Tuesday 29th November. Failover of OPN link to backup during maintenance. Should be transparent.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Update, in rolling manner, the Site and Top-BDII nodes to the UMD release.
  • Regular Oracle "PSU" patches are pending.
  • Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
  • There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 9th to 16th November 2011.

There was 1 unscheduled entry in the GOC DB for this last week, which was for the problems on the Alice xrootd manager.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED OUTAGE 15/11/2011 07:55 15/11/2011 09:00 1 hour and 5 minutes Short outage during work on site network link. We andticipate around a 30 minute break in connectivity but have allowed some contingency.
srm-alice UNSCHEDULED OUTAGE 14/11/2011 09:30 14/11/2011 11:25 1 hour and 55 minutes Investigating problem with xrootd manager.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
76023 Red very urgent in progress 2011-11-05 2011-11-16 Camont LB query failed
75395 Red urgent in progress 2011-10-17 2011-11-15 T2K WMS 'jumping' (Ticket now with L&B support)
74353 Red very urgent waiting for reply 2011-09-16 2011-11-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas