Difference between revisions of "Tier1 Operations Report 2011-11-30"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:15, 30 November 2011

RAL Tier1 Operations Report for 30th November 2011

Review of Issues during the week 23rd to 30th November 2011.

  • In the middle of last week we had a problem starting Alice batch work. Early Thursday morning (24th Nov) Alice found and fixed a problem with the Alice VO box that resolved the issue.
  • On Thursday afternoon (24th Nov) we found a problem of a very low start rate on the farm. This was traced to batch jobs being in a queued state but with a particular execution host allocated. The particular node was disabled from the batch system, followed by the affected jobs being deleted, and the job start rate rose back to normal levels.

Resolved Disk Server Issues

  • Gdss456 (AtlasDataDisk), which had failed with a read only file system on Wednesday 28th September. This server had been replaced on 3rd November.
  • On Tuesday morning (22nd Nov) gdss375 (AtlasTape D0T1) had two failed drives and was taken out of production. It was returned to production on Friday morning (25th Nov.)
  • On Tuesday (29th) the monitoring reported a problem with the xroot daemon on gdss569 (LHCbDst D1T0). This was traced to the clock being out by 90 seconds and fixed later that day.

Current operational status and issues.

  • The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only one channel (Birmingham to RAL) remains a cause for concern.
  • We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
  • CERN has reported issues with AFS callbacks to RAL worker nodes (29/11/2011). This is being investigated.

Ongoing Disk Server Issues

  • GDSS296 (CMSFarmRead - D0T1) was set “read-only” on Monday (28th), and will be removed from production. This follows the “checksum-mismatch" daily checks throwing up four files that had been written the day before as corrupt.

Notable Changes made this last week

  • Roll-out of UMD versions of Top BDII & Site BDII well under way.

Forthcoming Work & Interventions

  • Saturday 10th December. Replacement of some DNS servers at RAL. These are ones not mainly used by the Tier1. The two remaining DNS servers mainly used by the Tier1 will be updated in January.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Regular Oracle "PSU" patches are pending.
  • There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).
  • Updates to the RAL DNS infrastructure (replacing DNS servers)

Entries in GOC DB starting between 23rd and 30th November 2011.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
76877 Green Urgent In progress 2011-11-29 2011-11-29 T2K FTS transfers RALLCG2-VICTORIALCG2
76750 Green Very Urgent In progress 2011-11-23 2011-11-29 T2K Jobs get aborted due to proxy(?) issues
76735 Green Urgent In progress 2011-11-22 2011-11-25 vo.londongrid.ac.uk lcglb02 GSS error
76564 Amber Very urgent waiting for reply 2011-11-17 2011-11-29 geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
75395 Red urgent unsolved 2011-10-17 2011-11-28 T2K WMS 'jumping' (Set unsolved by L&B support.)
74353 Red very urgent waiting for reply 2011-09-16 2011-11-22 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas