Tier1 Operations Report 2011-12-07

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 7th December 2011

Review of Issues during the week 30th November to 7th December 2011.

  • An Operational error on Friday afternoon (2nd Dec) led to the software server used by the non-LHC VOs (called gdss142) being unavailable over the weekend. The service was restored around 2pm Monday (5th Dec.) A Post Mortem will be produced for this incident.
  • On Tuesday (6th Dec) there were two interruption to the Atlas SRM both caused by locking issues in the SRM database. The first was between 07:45 and 08:30 and resolved when staff arrived at work. The second was between around 16:00 and 16:30 when SRM database queries hung under heavy load. This fixed itself when the load reduced.
  • The problem reported last week that CERN had seen issues with AFS callbacks to RAL worker nodes has been resolved. The AFS rule set had not been applied to worker nodes when they were moved to a different IP subnet.

Resolved Disk Server Issues

  • GDSS296 (CMSFarmRead - D0T1) which was reported at the last meeting as being retired following some “checksum-mismatch" failures has now been removed from service.

Current operational status and issues.

  • This morning (Wednesday 7th Dec.) we are having some problems with Atlas transfers and the batch system. We are seeing some underlying DNS issues that are being investigated at the time of this report/meeting.
  • The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. However, the only channel that is still of concern is from Birmingham to RAL. This item will now be dropped from this report.
  • We continue work with the Perfsonar network test system. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar and one of these nodes has now been installed with Perfsonar for bandwidth tests.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Roll-out of UMD versions of Site BDIIs complete and for Top BDIIs three out of five completed.
  • Migration of MINOS data from a NFS server to Castor.
  • Have completed a roll-out of glite 3.2.11 & CVMFS 2.0.4-1 across batch farm.

Forthcoming Work & Interventions

  • Saturday 10th December. Replacement of some DNS servers at RAL. These are ones not mainly used by the Tier1. The two remaining DNS servers mainly used by the Tier1 will be updated in January.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Regular Oracle "PSU" patches are pending.
  • There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Replace hardware running Castor Head Nodes (aimed for end of year).
  • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • Updates to the RAL DNS infrastructure (replacing DNS servers)

Entries in GOC DB starting between 30th November and 7th December 2011.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Urgent In progress 2011-11-29 2011-12-06 BDII
76750 Green Very Urgent In progress 2011-11-23 2011-11-29 T2K Jobs get aborted due to proxy(?) issues
76564 Amber Very urgent waiting for reply 2011-11-17 2011-11-29 geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
74353 Red very urgent In Progress 2011-09-16 2011-12-05 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas