Tier1 Operations Report 2011-12-14

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 14th December 2011

Review of Issues during the week 7th to 14th December 2011.

  • The Operational error on Friday afternoon (2nd Dec) that led to the software server used by the non-LHC VOs being unavailable over a weekend has been Post Mortemed and this is available at:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111202_VO_Software_Server
  • As reported last week there were problems last Tuesday & Wednesday (6/7 Dec) on the Atlas SRM. During Wednesday there were more general problems and these were traced to very slow response from one of the DNS servers. During Wednesday the DNS issues were resolved, partly by the Tier1 re-configuring a large number of nodes to select the DNSs servers in a different order which lightened the load on the one failing DNS server. It is possible that the various problems seen earlier (including with the Atlas SRM) were triggered by the DNS issues.
  • CMS migration to tape was not working overnight Thursday-Friday (7/8 Dec). This was picked up late Thursday afternoon but initial attempts to resolve it failed. Work continued Friday morning and a configuration problem was found and fixed during that morning.

Resolved Disk Server Issues

  • Early Wednesday evening (7th Dec) gdss428 (AtlasDataDisk) failed with a read-only file system. It was restored to service in read-only mode (as a precaution) on Thursday morning (8th), and back into full production on Tuesday (13th) after no further problems were seen.

Current operational status and issues.

  • Work is ongoing with the Perfsonar network test system. The throughput measurements are being made with Perfsonar running on dedicated hardware, rather than a virtual machine. Studies of network behaviour with Perfsonar will be ongoing and, unless specific operational problems arise, will no longer be tracked here.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Roll-out of UMD versions of Top BDIIs complete.
  • Changes to job limits on our grid4000M queue following ATLAS creating a new queue pointing specifically to our grid4000M queue.
  • DNS servers at RAL were updated on Saturday 10th Dec. (As planned two servers remain to be updated.)

Forthcoming Work & Interventions

  • Thursday 5th January. First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.)

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

  • Infrastructure:
    • Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
  • Castor:
    • Castor 2.1.11 upgrade.
    • SRM 2.11 upgrade
    • Replace hardware running Castor Head Nodes.
  • Networking:
    • Changes required to extend range of addresses that route over the OPN.
    • Re-configure networking in UPS room.
    • Install new Routing & Spine layers.
    • Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
  • Fabric:
    • BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.)
  • Grid Services:
    • Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
  • VO:
    • Address permissions problem regarding Atlas User access to all Atlas data.
    • Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 7th and 14th December 2011.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
77026 Red Less Urgent Waiting Reply 2011-11-29 2011-12-07 BDII
76564 Red Very urgent In Progress 2011-11-17 2011-12-14 geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
74353 Red very urgent In Progress 2011-09-16 2011-12-07 Pheno Proxy not renewing properly from WMS
68853 Red less urgent On hold 2011-03-22 2011-11-07 Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077 Red less urgent in progress 2011-02-28 2011-09-20 Mandatory WLCG InstalledOnlineCapacity not published
64995 Red less urgent in progress 2010-12-03 2011-09-20 No GlueSACapability defined for WLCG Storage Areas