Tier1 Operations Report 2012-03-14

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 14th March 2012

Review of Issues during the week 7th to 14th March 2012.

  • There have been more failures of the network link between the UKlight and SAR routers (this affects transfers to/from Tier2s). These were on Friday evening and Tuesday morning. In both cases the problem were seen and resolved by the networking team.
  • The programmatic interface to the SAM tests that we had been using was switched off on Tuesday 6th March. This broke both the display of SAM test updates on our dashboard and call-outs on SAM test failures. These have been re-worked to use the newer interface, although some adjustments remains to be done. We are also using the newer "SUM" web pages in our monitoring rather than the now defunct Gridview pages.
  • The loss of two files from tape has been reported to MINOS. These were discovered during a tape re-pack operation. They date from 2008 and the loss appears is believed to have been caused by a bug that was present in the version of Castor running at that time.

Resolved Disk Server Issues

  • GDSS379 (CMSTape D0T1) was taken out of production last Wednesday morning (7th) and returned to operation the following day. A faulty disk was causing problems for the raid controller.

Current operational status and issues.

  • There is a known issue with the Atlas SRMs which is being investigated. A patched version has been rolled out that provides a workaround for the problem and stops the SRMs crashing. The remaining impact of this problem is very minimal.
  • Work is ongoing to get to the root of the problems that affect the network link between the UKlight and SAR routers.
  • There has been a low level problem with the new FTS (version 2.2.8) whereby the FTS agent can crash. The re-starts mean this has minimal impact. It is being followed up with the developers.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Re-working of infrastructure that picks up SAM test results.
  • New internal e-mail server being brought into use.

Forthcoming Work & Interventions

  • The Castor Client software on the Worker Nodes is being upgraded to version 2.1.11-8.
  • The second batch of worker nodes are expected to go into production within a few weeks.
  • A further intervention on a power board supplied by the UPS will be needed. This will lead to a very low risk intervention probably in the week beginning 26th March.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 7th and 14th March 2012.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
80119 Green Less Urgent In Progress 2012-03-12 2012-03-12 SNO+ ROOT build failing
79428 Red Less Urgent Waiting Reply 2012-02-21 2012-03-14 SNO+ glite-wms-job aborted
68853 Red Less Urgent On hold 2011-03-22 2012-03-12 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)