Tier1 Operations Report 2011-03-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd March 2011

Review of Issues during the week from 23rd February to 2nd March 2011.

  • On Tuesday (22nd Feb) gdss115 was taken out of Castor following problems. (It was a CMSWanIn machine with no migration candidates on it.) This server is part of a batch that is due for decommissioning and rather than fix the problem the server has been taken out of service.
  • On Monday (28th Feb) GDSS424 (AtlasStripInput) was out of service for just over half an hour while memory was replaced.
  • Changes:
    • On Tuesday (1st March) gLite update 21 was been applied to the RAL top-level BDIIs
    • On Wednesday (2nd March) Reconfiguration of FTS channels to use channel groups done this morning.

Current operational status and issues.

  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). This is being investigated.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix which has been rolled out to all CMS disk servers. This problem (and fix) is still being monitored.

Declared in the GOC DB

  • Wednesday 2nd March: At Risk on FTS during reconfiguration of FTS channels to use channel groups.
  • From Wednesday 2nd March: RGMA registry (lcgic01) being decommissioned.
  • Thursday 10th March. At Risk for glite update on Site BDIIs

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Castor Name Server upgrade - Proposed date Wednesday 9th March. (Will require Castor outage).
  • Updates to Site Routers on morning of Tuesday 15th March. (Will require Tier outage).
  • Castor upgrades (possibly directly to 2.1.10) - Possibly late March (27/28/29 during technical stop.)
  • Switch Castor to new Database Infrastructure - Possibly end March.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. Plan to install for one of Castor systems on 8th March (during castor stop), with others during networking intervention on 15th March.
  • Atlas requested changes pending:
    • Renaming Atlas files (that were in MCDisk)
    • Adding xrootd libraries to worker nodes

Entries in GOC DB starting between 23rd February and 2nd March 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgic01.gridpp.rl.ac.uk SCHEDULED OUTAGE 02/03/2011 12:00 01/04/2011 01:00 29 days, 12 hours RGMA registry to be decommissioned off
lcgfts.gridpp.rl.ac.uk SCHEDULED AT_RISK 02/03/2011 10:00 02/03/2011 13:00 3 hours Reconfiguration of FTS channels to use channel groups.
lcgbdii.gridpp.rl.ac.uk SCHEDULED AT_RISK 01/03/2011 10:00 01/03/2011 13:00 3 hours Applying gLite updates to the RAL top-level BDIIs.