Tier1 Operations Report 2011-02-16

From GridPP Wiki
Revision as of 11:32, 16 February 2011 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 16th February 2011

Review of Issues during the week from 9th to 16th February 2011.

  • On Tuesday (8th Feb) GDSS84 (CMSFarmRead) was taken out of production to investigate memory errors. There were two un-migrated files on it. It was returned to production on Thursday 10th Feb.
  • GDSS496 (CMSFarmRead), which came out of production on Thursday 6th January following a problem and data loss, was replaced by GDSS514 on Monday 14th Feb.
  • On Monday 14th Feb. Disk servers gdss104, gdss113(both AtlasFarm) and gdss121(CMSWanOut) were taken out of production for investigation of reported memory errors. All are D0T1 machines, with no files awaiting migration to tape.
  • On Monday 14th Feb. Some problems seen with Atlas FTS transfers. This was fixed on Tuesday morning. It was caused by a bad entry in the gridmap file on the SRMs.
  • Changes:
    • On Wednesday 9th Feb the Oracle database for LFC, FTS and 3D services was updated to version 10.2.0.5.
    • On Tuesday 15th Feb the Castor GEN instance disk servers were upgraded to 64-bit OS.
    • On Tuesday 15th Feb modified network parameters were rolled out for remaining CMS service classes.
    • Wednesday 16th Feb First part of update of FTS back-end (runs FTS agents) to a Quattorised box.
    • The number of Atlas user jobs has been increased (from 500 to 1000 on Friday 11th Feb.)

Current operational status and issues.

  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix which has been rolled out to all CMS disk servers. This problem (and fix) is still being monitored.

Declared in the GOC DB

  • Thursday 17th 11:00 to 15:00. At risk for Atlas SRMs and CEs while we merge the DATADISK and MCDISK diskpools.
  • Thursday 17th 14:00 to 15:00. Not a GOC DB entry, but there is going to be an AT RISK for the Atlas 3D database. This is a result of a request made at the WLCG daily operations meeting. Atlas are aware of this.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Second part of update of FTS back-end (runs FTS agents) to a Quattorised box. Probably Wed 23th Feb
  • Update WAN tuning parameters on remaining disk servers (CMS done) - All others - Feb 22nd.
  • FTS re-configuration to make use of channel Groups (for CMS). Probably Wed 2nd March
  • Castor Name Server upgrade - Possibly Tuesday 1st March.
  • Castor upgrades (possibly directly to 2.1.10) - Possibly late March (27/28/29 during technical stop.)
  • Switch Castor to new Database Infrastructure - Possibly end March.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Firmware updates for central networking components (likely to have some short network breaks - maybe in March)
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. An opportunity needs to be found to make use of these. (Possible install for one of Castor systems on 1st March).

Entries in GOC DB starting between 9th and 16th February 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts SCHEDULED AT_RISK 16/02/2011 08:00 16/02/2011 12:00 4 hours At Risk during update of host running FTS agents to a Quattorized server.
srm-cms SCHEDULED AT_RISK 15/02/2011 10:00 15/02/2011 12:00 2 hours At-Risk to roll out changes to WAN tuning for disk servers in CMS service classes FarmRead Temp and Test.
srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-superb, srm-t2k. SCHEDULED OUTAGE 15/02/2011 08:00 15/02/2011 16:00 8 hours Upgrade to 64-bit OS for disk servers and apply updated GridFTP Castor component.
lcgftm, lcgfts, lfc-atlas.gridpp, lfc.gridpp, lhcb-lfc.gridpp, SCHEDULED OUTAGE 09/02/2011 09:00 09/02/2011 15:35 6 hours and 35 minutes Outage on LFC, FTS and 3D services while Oracle databases are updated to version 10.2.0.5.