Tier1 Operations Report 2011-02-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd February 2011

Review of Issues during the week from 16th to 23rd February 2011.

  • On Monday (14th Feb) Disk servers gdss104, gdss113(both AtlasFarm) and gdss121(CMSWanOut) were taken out of production for investigation of reported memory errors. All are D0T1 machines, with no files awaiting migration to tape. All three were returned to production on Wednesday 16th Feb after running memory tests.
  • Thursday (17th Feb) gdss113(AtlasFarm) and gdss121 (CMSWanOut) out of production for an hour while memory replaced.
  • Tuesday (22nd Feb) gdss115 taken out of Castor following problems. It is a CMSWanIn machine with no migration candidates on it.
  • Changes:
    • Thursday (17th Feb) Castor Atlas 3D database audit has been turned on successfully.
    • Thursday (17th Feb) Atlas disk pools merged. The diskpool atlasSimStrip has been merged into atlasStripInput. The SRM space tokens ATLASDATADISK and ATLASMCDISK now point to the AtlasStripInput service class.
    • Tuesday (22nd Feb) WAN tuning applied to all Castor disk servers for Atlas, LHCb & GEN service classes.
    • Wednesday (23rd Feb) All FTS agents moved to the Quattorised server, lcgfts01.

Current operational status and issues.

  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.
  • We have previously reported a problem with disk servers becoming unresponsive. We believe we understand the problem and are testing the fix which has been rolled out to all CMS disk servers. This problem (and fix) is still being monitored.

Declared in the GOC DB

  • Wednesday 23rd Feb: At Risk on FTS during update of host running FTS agents to a Quattorized server.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • FTS re-configuration to make use of channel Groups (for CMS). Planned for Wed 2nd March
  • Castor Name Server upgrade - Earliest possibly date Tuesday 8th March.
  • Castor upgrades (possibly directly to 2.1.10) - Possibly late March (27/28/29 during technical stop.)
  • Switch Castor to new Database Infrastructure - Possibly end March.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Firmware updates for central networking components (likely to have some short network breaks - maybe in March)
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. An opportunity needs to be found to make use of these. (Possible install for one of Castor systems on 1st March).
  • Atlas changes in pipeline:
    • Renaming Atlas files (that were in MCDisk) - Starting 1st March.
    • Adding xrootd libraries to worker nodes

Entries in GOC DB starting between 16th and 23rd February 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts SCHEDULED AT_RISK 23/02/2011 08:00 23/02/2011 12:00 4 hours At Risk during update of host running FTS agents to a Quattorized server.
All Castor (SRM endpoints) except CMS SCHEDULED AT_RISK 22/02/2011 09:00 22/02/2011 12:00 3 hours At risk while rolling out WAN tuning.
lcgce06, lcgce08, lcgce09, srm-atlas SCHEDULED AT_RISK 17/02/2011 11:00 17/02/2011 15:00 4 hours At-risk on CASTOR ALTAS to merge DATADISK and MCDISK