Tier1 Operations Report 2010-06-09

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 9th June 2010

Review of Issues during week from 2nd to 9th June 2010.

  • On Wednesday 2nd June the application of the Oracle April PSU patch for database 'SOMNUS', behind the LFC and FTS services, ran into a problem. This update bas backed out. There was around 2.5 hours of downtime on these services.
  • One subsequent fallout of these problems was a need to re-balance (re-mirror) the disk for the LUGH database (LHCb 3D & LFC) which took place successfully during an "At Risk" period on Tuesday morning, 8th June.
  • Gdss390 (Atlas Data Disk) was unavailable for a few hours on Thursday 3rd June. The memory was replaced following an error.
  • One of the four machines behind srm-atlas failed on Wednesday afternoon 2nd June following an operational error. It was removed from the DNS alias. We ran for a few days with three nodes behind "srm-atlas" until the DNS entry was put back to point at all four Atlas SRM nodes.
  • The Atlas software server was replaced with a more powerful machine. Farm worker nodes were moved across progressively onto the new server during last week, a move that is now complete.

Current operational status and issues.

  • gdss67 (CMSFarmRead) had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th May. The memory in this system was replaced but it is still showing problems and is being worked on.
  • On Monday 24th May gdss207 (AliceTape) was removed from service due to possible file system corruption. System not yet back in service. It is being rebuilt.
  • On Thursday 3rd June gdss474 (LHCbMDst) failed with a read only file system. Work is ongoing to resolve hardware issues on this system.
  • Dust in the Computer Room - particularly the HPD room: This is coming from lagging on cold water pipes which is abrading in the high airflow in some locations close to the CRAC (Computer Room Air Conditioning) units. This is particularly bad in the HPD room. The situation is being monitored carefully and possible solutions investigated.

Declared in the GOC DB

None

Advanced warning:

The following items remain to be scheduled:

  • FTS upgrade to version 2.2.4 unscheduled as yet, but imminent.
  • Preventative maintenance work on transformers in R89 will require site 'At Risk's. The first of these will take place on 28-30th June. This was supposed to be a technical stop - but these have been moved by CERN.
  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 2nd and 9th June 2010.

There was one unscheduled outage in the GOC DB for this last week, along with 3 unscheduled "At Risk"s. The outage occurred on the failure when applying the Oracle patch on the SOMNUS database. Likewise two of the "At Risk"s were for this cause. The third "At Risk" covered the re-balancing of the disks for the LUGH (LHCb 3D & LFC) database.

Service Scheduled? Outage/At Risk Start End Duration Reason
LHCb LFC & 3D (lhcb-lfc.gridpp.rl.ac.uk, lugh.gridpp.rl.ac.uk) UNSCHEDULED AT_RISK 08/06/2010 09:00 08/06/2010 12:00 3 hours At risk on lugh (LHCb 3D and LHCb LFC) while we re-balance database nodes.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 02/06/2010 15:20 02/06/2010 17:31 2 hours and 11 minutes The problems on the databases behind these services (LFC and FTS) have led to a loss of service. Declaring an outage while we attempt to resolve the problem.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 02/06/2010 15:00 03/06/2010 09:05 18 hours and 5 minutes Following the application of the Oracle patch (April PSU) this morning there are some problems with one of the RAC nodes. This is being investigated but is expected to take some time.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 02/06/2010 12:06 02/06/2010 15:00 2 hours and 54 minutes Following the application of the Oracle patch (April PSU) this morning there are some problems with one of the RAC nodes. Adding an At Risk while this is investigated.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk SCHEDULED AT_RISK 02/06/2010 10:00 02/06/2010 12:00 2 hours At Risk for application of Oracle PSU patches.