Tier1 Operations Report 2010-04-07

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 7th April 2010.

Review of Issues during week 31st March to 7th April 2010.

  • On Sunday 4th April there were two breaks in RAL's network connectivity to the outside world. The first between 06:00 and 09:00, the second between 20:00 and 21:00 (approx). These were traced to a faulty interface (GBIC) in a router.
  • GDSS274 (part of AtlasScratchDisk) was out of production from Tuesday to Wednesday following issues reported on multiple disks.
  • The rolling upgrade to the batch nodes to SL 5.4 has been completed.
  • There was a load problem on the AtlasFarm service class caused by access to files from both batch work and data migrations out of the Tier1. This morning (Wednesday 7th April) the number of servers in this pool was increased from 6 to 15.

Current operational status and issues.

  • There have been intermittent failures with access to the top-bdii (lcgbdii.gridpp.rl.ac.uk). These have been traced to a load problem. Work is ongoing to improve the performance of the top bdii.

Declared in the GOC DB:

  • Thursday 8th April. At Risk 09:30 - 11:30 on OGMA (Atlas 3D) for kernel and glibc updates
  • Thursday 8th April. At Risk 10:00 - 13:00 on the top-bdii (lcgbdii.gridpp.rl.ac.uk) when the DNS will be updated to point to a new quintet of nodes.

Advanced warning:

The following items remain to be scheduled:

  • Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.
  • Kernel and glibc updates will need to be done on LFC, FTS & LHCb 3D Oracle database (RAC) nodes.
  • Update microcode on remaining half of tape drives.

Entries in GOC DB starting between 31st March and 7th April 2010.

There was one unscheduled entry in the GOC DB for this last week. There should have been two as the first network outage (06:00 to 09:00 on Sunday 4th April) should also have been added retrospectively.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED OUTAGE 04/04/2010 20:00 04/04/2010 21:00 1 hour Site outage owing to a network failure. This was the second of two such failures during the day. (The first was from 05:00 to 08:00 UTC.) The cause was traced to a failing interface (GBIC) in a router that was replaced following the second failure. (This entry being added retrospectively to the GOC DB).