Tier1 Operations Report 2010-04-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21st April 2010.

Note: This report covers a two-week period as there was no meeting last week owing to the GridPP meeting.

Review of Issues during weeks 7th to 21st April 2010.

  • Although not a problem we note the high levels of traffic outbound over the OPN link to CERN (and other Tier1s) during the afternoon & evening of 7th April when we were saturating the 10Gbit link. (This is the subject of a BLOG entry - see http://www.gridpp.rl.ac.uk/blog/2010/04/08/white-hot-outbound-opn-network-rates/ )
  • On Thursday 8th April the five nodes that make up the Top-BDII were all replaced with five higher specification nodes. This has resolved the issue of occasional time-outs seen when making requests to the Top-BDII.
  • A planned update to the database nodes behind OGMA (Atlas 3D) encountered some problems on 8th April and overran.
  • Over the weekend 10/11 April there was a recurrence of the problems we have seen before on the Atlas software server.
  • During the morning of Monday 12th April one of the nodes in the Oracle RAC (SOMNUS) behind the LFC and FTS services was rebooted. The caused unexpected problems. There was some minor corruption of the database. Services (LFC, FTS) were unavailable for a little under three hours while this was resolved. Before re-enabling the LFC service checks were made on the integrity of the database. It was found that only one update, the last requested before the crash, had not completed and was lost.
  • During Monday an Atlas software install job was running for a long time (possibly stuck) and Atlas CE SAM tests were queuing behind and failing.
  • Early afternoon on Sunday 11th April GDSS405 (part of ATLAS MCDISK space token) failed. This was diagnosed as a memory problem. The system was returned to production on Wednesday (14th). This took some time as one of the replacement memory modules used to fix the problem was itself found to be faulty and had to be replaced.
  • On Thursday 15th there we experienced very high load on some CMS Castor servers. This was traced to a very large number of requests from the CERN FTS. (This is the subject of a BLOG entry - see http://www.gridpp.rl.ac.uk/blog/2010/04/15/high-load-on-cms-srms/ )
  • At the end of last week - around Friday 15th April, a very large backlog of CMS tape migration requests built up (around 70,000 files). This lack of tape migration was traced to a configuration issue and resolved on Sunday (18th). The backlog of files was migrated over the following couple of days.
  • On Tuesday 20th April the microcode was updated in the remaining half of the tape drives.

Current operational status and issues.

  • Following the two database related issues (problems delaying updates applied to OGMA on the 8th, and the problems on SOMNUS on the 12th) it was necessary to 'rebalance' the databases on disk. (This operation brings the mirrored pairs of databases back into synchronization.) This has been carried out successfully on OGMA (Atlas 3D) and LUGH (LHCb 3D & LHCb-LFC). However, the re-balancing on SOMNUS (FTS & LFC) has been delayed owing to some issues being encountered.

Declared in the GOC DB:

  • None

Advanced warning:

The following items remain to be scheduled:

  • Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.
  • Kernel and glibc updates will need to be done on LFC, FTS & LHCb 3D Oracle database (RAC) nodes.
  • Addition of an additional node to the SAN that supports the Castor databases. This node will process the backup files for copying to tape.
  • Re-balancing of SOMNUS (LFC & FTS) databases.
  • Addition of switch into network stack that provides connections to core services in the UPS room.

It is planned to have a Castor STOP (and a batch pause) either on Tuesday or Wednesday next week (27/28 April, during the LHC technical stop) while the last three of the above items are carried out.

Entries in GOC DB starting between 7th and 21st April 2010.

There were three unscheduled entries in the GOC DB for this last fortnight.

  • Kernel updates to nodes behind the the 3D services (OGMA) overran.
  • There was a problem on the databases (SOMNUS) behind LFC & FTS services on 12th.
  • Intervention at short notices on databases on 20th
Service Scheduled? Outage/At Risk Start End Duration Reason
LFCs, FTS and 3D systems (lcgftm, lcgfts, lfc-atlas, lfc, lhcb-lfc, lugh, ogma) UNSCHEDULED AT_RISK 20/04/2010 12:00 20/04/2010 17:00 5 hours At Risk on some services (LFCs, FTS and 3D) while urgent maintenance is carried out on the back-end databases.
lcgfts, lfc-atlas, lfc UNSCHEDULED OUTAGE 12/04/2010 11:59 12/04/2010 14:39 2 hours and 40 minutes Outage while we are investigating backend database problems.
ogma UNSCHEDULED AT_RISK 08/04/2010 14:00 08/04/2010 17:00 3 hours Extending the At Risk for kernel updates to Oracle RAC nodes.
lcgbdii SCHEDULED AT_RISK 08/04/2010 10:00 08/04/2010 13:00 3 hours At Risk as TOP-BDII is moved to more powerful hardware. This will be implemented by changing the DNS entry for lcgcbdii.gridpp.rl.ac.uk to point to a more powerful set of systems.
ogma SCHEDULED AT_RISK 08/04/2010 09:30 08/04/2010 11:30 2 hours At Risk for kernel updates to Oracle RAC nodes.