Difference between revisions of "Tier1 Operations Report 2011-04-20"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:22, 20 April 2011

RAL Tier1 Operations Report for 20th April 2011

Review of Issues during the week from 13th to 20th April 2011.

  • In the early hours of Friday morning, 15th April, there were problems with the Atlas SRM with FTS transfers failing. A GGUS alarm ticket was received from Atlas that morning. The problem was traced to slow database performance. A fix has been put in place in the database. This problem occurred some hours after the Atlas SRM upgrade on the Thursday morning. However, it is not clear that this was the trigger for the problem as load issues can be seen on the Atlas SRM database starting in the days before the SRM upgrade. During Friday the Atlas FTS transfers were re-established. Only a very limited amount of Atlas batch work was run over the weekend. The Atlas batch queues were opened up again on Monday (18th). As a result of the problems with the Atlas SRM upgrade the GEN SRM upgrade on the Friday was cancelled and will be rescheduled.
  • On the morning of Tuesday 19th April we came in to find a very large queue of Alice jobs (around 20,000). This was investigated and found to be due to time-outs in Maui's response to the CEs requests which in turn led to Alice submitting more jobs.
  • The network intervention yesterday morning (Tuesday 19th) encountered some problems. A switch, which had some non-working ports, was removed from one of the switch stacks and the stack reset. Following the intervention a number of ports on a different switch in the stack were found not to be working. This caused some disruption to services, particularly the FTS. In addition there was a software problem with some Castor daemons not initially reconnecting to the database after the network interruption.
  • Last night/this morning we discovered further network problems in that one of the sbdii's (lcgsbdii0652) had connectivity problems. This was fixed by 10:00.
  • As reported last week some short breaks in connectivity within the Tier1 network have been seen in the last couple of weeks. Two network changes have been made. One was to correct the load balancing on the pair of links to one of the switch stacks, done on Wednesday (13th). The other was a reset of the network switch stack in the UPS room which took place on Tuesday morning (19th) and is referred to in the previous bullet point. These network breaks are being monitored closely.
  • Summary of changes made during the last fortnight:
    • On Thursday 14th April the Atlas SRM was upgraded to version 2.10-2.

Current operational status and issues.

  • Following a report from Atlas of failures accessing the LFC at RAL we have been following some network issues at RAL that are believed to be the underlying cause of this. These problems are intermittent and their investigation is ongoing.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is ongoing.
  • The long standing problem with the Castor Job Manager occasionally hanging up has not been seen since the Castor 2.1.10 update.

Declared in the GOC DB

  • None

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • SRM 2.10-2 upgrade for the Castor GEN instance. (Re-scheduled from 15th April).
  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10 & Atlas request to add xrootd libraries to worker nodes.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 13th and 20th April 2011.

There were four unscheduled entries during this period:

  • The problem on the Atlas SRM on Friday 15th led to an outage which had to be extended (via another outage being declared). This was followed by an unscheduled At Risk over the weekend.
  • An intervention was made on a network switch at short notice leading to an unscheduled At Risk on the whole site.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site UNSCHEDULED WARNING 19/04/2011 09:00 19/04/2011 11:00 2 hours A network switch stack with the RAL Tier1 network needs to be reset. This operation should only cause a few minutes break to services. Declaring an At Risk in case of problems with the reset.
srm-atlas UNSCHEDULED WARNING 15/04/2011 12:35 18/04/2011 12:00 2 days, 23 hours and 25 minutes At Risk on ATLAS SRM following problems on Friday.
srm-atlas UNSCHEDULED OUTAGE 15/04/2011 11:00 15/04/2011 12:35 1 hour and 35 minutes Problems on atlas-srm still ongoing.
srm-atlas UNSCHEDULED OUTAGE 15/04/2011 02:00 15/04/2011 11:00 9 hours SRM-ATLAS not functioning. Problem under investigation.
srm-atlas SCHEDULED OUTAGE 14/04/2011 11:00 14/04/2011 12:40 1 hour and 40 minutes Upgrade of Atlas SRM to version 2.10-2
srm-cms SCHEDULED OUTAGE 13/04/2011 11:00 13/04/2011 13:00 2 hours Upgrade of CMS SRM to version 2.10-2