Tier1 Operations Report 2011-09-21

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 21st September 2011

Review of Issues during the fortnight 7th to 21st September 2011.

  • Wednesday 7th Sep: There was a large backlog of Atlas migrations (up to about 13,000). Following investigations this started to reduce, but only came down very slowly. There were a very large number of files to migrate.
  • Friday 9th Sep: Problem with some sites not being able to submit FTS transfers. Fixed on Monday (12th). Tracked to an incorrect gridmap file on one of the front end FTS systems. This has been fixed manually. Was not clear why one of the front-ends was mis-configured and not the others.
  • Sunday 11th Sep: Very high load on both AtlasScratchDisk and AtlasStripInoput that started during the evening. This correlates with the start of a large number of batch jobs (>1000 in an hour), along with FTS transfers and some draining disk servers.
  • Monday 12th Sep: Problem with Castor LHCb - all jobs were failing. Fixed by restarting the LSF.
  • Tuesday 13th Sep: problem on Castor for Atlas. Traced to a LSF problem (/lsf was full). Logs cleared but needed a reboot of the node to resolve the issue.
  • Tuesday 13th Sep: There were a large number of Atlas jobs queued in the grid4000M queue. This was partly resolved by increasing the running job limit for that queue, but also restoring a priority boost for the grid4000M jobs over the grid3000M ones.
  • Tue 13th Sep: Problem on Castor for Atlas caused by LSF - with a full disk partition on the LSF server machine. Once the offending logs were cleared the machine required a reboot to resolve the issue.
  • Wednesday 14th Sep: Backlog on the GEN migration queue which peaked at around 12K waiting transfers.
  • Thursday 13th Sep: Problem on worker nodes caused by /tmp filling up. A change had taken place whereby Alice have started using bit torrent to distribute software to the worker nodes. This was initially mis-configured. This has a significant effect on Atlas batch work and CMS availability.
  • Friday 16th Sep: late afternoon, under heavy load SRM problems were caused by a database deadlock. Then early evening there was another problem with the Atlas Castor LSF scheduler being very busy but not doing much. This was resolved during the evening by restarting LSF and killing the pending Castor jobs.
  • Sunday/Monday 18/19th Sep: Large backlog of tape migration for LHCb. This was reduced during Monday.
  • Tuesday 20th Sep: There were problems with the Atlas SRMs and FTS channels overnight. Remedial action included reducing the number of Atlas batch jobs and pulling their FTS capacity to 25% of its nominal value. These values were wound back up in the next day.

Resolved Disk Server Issues

  • Wednesday 7th Sep: There was a problem on gdss233 (AtlasGroupDisk) – with three drives reporting problems. This server was being drained, a process almost completed. It was put back into production, to continue being drained, on Friday (9th Sep).
  • Thursday 8th Sep: gdss474 (LHcbDst), which is in the process of being drained, was rebooted to clear up disk to disk copy problems. The system fsck'd its disks and was unavailable until later that morning.

Current operational status and issues.

  • Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. A transparent intervention made recently may have fixed this, but further tests are needed to confirm this.
  • The problem of packet loss on the main network link from the RAL site remains. RAL networking team continue to actively investigate this problem. This is currently at a low level and is not causing operational problems, although a concern remains that it may become worse if the network load rises.
  • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.
  • The planned first step of the database migration that was planned for today (21st Sep) has had to be postponed following a problem on teh disk array to be used in the intermediate configuration.

Ongoing Disk Server Issues

  • Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server is still undergoing tests.

Notable Changes made this last week

  • Work is ongoing to update the disk controller firmware on the Viglen2007a batch of disk servers. Those in D0T1 service classes were done on 12th September. Work has been progressing this week (starting 19th Sep) on the D1T0 servers.
  • There were short interruptions to the site network connection to the outside on the mornings of Tuesdays 13th & 20th September. (A few minutes in one case, and about one minute for the second.) These are part of the ongoing investigations into the packet loss problem. We planned to drain the FTS and pause batch work around these outages. During the first (13th) only a portion of batch work was successfully paused. The FTS drain was not successful for the second, although the very short outage meant there was negligible effect on file transfers.
  • On Wednesday/Thursday the distribution of Alice software to worker nodes was changed to use bit torrent.

Forthcoming Work & Interventions

  • Further updates to the disk controller firmware for the Viglen07a batch of disk servers.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced:

  • Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Networking change required to extend range of addresses that route over the OPN.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Microcode updates for the tape libraries are due.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
  • Replace hardware running Castor Head Nodes (aimed for end of year).

Entries in GOC DB starting between 7th and 21st September 2011.

There were no unscheduled entries during this fortnight.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole site SCHEDULED WARNING 20/09/2011 07:55 20/09/2011 08:30 35 minutes Short break in site network connectivity (few minutes) at 07:00 UTC. At Risk in case intervention problematic.
Whole site SCHEDULED WARNING 13/09/2011 07:55 13/09/2011 08:30 35 minutes Short break in site network connectivity (few minutes) at 07:00 UTC. At Risk in case intervention problematic.
lcgftm, lcgfts, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk SCHEDULED WARNING 07/09/2011 09:00 07/09/2011 15:04 6 hours and 4 minutes At Risk during rolling upgrade to apply Oracle Critical Patch Update.