RAL Tier1 PBS Efficiencies

From GridPP Wiki
Jump to: navigation, search

This page is obsolete

Introduction

Efficiency data relating to the execution of jobs on the RAL Tier1 farm are calculated. Efficiency is defined as the ratio of CPU time to wall time. A job with very high efficiency (close to one) implies that the CPU was in use during the majority of time that the job took to complete; conversely, a job with low very efficiency (close to zero) implies that very little CPU usage occurred during the lifetime of the job. Scenarios leading to high and low efficiencies will be discussed below.

Efficiency statistics are calculated monthly, and are available (along with those for CPU, disk and tape) at the Tier 1/A Statistics page.

Calculation

We calculate and report overall efficiencies for each month, and these are defined as ratios of the sum of all CPU times to the sum of all wall times:

   overall efficiency = SUM (CPU times) / SUM (wall times).

Efficiency data have been collected dating back to the beginning of 2005.

Graphs

The latest plots for the current year (up to the end of the previous month) are available here.

Global Efficiencies

We calculate global efficiencies that include jobs from all groups:

   File:RAL Tier1-PBS-Efficiencies-year-global-summary.png

Group Efficiencies

Data for jobs submitted by group are also calculated:

   File:RAL Tier1-PBS-Efficiencies-year-expt-summary.png

CPU Time vs Efficiency

Plots of CPU time vs efficiency for each group and each month are generated.

For example, this plot shows the ATLAS data for April 2005:

   File:RAL Tier1-PBS-Efficiencies-2005-04-ATLAS-scatter.png

Each point on the graph represents one or more jobs, and CPU time is plotted against efficiency (CPU time / wall time).

A summary of all the monthly efficiencies, and links to pages with scatter plots of efficiency vs CPU time (by experiment and by month) is available.

Analysis

The efficiency of a particular job is dependent on factors such as the nature of the code (CPU intensive or I/O intensive), access to external resources (for example a storage element), and the hardware that the job is running on.

Low Efficiencies

Startup Overheads

Very short jobs will be inefficient when startup overheads such as job accounting are comparable to the length of the job.

File Transfer

Long jobs may be inefficient for a number of reasons. For example, if data has to be transferred to the worker node, this may take a significant amount of wall time, but little CPU time (depending on the size and location of the files being transferred, and the available network bandwidth). If the time taken to transfer the file is a significant fraction of the total elapsed wall time, the efficiency of the job will be impacted.

File Access

If a job involves a significant amount of I/O (reading large amounts of data from a transferred file, or writing large amounts of data to temporary storage), and this is the limiting factor in the execution of the job, then the CPU will at times be idle. For such I/O-bound jobs, the more time the CPU is idle the less efficient the job will be.

If access to a disk server is required (via an NFS mount), the I/O will generally be slower than for local disks, and hangs may lead to periods of CPU inactivity or eventually failure/termination of the job.

Large Memory

Jobs that use a large amount of memory (more physical memory than is available on the worker node) may be inefficient. Jobs that leak memory may become increasingly inefficient as their execution continues.

Operational Reasons

Batch jobs may be suspended while routine maintenance of the farm is being carried out (for example when rebuilding disk servers). This will increase the wall time that a job uses, and decrease its efficiency.

High Efficiencies

Jobs that do not require access to large amounts of data, and are not I/O intensive are likely to be efficient. Note that efficiency here is simply defined as the ratio of CPU time to wall time, and that poorly written code leading to long CPU times may result in high efficiency according to this criterion.

Trends

Some information can be extracted from the CPU time vs efficiency graphs. As in the ATLAS plot shown above, several straight-line structures are visible in a plot of the H1 efficiencies for July 2005:

   File:RAL Tier1-PBS-Efficiencies-2005-07-H1-scatter-annot.png

The highlighted vertical line represents constant CPU time, and the other highlighted line (which if extrapolated would pass through the origin) represents constant wall time. There are many of this latter type of line in this example, and each represents a different wall time (and probably a different class of job).

Both constant CPU time and constant wall time may correspond to jobs that are not terminating correctly.

Summary

Job efficiencies are one way of examining the performance of the farm. A variety of reasons can contribute to low efficiencies, and some problems may be transient (such as obtaining data from a storage element), and others may be intrinsic to the farm (such as disk speeds).

The data that we plot may help to identify jobs that are failing, in which case actions can be taken to investigate the underlying causes.

Related Documents