https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_PBS_Efficiencies&feed=atom&action=historyRAL Tier1 PBS Efficiencies - Revision history2024-03-28T13:56:25ZRevision history for this page on the wikiMediaWiki 1.22.0https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_PBS_Efficiencies&diff=3801&oldid=prevAndrew Lahiff b13e4f09e2 at 12:24, 18 March 20142014-03-18T12:24:08Z<p></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr style='vertical-align: top;'>
<td colspan='2' style="background-color: white; color:black; text-align: center;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black; text-align: center;">Revision as of 12:24, 18 March 2014</td>
</tr><tr><td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">'''This page is obsolete'''</ins></div></td></tr>
<tr><td colspan="2"> </td><td class='diff-marker'>+</td><td style="color:black; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;"></ins></div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"><div>== Introduction ==</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f9f9f9; color: #333333; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #e6e6e6; vertical-align: top; white-space: pre-wrap;"></td></tr>
</table>Andrew Lahiff b13e4f09e2https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_PBS_Efficiencies&diff=1187&oldid=prevMatt hodges at 07:51, 22 August 20082008-08-22T07:51:07Z<p></p>
<p><b>New page</b></p><div>== Introduction ==<br />
<br />
Efficiency data relating to the execution of jobs on the [[RAL Tier1]]<br />
farm are calculated. Efficiency is defined as the ratio of CPU time<br />
to wall time. A job with very high efficiency (close to one) implies<br />
that the CPU was in use during the majority of time that the job took<br />
to complete; conversely, a job with low very efficiency (close to<br />
zero) implies that very little CPU usage occurred during the lifetime<br />
of the job. Scenarios leading to high and low efficiencies will be<br />
discussed below.<br />
<br />
Efficiency statistics are calculated monthly, and are available (along<br />
with those for CPU, disk and tape) at the<br />
[http://www.gridpp.rl.ac.uk/stats/ Tier 1/A Statistics] page.<br />
<br />
== Calculation ==<br />
<br />
We calculate and report overall efficiencies for each month, and these<br />
are defined as ratios of the sum of all CPU times to the sum of all<br />
wall times:<br />
<br />
overall efficiency = SUM (CPU times) / SUM (wall times).<br />
<br />
Efficiency data have been collected dating back to the beginning of<br />
2005.<br />
<br />
== Graphs ==<br />
<br />
The latest plots for the current year (up to the end of the previous<br />
month) are available [http://www.gridpp.rl.ac.uk/stats/ here].<br />
<br />
=== Global Efficiencies ===<br />
<br />
We calculate global efficiencies that include jobs from all groups:<br />
<br />
[[Image:RAL_Tier1-PBS-Efficiencies-year-global-summary.png]]<br />
<br />
=== Group Efficiencies ===<br />
<br />
Data for jobs submitted by group are also calculated:<br />
<br />
[[Image:RAL_Tier1-PBS-Efficiencies-year-expt-summary.png]]<br />
<br />
=== CPU Time ''vs'' Efficiency ===<br />
<br />
Plots of CPU time ''vs'' efficiency for each group and each month are generated.<br />
<br />
For example, this plot shows the ATLAS data for April 2005:<br />
<br />
[[Image:RAL_Tier1-PBS-Efficiencies-2005-04-ATLAS-scatter.png]]<br />
<br />
Each point on the graph represents one or more jobs, and CPU time is plotted<br />
against efficiency (CPU time / wall time).<br />
<br />
A summary of all the monthly efficiencies, and links to pages with<br />
scatter plots of efficiency ''vs'' CPU time (by experiment and by month)<br />
is<br />
[http://www.gridpp.rl.ac.uk/stats/eff/RAL/All/archive/summary.html available].<br />
<br />
== Analysis ==<br />
<br />
The efficiency of a particular job is dependent on factors such as the<br />
nature of the code (CPU intensive or I/O intensive), access to<br />
external resources (for example a storage element), and the hardware<br />
that the job is running on.<br />
<br />
=== Low Efficiencies ===<br />
<br />
==== Startup Overheads ====<br />
<br />
Very short jobs will be inefficient when startup overheads such as job<br />
accounting are comparable to the length of the job.<br />
<br />
==== File Transfer ====<br />
<br />
Long jobs may be inefficient for a number of reasons. For example, if<br />
data has to be transferred to the worker node, this may take a<br />
significant amount of wall time, but little CPU time (depending on the<br />
size and location of the files being transferred, and the available<br />
network bandwidth). If the time taken to transfer the file is a<br />
significant fraction of the total elapsed wall time, the efficiency of<br />
the job will be impacted.<br />
<br />
==== File Access ====<br />
<br />
If a job involves a significant amount of I/O (reading large amounts<br />
of data from a transferred file, or writing large amounts of data to<br />
temporary storage), and this is the limiting factor in the execution<br />
of the job, then the CPU will at times be idle. For such I/O-bound<br />
jobs, the more time the CPU is idle the less efficient the job will<br />
be.<br />
<br />
If access to a disk server is required (via an NFS mount), the I/O<br />
will generally be slower than for local disks, and hangs may lead to<br />
periods of CPU inactivity or eventually failure/termination of the<br />
job.<br />
<br />
==== Large Memory ====<br />
<br />
Jobs that use a large amount of memory (more physical memory than is<br />
available on the worker node) may be inefficient. Jobs that leak<br />
memory may become increasingly inefficient as their execution<br />
continues.<br />
<br />
==== Operational Reasons ====<br />
<br />
Batch jobs may be suspended while routine maintenance of the farm is<br />
being carried out (for example when rebuilding disk servers). This<br />
will increase the wall time that a job uses, and decrease its<br />
efficiency.<br />
<br />
=== High Efficiencies ===<br />
<br />
Jobs that do not require access to large amounts of data, and are not<br />
I/O intensive are likely to be efficient. Note that efficiency here<br />
is simply defined as the ratio of CPU time to wall time, and that<br />
poorly written code leading to long CPU times may result in high<br />
efficiency according to this criterion.<br />
<br />
=== Trends ===<br />
<br />
Some information can be extracted from the CPU time ''vs'' efficiency<br />
graphs. As in the ATLAS plot shown above, several straight-line<br />
structures are visible in a plot of the H1 efficiencies for July 2005:<br />
<br />
[[Image:RAL_Tier1-PBS-Efficiencies-2005-07-H1-scatter-annot.png]]<br />
<br />
The highlighted vertical line represents constant CPU time, and the<br />
other highlighted line (which if extrapolated would pass through the<br />
origin) represents constant wall time. There are many of this latter<br />
type of line in this example, and each represents a different wall<br />
time (and probably a different class of job).<br />
<br />
Both constant CPU time and constant wall time may correspond to jobs<br />
that are not terminating correctly.<br />
<br />
== Summary ==<br />
<br />
Job efficiencies are one way of examining the performance of the farm.<br />
A variety of reasons can contribute to low efficiencies, and some<br />
problems may be transient (such as obtaining data from a storage<br />
element), and others may be intrinsic to the farm (such as disk<br />
speeds).<br />
<br />
The data that we plot may help to identify jobs that are failing, in<br />
which case actions can be taken to investigate the underlying causes.<br />
<br />
== Related Documents ==<br />
<br />
* [[RAL Tier1 PBS Scheduling]]<br />
<br />
[[Category:Batch Systems]]<br />
[[Category:RAL Tier1]]</div>Matt hodges