Publishing tutorial for power and accounting
This document summarizes a way to publish the values for accurate power and accounting. It uses an example Grid cluster loosely based on Liverpool. This way of publishing does not involve the use of the glite-CLUSTER node type, which is described elsewhere. In the sections below, published values are shown in italics and bold.
Note on units: Power figures are now generally in HEPSPEC06 (HS06). However, for historical reasons, values are sometimes transmitted to the outside world in an obsolete standard, known as SpecInt2K (SI2K). In a bizarre twist, it was decided that a new SI2K would be redefined to be equal to 1/250th of a HS06, and the values would then be transmitted in the new SI2K units (which might be called "bogoSpecInt2k" if I were being rigorous!) HS06 values are converted to this SI2K by timesing them by 250 prior to transmission. Note that, as a corollary to this, there are 4 x HS06 in one KiloSpecInt2k (KSI2K), i.e. 1000/250.
Note on hyper-threading: A chassis may contain one or more worker node systems. Each worker node system is composed of several CPUs. A CPU contains several cores and each core can run in a mode called "hyper-threading", that allows it to efficiently run 2 threads at once. Job slots are allotted to worker nodes. In a previous typical case, it was usual to allot 1 job slot to each core. Throughput could be further increased by running the core in "hyper-threading" mode, where 2 job slots could be used. However, it became apparent that this approach does not necessarily maximise throughput. Experiments show that, in order to fully utilize a system, it is often necessary to choose a number of slots that is higher than the number of cores, but lower than twice the number of cores. The reason for this phenomenon is related to contention. At the example site, there are three (fictional) node types, that were bench-marked (see GridPP benchmark guidelines) yielding the following power values per slot. Figures for the ideal number of slots per node, which vary according to the type, are also shown. The BASELINE node-type is "abstract", and will be discussed below.
Logical and physical CPUs
We can now calculate some values.
Consider an example site which has 64 x F5620 nodes, 16 x Y5650 nodes and 7 x M5420 nodes, which makes (64 * 10 * 14.7875 + 16 * 16 * 14.9750 + 7 * 8 * 10.9375) = 13910.1 HS06 total power.
It has (64 * 10 + 16 * 16 + 7 * 8) = 952 slots, AKA Logical CPUs.
Sites are also obliged to state the total physical CPUs. F5620, Y5650 and M5420 all contain 2 physical CPUs, so (64 * 2 + 16 * 2 + 7 * 2) = 174 Physical CPUs.
And sites also need to transmit the average number of cores (or slots) per cpu, i.e. Logical cpus / physical cpus, which comes to Cores=5.47. And they need to send the average Benchmark strength of a single core (or slot), i.e. total power of site / logical cpus, which comes out to 14.611.
Sites have to transmit (via the BDII and the accounting system) a couple more things; the power of the site and the amount of work done. In a cluster of mixed nodes, the work done is equal to the time spent running in a slot multiplied by a factor which relates the strength of a slot of that particular machine-type to some arbitrary reference value. At the example site, an abstract node-type was introduced, called BASELINE, with a reference power of 10 HS06 (per single core job.)
Note: This is done because machines are replaced when obsolete, so it is convenient to have a node type that never changes and scale to that. All other node-types receive a scaling factor that describes how powerful their jobs slots are relative to the reference. The next table shows these values at the example site.
These scaling factors are installed on the worker nodes, and all work-done durations are multiplied by the factor to reduce or increase the value of work done, depending on whether the physical node is weaker or stronger than the reference node. Also used in the calculation is whether the job ran a single thread (i.e. a "single core" job) or multiple threads (i.e. a "multi core" job, which is currently always equivalent to 8 singles).
These scaled values, in the torque logs, are then suitable to be processed by the APEL system. Furthermore, the scaled duration is used by the batch head node to determine if a job has exceeded the CPU or wall time limits of the queue it is executing on.
To allow the accounting system to translate the scaled durations into actual figures of work done, it is necessary to tell the outside world about the reference value used. This is done via the CPU Scaling Reference value. It takes the reference value, 10, converted to SI2K, i.e. 10 x 250, i.e. CPUScalingReferenceSI00 = 2500.
Total power transmissions
To allow the power of the site to the transmitted, independent of any scaling, it is necessary to calculate it by taking 13910.1 (HS06 total power at site) and dividing it by 952 (the number of logical CPUs, or slots), giving 14.611 HS06. This must be converted to SI2K, i.e. 14.611 * 250, i.e. CE_SI00 = 3653 (this must be an integer value).
All these values are placed in YAIM variables as per the example below.
CE_PHYSCPU=174 CE_LOGCPU=952 CE_CAPABILITY="CPUScalingReferenceSI00=2500 Share=atlas:63 Share=lhcb:25 glexec" CE_SI00=3653 CE_OTHERDESCR=Cores=5.47,Benchmark=14.61-HEP-SPEC06
Avoiding double counting
The Yaim mapping above would be good for a site with one CE. But the example site has two CEs sharing the same TORQUE Server, so it is would be correct to set the logical and physical cpu counts in one CE, and set them to zero in the other, else double counting would occur. Unfortunately this raises divide by zero errors elsewhere. To workaround that, we set set the logical and physical cpu counts to 1 in one CE, and set them to the “count – 1” in the other CE. This kludge gives the correct arithmetic while avoiding the zero division, so everyone is happy. You can adjust this technique for any amount of CEs.
Accounting for VAC which has no BDII
Each VAC "Factory" is regarded as a CE. Since (a) VAC has no BDII and (b) it is a requirement that the total site HS06 be represented (else a site would look smaller than it is) then it is necessary to inflate the publishing of another CE |(say ARC or CREAM) so that it represents the total of VAC power as well as the traditional batch system power.
One last thing
If you are using Arc/Condor (or even just Arc), there is some information on using these variables in the Publishing section of this document: Example_Build_of_an_ARC/Condor_Cluster.