RAL Tier1 PBS Scheduling

From GridPP Wiki
Revision as of 09:50, 7 March 2014 by Andrew lahiff (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is obsolete

Introduction

This document gives an overview of the batch queues and scheduling algorithms used at the RAL Tier1 Batch Farm. The aim is to give users a better understanding of job submission to the farm, and what factors influence the time taken for a job to begin running, sometimes referred to as the estimated traversal time (ETT), and how long a job takes to complete.

Batch Queues

Grid and classical (non-Grid) jobs have different routes to the batch system, but are ultimately prioritised by a single scheduler, and have access to the same resources. The policies that determine when jobs are scheduled to run are applied on a per-group basis. For Grid jobs, there is a one-to-one correspondence between groups and virtual organisations (VOs) such as the four LHC VOs (ALICE, ATLAS, CMS, and LHCb). For classical jobs, the group is the user's login group.

Grid jobs cannot be run on the classical queues, and vice versa.

Classical Queues

The farm can now be accessed by only a small number of non-Grid users (typically for testing purposes).

Note that the default queue is a Grid queue (gridS), and submission to this from a UI will not work. A non-grid queue (prod4 or express) must be specified when submitting jobs.

The prod4 queue is a routing queue that schedules jobs on the sl4p and sl4m execution queues. The default requested resources are 24 hours of CPU time and 500 Mb of memory, and jobs within those limits will execute on the sl4p queue. The sl4m queue is for large memory jobs. The resource limits are as follows:

Queue name Job Type Default requested CPU Time (KSI2K hours) Max. CPU Time (KSI2K hours) Max. Walltime (KSI2K hours) Min. Memory (Mb) Max. Memory (Mb)
sl4p Production 24 48 56 0 1000
sl4m Large memory 24 48 56 1001 4000

(See RAL Tier1 PBS Scheduling#Farm Scaling for information on KSI2K scaling.)

Express Queues

Two express queues exist, sl4pe and sl4me, which have the same resource limits as sl4p and sl4m respectively. Jobs can be routed to these execution queues by submitting jobs to the express queue like:

$ qsub -q express ...

At any given time, only one job per user can be run from the express queue.

CPU Time Limits

If you want to increase the default CPU time limit then it may be necessary to specify both the CPU time (cput) and per-process CPU time (pcput) limits, in which case submit jobs using:

$ qsub -l cput=hh:mm:ss,pcput=hh:mm:ss

Grid Queues

The VOs share queues defined by memory limits, so the grid500M requests 500MB of memory, for example. There are analogous queues requesting 700MB, 1000MB, 2000MB and 3000MB of memory. In addition there is a gridS queue for short Grid jobs.

Queue Information

Ganglia

Details about the number of jobs running and queued for the last 24 hours can be obtained from the Ganglia plots such as:

   File:RAL Tier1-PBS-Scheduling-queues.png

at http://ganglia.gridpp.rl.ac.uk/cgi-bin/ganglia-pbs/pbs-page.pl?r=day. A breakdown of the queued and running jobs by queue is also shown.

Front Ends

A list of queues, and their status can be obtained from the front ends/user interfaces (tier1a.gridpp.rl.ac.uk) with the qstat command:

 $ qstat -q
 
 server: csflnx353.rl.ac.uk
 
 Queue            Memory CPU Time Walltime Node  Run Que Lm  State
 ---------------- ------ -------- -------- ----  --- --- --  -----
 prod4              --      --       --      --    0   0 --   E R
 prod               --      --       --      --    0   0 --   E R
 express            --      --       --      --    0   0 --   E R
 sl4pe              --   48:00:00 56:00:00   --    0   0 --   E R
 sl4p               --   48:00:00 56:00:00   --  427   5 --   E R
 sl4me              --   48:00:00 56:00:00   --    0   0 --   E R
 sl4m               --   48:00:00 56:00:00   --    0   0 --   E R
 grid500M           --   60:00:00 72:00:00   --  127  17 --   E R
 grid700M           --   60:00:00 72:00:00   --   11  12 --   E R
 grid1000M          --   60:00:00 72:00:00   --   13   6 --   E R
 grid2000M          --   60:00:00 72:00:00   --  104  17 --   E R
 grid3000M          --   60:00:00 72:00:00   --   28   0 --   E R
 gridS              --   01:00:00 02:00:00   --    1   1 --   E R
                                                ----- -----
                                                  711    58

Grid

For Grid users, the Information System can be queried for details about queues with, for example:

 $ lcg-infosites --vo atlas ce | grep gridpp
 2527    1910       4              3        1    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid700M
 2527    1910     128            117       11    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
 2527    1906      25             25        0    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid3000M
 2527    1910       0              0        0    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
 2527    1906     129            118       11    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M
 2527    1910      76             76        0    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid2000M
 2527    1910       5              5        0    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid1000M
 2527    1906      78             78        0    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid2000M
 2527    1910      24             24        0    lcgce03.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid3000M
 2527    1906       4              3        1    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid700M
 2527    1906       0              0        0    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS
 2527    1906       5              5        0    lcgce05.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid1000M

or for more detailed information, with:

 $ ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 -b 'mds-vo-name=RAL-LCG2,o=grid'

Queues map on to Clusters, and Clusters map uniquely on to SubClusters, where the latter publish resources to the Grid.

Scheduling

It is important to understand that while there is a logical separation of jobs into various queues, in terms of scheduling, the jobs all sit in one pool. The only factors that affect the job priorities are the fairshare policies. (The one exception to this is for jobs that run on the express queues, which have priorities that exceed the maximum available from the fairshare policies.)

Fairshares

Fairshare (FS) targets correspond to certain fractions of the Farm's utilisation being accounted to each group. For example, a FS profile could be groupA utilising 50% of the farm, groupB 25% and so on. Utilisation is determined from the elapsed walltime (not CPU time) of a group's jobs.

Group FS targets are derived from the CPU allocations made by the User Board (UB) by converting them into percentages of the total allocation. These allocations are updated quarterly, and are available in the Current Schedule document. (Questions regarding the allocations should be directed to the Experiment Operational Contacts.)

The FS scheduling is handled by the Maui Cluster Scheduler. This assigns job priorities, and jobs are scheduled to start in order according to these such that the FS targets are met (or matched as closely as possible given the mix of jobs submitted to the Farm).

Fairshare Policies

The fairshares operate in a hierarchical manner. Priorities at all levels are proportional to:

1 - (FS usage / FS target).

but have different weights.

LHC vs. non-LHC Fairshares

The highest weighted fairshare determines the relative scheduling of LHC (ALICE, ATLAS, CMS and LHCb) jobs vs. all other jobs. The LHC target is the sum of the four LHC experiment fairshares, and the non-LHC target is the sum of all the non-LHC experiment fairshares.

This means that if the total LHC fairshare usage is lower that the total LHC target, one or more of the individual LHC experiments can exceed their group fairshare target to try and meet this policy.

Group Fairshares

Within each LHC or non-LHC set of experiments the group fairshares are those set by the GridPP User Board.

These fairshares contribute less to job priorities than the LHC/non-LHC fairshares.

Intra-Group Fairshares

Configuration of intra-group fairshares is being tested. Currently we have configured CMS such that the target is for 90% of their work to be production jobs.

To meet these policies the experiments have to submit a sufficient mix of production and non-production work.

These fairshares contribute less to job priorities than the group fairshares.

User Fairshares

The fairshares that contribute least to job priorities are the user fairshares. These are the fairshares that discriminate between users in the same group (or sub-group if intra-group fairshares are configured.)

All users have an equal fairshare target.

Summary

(1) All LHC jobs have a highly-weighted priority component that is equal. All non-LHC jobs have a highly-weighted priority component that is equal.

(2) All jobs within a group have an equal priority component from the group fairshare usage. This has a smaller weight than the LHC/non-LHC component.

(3) All jobs within a sub-group have an equal priority component from the intra-group fairshare usage. This has a smaller weight than the group component. (Only ATLAS and CMS currently have sub-groups.)

(4) All jobs for a given user have an equal priority component from the user fairshare usage. This has a smaller weight than the intra-group component.

Fairshare Algorithm

The average use of the farm (for all levels of the hierarchy) is calculated over a configured number of periods (FS windows), the weight of which decreases geometrically by a configured factor. For the RAL Farm, there are nine fairshare windows of 24 hours, and the weight decays by a factor of 0.7 for each successive window. (Farm usage beyond this period is irrelevant to the scheduling of submitted jobs.)

For example, if over the last nine days a group accounts for 25% of the farm utilisation over the last 24 hours, 40% of the farm utilisation in the period preceding that, and 60% of the farm utilisation in the period preceding that, the FS usage will be:

25 + (40 * 0.7) + (60 * 0.7 * 0.7) + (0 * 0.7 * 0.7 * 0.7) + ... = 82.40

When will my job run?

The jobs are given priorities according to the FS usage described above. Jobs are then run in order as long as there are job slots in queues that can satisfy the requested resources (CPU time, walltime, memory). The factors that influence when a job will run are therefore:

(1) LHC/non-LHC priorities. Priority will be low for non-LHC experiments relative to LHC experiments if, for example, the total non-LHC fairshare usage is above the total non-LHC fairshare target.

(2) Group FS priorities. Target and usage data are accumulated and published via Ganglia at http://ganglia.gridpp.rl.ac.uk/cgi-bin/ganglia-fs/fs-page.pl. Jobs from groups with high (1 - (usage / target)) values on plots like the following will have high priorities.

   File:RAL Tier1-PBS-Scheduling-FS-ratios.png

(3) Intra-group FS priorities. Priority will be low relative to other sub-groups if the intra-group fairshare target has been exceeded.

(4) User FSs. Within a group, jobs are prioritised to try and reach equal FSs amongst the users.

(5) Requested job resources. Jobs requesting more resources may have fewer slots that they can run in. For example, the number of worker nodes that can run large memory jobs is relatively small, so requesting more memory than needed can increase the time taken for a job to be executed. Jobs requesting small amounts of walltime may be able to run in job slots reserved for high-priority jobs, while the high-priority jobs are waiting for sufficient resources (typically memory) to become available.

(6) If the farm is not full, low priority jobs will still run

It is important to request a reasonable estimate of the walltime that a job is expected to use. If excessive walltime is requested, and jobs are blocked for some reason (consuming walltime but not CPU time), then both the user and group FSs will increase, even though no useful work is being done.

Note that if the Farm is not full, jobs will be run so long as there are slots that satisfy the requested resources, regardless of FS targets being exceeded. Conversely, if a group has exceeded its FS target, and the Farm is full, no jobs from that group will be scheduled until either the group's current FS usage has dropped below the target, or the Farm is no longer full, or there is a suitable job slot that cannot be filled by a higher-priority job (due to requested resources).

Command-line tools

The following commands can be used from the front ends:

(1) diagnose -f: examine Fairshare targets and usage;

(2) showq -i: examine job priorities;

(3) diagnose -p: examine job priorities components (user, group, account, QoS).

Farm Scaling

To hide the details of the farm's heterogeneous mix of CPUs from users, resources are requested against a hypothetical CPU with a KSpecInt2K (KSI2K) rating of 1.0. (The i686 processors here with a clock speed of about 2.6 GHz have a KSI2K rating of 1.033).

When submitting jobs and setting the maximum CPU/walltime, remember that these limits correspond to a machine with a KSI2K rating of 1.0. Not doing so may mean that too much CPU/walltime is requested, possibly increasing the ETT, or that too little CPU/walltime is requested, which may result in the job not running to completion.

Related Documents