LCG-on-SGE

From GridPP Wiki
Jump to: navigation, search

Overview

The LCG software distribution assumes that you will be using the provided PBS cluster management system in order to manage your cluster. This page describes how to adapt the provided software to so that you can construct an LCG Compute Element (CE) that will backend onto a pre-existing Sun Grid Engine (SGE) cluster installation.

All of the documentation here assumes that you already have a functioning GridEngine installation.

NOTE: This document is a work-in-progress. It is not yet complete and may contain factual inaccuracies or out-of-date information.

There exists a mailing list for those people interested in the use of LCG on SGE. See also [1].

Prerequisites

You will need:

  • An installed machine that will act as your Compute Element(CE).
  • The LCG CE packages suitable for your system installed on your CE.
  • The SGE client-side tools installed on your CE.
  • A copy of the SGE JobManager (Recent versions of LeSC's original jobmanager have not currently been packaged/versioned, however the code can be downloaded from http://www.doc.ic.ac.uk/~dwm/Code/. Note this code may undergo changes before being officially released, and alternative JobManager implementations are being developed elsewhere)
  • A copy of the SGE Information Reporter. (http://www.lesc.ic.ac.uk/projects/SGE-LCG.html)
  • Shared home directories between your SGE worker nodes. (This is a prerequisite imposed by the current LeSC JobManager implementation)

[ Note: As of December 2006, a new maintainer has taken over development of the information provider and the version located at http://www.doc.ic.ac.uk/~dwm/Code/ is no longer up-to-date.]

[ Note: These notes assume that all of the core setup steps required for an LCG node -- such as obtaining a trusted, signed host keypair -- have already been completed. ]

Explanation: The JobManager

The function of the JobManager is to translate between the interfaces provided by the general-purpose Globus Gatekeeper and the software-specific interfaces of the underlying cluster system. The JobManager, at it's core, is a simple perl module; it implements three basic methods: submit(), cancel() and poll():

  • submit(), when called with a set of parsed RSL fields as a parameter, generate a job submission script that encodes all of the job requirements and submit it to the local cluster installation. It will return a cluster-local job ID number for later use.
  • cancel(), when called with a cluster-local job ID as a parameter, remove that job from the underlying cluster installation.
  • poll(), when called with a cluster-local job ID as a parameter, query the state of that job on the underlying cluster installation and report one of a set of general state conditions, such as "running", "queued", "held" etc.

To use SGE, rather than PBS, as the backend cluster system, you will need to install and configure an SGE-specific JobManager on your CE.

Implementation of the JobManager

  • The job manager assumes that shared home directories are used.
  • It has been tested with sge version 6.
  • The sge.pm derives from Globus::GRAM::JobManager which provides utility function to retrieve rsl.
  • The script uses several configuration files located in /etc/sge-jobmanager
jobmanager.conf contains
preamble-path Path to a preamble script. Can be used to source the environement
postamble-path Path to the post script.
vqueues-path Path to the virtual queues definitions
vqueues.conf contains the virtual queue definitions an example is shown below. See in the information plugin below in this wiki for an example.
  • The actual implementation of the jobmanager can be found here

Explanation: The Information Provider

The function of the Information Reporter is to extract current state information from the underlying cluster system and feed it, through various processes, to the site-local BDII information service. The data that the Information Reporter extracts is published to the world and is used by Resource Brokers to make scheduling decisions.

The information that the Information Reporter publishes is formatted according to the GLUE Schema.

Implementation of the information Provider

  • The initial script that is called is located in /opt/lcg/var/gip/plugin/lcg-info-dynamic-ce and consists of a call to the SGE specfic information provider with whatever command line arguments are needed for the information provider to find its configuration files (or to override settings within the configuration files)
  • The current information provider uses three configuration files:
    • /etc/sge-jobmanager/info-reporter.conf which can be used to override default locations and settings
    • A static ldif file generated by yaim with the static values for the CE (typically located in /opt/lcg/var/gip/plugin/lcg-info-dynamic-ce)
    • A virtual queue configuration /etc/sge-jobmanager/vqueues.conf
A typical content of the virtual queue configuration file might be
# Virtual
# Queue         Property                New Value
# --------------------------------------------------
10min           queue                   *
30min           queue                   *
1hr             queue                   *
3hr             queue                   *
6hr             queue                   *
12hr            queue                   *
24hr            queue                   *
72hr            queue                   *

10min           max_wall_time           10
30min           max_wall_time           30
1hr             max_wall_time           60
3hr             max_wall_time           180
6hr             max_wall_time           360
12hr            max_wall_time           720
24hr            max_wall_time           1440
72hr            max_wall_time           4320
This configuration file is also used by the jobmanager developed by LeSC.

Information provider flow

[ Note: This information refers to version 0.5 of the information provider and may therefore be out-of-date]

Most of the information about the state of the batch system is encoded in two hash tables

cluster
queues


  • lookupDNs(): Get the list of virtual queues from the ldif file and create a ash table that will contain all the queue parameters.
  • lookupGlobalPolicy() Used to retrieve the information for the following
GlueCEInfoLRMSType
GlueCEInfoLRMSVersion
GlueCEPolicyMaxTotalJobs
  • lookupClusterQueuePolicy() used to retrieve the information of
GlueCEPolicyMaxWallClockTime
  • lookupVirtualQueuePolicy() used to retrieve the virtual queue information from the configuration file located at /etc/sge-jobmanager/vqueues.conf. The function will map the values for h_rt to GlueCEPolicyMaxWallClockTime


  • It build a Hash with the virtual queue name and the associated information. The associated information is another Hash Table that contains the following keys
GlueCEInfoTotalCPUs
GlueCEPolicyMaxWallClockTime
GlueCEStateEstimatedResponseTime
GlueCEStateFreeCPUs
GlueCEStateFreeJobSlots
GlueCEStateRunningJobs
GlueCEStateStatus
GlueCEStateTotalJobs
GlueCEStateWaitingJobs
GlueCEStateWorstResponseTime
  • Then there is the GlueVOViewLocalID sub dn that need to be filled. The assumption in this information provider is that the value is just copied from the values in the queue view. This should be enhanced

Explanation: Virtual Queues

The information reporter is designed to advertise virtual queues. These are queues which, whilst accepted as valid by the JobManager, do not actually exist on the underlying batch system. Virtual queues are defined in terms of job constraints, eg maximum wallclock time, maximum memory usage, or some combination. When jobs are submitted to one of these virtual queues, the job's metadata is updated with the properties of that virtual queue and then submitted to the cluster's global job queue (Or to be more precise, the cluster's global job pool -- jobs are not necessarily scheduled on a FIFO basis.)

(This makes the work of the Information Reporter substantially harder, as it has to report job and CPU utilisation statistics for virtual queues defined only in terms of the properties of the individual jobs and execution hosts. Many virtual queues will actually overlap; for example, a 10-minute queue will in fact be a subset of the jobs sitting in the 24-hour queue.)

For example, at LeSC we advertise several different queues with a different maximum wall-clock time limit, 10min, 24hr, 72hr, etc. (See GStat) When a job is submitted to the 10min virtual queue, the JobManager sets the maximum wallclock time limit to 10 minutes and then injects the job into the batch system. However, if the same job were submitted to the 72hr queue, the JobManager would redefine the maximum wallclock time limit to 72 hours and then add the job to the global pool.

The batch system is then completely free to schedule the jobs in whatever way it (or rather, the cluster manager) sees fit, based on the job's actual constraints, rather than whichever queue it happened to be submitted to. The wonderful advantage of this system is that the advertised LCG/gLite queue configuration is now completely independent of the underlying batch system configuration; the virtual queue definitions can be modified arbitrarily depending on LCG operational constraints, whilst the cluster scheduling configuration can be updated without affecting the external interfaces used by LCG jobs.

Ideally, LCG/gLite would be updated to do away with the concept of queues altogether and require that users specify job constraints explicitly.

Installation Notes

Installation notes can be found here for the LeSC cluster. here