Difference between revisions of "Guide to Ganga"

From GridPP Wiki
Jump to: navigation, search
(Submission seems quite slow - how can I speed it up?)
Line 160: Line 160:
 
During submisssion, Ganga is simply acting as a nice frontend/wrapper for whatever underlying submission command the backend uses. Consequently, if this can take a while to submit, Ganga can't magically speed this up. However, there are a couple of things you can do alleviate the problem:
 
During submisssion, Ganga is simply acting as a nice frontend/wrapper for whatever underlying submission command the backend uses. Consequently, if this can take a while to submit, Ganga can't magically speed this up. However, there are a couple of things you can do alleviate the problem:
  
* Use queues to submit several jobs in parallel
+
* Use queues to submit several jobs in parallel [https://www.gridpp.ac.uk/wiki/Guide_to_Ganga#Using_Queues_to_Speed_Up_Submission][link]
 
* Use a splitter to take advantage of bulk submission (note the backend needs to support this which is only really LCG/WMS at present)
 
* Use a splitter to take advantage of bulk submission (note the backend needs to support this which is only really LCG/WMS at present)

Revision as of 12:58, 26 December 2014

Introduction

This is a guide to installing and configuring the Ganga Job Management tool for use with both local batch systems and the DIRAC workload management system. It's maintained by Mark Slater (mws<AT>hep.ph.bh.bham.ac.uk) - please email if you have any comments/problems!

For a general overview talk on the Grid, Dirac and Ganga, please see this talk http://epweb2.ph.bham.ac.uk/user/slater/DurhamSeminar_Oct2014.pdf

For more info and more in depth user guides, please visit the main Ganga website http://ganga.web.cern.ch/ganga/

Requirements

Before you start using Ganga (assuming you want to use it to submit jobs to the grid rather than just for local batch system submission), there are a few steps you need to go through:

Installation and Configuration

Here are the steps to download and configure Ganga:

  • Download the install script from the Ganga website and make it executable:
wget http://ganga.web.cern.ch/ganga/download/ganga-install
chmod +x ganga-install
  • Run the script with the external plugins you want to include to download and install Ganga at ~/Ganga. Generally, this will be the GangaDirac plugin:
./ganga-install --extern=GangaDirac LAST
  • Now run Ganga with the -g flag to create the default .gangarc file:
/home/<username>/Ganga/install/<version>/bin/ganga -g -o[Configuration]RUNTIME_PATH=GangaDirac
  • To configure Ganga to submit using your DIRAC client installation, setup the DIRAC client and export the environment to a file for Ganga to use:
source ~/dirac/bashrc
env > ~/dirac/envfile
  • Now edit your .gangarc file and set the following option:
[Configuration] RUNTIME_PATH = GangaDirac
[Dirac] DiracEnvFile = /home/<username>/dirac/envfile
[defaults_GridCommand]info = dirac-proxy-info
[defaults_GridCommand]init = dirac-proxy-init -g <dirac user group>
  • Now setup the DIRAC client (as you should before running Ganga if you want to use it) and then run Ganga. It should ask you to generate a proxy and then leave you at the IPython prompt:
source ~/dirac/bashrc
/home/<username>/Ganga/install/<version>/bin/ganga
  • To test that all is working, try to submit a basic job to the local machine you're running and then to DIRAC:
Job().submit()
Job( backend=Dirac() ).submit()

Getting Started

Ganga is a general job management tool to help with the submission, monitoring and manipulation of jobs to different systems. It is based on the idea of plugins that tell a Job what to run (Application), Where to run (Backend), how to run (Splitter and PostProcessor) and what data to use (InputFiles and OutputFiles). It is written almost entirely in Python and either the modified IPython prompt or scripts can be used to control it.

To start, we'll submit a default job that will go to the 'Local' backend (i.e. the machine you are using at present). Start ganga as above and then enter the following:

j = Job()
j.submit()

You should (almost immediately) have the job submit, start running and then complete. By default, the stdout/err are copied back with your job and stored in the Ganga workspace. To view them, you can use the following:

j.peek("stdout", "emacs")    # open any file in the j.outputdir with the given command
!emacs $j.outputdir/stdout   # Use '!' to give a shell command and '$' for an IPython command

This default job object uses the 'Executable' application with the exe set to 'echo' and the arguments set to 'Hello World'. To run your own scripts, do the following:

j = Job()
j.application = Executable()
j.application.exe = '/path/to/script'
j.application.args = [ ... ]
j.submit()

To view the jobs that you have created, use the 'jobs' command. This gives a list of the job objects along with their status. You can also use this to access the jobs themselves and view all the information about them, e.g.

jobs
j = jobs(0)    # grab jobs object ID 0
j              # view the object
j.application
j.backend

To get more information about the different objects and plugins, use the 'help' system:

help()
help(Job)
help(Executable)
plugins("applications")
plugins("backends")


Input and Output Data

Splitting into Subjobs

Quite often, you need to run the same job but with different arguments or input data or wish to take advantage of a backend's bulk submission capabilities. The Splitter in Ganga is what you use to achieve this. In GangaCore, there is a single splitter that should serve most purposes - the GenericSplitter. A simple example of it's use is shown below:

j = Job()
j.splitter = GenericSplitter()
j.splitter.attribute = "application.args"
j.splitter.args = [ 'arg 1', 'arg 2', 'arg 3' ]
j.submit()

This will create one master job with the 3 subjobs that have 3 different arguments. To view these subjobs, do:

j.subjobs
j.subjobs(0)

Note that these will have been submitted in bulk if the backend supports it (Dirac does not at time of writing).

As a second example, if you want to submit several subjobs but changing multiple parameters for each subjob, you can use also do this with the GenericSplitter:

j = Job()
j.splitter = GenericSplitter()
j.splitter.multi_args = { "application.args":["hello1", "hello2"], "application.env":[{"MYENV":"test1"}, {"MYENV":"test2"}] }

The multi_args field takes a dictionary with the keys of the parameter names you want to change and then values as a list of the what the parameter should be set to per subjob. In the example above, 2 subjobs are created, the first with:

application.args = "hello1"
application.env = {"MYENV":"test1"}

and the second with:

application.args = "hello2"
application.env = {"MYENV":"test2"}

For both of these examples, you can split on any property of the job, e.g. inputfiles, backend.requirements, etc.

Submitting to Different Backends

One of the main benefits of Ganga is that you can submit to different backends with very little change to your submission scripts. For example, above we were submitting to the Dirac backend but if you wanted to submit to a local PBS batch system, you only need to change the backend line:

j.backend = PBS()

All input and output data will be handled by Ganga. The supported Core backends include Local, LSF, PBS, Condor, LCG and Dirac. Note that some requirements will be backend dependent so do check the associated requirements object, e.g.

j.backend.requirements

to see what options are available.


Using Queues to Speed Up Submission

When submitting to some backends, DIRAC included, it can take a bit of time to go through the whole submission process. When you have 10s-1000s of jobs to submit, this can become a significant problem. You can greatly speed things up by using the Ganga queues system to submit your jobs in parallel, e.g.:

for i in range(0, 10):
   j = Job( backend = Dirac() )
   queues.add(j.submit)

You can view the threads Ganga knows about by using the 'queues' command. To configure the number of queues, use:

[DIRAC] NumWorkerThreads

You can add any function call to the queues system to run in the background. To get more info, use help(queues).

Using Tasks for Automated Submission and Job Chaining

Using Ganga as a Service

FAQ

Submission seems quite slow - how can I speed it up?

During submisssion, Ganga is simply acting as a nice frontend/wrapper for whatever underlying submission command the backend uses. Consequently, if this can take a while to submit, Ganga can't magically speed this up. However, there are a couple of things you can do alleviate the problem:

  • Use queues to submit several jobs in parallel [1][link]
  • Use a splitter to take advantage of bulk submission (note the backend needs to support this which is only really LCG/WMS at present)