How can I use torque and possibly maui instead of OpenPBS?

The version of OpenPBS with the fifo schedular in LCG has two signifant problems for use in serious production.

  1. If a batch worker crashes while it is running jobs the pbs_server hangs until the batch worker returns. While locked up trying to check the status of the job no other jobs can be submitted and the farm slowly drains to nothing. Also in LCG the information provider scripts also hang against qstat (this may be fixed now).
  2. The amount of scheduling you can do is very limited. You can basically say that any group can run a maximum number of jobs or any user can run a maximum amount of jobs and that is about it.

The way to solve the first problem is to use TORQUE. For the second problem this can be solved using Maui.

Before continuing any furthur you farm must be disabled and completley drained of jobs. The database format for jobs has changed.

  1. Make a copy of your existing queue information.
    qmgr -c 'p s' > /root/qmgr.conf
  2. Grab the packages: You need maui-*i386.rpm and torque-*i386.rpm from Maui RPMS and Torque RPMS or /afs/rl.ac.uk/user/t/traylens/public_html/rpms.

    There are src packages there and the contents of the src package can be browsed in CVS for Maui and TORQUE.

  3. Set up a system to allow you to remove and add packages to your current set of RPMS.
  4. The CE must have openpbs packages removed and torque ones added.
       -openpbs-*-*
       -openpbs-sched-*-*
       -openpbs-mom-*-*
       -openpbs-server-*-*
     
     
       torque-1.0.1p6-4.rh7.3.st
       /* You can use the torque fifo schedular instead of 
          maui below if you only want to fix problem 1. */
       /*torque-scheduler.cc.fifo-1.0.1p6-4.rh7.3.st */
       torque-server-1.0.1p6-4.rh7.3.st
       torque-clients-1.0.1p6-4.rh7.3.st
       tclx-8.3-67
     
       maui-3.2.6p6-2_rh73
       maui-client-3.2.6p6-2_rh73
     
  5. The WN must have openpbs packages removed and torque ones added.
        -openpbs-mom-*-*
        -openpbs-*-*
                                                                                    
        torque-1.0.1p6-2.rh7.3.st
        torque-resmom-1.0.1p6-2.rh7.3.st
        torque-clients-1.0.1p6-2.rh7.3.st
        tclx-8.3-67
        
  6. Also tclx is required by torque-clients. This is available from a standard RH7.3 distribution.
  7. To all of your WNs' profile add the line
    +pbsexechost.clients PBS_MASTER localhost
    The default is annoyingly the wrong way around.
  8. Compile your profiles and check these rpms have been installed.
  9. On the CE do two things.
           # rm /var/spool/pbs/server_priv/serverdb
           # /sbin/chkconfig maui on
          
  10. Reboot all your WNs and the CE.
  11. Install your old queue information.
    qmgr < /root/qmgr.conf

TORQUE and Maui should now be running okay so try submitting a job to the jobmanager-lcgpbs jobmanager to see if it is.

The configuration for maui is in /var/spool/maui/maui.cfg. This is where to change your scheduling parameters and you should defenetly look at the Maui documentation before proceding with editing this file.

Savanah BUG #2667 is also solved by these rpms.

This is the current maui.cfg at RAL.

LCG FAQS


Last modified Mon 21 April 2008 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3
For more about GridPP please contact Neasan O'Neill