Difference between revisions of "Manchester Multicore Torque Configuration"
(Created page with "Category:Multicore") |
|||
Line 1: | Line 1: | ||
+ | == Dynamic Partitioning with Nikhef scripts == | ||
+ | |||
+ | This method relies on a custom python script - wirtten by Jeff templon - to create dynamic partitions in torque. It needs a specific queue for mcore jobs to be created. It also requires to change the (so far) standard properties of the nodes as it uses the resources_default.neednodes queue attribute to create the partitions. | ||
+ | |||
+ | === Scripts Installation === | ||
+ | |||
+ | You can download the scripts from the Nikhef SVN repository. There is the main script *mcfloat* and 3 python modules to install. I've opted to install everything in /usr/local. The script will also require a $HOME/tmp directory for a file. | ||
+ | |||
+ | *Python Modules* | ||
+ | |||
+ | mkdir -p $HOME/tmp | ||
+ | mkdir -p /usr/local/lib/python2.6/site-packages | ||
+ | cd /usr/local/lib/python2.6/site-packages | ||
+ | wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/trunk/nl.nikhef.ndpf.tools/pjobstats/"torqueJobs.py?revision=2698" -O torqueJobs.py | ||
+ | wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/trunk/nl.nikhef.ndpf.tools/pjobstats/"torqueAttMappers.py?revision=2526" -O torqueAttMappers.py | ||
+ | wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/nl.nikhef.pdp.dynsched-pbs-plugin/trunk/torque_utils.py?revision=2028 -O torque_utils.py | ||
+ | |||
+ | cd /usr/local/bin | ||
+ | wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/nl.nikhef.ndpf.mcfloat/trunk/"mcfloat?revision=2708" -O mcfloat | ||
+ | |||
+ | *torque_utils.py* | ||
+ | |||
+ | With python2.6 torque_utils.py gives a syntax error and you need to move the line | ||
+ | |||
+ | from __future__ import generators # only needed in Python 2.2 | ||
+ | |||
+ | above the version line or remove it as it is apparently only needed in python 2.2. | ||
+ | |||
+ | *mcfloat* | ||
+ | |||
+ | You also need to edit mcfloat to set 4 things | ||
+ | |||
+ | Torque server | ||
+ | |||
+ | TORQUE = "<torque-server-fqdn>" | ||
+ | |||
+ | Initial set of WN to use | ||
+ | |||
+ | CANDIDATE_NODES = [ 'node-0%02d.domain' % (n) for n in range(1,19) ] | ||
+ | |||
+ | Queue name you can leave it or replace it. I've replace it with a less experiment oriented name | ||
+ | |||
+ | MCQUEUE = 'atlasmc' | ||
+ | |||
+ | my nodes don't have sequential names so I replaced the elegant for loop to build the array with a plainer comma separated list of nodes names. Do not forget the quotes around the names. | ||
+ | |||
+ | MAXDRAIN = 7 # max num of nodes allowed to drain | ||
+ | MAXFREE = 49 # max num of free slots to tolerate | ||
+ | |||
+ | These depend on the size of your nodes and cluster, you may want to play with it. I reduced the number of MAXDRAIN to 4 for example but it is probably going to be reviewed. | ||
+ | |||
+ | *qmgr commands and nodes properties* | ||
+ | |||
+ | Now you need to create the queue for multicore. It will rely on the nodes properties in /var/lib/torque/server_priv/nodes. If you are still using YAIM thay are usually set to lcgpro. The mcfloat script makes use of *el6* for the nodes to use for single core jobs and *mc* for the nodes to use for multicore. If you want to use something else you need to edit the mcfloat script. I've opted for a smooth sed command | ||
+ | |||
+ | sed -i.old 's/lcgpro/el6/g' /var/lib/torque/server_priv/nodes | ||
+ | |||
+ | To create the new queue you need to limit the access to those groups that will run multicore. For the moment for me it is only atlas production, but some sites may need to add cms too. | ||
+ | |||
+ | qmgr | ||
+ | create queue mcore | ||
+ | set queue mcore queue_type = Execution | ||
+ | set queue mcore resources_max.cput = 48:00:00 | ||
+ | set queue mcore resources_max.walltime = 72:00:00 | ||
+ | set queue mcore resources_default.cput = 48:00:00 | ||
+ | set queue mcore resources_default.neednodes = mc | ||
+ | set queue mcore resources_default.walltime = 72:00:00 | ||
+ | set queue mcore acl_group_enable = True | ||
+ | set queue mcore acl_groups = atlprd | ||
+ | set queue mcore acl_groups += cmsprd | ||
+ | set queue mcore enabled = True | ||
+ | set queue mcore started = True | ||
+ | |||
+ | you also need to set resource_default.neednodes for the other queues for the partitioning to work. So for example at my site the other main queue would be set to | ||
+ | |||
+ | set queue long resources_default.neednodes = el6 | ||
+ | |||
+ | other queues have their own parameters. | ||
+ | |||
+ | |||
+ | |||
[[Category:Multicore]] | [[Category:Multicore]] |
Revision as of 12:33, 31 July 2014
Dynamic Partitioning with Nikhef scripts
This method relies on a custom python script - wirtten by Jeff templon - to create dynamic partitions in torque. It needs a specific queue for mcore jobs to be created. It also requires to change the (so far) standard properties of the nodes as it uses the resources_default.neednodes queue attribute to create the partitions.
Scripts Installation
You can download the scripts from the Nikhef SVN repository. There is the main script *mcfloat* and 3 python modules to install. I've opted to install everything in /usr/local. The script will also require a $HOME/tmp directory for a file.
- Python Modules*
mkdir -p $HOME/tmp mkdir -p /usr/local/lib/python2.6/site-packages cd /usr/local/lib/python2.6/site-packages wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/trunk/nl.nikhef.ndpf.tools/pjobstats/"torqueJobs.py?revision=2698" -O torqueJobs.py wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/trunk/nl.nikhef.ndpf.tools/pjobstats/"torqueAttMappers.py?revision=2526" -O torqueAttMappers.py wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/nl.nikhef.pdp.dynsched-pbs-plugin/trunk/torque_utils.py?revision=2028 -O torque_utils.py cd /usr/local/bin wget https://ndpfsvn.nikhef.nl/cgi-bin/viewvc.cgi/pdpsoft/nl.nikhef.ndpf.mcfloat/trunk/"mcfloat?revision=2708" -O mcfloat
- torque_utils.py*
With python2.6 torque_utils.py gives a syntax error and you need to move the line
from __future__ import generators # only needed in Python 2.2
above the version line or remove it as it is apparently only needed in python 2.2.
- mcfloat*
You also need to edit mcfloat to set 4 things
Torque server
TORQUE = "<torque-server-fqdn>"
Initial set of WN to use
CANDIDATE_NODES = [ 'node-0%02d.domain' % (n) for n in range(1,19) ]
Queue name you can leave it or replace it. I've replace it with a less experiment oriented name
MCQUEUE = 'atlasmc'
my nodes don't have sequential names so I replaced the elegant for loop to build the array with a plainer comma separated list of nodes names. Do not forget the quotes around the names.
MAXDRAIN = 7 # max num of nodes allowed to drain MAXFREE = 49 # max num of free slots to tolerate
These depend on the size of your nodes and cluster, you may want to play with it. I reduced the number of MAXDRAIN to 4 for example but it is probably going to be reviewed.
- qmgr commands and nodes properties*
Now you need to create the queue for multicore. It will rely on the nodes properties in /var/lib/torque/server_priv/nodes. If you are still using YAIM thay are usually set to lcgpro. The mcfloat script makes use of *el6* for the nodes to use for single core jobs and *mc* for the nodes to use for multicore. If you want to use something else you need to edit the mcfloat script. I've opted for a smooth sed command
sed -i.old 's/lcgpro/el6/g' /var/lib/torque/server_priv/nodes
To create the new queue you need to limit the access to those groups that will run multicore. For the moment for me it is only atlas production, but some sites may need to add cms too.
qmgr create queue mcore set queue mcore queue_type = Execution set queue mcore resources_max.cput = 48:00:00 set queue mcore resources_max.walltime = 72:00:00 set queue mcore resources_default.cput = 48:00:00 set queue mcore resources_default.neednodes = mc set queue mcore resources_default.walltime = 72:00:00 set queue mcore acl_group_enable = True set queue mcore acl_groups = atlprd set queue mcore acl_groups += cmsprd set queue mcore enabled = True set queue mcore started = True
you also need to set resource_default.neednodes for the other queues for the partitioning to work. So for example at my site the other main queue would be set to
set queue long resources_default.neednodes = el6
other queues have their own parameters.