Difference between revisions of "Imperial Condor Log"
From GridPP Wiki
(6 intermediate revisions by one user not shown) | |||
Line 42: | Line 42: | ||
* Too see what's going on: | * Too see what's going on: | ||
** On cetest03: condor_q, condor_status -long | ** On cetest03: condor_q, condor_status -long | ||
− | <pre> | + | <pre>[user001@cetest03 ~]$ condor_q |
− | [user001@cetest03 ~]$ condor_q | + | |
-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... | -- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... | ||
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | ||
Line 50: | Line 49: | ||
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended | 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended | ||
</pre> | </pre> | ||
− | ** On lt2gpu00: | + | ** (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00: |
<pre>[root@lt2gpu00 ~]# condor_status | <pre>[root@lt2gpu00 ~]# condor_status | ||
Name OpSys Arch State Activity LoadAv Mem ActvtyTime | Name OpSys Arch State Activity LoadAv Mem ActvtyTime | ||
Line 56: | Line 55: | ||
slot10@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:26 | slot10@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:26 | ||
slot11@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:27 | slot11@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:27 | ||
− | [snip] | + | [snip]</pre> |
− | + | ||
− | + | ** The other helpful command to see what's going on: condor_config_val -dump | |
− | * The other helpful command to see what's going on: condor_config_val -dump | + | |
* Submit a test job (as user001 on cetest03) <br /> | * Submit a test job (as user001 on cetest03) <br /> | ||
Line 101: | Line 98: | ||
</pre> | </pre> | ||
For non-GPU jobs set request_GPUs = 0 | For non-GPU jobs set request_GPUs = 0 | ||
+ | |||
+ | === Only allow jobs that use GPU === | ||
+ | Add the following on lt2gpu00: | ||
+ | <pre> | ||
+ | SLOT_TYPE_1 = cpus=25% | ||
+ | SLOT_TYPE_1_PARTITIONABLE = False | ||
+ | NUM_SLOTS_TYPE_1 = 4 | ||
+ | START = RequestGpus > 0 | ||
+ | </pre> | ||
+ | |||
+ | Jobs that don't request a GPU will not start. |
Latest revision as of 14:44, 9 June 2016
Plain condor with GPU node as WN
- Install plain condor on cetest03 and lt2gpu00
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor yum install condor
- Open the relevant ports on both machines (wiki not secret enough to list here, I think)
- Make some users (same uid/gid on both machines).
(Because I am no good at remembering options, here's two samples:)
useradd -m -d /srv/localstage/user004 user004 useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002
- All configurations go in /etc/condor/condor_config.local.
We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node. - After changing the configuration condor needs to be restarted to reload the config file:
service condor restart
- These basic config files work:
On cetest03:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a scheduler DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs # stop the emails SCHEDD_RESTART_REPORT =
On lt2gpu00:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a WN DAEMON_LIST = MASTER, STARTD # get server and WN to talk to each other SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk # I don't want to be nobody: keep same user name throught UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs SCHEDD_RESTART_REPORT =
- Too see what's going on:
- On cetest03: condor_q, condor_status -long
[user001@cetest03 ~]$ condor_q -- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 19.0 user001 6/6 14:42 0+00:00:02 R 0 0.0 hello_world.sh 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
- (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00:
[root@lt2gpu00 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot10@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:26 slot11@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:27 [snip]
- The other helpful command to see what's going on: condor_config_val -dump
- Submit a test job (as user001 on cetest03)
[user001@cetest03 ~]$ cat test.submit Universe = vanilla Executable = hello_world.sh input = /dev/null output = hello.out.$(Cluster) error = hello.error.$(ClusterId) request_GPUs = 1 Queue
[user001@cetest03 ~]$ cat hello_world.sh #!/bin/bash echo "Hello World" echo "Today is: " `date` echo "I am running on: " `hostname` echo "I am " `whoami` env | sort echo "+++++++++++++++++++++++++++++++++++" /srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery sleep 30
The two different GPUs can be distinguished by their Bus ID: 4 and 10. To submit the job do:
condor_submit test.submit
For non-GPU jobs set request_GPUs = 0
Only allow jobs that use GPU
Add the following on lt2gpu00:
SLOT_TYPE_1 = cpus=25% SLOT_TYPE_1_PARTITIONABLE = False NUM_SLOTS_TYPE_1 = 4 START = RequestGpus > 0
Jobs that don't request a GPU will not start.