Difference between revisions of "Imperial Condor Log"

Latest revision as of 14:44, 9 June 2016

Plain condor with GPU node as WN

Install plain condor on cetest03 and lt2gpu00

wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
yum install condor

Open the relevant ports on both machines (wiki not secret enough to list here, I think)
Make some users (same uid/gid on both machines).
(Because I am no good at remembering options, here's two samples:)

   useradd -m -d /srv/localstage/user004 user004
   useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002

All configurations go in /etc/condor/condor_config.local.
We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node.
After changing the configuration condor needs to be restarted to reload the config file:

  service condor restart

These basic config files work:

On cetest03:

  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a scheduler
  DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  # stop the emails
  SCHEDD_RESTART_REPORT =

On lt2gpu00:

  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a WN
  DAEMON_LIST = MASTER, STARTD
  # get server and WN to talk to each other
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  # I don't want to be nobody: keep same user name throught
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  SCHEDD_RESTART_REPORT =

Too see what's going on:
- On cetest03: condor_q, condor_status -long

[user001@cetest03 ~]$ condor_q
-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?...
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  19.0   user001         6/6  14:42   0+00:00:02 R  0   0.0  hello_world.sh

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

- (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00:

[root@lt2gpu00 ~]# condor_status 
  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  slot10@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:26
  slot11@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:27
  [snip]

- The other helpful command to see what's going on: condor_config_val -dump

Submit a test job (as user001 on cetest03)

[user001@cetest03 ~]$ cat test.submit

Universe       = vanilla
Executable     = hello_world.sh  

input   = /dev/null
output  = hello.out.$(Cluster)                
error   = hello.error.$(ClusterId)       

request_GPUs = 1
Queue

[user001@cetest03 ~]$ cat hello_world.sh
#!/bin/bash

echo "Hello World"
echo "Today is: " `date`
echo "I am running on: " `hostname`
echo "I am " `whoami`

env | sort

echo "+++++++++++++++++++++++++++++++++++"

/srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery

sleep 30

The two different GPUs can be distinguished by their Bus ID: 4 and 10. To submit the job do:

condor_submit test.submit

For non-GPU jobs set request_GPUs = 0

Only allow jobs that use GPU

Add the following on lt2gpu00:

SLOT_TYPE_1 = cpus=25%
SLOT_TYPE_1_PARTITIONABLE = False
NUM_SLOTS_TYPE_1 = 4
START = RequestGpus > 0

Jobs that don't request a GPU will not start.

Difference between revisions of "Imperial Condor Log"

Latest revision as of 14:44, 9 June 2016

Plain condor with GPU node as WN

Only allow jobs that use GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools

@@ Line 40: / Line 40: @@
    use feature : GPUs
    SCHEDD_RESTART_REPORT = </pre>
 * Too see what's going on:
-** On lt2gpu00: condor_status
 ** On cetest03: condor_q, condor_status -long
+<pre>[user001@cetest03 ~]$ condor_q
+-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?...
+ ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
+.0   user001         6/6  14:42   0+00:00:02 R  0   0.0  hello_world.sh
-* The other helpful command to see what's going on: condor_config_val -dump
+jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
+</pre>
+** (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00:
+  <pre>[root@lt2gpu00 ~]# condor_status
+  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
+  slot10@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:26
+  slot11@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:27
+  [snip]</pre>
+** The other helpful command to see what's going on: condor_config_val -dump
 * Submit a test job (as user001 on cetest03) <br />
@@ Line 85: / Line 98: @@
 </pre>
 For non-GPU jobs set request_GPUs = 0
+=== Only allow jobs that use GPU ===
+Add the following on lt2gpu00:
+<pre>
+SLOT_TYPE_1 = cpus=25%
+SLOT_TYPE_1_PARTITIONABLE = False
+NUM_SLOTS_TYPE_1 = 4
+START = RequestGpus > 0
+</pre>
+Jobs that don't request a GPU will not start.