Difference between revisions of "Imperial Condor Log"

From GridPP Wiki
Jump to: navigation, search
 
(11 intermediate revisions by one user not shown)
Line 40: Line 40:
 
   use feature : GPUs
 
   use feature : GPUs
 
   SCHEDD_RESTART_REPORT = </pre>
 
   SCHEDD_RESTART_REPORT = </pre>
* Too see what's going on:  
+
* Too see what's going on:
** On lt2gpu00: condor_status
+
 
** On cetest03: condor_q, condor_status -long
 
** On cetest03: condor_q, condor_status -long
 +
<pre>[user001@cetest03 ~]$ condor_q
 +
-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?...
 +
ID      OWNER            SUBMITTED    RUN_TIME ST PRI SIZE CMD             
 +
  19.0  user001        6/6  14:42  0+00:00:02 R  0  0.0  hello_world.sh
  
* The other helpful command to see what's going on: condor_config_val -dump
+
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
 +
</pre>
 +
** (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00:
 +
  <pre>[root@lt2gpu00 ~]# condor_status
 +
  Name              OpSys      Arch  State    Activity LoadAv Mem  ActvtyTime
 +
 
 +
  slot10@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:26
 +
  slot11@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:27
 +
  [snip]</pre>
 +
 
 +
** The other helpful command to see what's going on: condor_config_val -dump
  
 
* Submit a test job (as user001 on cetest03) <br />  
 
* Submit a test job (as user001 on cetest03) <br />  
Line 85: Line 98:
 
</pre>
 
</pre>
 
For non-GPU jobs set request_GPUs = 0
 
For non-GPU jobs set request_GPUs = 0
 +
 +
=== Only allow jobs that use GPU ===
 +
Add the following on lt2gpu00:
 +
<pre>
 +
SLOT_TYPE_1 = cpus=25%
 +
SLOT_TYPE_1_PARTITIONABLE = False
 +
NUM_SLOTS_TYPE_1 = 4
 +
START = RequestGpus > 0
 +
</pre>
 +
 +
Jobs that don't request a GPU will not start.

Latest revision as of 14:44, 9 June 2016

Plain condor with GPU node as WN

  • Install plain condor on cetest03 and lt2gpu00
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
yum install condor
  • Open the relevant ports on both machines (wiki not secret enough to list here, I think)
  • Make some users (same uid/gid on both machines).
    (Because I am no good at remembering options, here's two samples:)
   useradd -m -d /srv/localstage/user004 user004
   useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002  
  • All configurations go in /etc/condor/condor_config.local.
    We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node.
  • After changing the configuration condor needs to be restarted to reload the config file:
  service condor restart 
  • These basic config files work:

On cetest03:

  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a scheduler
  DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  # stop the emails
  SCHEDD_RESTART_REPORT = 
  

On lt2gpu00:
  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a WN
  DAEMON_LIST = MASTER, STARTD
  # get server and WN to talk to each other
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  # I don't want to be nobody: keep same user name throught
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  SCHEDD_RESTART_REPORT = 
  • Too see what's going on:
    • On cetest03: condor_q, condor_status -long
[user001@cetest03 ~]$ condor_q
-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?...
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  19.0   user001         6/6  14:42   0+00:00:02 R  0   0.0  hello_world.sh

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
    • (I know there's a double dot, but I can't make it go away without side effects) On lt2gpu00:
[root@lt2gpu00 ~]# condor_status 
  Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

  slot10@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:26
  slot11@lt2gpu00.gr LINUX      X86_64 Unclaimed Idle      0.000 4029  0+00:00:27
  [snip]
    • The other helpful command to see what's going on: condor_config_val -dump
  • Submit a test job (as user001 on cetest03)
[user001@cetest03 ~]$ cat test.submit

Universe       = vanilla
Executable     = hello_world.sh  

input   = /dev/null
output  = hello.out.$(Cluster)                
error   = hello.error.$(ClusterId)       

request_GPUs = 1
Queue
[user001@cetest03 ~]$ cat hello_world.sh
#!/bin/bash

echo "Hello World"
echo "Today is: " `date`
echo "I am running on: " `hostname`
echo "I am " `whoami`

env | sort

echo "+++++++++++++++++++++++++++++++++++"

/srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery

sleep 30

The two different GPUs can be distinguished by their Bus ID: 4 and 10. To submit the job do:

condor_submit test.submit

For non-GPU jobs set request_GPUs = 0

Only allow jobs that use GPU

Add the following on lt2gpu00:

SLOT_TYPE_1 = cpus=25%
SLOT_TYPE_1_PARTITIONABLE = False
NUM_SLOTS_TYPE_1 = 4
START = RequestGpus > 0

Jobs that don't request a GPU will not start.