Imperial Condor Log
From GridPP Wiki
Revision as of 13:42, 6 June 2016 by Daniela Bauer 7cecb7c591 (Talk | contribs)
Plain condor with GPU node as WN
- Install plain condor on cetest03 and lt2gpu00
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor yum install condor
- Open the relevant ports on both machines (wiki not secret enough to list here, I think)
- Make some users (same uid/gid on both machines).
(Because I am no good at remembering options, here's two samples:)
useradd -m -d /srv/localstage/user004 user004 useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002
- All configurations go in /etc/condor/condor_config.local.
We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node. - After changing the configuration condor needs to be restarted to reload the config file:
service condor restart
- These basic config files work:
On cetest03:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a scheduler DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs # stop the emails SCHEDD_RESTART_REPORT =
On lt2gpu00:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a WN DAEMON_LIST = MASTER, STARTD # get server and WN to talk to each other SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk # I don't want to be nobody: keep same user name throught UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs SCHEDD_RESTART_REPORT =
- Too see what's going on:
- On cetest03: condor_q, condor_status -long
- On lt2gpu00:
[root@lt2gpu00 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot10@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:26 slot11@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:27 [snip]
- The other helpful command to see what's going on: condor_config_val -dump
- Submit a test job (as user001 on cetest03)
[user001@cetest03 ~]$ cat test.submit Universe = vanilla Executable = hello_world.sh input = /dev/null output = hello.out.$(Cluster) error = hello.error.$(ClusterId) request_GPUs = 1 Queue
[user001@cetest03 ~]$ cat hello_world.sh #!/bin/bash echo "Hello World" echo "Today is: " `date` echo "I am running on: " `hostname` echo "I am " `whoami` env | sort echo "+++++++++++++++++++++++++++++++++++" /srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery sleep 30
The two different GPUs can be distinguished by their Bus ID: 4 and 10. To submit the job do:
condor_submit test.submit
For non-GPU jobs set request_GPUs = 0