Difference between revisions of "Imperial Condor Log"
From GridPP Wiki
Line 42: | Line 42: | ||
* Too see what's going on: | * Too see what's going on: | ||
** On cetest03: condor_q, condor_status -long | ** On cetest03: condor_q, condor_status -long | ||
− | <pre> | + | <pre>[user001@cetest03 ~]$ condor_q |
− | [user001@cetest03 ~]$ condor_q | + | |
-- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... | -- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... | ||
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD | ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD |
Revision as of 13:44, 6 June 2016
Plain condor with GPU node as WN
- Install plain condor on cetest03 and lt2gpu00
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor yum install condor
- Open the relevant ports on both machines (wiki not secret enough to list here, I think)
- Make some users (same uid/gid on both machines).
(Because I am no good at remembering options, here's two samples:)
useradd -m -d /srv/localstage/user004 user004 useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002
- All configurations go in /etc/condor/condor_config.local.
We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node. - After changing the configuration condor needs to be restarted to reload the config file:
service condor restart
- These basic config files work:
On cetest03:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a scheduler DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs # stop the emails SCHEDD_RESTART_REPORT =
On lt2gpu00:
CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk # this makes it a WN DAEMON_LIST = MASTER, STARTD # get server and WN to talk to each other SEC_PASSWORD_FILE = /etc/condor/pool_password ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk # I don't want to be nobody: keep same user name throught UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk use feature : GPUs SCHEDD_RESTART_REPORT =
- Too see what's going on:
- On cetest03: condor_q, condor_status -long
[user001@cetest03 ~]$ condor_q -- Schedd: cetest03.grid.hep.ph.ic.ac.uk : <146.179.247.252:15363?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 19.0 user001 6/6 14:42 0+00:00:02 R 0 0.0 hello_world.sh 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
- On lt2gpu00:
- On lt2gpu00:
[root@lt2gpu00 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot10@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:26 slot11@lt2gpu00.gr LINUX X86_64 Unclaimed Idle 0.000 4029 0+00:00:27 [snip]
- The other helpful command to see what's going on: condor_config_val -dump
- Submit a test job (as user001 on cetest03)
[user001@cetest03 ~]$ cat test.submit Universe = vanilla Executable = hello_world.sh input = /dev/null output = hello.out.$(Cluster) error = hello.error.$(ClusterId) request_GPUs = 1 Queue
[user001@cetest03 ~]$ cat hello_world.sh #!/bin/bash echo "Hello World" echo "Today is: " `date` echo "I am running on: " `hostname` echo "I am " `whoami` env | sort echo "+++++++++++++++++++++++++++++++++++" /srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery sleep 30
The two different GPUs can be distinguished by their Bus ID: 4 and 10. To submit the job do:
condor_submit test.submit
For non-GPU jobs set request_GPUs = 0