Difference between revisions of "Imperial Condor Log"

Revision as of 13:28, 6 June 2016

Plain condor with GPU node as WN

Install plain condor on cetest03 and lt2gpu00

wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
rpm --import http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
yum install condor

Open the relevant ports on both machines (wiki not secret enough to list here, I think)
Make some users (same uid/gid on both machines).
(Because I am no good at remembering options, here's two samples:)

   useradd -m -d /srv/localstage/user004 user004
   useradd -m -d /srv/localstage/user002 -g user002 -u 502 user002

All configurations go in /etc/condor/condor_config.local.
We'll try and keep the configurations as identical on both nodes as possible, even if not every option is needed by every node.
After changing the configuration condor needs to be restarted to reload the config file:

  service condor restart

These basic config files work:

On cetest03:

  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a scheduler
  DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  # stop the emails
  SCHEDD_RESTART_REPORT =

On lt2gpu00:

  CONDOR_HOST = cetest03.grid.hep.ph.ic.ac.uk
  # this makes it a WN
  DAEMON_LIST = MASTER, STARTD
  # get server and WN to talk to each other
  SEC_PASSWORD_FILE = /etc/condor/pool_password
  ALLOW_WRITE = *.grid.hep.ph.ic.ac.uk
  # I don't want to be nobody: keep same user name throught
  UID_DOMAIN = cetest03.grid.hep.ph.ic.ac.uk
  use feature : GPUs
  SCHEDD_RESTART_REPORT =

Too see what's going on:
- On lt2gpu00: condor_status
- On cetest03: condor_q, condor_status -long

The other helpful command to see what's going on: condor_config_val -dump

submit a test job (as user001 on cetest03)

[user001@cetest03 ~]$ cat test.submit

Universe       = vanilla
Executable     = hello_world.sh  

input   = /dev/null
output  = hello.out.$(Cluster)                
error   = hello.error.$(ClusterId)       

request_GPUs = 1
Queue

[user001@cetest03 ~]$ cat hello_world.sh
#!/bin/bash

echo "Hello World"
echo "Today is: " `date`
echo "I am running on: " `hostname`
echo "I am " `whoami`

env | sort

echo "+++++++++++++++++++++++++++++++++++"

/srv/localstage/sf105/samples/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery

sleep 30

The two different GPUs can be distinguished by their Bus ID: 4 and 10.

Difference between revisions of "Imperial Condor Log"

Revision as of 13:28, 6 June 2016

Plain condor with GPU node as WN

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools