HTCondor Jobs In Containers
This page explains two methods of running LHC jobs in SL6 containers using HTCondor with SL7 worker nodes. Currently this documentation is a work in progress. It describes the setup at RAL.
Versions in use:
- HTCondor 8.6.2
- Docker 17.03.0-ce
The SL7 or Centos7 worker nodes should have CVMFS, HTCondor, Docker engine, CA certs and fetch-crl installed. See https://docs.docker.com/engine/installation/linux/rhel/ for information on how to install Docker engine. CVMFS should be installed as normal using autofs.
Add the following to sudoers to enable HTCondor to use the Docker CLI as root:
User_Alias CONDORUSER = condor Cmnd_Alias DOCKERCMD = /usr/bin/docker CONDORUSER ALL = NOPASSWD: DOCKERCMD
and add the following line to the HTCondor configuration:
DOCKER = sudo /usr/bin/docker
The alternative method of giving HTCondor permission to run containers (i.e. adding the condor user to the docker group) is problematic with Docker 1.13.1.
Some additional HTCondor configuration required in order to automatically bind mount CVMFS and /etc/grid-security into all Docker containers run by HTCondor:
DOCKER_MOUNT_VOLUMES = CVMFS, GRID_SECURITY, PASSWD, GROUP DOCKER_VOLUME_DIR_CVMFS = /cvmfs:/cvmfs:ro DOCKER_VOLUME_DIR_GRID_SECURITY = /etc/grid-security:/etc/grid-security:ro DOCKER_VOLUME_DIR_PASSWD = /etc/passwd:/etc/passwd:ro DOCKER_VOLUME_DIR_GROUP = /etc/group:/etc/group:ro
Here we also bind mount /etc/passwd and /etc/group into the containers so that pool accounts are available. The pool accounts must be configured on the host (in order for HTCondor to run a job as a particular user the user must exist on the host!) Note that it's essential to use the "shared" option for CVMFS - without this CVMFS will not work correctly.
With HTCondor 8.5.8 and above it's possible to specify what directories to mount in containers using an expression. For example, with this configuration:
DOCKER_VOLUME_DIR_GRID_SECURITY = /etc/grid-security:/etc/grid-security:ro DOCKER_VOLUME_DIR_PASSWD = /etc/passwd:/etc/passwd:ro DOCKER_VOLUME_DIR_GROUP = /etc/group:/etc/group:ro DOCKER_VOLUME_DIR_CVMFS = /cvmfs:/cvmfs:shared DOCKER_MOUNT_VOLUMES = GRID_SECURITY, PASSWD, GROUP, CVMFS
If you want to force all jobs to run in Docker containers by default, this can be done easily by some configuration like the following:
WantDocker = True DockerImage = "alahiff/grid-workernode-sl6:latest" SUBMIT_EXPRS = $(SUBMIT_EXPRS), WantDocker, DockerImage
where the image name should be changed as appropriate. In an environment where jobs could be run on either normal worker nodes or in containers (e.g. during migration from SL6 to to SL7 with the Docker universe), it is probably better to control the number of jobs requesting the Docker universe by using a job router. E.g.