Difference between revisions of "HTCondor Jobs In Containers"
Line 1: | Line 1: | ||
This page explains two methods of running LHC jobs in SL6 containers using HTCondor with SL7 worker nodes. Currently this documentation is a work in progress. It describes the setup at RAL. | This page explains two methods of running LHC jobs in SL6 containers using HTCondor with SL7 worker nodes. Currently this documentation is a work in progress. It describes the setup at RAL. | ||
+ | == Worker nodes == | ||
The SL7 or Centos7 worker nodes should have CVMFS, HTCondor, Docker engine, CA certs and fetch-crl installed. See https://docs.docker.com/engine/installation/linux/rhel/ for information on how to install Docker engine. CVMFS should be installed as normal using autofs. | The SL7 or Centos7 worker nodes should have CVMFS, HTCondor, Docker engine, CA certs and fetch-crl installed. See https://docs.docker.com/engine/installation/linux/rhel/ for information on how to install Docker engine. CVMFS should be installed as normal using autofs. | ||
Line 41: | Line 42: | ||
DOCKER_VOLUME_DIR_PASSWD=/etc/passwd:/etc/passwd:ro | DOCKER_VOLUME_DIR_PASSWD=/etc/passwd:/etc/passwd:ro | ||
+ | == CEs == | ||
If you want to force all jobs to run in Docker containers by default, this can be done easily by some configuration like the following: | If you want to force all jobs to run in Docker containers by default, this can be done easily by some configuration like the following: | ||
WantDocker = True | WantDocker = True | ||
Line 46: | Line 48: | ||
SUBMIT_EXPRS = $(SUBMIT_EXPRS), WantDocker, DockerImage | SUBMIT_EXPRS = $(SUBMIT_EXPRS), WantDocker, DockerImage | ||
where the image name should be changed as appropriate. In an environment where jobs could be run on either normal worker nodes or in containers (e.g. during migration from SL6 to to SL7 with the Docker universe), it is probably better to control the number of jobs requesting the Docker universe by using a job router. E.g. | where the image name should be changed as appropriate. In an environment where jobs could be run on either normal worker nodes or in containers (e.g. during migration from SL6 to to SL7 with the Docker universe), it is probably better to control the number of jobs requesting the Docker universe by using a job router. E.g. | ||
+ | |||
+ | == Image == | ||
+ | The Dockerfile for the image in use is here: https://github.com/alahiff/grid-workernode/blob/master/centos6/Dockerfile |
Revision as of 19:18, 27 April 2017
This page explains two methods of running LHC jobs in SL6 containers using HTCondor with SL7 worker nodes. Currently this documentation is a work in progress. It describes the setup at RAL.
Worker nodes
The SL7 or Centos7 worker nodes should have CVMFS, HTCondor, Docker engine, CA certs and fetch-crl installed. See https://docs.docker.com/engine/installation/linux/rhel/ for information on how to install Docker engine. CVMFS should be installed as normal using autofs.
We are currently using Docker 17.03.0-ce. We found that the only reliable choice for the storage driver is OverlayFS. The file /etc/docker/daemon.json
contains:
{ "storage-driver": "overlay", "graph": "/pool/docker" }
The partition /pool
is an XFS filesystem which is formatted including the option -n ftype=1
. This is essential. Without this there will be lots of kernel errors.
Add the following to sudoers to enable HTCondor to use the Docker CLI as root:
User_Alias CONDORUSER = condor Cmnd_Alias DOCKERCMD = /usr/bin/docker CONDORUSER ALL = NOPASSWD: DOCKERCMD
and add the following line to the HTCondor configuration:
DOCKER = sudo /usr/bin/docker
The alternative method of giving HTCondor permission to run containers (i.e. adding the condor user to the docker group) is problematic with Docker 1.13.1 and above.
Our full HTCondor configuration relating to Docker is as follows:
DOCKER = sudo /usr/bin/docker DOCKER_DROP_ALL_CAPABILITIES=regexp("pilot",x509UserProxyFirstFQAN) =?= False DOCKER_MOUNT_VOLUMES=GRID_SECURITY, MJF, GRIDENV, GLEXEC, LCMAPS, LCAS, PASSWD, GROUP, CVMFS, CGROUPS, ATLAS_RECOVERY, ETC_ATLAS, ETC_CMS, ETC_ARC DOCKER_VOLUME_DIR_ATLAS_RECOVERY=/pool/atlas/recovery:/pool/atlas/recovery DOCKER_VOLUME_DIR_ATLAS_RECOVERY_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_CGROUPS=/sys/fs/cgroup:/sys/fs/cgroup:ro DOCKER_VOLUME_DIR_CGROUPS_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_CVMFS=/cvmfs:/cvmfs:shared DOCKER_VOLUME_DIR_ETC_ARC=/etc/arc:/etc/arc:ro DOCKER_VOLUME_DIR_ETC_ATLAS=/etc/atlas:/etc/atlas:ro DOCKER_VOLUME_DIR_ETC_ATLAS_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_ETC_CMS=/etc/cms:/etc/cms:ro DOCKER_VOLUME_DIR_ETC_CMS_MOUNT_IF=regexp("cms",Owner) DOCKER_VOLUME_DIR_GLEXEC=/etc/glexec.conf:/etc/glexec.conf:ro DOCKER_VOLUME_DIR_GRIDENV=/etc/profile.d/grid-env.sh:/etc/profile.d/grid-env.sh:ro DOCKER_VOLUME_DIR_GRID_SECURITY=/etc/grid-security:/etc/grid-security:ro DOCKER_VOLUME_DIR_GROUP=/etc/group:/etc/group:ro DOCKER_VOLUME_DIR_LCAS=/etc/lcas:/etc/lcas:ro DOCKER_VOLUME_DIR_LCMAPS=/etc/lcmaps:/etc/lcmaps:ro DOCKER_VOLUME_DIR_MJF=/etc/machinefeatures:/etc/machinefeatures:ro DOCKER_VOLUME_DIR_PASSWD=/etc/passwd:/etc/passwd:ro
CEs
If you want to force all jobs to run in Docker containers by default, this can be done easily by some configuration like the following:
WantDocker = True DockerImage = "alahiff/grid-workernode-sl6:latest" SUBMIT_EXPRS = $(SUBMIT_EXPRS), WantDocker, DockerImage
where the image name should be changed as appropriate. In an environment where jobs could be run on either normal worker nodes or in containers (e.g. during migration from SL6 to to SL7 with the Docker universe), it is probably better to control the number of jobs requesting the Docker universe by using a job router. E.g.
Image
The Dockerfile for the image in use is here: https://github.com/alahiff/grid-workernode/blob/master/centos6/Dockerfile