Example Build of an ARC/Condor Cluster
Contents
Introduction
A multi-core job is one which needs to use more than one processor on a node. Until recently, multi-core jobs have not been used much on the grid infrastructure. This has all changed because Atlas and other large users have now asked sites to enable multi-core on their clusters.
Unfortunately, it is not just a simple task of setting some parameter on the head node and sitting back while jobs arrive. Different grid system have varying levels of support for multi-core, ranging from non-existent to virtually full support.
This report discusses the multi-core configuration at Liverpool. We decided to build a test cluster using one of the most capable batch systems currently available, called HTCondor (or condor for short). We also decided to front the system with an ARC/Condor CE.
I nned to give a mention to Andrew Lahiff at RAL for the initial configuration and many suggestions and help.
Infrastructure/Fabric
The multicore test cluster consists of an SL6 headnode to run the ARC CE and the Condor batch system. The headnode has a dedicated set of 11 workernodes of various types, providing a total of 96 single threads of execution.
Head Node
The headnode is a virtual system running on KVM.
Host Name | OS | CPUs | RAM | Disk Space
|
---|---|---|---|---|
hepgrid2.ph.liv.ac.uk | SL6.4 | 8 | 2 gig | 35 gig |
Worker nodes
The physical workernodes are described below.
Node names | CPU type | OS | RAM | Disk Space | CPUs Per Node | Slots used per cpu | Slots used per node | Total nodes | Total CPUs | Total slots | HepSpec per slot | Total hepspec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
r21-n01-4 | E5620 | SL6.4 | 24 GB | 1.5 TB | 2 | 5 | 10 | 4 | 8 | 40 | 12.05 | 482 |
r26-n05-11 | L5420 | SL6.4 | 16 GB | 1.7 TB | 2 | 4 | 8 | 7 | 14 | 56 | 8.86 | 502
|
Software Builds and Configuration
There are a few particulars of the Liverpool site that I want to get out of the way to start with. For the initial installation of an operating system on our head nodes and worker nodes, we use tools developed at Liverpool (BuildTools) based on Kickstart, NFS, TFTP and DHCP. The source (synctool.pl and linktool.pl) can be obtained from sjones@hep.ph.liv.ac.uk. Alternatively, similar functionality is said to exist in the Cobler suite, which is released as Open Source and some sites have based their initial install on that. Once the OS is on, the first reboot starts Puppet to give a personality to the node. As puppet is becoming something of a defacto standard in its own right, I'll use puppet stanza within this document where some explanation of a particular feature is needed.
So we'll start with the headnode and "work down" so to speak.
Head Node
sdfsd
Worker Node
sdfsd
Performance/Tuning
fsd