Vac Site Admin Guide 03.00

Andrew McNab <Andrew.McNab AT cern.ch>

Vac implements the Vacuum Platform described in HSF-TN-2016-04, on one or more autonomous hypervisors or "VM Factories".

Quick start
Configuration step-by-step
Starting and stopping vacd
Stopping VMs
Using the vac command
Setting up Nagios
APEL accounting
Puppet
Setting fizzle_seconds and backoff_seconds

Quick start

By following this quick start recipe you can verify that your installation will work with Vac and see it creating and destroying virtual machines. You will almost certainly want to start again from scratch by following the step-by-step part of the Admin Guide so don't invest a lot of time here. If you're already familiar with VMs, you could just skip straight there but it's safest to go through the quick start to make sure the requirements are all there.

To follow the quick start, you need an x86_64 Intel or AMD machine with hardware virtualization (Intel VT-x or AMD-V) enabled in its BIOS; and the machine needs to be installed with a version of Scientific Linux 6, with libvirt installed and enabled. In particular, the packages libvirt, libvirt-client, libvirt-python, qemu-kvm, and then run "service libvirtd restart" to make sure libvirtd daemon is running.

Install the vac RPM and at the command line excecute:

virsh list
virsh create /usr/share/doc/vac-*/testkvm.xml
virsh list
virsh destroy testkvm
virsh list

You should see no VMs listed as running to start with. After the create command, the testkvm VM should be listed as running. Afer destroying it, an empty list of VMs should be returned. If all this doesn't happen, then something is wrong with your installation or hardware virtualization isn't enabled. Please check the libvirt documentation to try to identify where the problem is.

The factory machine must have a fully qualified domain name (FDQN) as its hostname. So factory1.example.com not just factory1. The 169.254.0.0 network should not be configured on the factory machine before you start Vac. In particular, Zeroconf support should be disabled by adding NOZEROCONF=yes to /etc/sysconfig/network and restarting networking.

Next create the /etc/vac.conf configuration file. Copy /usr/share/doc/vac-VERSION/example.vac.conf to /etc/vac.conf and read through its comments. There are 5 lines you need to check and probably change.

vac_space = in [settings]: Set this to vac01 in your site's domain. So if your site is .example.com then set it to vac01.example.com . A Vac space is a group of factory machines that communicate with each other, and is equivalent to a subcluster or subsite. A space's name is a fully qualified domain name (FQDN), and can be used as a virtual CE name where necessary in other systems.
factories = in [settings]: Since we're creating a space that contains a single factory machine, set this to be the FQDN of the factory machine you're workng on.
root_public_key = in [machinetype example]: This setting is not strictly necessary but is very useful. By copying an RSA key pair to /root/.ssh on the factory machine, or creating one with ssh-keygen you will be able to ssh into the VM as root and see how it is laid out and how it is running. If you don't place a public key at the location given in this option, you need to comment the line out.
user_data_option_cvmfs_proxy = in [machinetype example]: The value of this option is included in the user_data file given to the VM. It must be set to the URL of an HTTP cache you have access to. If you are already using cvmfs for grid worker nodes, you can use the same value. Make sure that access from your factory machine isn't blocked in the cache machine's configuration.

The files needed for the example machinetype are fetched over HTTPS, as indicated by the root_disk and user_data options which should not be changed. You should install the lcg-CA RPMs or similar to ensure that Vac can access this and other HTTPS servers which use grid host certificates in most cases.

Just do service vacd restart to make sure vacd is running and look in the log files.

When vacd starts it forks a factory process that watches the VMs and creates or destroys them as necessary; and a responder process that replies to queries from factories about what is running on this host. These two processes have separate log files as /var/log/vacd-factory and /var/log/vacd-responder .

In its log file, you should be able to see the factory daemon trying to decide what to do and then creating the example VM which runs for 5 minutes then shuts itself down. Vac will create hostnames for the VMs from the factory name. For example, factory1.example.com will lead to factory1-00.example.com, factory1-01.example.com, ... When deciding what to do, the factory queries its own responder via UDP and this should be visible in the responder's log file.

You should also be able to see the state of the VM using the command vac machines, where vac is a command line tool that the RPM installs in /usr/sbin.

Configuration step-by-step

This part of the guide covers the same ground as the quick start guide but in a lot more detail. It's intended to help you choose how best to configure your site.

The configuration file /etc/vac.conf uses the Python ConfigParser syntax, which is similar to MS Windows INI files. The file is divided into sections, with each section name in square brackets. For example: [settings]. Each section contains a series of option=value pairs. Values can continue onto the next line by starting that line with some white space. Sections with the same name are merged and if options are duplicated, later values overwrite values given earlier. Any configuration file ending in .conf in the directory /etc/vac.d will also be read. These files are read in alphanumeric order, and then /etc/vac.conf is read if present.

Based on this ordering in /etc/vac.d/, options from space.conf would override any given in site.conf, but themselves be overwritten by options from subspace.conf or vacfactory.conf .

One useful approach is to populate /etc/vac.d with a management system like Puppet, and only create /etc/vac.conf manually to override the state on individual development machines or if a machine is being drained of work for maintenance. Site-wide configuration, such as machinetype definitions, can be included in /etc/vac.d files present on every factory, but host specific options, such as HEPSPEC06 values and the total number of VMs to create, can be given in /etc/vac.d files which are specific to particular subsets of machines. Vac will merge all of this information as outlined above.

CernVM images

Vac currently requires the use of CernVM images with HEPiX contexualization based on EC2/ISO ("CD-ROM") images, and we recommend the use of CernVM 3 micro boot images.

If you need to download an image, they can be found on the CernVM downloads page. You must get the generic .iso image file and not the .hdd file listed for KVM.

However, most experiments will supply you with their own URL from which Vac can automatically fetch their current designated image version, which Vac caches in /var/lib/vac/imagecache .

DNS, IP, MAC

The factory machines must have fully qualified domain names (FDQN) as their hostname. So factory1.example.com etc, not just factory1.

Vac uses a private NAT network for the set of virtual machines on a given factory. Vac then creates the VM FQDNs from the factory name by adding -00, -01, ... So factory1.example.com has factory1-00.example.com, factory1-01.example.com, ... as its VMs. The number of possible virtual machines on the factory is calculated by Vac from the number of logical processors in /proc/cpuinfo. Vac assigns IP addresses starting 169.254.169.0 for VM 0, 169.254.169.1 for VM 1 etc. Unique MAC addresses are also assigned to each VM. Using libvirt NAT machinery means this network is hidden from the rest of the LAN and only visible from the factory and its VMs. libvirt configures the dnsmasq server to run dedicated DNS and DHCP servers on this private network. The factory's address in this private network is 169.254.169.253. An EC2/Openstack metadata service is also provided on 169.254.169.254, which is the so-called Magic IP used by some Cloud systems for a local configuration service.

To use IP addresses in the 169.254.0.0 network, you must ensure you are using a recent version of dnsmasq. For SL6, dnsmasq-2.48-14.el6.x86_64.rpm avaiable as part of SL6 updates, is suitable.

The 169.254.0.0 network should not be configured on the factory machine before you start Vac. For example, Zeroconf support can be disabled by adding NOZEROCONF=yes to /etc/sysconfig/network and restarting networking.

Vac tries to create the vac private network itself if it doesn't already exist. You can check the private network exists with the command virsh net-list --all which should list the vac_169.254.0.0 network and the "default" network defined by libvirtd as both being active. The vac network should be using the virbr1 virtual interface, with virbr0 still used by the default network. The vac network is set up as persistent and auto-starting, so it should survive reboots and restarts of libvirtd.

However, Vac will log the error "Failed to create NAT network vac_169.254.0.0" if it doesn't exist and cannot be created. This is usually after a restart or an upgrade of libvirtd which leaves settings which block the new network. Vac will attempt to fix this too unless the setting fix_networking = false is given in the Vac configuration.

Network problems are usually caused by:

Still needing to upgrade dnsmasq to an RPM >= 2.48-13
Not disabling Zeroconf
An old dnsmasq process running with argument "--listen-address 169.254.169.253" (Use "service dnsmasq stop")
virbr1 already existing (remove with "ifconfig virbr1 down" and "brctl delbr virbr1", using brctl from the RPM bridge-utils.)

libvirtd attempts to configure the factory machine's iptables rules to support network address translation / masquerading from the private NAT network. Assumming the virbr1 virtual interface is used for the private network, then iptables can be set up using the iptables-restore command without relying on libvirtd:

# Set up masquerading from private network for the VMs
*nat
:PREROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 169.254.0.0/16 ! -d 169.254.0.0/16 -p tcp -j MASQUERADE ---to-ports 1024-65535 
-A POSTROUTING -s 169.254.0.0/16 ! -d 169.254.0.0/16 -p udp -j MASQUERADE ---to-ports 1024-65535 
-A POSTROUTING -s 169.254.0.0/16 ! -d 169.254.0.0/16 -j MASQUERADE 
COMMIT
# Filtering and forwarding rules
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
# Catch-all, including HTTP
-A INPUT -i virbr1 -p udp -j ACCEPT 
-A INPUT -i virbr1 -p tcp -j ACCEPT 
# Forward to/from private network
-A FORWARD -d 169.254.0.0/16 -o virbr1 -m state --state RELATED,ESTABLISHED --j ACCEPT 
-A FORWARD -s 169.254.0.0/16 -i virbr1 -j ACCEPT 
-A FORWARD -i virbr1 -o virbr1 -j ACCEPT 
-A FORWARD -o virbr1 -j REJECT --reject-with icmp-port-unreachable 
-A FORWARD -i virbr1 -j REJECT --reject-with icmp-port-unreachable 
COMMIT
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
# Create checksums for DHCP clients even when using virtio
-A POSTROUTING -o virbr1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill 
COMMIT

This approach allows you to integrate the extra rules needed for the VMs private network into any existing rules you use at your site.

Vac performs a quick check of the current iptables rules at the start of each cycle using the iptables-save command. If any obvious problems are identified, then "Failed to match XXX in output of iptables-save. Have the NAT rules been removed?" lines will be included in /var/log/vacd-factory . However, if you have an unusual set of rules, these checks may produce false warnings, and they are not an exhaustive check that the rules you have are sufficient for Vac.

Logical volumes

Vac virtual machines can use logical volumes on the factory machine to provide faster disk space. For the cernvm3 VM model, only one disk device is presentd to the VM: either a logical volume or a file-backed virtual disk.

By default, the block device associated with the logical volume is available to the VM is vda, but this can changed with the root_device option in a [machinetype ...] section.

The global volume_group option in [settings] (default vac_volume_group) and the virtual machine's name are used to construct the logical volume path for each VM. For example, /dev/vac_volume_group/factory1-01.example.com

You must create the volume group with a name that matches the volume_group setting. Vac will create the logical volumes when it creates VMs.

During the creation of each virtual machine instance, Vac will attempt to create the logical volume in volume_group with lvcreate. The setting disk_gb_per_processor can be used to set its size explicitly, but by default Vac will calculate the size based on the available space which is not occupied by Vac logical volumes. The calculation takes into account the space occupied by logical volumes created for other purposes so you can safely create volumes for partitions used by the hypervisor machine if necessary. The available space will nominally be shared between the allocatable processors given by total_processors and then multiplied by the number of logical processors per machine to give the size of logical volume to create for each VM at the same time as the VM itself is created. You will need about 40GB of space per processor for most machinetypes.

Installation of Vac: tar vs RPM

RPM is the recommended installation procedure, and RPMs are available from the Vac downloads area.

It is also possible to install Vac from a tar file, using the install Makefile target.

Configuration of the Vac space

The [settings] section must include a vac_space name, which is also used as the virtual CE name.

The factories option takes a space separated list of the fully qualified domain names of all the factories in this Vac space, including this factory. The factories are queried using UDP when a factory needs to decide which machinetype to start. The Vac responder process on the factories replies to these queries with a summary of the VM and the outcome of recent attempts to run a VM of each machinetype.

For ease of management, the factories option can be placed in its own [settings] section in a separate configuration file in /etc/vac.d which can be automatically generated and maintained from another source, such as the site's assets database.

GOCDB and GGUS

Vac is designed to work within the WLCG/EGI grid model of sites composed of one or more CEs. Each Vac space name corresponds to one CE within a site, and can co-exist with conventional CREAM or ARC CEs. If you are at a site registered in the GOCDB, you should add your space name(s) to your site in GOCDB as services. There is a registered service type (uk.ac.gridpp.vac) for Vac spaces.

Problems encountered during the operation of Vac in production may appear as tickets in GGUS. The Vac/Vcycle Support Unit appears under "Second Level - Software" on the GGUS "Assign ticket to support unit" menu.

Vac writes APEL accounting records as described below. The GOCDB site name given by gocdb_sitename in [settings] is included in these records. To avoid the risk of polluting the central APEL database with incorrect site names, please use your real GOCDB sitename for this option.

Setting up machinetypes

One [machinetype ...] section must exist for each machinetype in the system, with the name of the machinetype given in the section name, such as [machinetype example]. A machinetype name must only consist of lowercase letters, numbers, and hyphens. The vac.conf(5) man page lists the options that can be given for each machinetype.

The target_share option for the machinetype gives the desired share of the total VMs available in this space for that machinetype. The shares do not need to add up to 1.0, and if a share is not given for a machinetype, then it is set to 0. The creation of new VMs can be completely disabled by setting all shares to 0. Vac factories consult these shares when deciding which machinetype to start as VMs become available.

For ease of management, the target_shares options can be grouped together in a separate file in /etc/vac.d apart from the main [machinetype ...] sections, which is convenient if shares are generated automatically or frequently edited by hand and pushed out to the factory machines. For example:

[machinetype example1]
target_share = 5.0
[machinetype example2]
target_share = 6.0
[machinetype example3]
target_share = 7.0

The experiment or VO responsible for each machinetype should supply step by step intructions on how to set up the rest of the [machinetype ...] section and how to create the files to be placed in its subdirectory of /var/lib/vac/machinetypes (likely to be a hostcert.pem and hostkey.pem pair to give to the VM, placed in the files subdirectory.)

In Vac 02.00 onwards, a VOs may publish Vacuum Pipe files which specify default options to create their VMs. In this case, a machinetype section will typically just contain the target_share and vacuum_pipe_url options.

Superslots for multiprocessor VMs

Vac 02.00 introduces the concept of superslots: VMs created within the same superslot always have the same finish time, and their processors are likely to become available at the same time. For example, if the VM factory is trying to manage a mix of single processor and 8 processor VMs, the superslots mechanism tries to run single processor VMs in groups to avoid fragmenting the resource and preventing 8 processor VMs from having a chance to start when appropriate.

Machinetype definitions including min_processors and max_processors options, allow VMs to be created with more than one virtual CPU. For Vac to create VMs with more than one processor, the option superslots must be set to a value greater than 1 in [settings], and this option also limits the largest number of processors which may be assigned to a single VM (which will occupy a whole superslot.) Vac may also try to create multiple VMs with less processors than processors_per_superslot to complete a superslot. The choice of which machinetype to create VMs for is driven by the target shares mechanism, but limited by the limits on processors, and min_wallclock_seconds and max_wallclock_seconds.

Starting and stopping vacd

The Vac daemon, vacd, is started and stopped by /etc/rc.d/init.d/vacd on conjunction with the usual service and chkconfig commands. As the configuration files are reread at the start of each cycle (by default, one per minute) it is not necessary to restart vacd after changing the configuration.

Furthermore, as vacd rereads the current state of the VMs from status files and the hypervisor at the start of each cycle, vacd can be restarted without disrupting running VMs or losing information about their state. In most cases it will even be possible to upgrade vacd from one patch level to another within the same minor release without having to drain the factory of running VMs. If problems arise during upgrades, the most likely outcome is that Vac will fail to create new VMs until the configuration is fixed, but the existing VMs will continue to run. ("We want Vac failures to look like planned draining.") Furthermore, since Vac factory machines are autonomous, it is straightforward to upgrade one factory in a production Vac space to check the consequences.

Stopping VMs

It is possible to use libvirt's virsh command to list and kill individual VMs, however this can be done more gracefully by adding the shutdown_time option in [settings] which enforces a deadline for all VMs. This is one way to drain a factory of VMs in a planned way, by setting shutdown_time to the desired point in the future beyond which no VMs must be running. This has the advantage that /etc/machinefeatures/shutdowntime in existing VMs will be updated if necessary and new VMs created before the deadline will be started with /etc/machinefeatures/shutdowntime based on shutdown_time. This maximises the chance of doing useful work right up until the deadline.

However, if the VMs you are running do not frequently check for new values of /etc/machinefeatures/shutdowntime, then you will need to set a shutdown_time that gives a suitable grace period, typically allowing at least max_wallclock_seconds to elapse before the time of your deadline.

Using the vac command

The vac(1) man page explains how the vac command can be used to scan the current Vac space and display the VMs running, along with statistics about their CPU load and wall clock time.

Setting up Nagios

The check-vacd script installed in /usr/sbin can be used with Nagios to monitor the state of the vacd on a factory node.

It can be run from the local Nagios nrpe daemon with a line like this in its configuration file:

command[check-vacd]=/usr/sbin/check-vacd 600

which raises an alarm if the vacd heartbeat wasn't updated in the last 600 seconds.

APEL accounting

When Vac detects that a VM has run for at least fizzle_seconds and now finished, it writes a copy of the APEL accounting message to subdirectories of /var/lib/vac/apel-archive . If you have set gocdb_sitename in [settings], then the file is also written to /var/lib/vac/apel-outgoing .

Vac uses the UUID of the VM as the local job ID, the factory hostname as the local user ID, and the machinetype name as the batch queue name. A unique user DN is constructed from the components of the Vac space name. For example, vac01.example.com becomes /DC=com/DC=example/DC=vac01 . If the accounting_fqan option is present in the [machinetype ...] section, then for VMs of that type the value of that option is included as the user FQAN, which indicates the VO associated with the VM. The GOCDB sitename field is either the value you gave explicitly or the Vac space name as a placeholder.

These accounting messages are designed to be published to the central APEL service using the standard APEL ssmsend command, which can be run on each factory machine from cron. Please see the APEL SSM client documentation for details. To submit records you should agree use of APEL with the APEL team, have your certificate authorized, set up the correct APEL entries in GOCDB, and do any requested tests. On each factory it should be sufficient that: you install the apel-ssm RPM on each machine, install a host certificate (vac-apel-cert.pem) and key (vac-apel-key.pem) authorized to talk to APEL in /etc/grid-security, and make sure gocdb_sitename is set.

The Vac RPM installs suitable cron entries in /etc/cron.d/vac-ssmsend and will start running ssmsend as soon as the PEM files are installed in /etc/grid-security. The ssmsend command can safely be run multiple times per day as it does not connect to APEL if there are no new messages or if the PEM files are not in place, and deletes the copies in /var/lib/vac/apel-outgoing once they are sent. By default, the Vac RPM creates cron entries to run ssmsend at a randomly chosen minute each hour, chosen at install time.

APEL has consistency checking which expects APEL Sync summary records to be published for each month. The vac apel-sync command is also run from the vac-ssmsend cron entries at midday to make these records each day for the current month and for the complete previous month during the first day of the next month. These summaries are based on the contents of /var/lib/vac/apel-archive.

The records can then be republished at any time by copying the record files from /var/lib/vac/apel-archive to /var/lib/vac/apel-outgoing . In case of problems with APEL, you may wish to keep backups of apel-archive.

If you forget to give gocdb_sitename at some point, you can make copies of the records in /var/lib/vac/apel-archive with the "Site:" fields corrected to your GOCDB sitename and put them in /var/lib/vac/apel-outgoing for publishing when ssmsend is next run automatically from cron.

ssmsend is also run from /etc/rc.d/init.d/vacd when the vacd service is stopped. This should ensure that any accounting messages waiting to be sent are passed to the central APEL service as long as the factory is shutdown cleanly.

Puppet

A simple Puppet module for Vac exists as the file init.pp which is installed in the /usr/share/doc/vac-VERSION directory. There are extensive comments at the start of the file which outline how to use it.

Setting fizzle_seconds and backoff_seconds

For each machinetype you can set values of fizzle_seconds and backoff_seconds to make the most efficient use of your resources when deciding to start VMs.

When Vac detects that a VM has started successfully but found that the experiment doesn't have any jobs available, it makes an estimate of the fizzle_seconds value this outcome would correspond to. These estimates are recorded in /var/log/vacd-factory by messages similar to "Minimum MACHINETYPE fizzle_seconds=NNN ?". By looking at a sample of such lines for the machinetype you can arrive at a value of fizzle_seconds to use which is reliably slightly above the estimates you are seeing. The value will depend on what the VM does, the performance of the factory machine and its networking. However, for most VM architectures and factories a value of 600 seconds will work perfectly well and you do not need to spend a lot of effort arriving at an ideal number.

The value of backoff_seconds is a matter for your site policy about how quickly to recover from periods when each experiment has no work available. If you know a particular experiment usually has plenty of work, then you could set a low value of 600 seconds, so that Vac will try creating one VM every 10 minutes or so to see if work is available again after a short idle period. Alternatively, if you know that an experiment usually has no work for you, then you could set much larger values of many hours between creating VMs. In either case, once Vac identifies that VMs for the experiment are indeed passing fizzle_seconds and finding work to do, then the backoff_seconds option is ignored and VMs are created as VM slots become free in line with the machinetype's target_share.

Vac Site Admin Guide 03.00

Contents