Difference between revisions of "UK CVMFS Deployment"

Latest revision as of 10:29, 8 May 2017

Warning: Configuration information is out of date

Introduction

This page details the deployment of CVMFS across the UK cloud. Details about what CVMFS is can be found #Additional documentation section.

During the ADC meeting on Monday 1st August 2011, a presentation was made by Doug Benjamin with regards to ATLAS CVMFS deployment. The main conclusions of this talk were:

CVMFS will become how ATLAS distributes software on the grid, the current method will go away.
Sites who have not familiarized themselves with CVMFS should do so.

As well as being used for distributing software CVMFS will also be used to access the flat conditions data files. This will mean that hotdisk should no longer be required.

ATLAS have not yet set a definite timeline for when they will withdraw their support for the current software install method. However it would be unwise to wait until they have as this will mean there is little time to fix issues as and when they arise.

Deployment Strategy

Overview

If sites wish to do their own thing they may however this is the recommended procedure which should allow full testing of the setup before switchover. Deployment will happen in 3 stages. The first step is done entirely by the site which will end when the site has installed (but not started to use) cvmfs across the whole of the farm that ATLAS can use. The second step will be performed by ATLAS cloud support. This involves submitting ATLAS jobs via the standard production mechanisms but overriding the $VO_ATLAS_SW_DIR parameter to use CVMFS. If this works the panda queue for analysis work can be modified so that user jobs use CVMFS too. Once this has been demonstrated to work for a few weeks then the entire site can switch over to CVMFS. This will require the involvement of Alessandro de Salvo who will change the installation settings for the site. It will also require all the releases at the site to be re-validated which will mean that the site can run a limited number of jobs for a day or so.

The hope would be that all UK sites would be able to have CVMFS setup comfortably before any ATLAS deadline. GGUS tickets will be sent to all sites during the deployment process.

Site Deployment

Install cernvm.repo, repo keys and rpms

# Install repo file
cd /etc/yum.repos.d/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo

# Install repo keys
cd /etc/pki/rpm-gpg/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM

# Install rpms
yum -y install fuse cvmfs SL_no_colorls cvmfs-init-scripts

Create the cache space

By default it's in /var/cache. If you put it somewhere else - I put it in my scratch area which is much larger - you need to create the directory with the correct owner and permissions. If you move it to /tmp check that it is not affected by cleanup scripts such as tmpwatch. This is the directory CVMFS_CACHE_BASE in default.local (see below) points to. Be careful to use the same path. Below is what I used (and matches the value in the default.local example below):

mkdir -p /scratch/var/cache/cvmfs2
chown cvmfs:cvmfs /scratch/var/cache/cvmfs2
chmod 700 /scratch/var/cache/cvmfs2

Install cvmfs configuration files

Below is Manchester configuration you might need to tweak some parameters: CVMFS_CACHE_BASE and CVMFS_HTTP_PROXY in default.local and, if you really have disk space problems on the WN, CVMFS_QUOTA_LIMIT values (unit is MB) in the first 3 files. Leave cern.ch.local as it is, it's already set correctly for UK sites (if you are a non-UK European site you don't need to create cern.ch.local cern.ch.conf will work for you and if you are a US site you need to put one of the US servers first).

Generic configuration

/etc/cvmfs/default.local
CVMFS_REPOSITORIES=atlas,atlas-condb,lhcb
CVMFS_CACHE_BASE=/scratch/var/cache/cvmfs2
CVMFS_QUOTA_LIMIT=2000
CVMFS_HTTP_PROXY="http://[YOUR-SQUID-CACHE]:3128"

Atlas configuration

/etc/cvmfs/config.d/atlas.cern.ch.local
CVMFS_QUOTA_LIMIT=10000

Lhcb configuration

/etc/cvmfs/config.d/lhcb.cern.ch.local
CVMFS_QUOTA_LIMIT=5000

Domain configuration

/etc/cvmfs/domain.d/cern.ch.local
CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@;http://cvmfs.racf.bnl.gov/opt/@org@"
CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/cern.ch.pub

Configure fuse

/etc/fuse.conf
user_allow_other

Run the setup

/usr/bin/cvmfs_config setup
chkconfig cvmfs on
service cvmfs restartautofs
service cvmfs restart

Squid configuration

Some parameters need to change for squid. Below is what the documentation suggests. I tuned it to the size of my machine. For example the suggested 4GB for maximum_object_size and cache_mem were too big for our squid. I checked which other parameters were already set to evaluate if it was the case to change them.

collapsed_forwarding on
max_filedesc 8192
maximum_object_size 4096 MB
cache_mem 4096 MB
maximum_object_size_in_memory 32 KB
cache_dir ufs /var/spool/squid 50000 16 256

How many squids?

it is recommended to have more than one squid to maintain availability as explained in the small sites section of the RAL T1 installation notes.

squid ACLs

If you are restricting the destinations (dst in the configuration) i.e. the servers you get the data from you will need to add the 3 cvmfs servers listed in /etc/cvmfs/domain.d/cern.ch.local. Take care that each of the above server names has two IP addresses.

If instead you are restricting only the sources (src in the configuration) i.e. the WNs accessing you squid you don't need to modify anything.

cvmfs_fsck setup

Although the main problem with corrupted files was solved in cvmfs 2.0.3 (see http://savannah.cern.ch/support/?122564) occasionally there are still some corrupted files. To avoid these files affect jobs you can run cvmfs_fsck on a regular basis. It doesn't take a great load on the machines. Different sites have different policies on when to run. RAL runs every day, CERN every week, Manchester went half way and runs cvmfs_fsck every 3 days. You need to run it on each cache instance. In Manchester I just have small script that loops on the subdirs of the cache directory (CVMFS_CACHE_BASE) called by a cron job. Please customize with your CVMFS_CACHE_BASE value

#!/bin/bash

cache_dir='/scratch/var/cache/cvmfs2'

for a in `ls $cache_dir`; do

   /usr/bin/cvmfs_fsck -p $cache_dir/$a

done

CVMFS nagios probe

If you are using nagios a probe supplied by Cern-VM group here: Nagios probe

You can use it also standalone or in a cron job. You need to call it on each repository you have. Example:

check_cvmfs.sh atlas(.cern.ch)

Things that can affect cvmfs

tmpwatch: if you put the cache in a directory periodically cleaned up by some cron job make sure you exclude the cvmfs cache directories.
process patrol: if you have crons or equivalent to clean up users leftover processes make sure cvmfs processes are excluded.
mount point: while lhcb will work with whatever mount point atlas is not relocatable yet and therefore works only if cvmfs is mounted under /cvmfs.

LHCb last step

set VO_LHCB_SW_DIR to /cvmfs/lhcb.cern.ch. You can change it in YAIM and rerun it or you can replace the value in /etc/profile.d/grid-env.sh. Make sure YAIM is changed to avoid future grievances. The running jobs will finish and the new ones will get the new env var value and will use the software from cvmfs.

ATLAS last steps

Following steps will happen in order

1) Testing (this step is to skim most of the problems, if any, before Alessandro De Salvo validates in step 3)).

The appdir variable in SchedConfig for the analysis queues will be set to allow user jobs to access cvmfs without affecting production jobs. This will allow to monitor the installation for about a week through HC tests and is easy to roll back if something goes wrong. Sites that don't run analysis will be tested with production or simply go to step 2) and run a full validation.

If you are curious you can find the procedure the cloud squad will follow in the section below: ATLAS Testing

2) Final changes to the WNs setup

Set VO_ATLAS_SW_DIR to /cvmfs/atlas.cern.ch/repo/sw. As for LHCb case you can change it in YAIM and rerun it or you can replace the directory in /etc/profile.d/grid-env.sh remembering to change it also in YAIM to avoid future grievances.
Set somewhere the ATLAS_LOCAL_AREA env var pointing to a local directory on your nfs space and make it writable by the atlas sgm account(s). I created a bash script in /etc/profile.d/ to take care of this you might want to do the same:

/etc/profile.d/cvmfs.sh
export ATLAS_LOCAL_AREA=<atlas-nfs-sw-dir>/local

3) Software validation

Once the final changes have been applied atlas software team will send the validation jobs and re-validate all the tags present at your site.

Additional documentation

ATLAS Testing

Instructions for the cloud squad

Test that the CVMFS setup works for ATLAS jobs will be performed by cloud support. There is the possibility to test manually before changing appdir but it is easier and more thorough just to let the system do it. T3 without analysis queue can be tested changing appdir in the prod queues or running a full validation.

Manual

Submit individual test jobs to the site which will run a short ATLAS job using cvmfs. This job calculates the beam spot position in AOD data and requires both access to CVMFS and the conditions data. A Hammer Cloud test is also being setup so this test can be run by more people.

To run a test manually, you will need an ATLAS grid certificate. Start by configuring prun on your system:

export PATHENA_GRID_SETUP_SH=/afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh
source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh

Then create a file called gridscript.sh with the following lines in it:

#!/bin/bash
echo "import AthenaPoolCnvSvc.ReadAthenaPool" &> jobOptions.py
echo "from AthenaCommon.AppMgr import ServiceMgr" >> jobOptions.py
echo "ServiceMgr.EventSelector.InputCollections = [\"$1\"]">> jobOptions.py
echo "theApp.EvtMax = -1" >> jobOptions.py
echo "ToolSvc = Service('ToolSvc')" >> jobOptions.py
echo "from InDetBeamSpotFinder.InDetBeamSpotFinderConf import InDet__InDetBeamSpotReader as InDetBeamSpotReader" >> jobOptions.py
echo "from AthenaCommon.AlgSequence import AlgSequence" >> jobOptions.py
echo "topSequence = AlgSequence()" >> jobOptions.py
echo "topSequence += InDetBeamSpotReader()" >> jobOptions.py
# Configure how job access conditions files
echo "from AthenaCommon.GlobalFlags import globalflags" >> jobOptions.py
echo "globalflags.DetGeo.set_Value_and_Lock('atlas')" >> jobOptions.py
echo "globalflags.DataSource.set_Value_and_Lock('data')" >> jobOptions.py
echo "from IOVDbSvc.CondDB import conddb" >> jobOptions.py
echo "include(\"InDetBeamSpotService/BeamCondSvc.py\")" >> jobOptions.py

export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw
export RELEASE=17.0.3
export ATLAS_POOLCOND_PATH="/cvmfs/atlas.cern.ch/repo/conditions"
echo "setting VO_ATLAS_SW_DIR to $VO_ATLAS_SW_DIR" 
echo "The release to be used is $RELEASE"
source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/asetup.sh AtlasOffline $RELEASE
athena jobOptions.py

Finally run the following command:

prun --exec "source gridscript.sh %IN" --outDS user.dewhurst.CVMFSTest.1 --inDS data10_7TeV.00161379.physics_Muons.merge.AOD.f282_m578 --nFilesPerJob 1 --site ANALY_RAL

where you will need to change "dewhurst" to your panda username, ANALY_RAL to the site of your choice and make sure that the dataset you pick is actually at the site. This code will run on any AOD from real data (ie, it starts with data and contains the AOD).

System

Change the analysis queue at the site to use CVMFS. This will be done by changing the appdir parameter in the schedconfig to point to CVMFS:

appdir='/cvmfs/atlas.cern.ch/repo/sw'

This will direct user jobs to use CVMFS while not touching production jobs or your installation setup. This can be left for a few weeks to make sure things are working correctly. If things are not working correctly you will be set broker-off in Hammer Cloud tests and if the error is not trivial to fix, the schedconfig change can be quickly rolled back. You can see if your site is passing Hammer Cloud tests by going to: http://panda.cern.ch/server/pandamon/query?job=*&site=ANALY_RAL&type=analysis&hours=12&processingType=gangarobot where you can change ANALY_RAL to the analysis queue of your site. Once both the site and cloud support are happy with that the jobs are running well then the final steps to full production can be made.

Current Deployment status


Sites	Installed	Tested	In Production	Tier
RAL-LCG2	yes	yes	yes	T1
UKI-LT2-Brunel	yes	yes	yes	T3
UKI-LT2-IC-HEP	yes	yes	yes	T3
UKI-LT2-QMUL	yes	yes	yes	T2
UKI-LT2-RHUL	yes	yes	yes	T2
UKI-LT2-UCL-HEP	yes	yes	yes	T2
UKI-NORTHGRID-LANCS-HEP	yes	yes	yes	T2
UKI-NORTHGRID-LIV-HEP	yes	yes	yes	T2
UKI-NORTHGRID-MAN-HEP	yes	yes	yes	T2
UKI-NORTHGRID-SHEF-HEP	yes	yes	yes	T2
UKI-SCOTGRID-DURHAM	yes	yes	yes	T3
UKI-SCOTGRID-ECDF	yes	yes	yes	T2
UKI-SCOTGRID-GLASGOW	yes	yes	yes	T2
UKI-SOUTHGRID-BHAM-HEP	yes	yes	yes	T2
UKI-SOUTHGRID-BRIS-HEP	yes	yes	yes	T2
UKI-SOUTHGRID-CAM-HEP	yes	yes	yes	T2
UKI-SOUTHGRID-OX-HEP	yes	yes	yes	T2
UKI-SOUTHGRID-RALPP	yes	yes	yes	T2

Revision as of 16:35, 25 February 2016 (view source) Robin Long 35a6fa1314 (Talk \| contribs) m ← Older edit		Latest revision as of 10:29, 8 May 2017 (view source) Alessandra Forti c3313b292e (Talk \| contribs)
Line 372:		Line 372:

	\|}		\|}
		+
		+	[[Category:Sites Status]]

Difference between revisions of "UK CVMFS Deployment"

Latest revision as of 10:29, 8 May 2017

Contents

Introduction

Deployment Strategy

Overview

Site Deployment

Install cernvm.repo, repo keys and rpms

Create the cache space

Install cvmfs configuration files

Configure fuse

Run the setup

Squid configuration

cvmfs_fsck setup

CVMFS nagios probe

Things that can affect cvmfs

LHCb last step

ATLAS last steps

Additional documentation

ATLAS Testing

Current Deployment status

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools