Difference between revisions of "UK CVMFS Deployment"
|Line 372:||Line 372:|
Latest revision as of 10:29, 8 May 2017
Warning: Configuration information is out of date
- 1 Introduction
- 2 Deployment Strategy
- 2.1 Overview
- 2.2 Site Deployment
- 2.2.1 Install cernvm.repo, repo keys and rpms
- 2.2.2 Create the cache space
- 2.2.3 Install cvmfs configuration files
- 2.2.4 Configure fuse
- 2.2.5 Run the setup
- 2.2.6 Squid configuration
- 2.2.7 cvmfs_fsck setup
- 2.2.8 CVMFS nagios probe
- 2.2.9 Things that can affect cvmfs
- 2.2.10 LHCb last step
- 2.2.11 ATLAS last steps
- 2.2.12 Additional documentation
- 2.3 ATLAS Testing
- 3 Current Deployment status
This page details the deployment of CVMFS across the UK cloud. Details about what CVMFS is can be found #Additional documentation section.
During the ADC meeting on Monday 1st August 2011, a presentation was made by Doug Benjamin with regards to ATLAS CVMFS deployment. The main conclusions of this talk were:
- CVMFS will become how ATLAS distributes software on the grid, the current method will go away.
- Sites who have not familiarized themselves with CVMFS should do so.
As well as being used for distributing software CVMFS will also be used to access the flat conditions data files. This will mean that hotdisk should no longer be required.
ATLAS have not yet set a definite timeline for when they will withdraw their support for the current software install method. However it would be unwise to wait until they have as this will mean there is little time to fix issues as and when they arise.
If sites wish to do their own thing they may however this is the recommended procedure which should allow full testing of the setup before switchover. Deployment will happen in 3 stages. The first step is done entirely by the site which will end when the site has installed (but not started to use) cvmfs across the whole of the farm that ATLAS can use. The second step will be performed by ATLAS cloud support. This involves submitting ATLAS jobs via the standard production mechanisms but overriding the $VO_ATLAS_SW_DIR parameter to use CVMFS. If this works the panda queue for analysis work can be modified so that user jobs use CVMFS too. Once this has been demonstrated to work for a few weeks then the entire site can switch over to CVMFS. This will require the involvement of Alessandro de Salvo who will change the installation settings for the site. It will also require all the releases at the site to be re-validated which will mean that the site can run a limited number of jobs for a day or so.
The hope would be that all UK sites would be able to have CVMFS setup comfortably before any ATLAS deadline. GGUS tickets will be sent to all sites during the deployment process.
Install cernvm.repo, repo keys and rpms
# Install repo file cd /etc/yum.repos.d/ wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo # Install repo keys cd /etc/pki/rpm-gpg/ wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM # Install rpms yum -y install fuse cvmfs SL_no_colorls cvmfs-init-scripts
Create the cache space
By default it's in /var/cache. If you put it somewhere else - I put it in my scratch area which is much larger - you need to create the directory with the correct owner and permissions. If you move it to /tmp check that it is not affected by cleanup scripts such as tmpwatch. This is the directory CVMFS_CACHE_BASE in default.local (see below) points to. Be careful to use the same path. Below is what I used (and matches the value in the default.local example below):
mkdir -p /scratch/var/cache/cvmfs2 chown cvmfs:cvmfs /scratch/var/cache/cvmfs2 chmod 700 /scratch/var/cache/cvmfs2
Install cvmfs configuration files
Below is Manchester configuration you might need to tweak some parameters: CVMFS_CACHE_BASE and CVMFS_HTTP_PROXY in default.local and, if you really have disk space problems on the WN, CVMFS_QUOTA_LIMIT values (unit is MB) in the first 3 files. Leave cern.ch.local as it is, it's already set correctly for UK sites (if you are a non-UK European site you don't need to create cern.ch.local cern.ch.conf will work for you and if you are a US site you need to put one of the US servers first).
- Generic configuration
/etc/cvmfs/default.local CVMFS_REPOSITORIES=atlas,atlas-condb,lhcb CVMFS_CACHE_BASE=/scratch/var/cache/cvmfs2 CVMFS_QUOTA_LIMIT=2000 CVMFS_HTTP_PROXY="http://[YOUR-SQUID-CACHE]:3128"
- Atlas configuration
- Lhcb configuration
- Domain configuration
/etc/cvmfs/domain.d/cern.ch.local CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@;http://cvmfs.racf.bnl.gov/opt/@org@" CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/cern.ch.pub
Run the setup
/usr/bin/cvmfs_config setup chkconfig cvmfs on service cvmfs restartautofs service cvmfs restart
Some parameters need to change for squid. Below is what the documentation suggests. I tuned it to the size of my machine. For example the suggested 4GB for maximum_object_size and cache_mem were too big for our squid. I checked which other parameters were already set to evaluate if it was the case to change them.
collapsed_forwarding on max_filedesc 8192 maximum_object_size 4096 MB cache_mem 4096 MB maximum_object_size_in_memory 32 KB cache_dir ufs /var/spool/squid 50000 16 256
How many squids?
it is recommended to have more than one squid to maintain availability as explained in the small sites section of the RAL T1 installation notes.
If you are restricting the destinations (dst in the configuration) i.e. the servers you get the data from you will need to add the 3 cvmfs servers listed in /etc/cvmfs/domain.d/cern.ch.local. Take care that each of the above server names has two IP addresses.
If instead you are restricting only the sources (src in the configuration) i.e. the WNs accessing you squid you don't need to modify anything.
Although the main problem with corrupted files was solved in cvmfs 2.0.3 (see http://savannah.cern.ch/support/?122564) occasionally there are still some corrupted files. To avoid these files affect jobs you can run cvmfs_fsck on a regular basis. It doesn't take a great load on the machines. Different sites have different policies on when to run. RAL runs every day, CERN every week, Manchester went half way and runs cvmfs_fsck every 3 days. You need to run it on each cache instance. In Manchester I just have small script that loops on the subdirs of the cache directory (CVMFS_CACHE_BASE) called by a cron job. Please customize with your CVMFS_CACHE_BASE value
#!/bin/bash cache_dir='/scratch/var/cache/cvmfs2' for a in `ls $cache_dir`; do /usr/bin/cvmfs_fsck -p $cache_dir/$a done
CVMFS nagios probe
If you are using nagios a probe supplied by Cern-VM group here: Nagios probe
You can use it also standalone or in a cron job. You need to call it on each repository you have. Example:
Things that can affect cvmfs
tmpwatch: if you put the cache in a directory periodically cleaned up by some cron job make sure you exclude the cvmfs cache directories.
process patrol: if you have crons or equivalent to clean up users leftover processes make sure cvmfs processes are excluded.
mount point: while lhcb will work with whatever mount point atlas is not relocatable yet and therefore works only if cvmfs is mounted under /cvmfs.
LHCb last step
- set VO_LHCB_SW_DIR to /cvmfs/lhcb.cern.ch. You can change it in YAIM and rerun it or you can replace the value in /etc/profile.d/grid-env.sh. Make sure YAIM is changed to avoid future grievances. The running jobs will finish and the new ones will get the new env var value and will use the software from cvmfs.
ATLAS last steps
Following steps will happen in order
- 1) Testing (this step is to skim most of the problems, if any, before Alessandro De Salvo validates in step 3)).
- The appdir variable in SchedConfig for the analysis queues will be set to allow user jobs to access cvmfs without affecting production jobs. This will allow to monitor the installation for about a week through HC tests and is easy to roll back if something goes wrong. Sites that don't run analysis will be tested with production or simply go to step 2) and run a full validation.
If you are curious you can find the procedure the cloud squad will follow in the section below: ATLAS Testing
- 2) Final changes to the WNs setup
- Set VO_ATLAS_SW_DIR to /cvmfs/atlas.cern.ch/repo/sw. As for LHCb case you can change it in YAIM and rerun it or you can replace the directory in /etc/profile.d/grid-env.sh remembering to change it also in YAIM to avoid future grievances.
- Set somewhere the ATLAS_LOCAL_AREA env var pointing to a local directory on your nfs space and make it writable by the atlas sgm account(s). I created a bash script in /etc/profile.d/ to take care of this you might want to do the same:
/etc/profile.d/cvmfs.sh export ATLAS_LOCAL_AREA=<atlas-nfs-sw-dir>/local
- 3) Software validation
- Once the final changes have been applied atlas software team will send the validation jobs and re-validate all the tags present at your site.
- Release notes
- Init scripts
- Technical Report
- RAL T1 Installation Notes
- T3 setup page
- Latest changes
- A. Forti blog
- Instructions for the cloud squad
Test that the CVMFS setup works for ATLAS jobs will be performed by cloud support. There is the possibility to test manually before changing appdir but it is easier and more thorough just to let the system do it. T3 without analysis queue can be tested changing appdir in the prod queues or running a full validation.
Submit individual test jobs to the site which will run a short ATLAS job using cvmfs. This job calculates the beam spot position in AOD data and requires both access to CVMFS and the conditions data. A Hammer Cloud test is also being setup so this test can be run by more people.
To run a test manually, you will need an ATLAS grid certificate. Start by configuring prun on your system:
export PATHENA_GRID_SETUP_SH=/afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh
Then create a file called gridscript.sh with the following lines in it:
#!/bin/bash echo "import AthenaPoolCnvSvc.ReadAthenaPool" &> jobOptions.py echo "from AthenaCommon.AppMgr import ServiceMgr" >> jobOptions.py echo "ServiceMgr.EventSelector.InputCollections = [\"$1\"]">> jobOptions.py echo "theApp.EvtMax = -1" >> jobOptions.py echo "ToolSvc = Service('ToolSvc')" >> jobOptions.py echo "from InDetBeamSpotFinder.InDetBeamSpotFinderConf import InDet__InDetBeamSpotReader as InDetBeamSpotReader" >> jobOptions.py echo "from AthenaCommon.AlgSequence import AlgSequence" >> jobOptions.py echo "topSequence = AlgSequence()" >> jobOptions.py echo "topSequence += InDetBeamSpotReader()" >> jobOptions.py # Configure how job access conditions files echo "from AthenaCommon.GlobalFlags import globalflags" >> jobOptions.py echo "globalflags.DetGeo.set_Value_and_Lock('atlas')" >> jobOptions.py echo "globalflags.DataSource.set_Value_and_Lock('data')" >> jobOptions.py echo "from IOVDbSvc.CondDB import conddb" >> jobOptions.py echo "include(\"InDetBeamSpotService/BeamCondSvc.py\")" >> jobOptions.py export VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw export RELEASE=17.0.3 export ATLAS_POOLCOND_PATH="/cvmfs/atlas.cern.ch/repo/conditions" echo "setting VO_ATLAS_SW_DIR to $VO_ATLAS_SW_DIR" echo "The release to be used is $RELEASE" source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/asetup.sh AtlasOffline $RELEASE athena jobOptions.py
Finally run the following command:
prun --exec "source gridscript.sh %IN" --outDS user.dewhurst.CVMFSTest.1 --inDS data10_7TeV.00161379.physics_Muons.merge.AOD.f282_m578 --nFilesPerJob 1 --site ANALY_RAL
where you will need to change "dewhurst" to your panda username, ANALY_RAL to the site of your choice and make sure that the dataset you pick is actually at the site. This code will run on any AOD from real data (ie, it starts with data and contains the AOD).
Change the analysis queue at the site to use CVMFS. This will be done by changing the appdir parameter in the schedconfig to point to CVMFS:
This will direct user jobs to use CVMFS while not touching production jobs or your installation setup. This can be left for a few weeks to make sure things are working correctly. If things are not working correctly you will be set broker-off in Hammer Cloud tests and if the error is not trivial to fix, the schedconfig change can be quickly rolled back. You can see if your site is passing Hammer Cloud tests by going to: http://panda.cern.ch/server/pandamon/query?job=*&site=ANALY_RAL&type=analysis&hours=12&processingType=gangarobot where you can change ANALY_RAL to the analysis queue of your site. Once both the site and cloud support are happy with that the jobs are running well then the final steps to full production can be made.
Current Deployment status