Difference between revisions of "A quick guide to CVMFS"

From GridPP Wiki
Jump to: navigation, search
(Added the permission line.)
(Added a link to the UserGuide.)
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Deploying software with CVMFS==
 
==Deploying software with CVMFS==
 +
 
For more information about CVMFS at RAL, click [[RAL Tier1 CVMFS|here]].
 
For more information about CVMFS at RAL, click [[RAL Tier1 CVMFS|here]].
 +
 +
Also see the [https://www.gridpp.ac.uk/userguide GridPP UserGuide] for more information.
 +
 +
===Usage guidelines===
 +
See the CVMFS usage guidelines [http://cernvm.cern.ch/portal/filesystem/repository-limits here] for more detailed information, but for convenience:
 +
* '''''CVMFS repositories are not for uploading data'''''. Only software - to be used in running your grid jobs - should be uploaded to your CVMFS repository. Any data you need to process should be uploaded to Storage Elements and read during grid jobs as usual.
 +
* There are currently no restrictions on upload size, but if your software bundle is larger than 20 GB, please contact [mailto:catalin.condurache@stfc.ac.uk Catalin] to discuss your repository needs.
 +
* Please note that files that are greater than 200MB are not kept in the local cache, and so would need to be downloaded each time they are used (kind of defeating the point).
  
 
===Overview of the process===
 
===Overview of the process===
  
# Prepare your working area
+
* Prepare your software;
# Prepare your software
+
* Deploy your software to the CVMFS repository;
# Upload your software to the RAL CVMFS repository and deploy it
+
* Prepare for job submission;
# Wait a bit...
+
* Submit your job(s).
# Run your CVMFS-powered jobs
+
  
===A trivial example===
+
==A worked example==
  
====Preparing your working area====
+
Here we will demonstrate the full process of deploying and running software with CVMFS using a Python script and some sample CERN@school data. All of the code is available via the GridPP GitHub repository - please feel free to adapt and modify for your own needs!
  
Log in to your machine of choice and create a new working area.
+
===Prepare your software===
 +
In order to get your software running on the grid, you'll need to bundle it up into a tarball (<code>.tgz</code>) so that it's ready to upload to the RAL CVMFS stratum-1 server. This tarball will need to include the scripts, executables and libraries you need, all compiled to run on a 64-bit SL6 machine. For convenience, we have provided an example using Python in the GridPP GitHub repository [https://github.com/gridpp/cvmfs-test-001-00-00 cvmfs-test-001-00-00]. You can get this with:
  
 
<pre>
 
<pre>
$ ssh -Y whyntie@heppc402
+
$ cd $CVMFS_UPLOAD_DIR # choose a suitable location for this.
whyntie@heppc402's password: # enter your password here
+
$ wget https://github.com/gridpp/cvmfs-test-001-00-00/archive/master.zip -O cvmfs-test-001-00-00-master.zip
 
+
$ unzip cvmfs-test-001-00-00-master.zip
$ mkdir cvmfstests
+
$ rm cvmfs-test-001-00-00-master.zip
$ cd cvmfstests
+
$ tar -cvf cvmfs-test-001-00-00.tgz cvmfs-test-001-00-00-master/
$ mkdir helloworld
+
$ cd helloworld
+
$ pwd
+
/users/whyntie/cvmfstests/helloworld
+
 
</pre>
 
</pre>
  
====Preparing your software====
+
This contains:
  
Create a new directory that will form the basis of your CVMFS tarball.
+
* <code>process-frame.py</code>: a simple Python script to process a frame of CERN@school Timepix data, either uploaded with the job or retrived from a Storage Element (SE);
 +
* <code>lib</code>: some pre-compiled Python libraries for non-standard Python modules used by <code>process-frames.py</code>.
  
 +
The idea is that <code>process-frame.py</code> will run remotely on the grid, using the non-standard Python libraries supplied with the CVMFS repository. This saves having to install the modules on each Computing Element (CE) every time you want to run a grid job. You will need to do compile and supply the libraries you need when assembling your own tarballs.
 +
 +
====File permissions====
 +
'''Note''': '''''all''''' files uploaded to the repository should have permissions <code>o+r</code> and directories should have permissions <code>o+rx</code>. You can do this with the following commands:
 
<pre>
 
<pre>
$ mkdir hello-world_001-00-00
+
$ find ./  -type d -exec chmod go+rx {}
$ cd hello-world_001-00-00
+
$ find ./ -type f -exec chmod go+r {}
$ pwd
+
/users/whyntie/cvmfstests/helloworld/hello-world_001-00-00
+
 
</pre>
 
</pre>
 +
'''''If your bundle contains any non-readable files, it will not publish.'''''
  
Create three files in this directory:
+
===Deploy your software to the CVMFS repository===
  
* <code>hello-world.sh</code> - the "software";
+
With your tarball prepared, you can now upload it to the RAL CVMFS stratum-1 by generating a grid proxy, <code>gsiscp</code>-ing the tarball over, and unpacking the tarball in your repository:
* <code>run.sh</code> - the script that runs the software;
+
* <code>README.md</code> - for your notes, ideally in the [http://daringfireball.net/projects/markdown/syntax MarkDown] format.
+
  
 
<pre>
 
<pre>
$ cat hello-world.sh
+
$ voms-proxy-init --voms [your VO name, e.g. cernatschool.org] -dont_verify_ac
#!/bin/bash
+
$ gsiscp -P 1975 cvmfs-test-001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
#
+
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk
#=============================================================================
+
 
#                    The GridPP CVMFS Hello World! Script
+
$ cd cd cvmfs_repo/
#=============================================================================
+
$ tar -xvf cvmfs-test-001-00-00.tgz
#
+
# Usage: . hello-world.sh [whoever you want to greet]
+
#
+
echo 'Hello' $1'!'
+
$
+
$
+
$ cat run.sh
+
#!/bin/bash
+
#
+
#=============================================================================
+
#                    The GridPP CVMFS Hello World! run script
+
#=============================================================================
+
#
+
# Usage: . run.sh [full path of the hello-world.sh script]
+
#
+
# Note that for grid jobs (or clusters with CVMFS enabled) this will be
+
# the CVMFS directory.
+
#
+
$1/hello-world.sh World
+
$
+
$
+
$ cat README.md
+
My CVMFS Test Notes
+
===================
+
You're keeping detailed notes, right? Good good.
+
 
</pre>
 
</pre>
  
Don't forget to change the permissions on the scripts so that they can be run.
+
Your software has now been deployed. However, it may take up to '''three hours''' for the cron jobs to deploy it to all sites - be patient!
  
<pre>
+
===Prepare for job submission===
$ chmod a+x run.sh
+
$ chmod a+x hello-world.sh
+
</pre>
+
  
Now compress these files into a tarball.
+
You will need to setup whichever User Interface (UI) you use for submitting grid jobs and generate an appropriate proxy. We will use [[Quick_Guide_to_Dirac|DIRAC]] in this example.
  
 
<pre>
 
<pre>
$ cd ../
+
$ cd $DIRAC_DIR
$ tar -czf hello-world_001-00-00.tar hello-world_001-00-00
+
$ . bashrc # set the DIRAC environment variables
$ ls
+
$ dirac-proxy-init -g [VO name]_user -M
hello-world_001-00-00 hello-world_001-00-00.tgz
+
 
</pre>
 
</pre>
  
====Uploading and deploying the software====
+
For convenience, we have prepared an example JDL file and test data that will run with the software deployed above. You can get this from the GridPP GitHub repository [https://github.com/gridpp/cvmfs-getting-started cvmfs-getting-started]:
 
+
Once you have supplied the RAL team with your grid certificate DN, visit the [https://cvmfs-upload01.gridpp.rl.ac.uk/ RAL CVMFS repository]. You will need your grid certificate installed in your browser in order to be identified.
+
  
 
<pre>
 
<pre>
$ firefox https://cvmfs-upload01.gridpp.rl.ac.uk &
+
$ cd $CVMFS_SUBMIT_DIR
 +
$ git clone https://github.com/gridpp/cvmfs-getting-started.git
 +
$ cd cvmfs-getting-started
 
</pre>
 
</pre>
  
After confirming certificate and security settings, you should be presented with a page like this (click to enlarge):
+
This contains:
  
[[File:Cvmfs-repo-upload_home.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader page - home.]]
+
* <code>dirac-test.jdl</code>: a job description file for DIRAC users;
 +
* <code>glite-test.jdl</code>: a job description file for glite users;
 +
* <code>run.sh</code>: the script that sets the environment variables and runs the software in CVMFS.
  
<div style="clear:both;"></div>
+
In a nutshell, the job description file submits the <code>run.sh</code> script and the frame of data with the job. <code>run.sh</code> then sets the environment variables to make sure your CVMFS libraries are available, and runs the Python script <code>process-frame.py</code> remotely. You should look at the contents of these files to understand exactly what's going on.
  
(Note: this is from the CERN@school VO, which has already had software uploaded. Your page should be blank...)
+
===Submit your job(s)===
  
To upload the <code>hello-world_001-00-00</code> tarball, click on the blue "Upload" button. This will take you to the "Upload new package" screen:
+
To submit a job with DIRAC:
 
+
[[File:Cvmfs-repo-upload_upload.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader - "Upload new package".]]
+
 
+
<div style="clear:both;"></div>
+
 
+
Click on the "Select file" button, select your <code>hello-world_001-00-00.tar</code> tarball and then press the blue "Upload" button. The tarball should now be uploaded to your VO's repository, appearing on the right-hand side of the Uploader homepage with a white background (indicating that is uploaded).
+
 
+
Now you should be ready to deploy the tarball. This is done by pressing the "Deploy" button in the '''project content''' panel of the Uploader home page:
+
 
+
[[File:Cvmfs_deploy-button.PNG|thumb|left|The deploy button.]]
+
 
+
<div style="clear:both;"></div>
+
 
+
You should now see the deploy page:
+
 
+
[[File:Cvmfs-deploy-page.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader - the deploy page.]]
+
 
+
<div style="clear:both;"></div>
+
 
+
Click on the drop-down menu and you should see <code>hello-world_001-00-00.tar</code> available to select. Select it, then press the blue "Deploy" button. You should now be redirected to the home page, where <code>hello-world_001-00-00.tar</code> should be displayed with a green background to indicate that it has been successfully deployed to the repository.
+
 
+
It should take a maximum of three hours for the software to become available to worker nodes with the appropriate CVMFS access. If your local cluster has access to your VO's repository, you can check with:
+
  
 
<pre>
 
<pre>
$ ls /cvmfs/cernatschool.gridpp.ac.uk
+
$ chmod a+x run.sh
hello-world_001-00-00
+
$ dirac-wms-job-submit dirac-test.jdl
$ ls /cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00
+
JobID = [number]
hello-world.sh  README.md  run.sh
+
 
</pre>
 
</pre>
  
(<code>cernatschool.gridpp.ac.uk</code> is the address for the <code>cernatschool.org</code> VO - you can find out the corresponding address from the CVMFS uploader homepage.)
+
You can monitor the progress of your job using the [https://dirac.grid.hep.ph.ic.ac.uk:8443 DIRAC web interface]. Once it has completed, you can retrieve the output (which consists of a JSON file containing processed information about the frame and a log file) with:
 
+
====Run a job using your CVMFS software====
+
 
+
Jobs using your software can be submitted and run as normal - but now you don't need to worry about installing software anywhere. The executables you need will be available in the CVMFS repository. So a JDL file for the Hello World! software (deployed on the <code>cernatschool.org</code> repository would look like this:
+
  
 
<pre>
 
<pre>
$ cat helloworld.jdl
+
$ dirac-wms-job-get-output [number]
# The GridPP CVMFS Hello World! JDL file
+
$ cat [number]/file-info.json
Executable = "/bin/sh";
+
{"n_pixel": 735, "file_name": "data000.txt", "max_count": 639}
# Replace "cernatschool.gridpp.ac.uk" with your VO's address, of course!
+
Arguments = "/cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00/run.sh /cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00";
+
StdOutput = "stdout.txt";
+
StdError = "stderr.txt";
+
OutputSandbox = {"stdout.txt", "stderr.txt"};
+
#
+
 
</pre>
 
</pre>
  
As you can see, the shell executable is being run with two arguments:
+
And that's it! Congratulations, you've successfully used CVMFS to deploy and run software on the grid.
  
# The script to run - <code>run.sh</code> - which is in the tarball.
+
==Useful links==
# The first argument supplied to <code>run.sh</code>. In this example, the first argument of <code>run.sh</code> base directory (including the CVMFS prefix). We have made this an argument in the <code>run.sh</code> to make local testing easier - which, when the software gets more complicated, is a worthwhile thing to do...
+
  
So, if you're using <code>glite</code> for your WMS, you can submit the job as usual:
+
===Internal===
  
<pre>
+
* [[CVMFS Use Cases for GridPP]].
$ voms-proxy-init --voms cernatschool.org
+
Enter GRID pass phrase for this identity: # you know what to do...
+
...[proxy confirmation messages]
+
$ myproxy-init -d -n
+
Your identity: /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
+
Enter GRID pass phrase for this identity: # and again...
+
...[proxy confirmation messages]
+
$ glite-wms-job-submit -a -o jobIDfile helloworld.jdl
+
...[job submission messages]
+
</pre>
+
  
When you retrive the output from the jobs, <code>stdout.txt</code> should contain a very, very exciting message, indicating that your job has succeeded and the CVMFS software has been successfully deployed.
+
===External===
 
+
==Useful links==
+
  
 
* [http://cernvm.cern.ch/portal/ The CERN CVMFS page];
 
* [http://cernvm.cern.ch/portal/ The CERN CVMFS page];
 
* [http://indico3.twgrid.org/indico/contributionDisplay.py?sessionId=53&contribId=182&confId=513 CernVM-FS - Building an Infrastructure for Non-LHC Computing] - ISGC 2014 talk from I. Collier ([http://indico3.twgrid.org/indico/getFile.py/access?contribId=182&sessionId=53&resId=0&materialId=slides&confId=513 slides]).
 
* [http://indico3.twgrid.org/indico/contributionDisplay.py?sessionId=53&contribId=182&confId=513 CernVM-FS - Building an Infrastructure for Non-LHC Computing] - ISGC 2014 talk from I. Collier ([http://indico3.twgrid.org/indico/getFile.py/access?contribId=182&sessionId=53&resId=0&materialId=slides&confId=513 slides]).

Latest revision as of 13:35, 2 December 2015

Deploying software with CVMFS

For more information about CVMFS at RAL, click here.

Also see the GridPP UserGuide for more information.

Usage guidelines

See the CVMFS usage guidelines here for more detailed information, but for convenience:

  • CVMFS repositories are not for uploading data. Only software - to be used in running your grid jobs - should be uploaded to your CVMFS repository. Any data you need to process should be uploaded to Storage Elements and read during grid jobs as usual.
  • There are currently no restrictions on upload size, but if your software bundle is larger than 20 GB, please contact Catalin to discuss your repository needs.
  • Please note that files that are greater than 200MB are not kept in the local cache, and so would need to be downloaded each time they are used (kind of defeating the point).

Overview of the process

  • Prepare your software;
  • Deploy your software to the CVMFS repository;
  • Prepare for job submission;
  • Submit your job(s).

A worked example

Here we will demonstrate the full process of deploying and running software with CVMFS using a Python script and some sample CERN@school data. All of the code is available via the GridPP GitHub repository - please feel free to adapt and modify for your own needs!

Prepare your software

In order to get your software running on the grid, you'll need to bundle it up into a tarball (.tgz) so that it's ready to upload to the RAL CVMFS stratum-1 server. This tarball will need to include the scripts, executables and libraries you need, all compiled to run on a 64-bit SL6 machine. For convenience, we have provided an example using Python in the GridPP GitHub repository cvmfs-test-001-00-00. You can get this with:

$ cd $CVMFS_UPLOAD_DIR # choose a suitable location for this.
$ wget https://github.com/gridpp/cvmfs-test-001-00-00/archive/master.zip -O cvmfs-test-001-00-00-master.zip
$ unzip cvmfs-test-001-00-00-master.zip
$ rm cvmfs-test-001-00-00-master.zip
$ tar -cvf cvmfs-test-001-00-00.tgz cvmfs-test-001-00-00-master/

This contains:

  • process-frame.py: a simple Python script to process a frame of CERN@school Timepix data, either uploaded with the job or retrived from a Storage Element (SE);
  • lib: some pre-compiled Python libraries for non-standard Python modules used by process-frames.py.

The idea is that process-frame.py will run remotely on the grid, using the non-standard Python libraries supplied with the CVMFS repository. This saves having to install the modules on each Computing Element (CE) every time you want to run a grid job. You will need to do compile and supply the libraries you need when assembling your own tarballs.

File permissions

Note: all files uploaded to the repository should have permissions o+r and directories should have permissions o+rx. You can do this with the following commands:

$ find ./  -type d -exec chmod go+rx {}
$ find ./  -type f -exec chmod go+r {}

If your bundle contains any non-readable files, it will not publish.

Deploy your software to the CVMFS repository

With your tarball prepared, you can now upload it to the RAL CVMFS stratum-1 by generating a grid proxy, gsiscp-ing the tarball over, and unpacking the tarball in your repository:

$ voms-proxy-init --voms [your VO name, e.g. cernatschool.org] -dont_verify_ac
$ gsiscp -P 1975 cvmfs-test-001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk

$ cd cd cvmfs_repo/
$ tar -xvf cvmfs-test-001-00-00.tgz

Your software has now been deployed. However, it may take up to three hours for the cron jobs to deploy it to all sites - be patient!

Prepare for job submission

You will need to setup whichever User Interface (UI) you use for submitting grid jobs and generate an appropriate proxy. We will use DIRAC in this example.

$ cd $DIRAC_DIR
$ . bashrc # set the DIRAC environment variables
$ dirac-proxy-init -g [VO name]_user -M

For convenience, we have prepared an example JDL file and test data that will run with the software deployed above. You can get this from the GridPP GitHub repository cvmfs-getting-started:

$ cd $CVMFS_SUBMIT_DIR
$ git clone https://github.com/gridpp/cvmfs-getting-started.git
$ cd cvmfs-getting-started

This contains:

  • dirac-test.jdl: a job description file for DIRAC users;
  • glite-test.jdl: a job description file for glite users;
  • run.sh: the script that sets the environment variables and runs the software in CVMFS.

In a nutshell, the job description file submits the run.sh script and the frame of data with the job. run.sh then sets the environment variables to make sure your CVMFS libraries are available, and runs the Python script process-frame.py remotely. You should look at the contents of these files to understand exactly what's going on.

Submit your job(s)

To submit a job with DIRAC:

$ chmod a+x run.sh
$ dirac-wms-job-submit dirac-test.jdl
JobID = [number]

You can monitor the progress of your job using the DIRAC web interface. Once it has completed, you can retrieve the output (which consists of a JSON file containing processed information about the frame and a log file) with:

$ dirac-wms-job-get-output [number]
$ cat [number]/file-info.json
{"n_pixel": 735, "file_name": "data000.txt", "max_count": 639}

And that's it! Congratulations, you've successfully used CVMFS to deploy and run software on the grid.

Useful links

Internal

External