Difference between revisions of "A quick guide to CVMFS"

From GridPP Wiki
Jump to: navigation, search
(Removed the web interface instructions and updated with the gsi commands.)
(Added a link to the UserGuide.)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Deploying software with CVMFS==
 
==Deploying software with CVMFS==
 +
 
For more information about CVMFS at RAL, click [[RAL Tier1 CVMFS|here]].
 
For more information about CVMFS at RAL, click [[RAL Tier1 CVMFS|here]].
 +
 +
Also see the [https://www.gridpp.ac.uk/userguide GridPP UserGuide] for more information.
 +
 +
===Usage guidelines===
 +
See the CVMFS usage guidelines [http://cernvm.cern.ch/portal/filesystem/repository-limits here] for more detailed information, but for convenience:
 +
* '''''CVMFS repositories are not for uploading data'''''. Only software - to be used in running your grid jobs - should be uploaded to your CVMFS repository. Any data you need to process should be uploaded to Storage Elements and read during grid jobs as usual.
 +
* There are currently no restrictions on upload size, but if your software bundle is larger than 20 GB, please contact [mailto:catalin.condurache@stfc.ac.uk Catalin] to discuss your repository needs.
 +
* Please note that files that are greater than 200MB are not kept in the local cache, and so would need to be downloaded each time they are used (kind of defeating the point).
  
 
===Overview of the process===
 
===Overview of the process===
  
# Prepare your working area
+
* Prepare your software;
# Prepare your software
+
* Deploy your software to the CVMFS repository;
# Upload your software to the RAL CVMFS repository and deploy it
+
* Prepare for job submission;
# Wait a bit...
+
* Submit your job(s).
# Run your CVMFS-powered jobs
+
  
===A trivial example===
+
==A worked example==
  
====Preparing your working area====
+
Here we will demonstrate the full process of deploying and running software with CVMFS using a Python script and some sample CERN@school data. All of the code is available via the GridPP GitHub repository - please feel free to adapt and modify for your own needs!
  
Log in to your machine of choice and create a new working area.
+
===Prepare your software===
 +
In order to get your software running on the grid, you'll need to bundle it up into a tarball (<code>.tgz</code>) so that it's ready to upload to the RAL CVMFS stratum-1 server. This tarball will need to include the scripts, executables and libraries you need, all compiled to run on a 64-bit SL6 machine. For convenience, we have provided an example using Python in the GridPP GitHub repository [https://github.com/gridpp/cvmfs-test-001-00-00 cvmfs-test-001-00-00]. You can get this with:
  
 
<pre>
 
<pre>
$ ssh -Y whyntie@heppc402
+
$ cd $CVMFS_UPLOAD_DIR # choose a suitable location for this.
whyntie@heppc402's password: # enter your password here
+
$ wget https://github.com/gridpp/cvmfs-test-001-00-00/archive/master.zip -O cvmfs-test-001-00-00-master.zip
 
+
$ unzip cvmfs-test-001-00-00-master.zip
$ mkdir cvmfstests
+
$ rm cvmfs-test-001-00-00-master.zip
$ cd cvmfstests
+
$ tar -cvf cvmfs-test-001-00-00.tgz cvmfs-test-001-00-00-master/
$ mkdir helloworld
+
$ cd helloworld
+
$ pwd
+
/users/whyntie/cvmfstests/helloworld
+
 
</pre>
 
</pre>
  
====Preparing your software====
+
This contains:
  
Create a new directory that will form the basis of your CVMFS tarball.
+
* <code>process-frame.py</code>: a simple Python script to process a frame of CERN@school Timepix data, either uploaded with the job or retrived from a Storage Element (SE);
 +
* <code>lib</code>: some pre-compiled Python libraries for non-standard Python modules used by <code>process-frames.py</code>.
  
<pre>
+
The idea is that <code>process-frame.py</code> will run remotely on the grid, using the non-standard Python libraries supplied with the CVMFS repository. This saves having to install the modules on each Computing Element (CE) every time you want to run a grid job. You will need to do compile and supply the libraries you need when assembling your own tarballs.
$ mkdir hello-world_001-00-00
+
$ cd hello-world_001-00-00
+
$ pwd
+
/users/whyntie/cvmfstests/helloworld/hello-world_001-00-00
+
</pre>
+
 
+
Create three files in this directory:
+
 
+
* <code>hello-world.sh</code> - the "software";
+
* <code>run.sh</code> - the script that runs the software;
+
* <code>README.md</code> - for your notes, ideally in the [http://daringfireball.net/projects/markdown/syntax MarkDown] format.
+
  
 +
====File permissions====
 +
'''Note''': '''''all''''' files uploaded to the repository should have permissions <code>o+r</code> and directories should have permissions <code>o+rx</code>. You can do this with the following commands:
 
<pre>
 
<pre>
$ cat hello-world.sh
+
$ find ./ -type d -exec chmod go+rx {}
#!/bin/bash
+
$ find ./ -type f -exec chmod go+r {}
#
+
#=============================================================================
+
#                    The GridPP CVMFS Hello World! Script
+
#=============================================================================
+
#
+
# Usage: . hello-world.sh [whoever you want to greet]
+
#
+
echo 'Hello' $1'!'
+
$
+
$
+
$ cat run.sh
+
#!/bin/bash
+
#
+
#=============================================================================
+
#                    The GridPP CVMFS Hello World! run script
+
#=============================================================================
+
#
+
# Usage: . run.sh [full path of the hello-world.sh script]
+
#
+
# Note that for grid jobs (or clusters with CVMFS enabled) this will be
+
# the CVMFS directory.
+
#
+
$1/hello-world.sh World
+
$
+
$
+
$ cat README.md
+
My CVMFS Test Notes
+
===================
+
You're keeping detailed notes, right? Good good.
+
 
</pre>
 
</pre>
 +
'''''If your bundle contains any non-readable files, it will not publish.'''''
  
Don't forget to change the permissions on the scripts so that they can be run.
+
===Deploy your software to the CVMFS repository===
 
+
<pre>
+
$ chmod a+x run.sh
+
$ chmod a+x hello-world.sh
+
</pre>
+
  
Now compress these files into a tarball.
+
With your tarball prepared, you can now upload it to the RAL CVMFS stratum-1 by generating a grid proxy, <code>gsiscp</code>-ing the tarball over, and unpacking the tarball in your repository:
  
 
<pre>
 
<pre>
$ cd ../
+
$ voms-proxy-init --voms [your VO name, e.g. cernatschool.org] -dont_verify_ac
$ tar -czf hello-world_001-00-00.tar hello-world_001-00-00
+
$ gsiscp -P 1975 cvmfs-test-001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
$ ls
+
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk
hello-world_001-00-00 hello-world_001-00-00.tgz
+
</pre>
+
  
====Uploading and deploying the software====
+
$ cd cd cvmfs_repo/
 
+
$ tar -xvf cvmfs-test-001-00-00.tgz
Create a proxy with your Virtual Organisation (VO):
+
 
+
<pre>
+
$ voms-proxy-init --voms cernatschool.org
+
 
</pre>
 
</pre>
  
Then copy the tar ball to your CVMFS repository with the <code>gsiscp</code> command:
+
Your software has now been deployed. However, it may take up to '''three hours''' for the cron jobs to deploy it to all sites - be patient!
  
<pre>
+
===Prepare for job submission===
$ gsiscp -P 1975 hello-world_001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
+
</pre>
+
  
You should then be able to log on and decompress the tar ball into the repository. Your proxy will take care of the username and password.
+
You will need to setup whichever User Interface (UI) you use for submitting grid jobs and generate an appropriate proxy. We will use [[Quick_Guide_to_Dirac|DIRAC]] in this example.
  
 
<pre>
 
<pre>
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk
+
$ cd $DIRAC_DIR
$ tar -xvf hello-world_001-00-00.tgz
+
$ . bashrc # set the DIRAC environment variables
 +
$ dirac-proxy-init -g [VO name]_user -M
 
</pre>
 
</pre>
  
'''Note''': it may take up to ''three hours'' for your software to appear, depending on the timing of the cron jobs.
+
For convenience, we have prepared an example JDL file and test data that will run with the software deployed above. You can get this from the GridPP GitHub repository [https://github.com/gridpp/cvmfs-getting-started cvmfs-getting-started]:
 
+
<!--
+
 
+
Once you have supplied the RAL team with your grid certificate DN, visit the [https://cvmfs-upload01.gridpp.rl.ac.uk/ RAL CVMFS repository]. You will need your grid certificate installed in your browser in order to be identified.
+
  
 
<pre>
 
<pre>
$ firefox https://cvmfs-upload01.gridpp.rl.ac.uk &
+
$ cd $CVMFS_SUBMIT_DIR
 +
$ git clone https://github.com/gridpp/cvmfs-getting-started.git
 +
$ cd cvmfs-getting-started
 
</pre>
 
</pre>
  
After confirming certificate and security settings, you should be presented with a page like this (click to enlarge):
+
This contains:
  
[[File:Cvmfs-repo-upload_home.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader page - home.]]
+
* <code>dirac-test.jdl</code>: a job description file for DIRAC users;
 +
* <code>glite-test.jdl</code>: a job description file for glite users;
 +
* <code>run.sh</code>: the script that sets the environment variables and runs the software in CVMFS.
  
<div style="clear:both;"></div>
+
In a nutshell, the job description file submits the <code>run.sh</code> script and the frame of data with the job. <code>run.sh</code> then sets the environment variables to make sure your CVMFS libraries are available, and runs the Python script <code>process-frame.py</code> remotely. You should look at the contents of these files to understand exactly what's going on.
  
(Note: this is from the CERN@school VO, which has already had software uploaded. Your page should be blank...)
+
===Submit your job(s)===
  
To upload the <code>hello-world_001-00-00</code> tarball, click on the blue "Upload" button. This will take you to the "Upload new package" screen:
+
To submit a job with DIRAC:
 
+
[[File:Cvmfs-repo-upload_upload.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader - "Upload new package".]]
+
 
+
<div style="clear:both;"></div>
+
 
+
Click on the "Select file" button, select your <code>hello-world_001-00-00.tar</code> tarball and then press the blue "Upload" button. The tarball should now be uploaded to your VO's repository, appearing on the right-hand side of the Uploader homepage with a white background (indicating that is uploaded).
+
 
+
Now you should be ready to deploy the tarball. This is done by pressing the "Deploy" button in the '''project content''' panel of the Uploader home page:
+
 
+
[[File:Cvmfs_deploy-button.PNG|thumb|left|The deploy button.]]
+
 
+
<div style="clear:both;"></div>
+
 
+
You should now see the deploy page:
+
 
+
[[File:Cvmfs-deploy-page.PNG|thumb|512px|left|The RAL CVMFS Stratum-0 Uploader - the deploy page.]]
+
 
+
<div style="clear:both;"></div>
+
 
+
Click on the drop-down menu and you should see <code>hello-world_001-00-00.tar</code> available to select. Select it, then press the blue "Deploy" button. You should now be redirected to the home page, where <code>hello-world_001-00-00.tar</code> should be displayed with a green background to indicate that it has been successfully deployed to the repository.
+
 
+
It should take a maximum of three hours for the software to become available to worker nodes with the appropriate CVMFS access. If your local cluster has access to your VO's repository, you can check with:
+
  
 
<pre>
 
<pre>
$ ls /cvmfs/cernatschool.gridpp.ac.uk
+
$ chmod a+x run.sh
hello-world_001-00-00
+
$ dirac-wms-job-submit dirac-test.jdl
$ ls /cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00
+
JobID = [number]
hello-world.sh  README.md  run.sh
+
 
</pre>
 
</pre>
  
(<code>cernatschool.gridpp.ac.uk</code> is the address for the <code>cernatschool.org</code> VO - you can find out the corresponding address from the CVMFS uploader homepage.)
+
You can monitor the progress of your job using the [https://dirac.grid.hep.ph.ic.ac.uk:8443 DIRAC web interface]. Once it has completed, you can retrieve the output (which consists of a JSON file containing processed information about the frame and a log file) with:
 
+
-->
+
 
+
====Run a job using your CVMFS software====
+
 
+
Jobs using your software can be submitted and run as normal - but now you don't need to worry about installing software anywhere. The executables you need will be available in the CVMFS repository. So a JDL file for the Hello World! software (deployed on the <code>cernatschool.org</code> repository would look like this:
+
 
+
<pre>
+
$ cat helloworld.jdl
+
# The GridPP CVMFS Hello World! JDL file
+
Executable = "/bin/sh";
+
# Replace "cernatschool.gridpp.ac.uk" with your VO's address, of course!
+
Arguments = "/cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00/run.sh /cvmfs/cernatschool.gridpp.ac.uk/hello-world_001-00-00";
+
StdOutput = "stdout.txt";
+
StdError = "stderr.txt";
+
OutputSandbox = {"stdout.txt", "stderr.txt"};
+
#
+
</pre>
+
 
+
As you can see, the shell executable is being run with two arguments:
+
 
+
# The script to run - <code>run.sh</code> - which is in the tarball.
+
# The first argument supplied to <code>run.sh</code>. In this example, the first argument of <code>run.sh</code> base directory (including the CVMFS prefix). We have made this an argument in the <code>run.sh</code> to make local testing easier - which, when the software gets more complicated, is a worthwhile thing to do...
+
 
+
So, if you're using <code>glite</code> for your WMS, you can submit the job as usual:
+
  
 
<pre>
 
<pre>
$ voms-proxy-init --voms cernatschool.org
+
$ dirac-wms-job-get-output [number]
Enter GRID pass phrase for this identity: # you know what to do...
+
$ cat [number]/file-info.json
...[proxy confirmation messages]
+
{"n_pixel": 735, "file_name": "data000.txt", "max_count": 639}
$ myproxy-init -d -n
+
Your identity: /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=tom whyntie
+
Enter GRID pass phrase for this identity: # and again...
+
...[proxy confirmation messages]
+
$ glite-wms-job-submit -a -o jobIDfile helloworld.jdl
+
...[job submission messages]
+
 
</pre>
 
</pre>
  
When you retrive the output from the jobs, <code>stdout.txt</code> should contain a very, very exciting message, indicating that your job has succeeded and the CVMFS software has been successfully deployed.
+
And that's it! Congratulations, you've successfully used CVMFS to deploy and run software on the grid.  
  
 
==Useful links==
 
==Useful links==

Latest revision as of 13:35, 2 December 2015

Deploying software with CVMFS

For more information about CVMFS at RAL, click here.

Also see the GridPP UserGuide for more information.

Usage guidelines

See the CVMFS usage guidelines here for more detailed information, but for convenience:

  • CVMFS repositories are not for uploading data. Only software - to be used in running your grid jobs - should be uploaded to your CVMFS repository. Any data you need to process should be uploaded to Storage Elements and read during grid jobs as usual.
  • There are currently no restrictions on upload size, but if your software bundle is larger than 20 GB, please contact Catalin to discuss your repository needs.
  • Please note that files that are greater than 200MB are not kept in the local cache, and so would need to be downloaded each time they are used (kind of defeating the point).

Overview of the process

  • Prepare your software;
  • Deploy your software to the CVMFS repository;
  • Prepare for job submission;
  • Submit your job(s).

A worked example

Here we will demonstrate the full process of deploying and running software with CVMFS using a Python script and some sample CERN@school data. All of the code is available via the GridPP GitHub repository - please feel free to adapt and modify for your own needs!

Prepare your software

In order to get your software running on the grid, you'll need to bundle it up into a tarball (.tgz) so that it's ready to upload to the RAL CVMFS stratum-1 server. This tarball will need to include the scripts, executables and libraries you need, all compiled to run on a 64-bit SL6 machine. For convenience, we have provided an example using Python in the GridPP GitHub repository cvmfs-test-001-00-00. You can get this with:

$ cd $CVMFS_UPLOAD_DIR # choose a suitable location for this.
$ wget https://github.com/gridpp/cvmfs-test-001-00-00/archive/master.zip -O cvmfs-test-001-00-00-master.zip
$ unzip cvmfs-test-001-00-00-master.zip
$ rm cvmfs-test-001-00-00-master.zip
$ tar -cvf cvmfs-test-001-00-00.tgz cvmfs-test-001-00-00-master/

This contains:

  • process-frame.py: a simple Python script to process a frame of CERN@school Timepix data, either uploaded with the job or retrived from a Storage Element (SE);
  • lib: some pre-compiled Python libraries for non-standard Python modules used by process-frames.py.

The idea is that process-frame.py will run remotely on the grid, using the non-standard Python libraries supplied with the CVMFS repository. This saves having to install the modules on each Computing Element (CE) every time you want to run a grid job. You will need to do compile and supply the libraries you need when assembling your own tarballs.

File permissions

Note: all files uploaded to the repository should have permissions o+r and directories should have permissions o+rx. You can do this with the following commands:

$ find ./  -type d -exec chmod go+rx {}
$ find ./  -type f -exec chmod go+r {}

If your bundle contains any non-readable files, it will not publish.

Deploy your software to the CVMFS repository

With your tarball prepared, you can now upload it to the RAL CVMFS stratum-1 by generating a grid proxy, gsiscp-ing the tarball over, and unpacking the tarball in your repository:

$ voms-proxy-init --voms [your VO name, e.g. cernatschool.org] -dont_verify_ac
$ gsiscp -P 1975 cvmfs-test-001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk

$ cd cd cvmfs_repo/
$ tar -xvf cvmfs-test-001-00-00.tgz

Your software has now been deployed. However, it may take up to three hours for the cron jobs to deploy it to all sites - be patient!

Prepare for job submission

You will need to setup whichever User Interface (UI) you use for submitting grid jobs and generate an appropriate proxy. We will use DIRAC in this example.

$ cd $DIRAC_DIR
$ . bashrc # set the DIRAC environment variables
$ dirac-proxy-init -g [VO name]_user -M

For convenience, we have prepared an example JDL file and test data that will run with the software deployed above. You can get this from the GridPP GitHub repository cvmfs-getting-started:

$ cd $CVMFS_SUBMIT_DIR
$ git clone https://github.com/gridpp/cvmfs-getting-started.git
$ cd cvmfs-getting-started

This contains:

  • dirac-test.jdl: a job description file for DIRAC users;
  • glite-test.jdl: a job description file for glite users;
  • run.sh: the script that sets the environment variables and runs the software in CVMFS.

In a nutshell, the job description file submits the run.sh script and the frame of data with the job. run.sh then sets the environment variables to make sure your CVMFS libraries are available, and runs the Python script process-frame.py remotely. You should look at the contents of these files to understand exactly what's going on.

Submit your job(s)

To submit a job with DIRAC:

$ chmod a+x run.sh
$ dirac-wms-job-submit dirac-test.jdl
JobID = [number]

You can monitor the progress of your job using the DIRAC web interface. Once it has completed, you can retrieve the output (which consists of a JSON file containing processed information about the frame and a log file) with:

$ dirac-wms-job-get-output [number]
$ cat [number]/file-info.json
{"n_pixel": 735, "file_name": "data000.txt", "max_count": 639}

And that's it! Congratulations, you've successfully used CVMFS to deploy and run software on the grid.

Useful links

Internal

External