A quick guide to CVMFS

From GridPP Wiki
Jump to: navigation, search

Deploying software with CVMFS

For more information about CVMFS at RAL, click here.

Also see the GridPP UserGuide for more information.

Usage guidelines

See the CVMFS usage guidelines here for more detailed information, but for convenience:

  • CVMFS repositories are not for uploading data. Only software - to be used in running your grid jobs - should be uploaded to your CVMFS repository. Any data you need to process should be uploaded to Storage Elements and read during grid jobs as usual.
  • There are currently no restrictions on upload size, but if your software bundle is larger than 20 GB, please contact Catalin to discuss your repository needs.
  • Please note that files that are greater than 200MB are not kept in the local cache, and so would need to be downloaded each time they are used (kind of defeating the point).

Overview of the process

  • Prepare your software;
  • Deploy your software to the CVMFS repository;
  • Prepare for job submission;
  • Submit your job(s).

A worked example

Here we will demonstrate the full process of deploying and running software with CVMFS using a Python script and some sample CERN@school data. All of the code is available via the GridPP GitHub repository - please feel free to adapt and modify for your own needs!

Prepare your software

In order to get your software running on the grid, you'll need to bundle it up into a tarball (.tgz) so that it's ready to upload to the RAL CVMFS stratum-1 server. This tarball will need to include the scripts, executables and libraries you need, all compiled to run on a 64-bit SL6 machine. For convenience, we have provided an example using Python in the GridPP GitHub repository cvmfs-test-001-00-00. You can get this with:

$ cd $CVMFS_UPLOAD_DIR # choose a suitable location for this.
$ wget https://github.com/gridpp/cvmfs-test-001-00-00/archive/master.zip -O cvmfs-test-001-00-00-master.zip
$ unzip cvmfs-test-001-00-00-master.zip
$ rm cvmfs-test-001-00-00-master.zip
$ tar -cvf cvmfs-test-001-00-00.tgz cvmfs-test-001-00-00-master/

This contains:

  • process-frame.py: a simple Python script to process a frame of CERN@school Timepix data, either uploaded with the job or retrived from a Storage Element (SE);
  • lib: some pre-compiled Python libraries for non-standard Python modules used by process-frames.py.

The idea is that process-frame.py will run remotely on the grid, using the non-standard Python libraries supplied with the CVMFS repository. This saves having to install the modules on each Computing Element (CE) every time you want to run a grid job. You will need to do compile and supply the libraries you need when assembling your own tarballs.

File permissions

Note: all files uploaded to the repository should have permissions o+r and directories should have permissions o+rx. You can do this with the following commands:

$ find ./  -type d -exec chmod go+rx {}
$ find ./  -type f -exec chmod go+r {}

If your bundle contains any non-readable files, it will not publish.

Deploy your software to the CVMFS repository

With your tarball prepared, you can now upload it to the RAL CVMFS stratum-1 by generating a grid proxy, gsiscp-ing the tarball over, and unpacking the tarball in your repository:

$ voms-proxy-init --voms [your VO name, e.g. cernatschool.org] -dont_verify_ac
$ gsiscp -P 1975 cvmfs-test-001-00-00.tgz cvmfs-upload01.gridpp.rl.ac.uk:./cvmfs_repo/.
$ gsissh -p 1975 cvmfs-upload01.gridpp.rl.ac.uk

$ cd cd cvmfs_repo/
$ tar -xvf cvmfs-test-001-00-00.tgz

Your software has now been deployed. However, it may take up to three hours for the cron jobs to deploy it to all sites - be patient!

Prepare for job submission

You will need to setup whichever User Interface (UI) you use for submitting grid jobs and generate an appropriate proxy. We will use DIRAC in this example.

$ cd $DIRAC_DIR
$ . bashrc # set the DIRAC environment variables
$ dirac-proxy-init -g [VO name]_user -M

For convenience, we have prepared an example JDL file and test data that will run with the software deployed above. You can get this from the GridPP GitHub repository cvmfs-getting-started:

$ cd $CVMFS_SUBMIT_DIR
$ git clone https://github.com/gridpp/cvmfs-getting-started.git
$ cd cvmfs-getting-started

This contains:

  • dirac-test.jdl: a job description file for DIRAC users;
  • glite-test.jdl: a job description file for glite users;
  • run.sh: the script that sets the environment variables and runs the software in CVMFS.

In a nutshell, the job description file submits the run.sh script and the frame of data with the job. run.sh then sets the environment variables to make sure your CVMFS libraries are available, and runs the Python script process-frame.py remotely. You should look at the contents of these files to understand exactly what's going on.

Submit your job(s)

To submit a job with DIRAC:

$ chmod a+x run.sh
$ dirac-wms-job-submit dirac-test.jdl
JobID = [number]

You can monitor the progress of your job using the DIRAC web interface. Once it has completed, you can retrieve the output (which consists of a JSON file containing processed information about the frame and a log file) with:

$ dirac-wms-job-get-output [number]
$ cat [number]/file-info.json
{"n_pixel": 735, "file_name": "data000.txt", "max_count": 639}

And that's it! Congratulations, you've successfully used CVMFS to deploy and run software on the grid.

Useful links

Internal

External