The GridPP UserGuide

The DIRAC Command Line Tools

So you've mastered the DFC Command Line Interface. Great stuff. What you'll have probably noticed is that, while it's great for small-scale operations, it's not ideal for doing things with lots of files on any sort of scale. We will therefore want to take a look at the DIRAC command line tools for data management.

All of the DIRAC command line tools start with dirac-. The data management tools start with dirac-dms-, as in Data Management System. Press the tab key after typing `dirac-dms-` to see all of the available commands.

Why are these commands useful? Well, it means you can use scripting to automate large-scale tasks involving many files. There are many ways to script the DIRAC (or indeed any command line) commands. You've probably got your own preferred method that reflects your coding background. For the purposes of the UserGuide, we'll use simple bits of Python code (along with Python-based file management libraries) to generate some simple bash scripts that can then be run to perform the DIRAC operations we want to perform.

Of course, bash experts will be able to write scripts that perform all of the operations below purely in bash. This is left as an exercise for the reader - answers on a punch card please! (Also, we'll be using Python for the DIRAC Python API, so it's not a bad thing to use Python at this stage!)

Uploading files

The DIRAC file upload command takes the following form:

$ dirac-dms-add-file <LFN> <FILE> <SE>

where:

  • <LFN> is the Logical File Name (LFN) of the entry for the file in the DIRAC File Catalog (DFC);
  • <FILE> is the path to the file on your local machine, and;
  • <SE> is the name of the destination Storage Element (SE).
Remember, you can find the names of the available SEs with the dirac-dms-show-se-status command.

Suppose we have a number of files on our local machine in /home/gridpp/mydata/ that we want to upload to the grid. The following Python code will generate a bash script that will upload them to one of the Queen Mary Storage Elements:

$ cat make_upload_script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os, glob

data_path = '/home/gridpp/mydata'

lfn_dir = '/gridpp/user/a/ada.lovelace/mydata/'

se = 'UKI-LT2-QMUL2-disk'

s = "#!/bin/bash\n"

for my_file in sorted(glob.glob(data_path + "/*")):
    base_name  = os.path.basename(my_file)
    upload_lfn = os.path.join(lfn_dir, base_name)
    s += "dirac-dms-add-file %s %s %s\n" % (upload_lfn, my_file, se)

with open("upload_script.sh", "w") as sf:
    sf.write(s)

After you've generated a proxy and sourced the DIRAC environment, you can run the generated script as follows:

$ python make_upload_script.py
$ chmod a+x upload_script.sh
$ . upload_script.sh

The results of this will, of course, depend on the contents of /home/gridpp/mydata/, but all being well you should see the message:

Successfully uploaded file to UKI-LT2-QMUL2-disk

(or whichever SE you specified in your Python code) after each file has been uploaded.

If you're uploading a lot of files, you may wish to consider using something like the screen tool so that you can log off your terminal session and come back to it later.

And there we go! Multiple file uploads, all registered in the DIRAC File Catalog, using a DIRAC command line tool and a bit of (admitedly slightly clumsy) coding.

Replicating files

Now, as we did with the DFC CLI, we can also make replicas of files, list information about the replicas of a given file, and remove replicas with the following command line tools:

dirac-dms-replicate-lfn <LFN> <SE>
dirac-dms-lfn-replicas <LFN>
dirac-dms-remove-replicas <LFN> <SE>

Likewise, we can take the same approach with...

Downloading and removing files

dirac-dms-get-file <LFN>
dirac-dms-remove-files <LFN>

i.e. the DIRAC command line tools exist for these operations. However, getting information from the DFC about which files you would like to replicate, download, remove, etc. is non-trivial when taking the command line approach. This is especially true if you're writing scripts.

One approach is to use the metadata functionality the DIRAC File Catalog provides to find files of interest.

Metadata is data about the data. By assigning metadata to the files we upload to the DIRAC File Catalog, we can perform queries that will select only the files we are interested in. It also helps us to manage our data. We'll find out more about the DFC's metadata functionality later.

The dirac-dms-find-lfns command finds LFNs based on the DFC path and metadata query supplied as options. For example, to find all files in the DFC that have been assigned to the experiment UserGuide, we can type:

dirac-dms-find-lfns Path=/ "experiment=UserGuide"
{'experiment': 'UserGuide'}
/gridpp/userguide/WELCOME.md

experiment here is the metadata element or index. This is a string assigned to the file's LFN that, in this case, has the value UserGuide. We can use the results of this to download the files we want.

$ dirac-dms-get-file /gridpp/userguide/WELCOME.md
{'Failed': {},
 'Successful': {'/gridpp/userguide/WELCOME.md': '/home/gridpp/tmp/WELCOME.md'}}
$ cat WELCOME.md
#Welcome to GridPP!

It looks like your download has worked. Congratulations!

Let's take a closer look at the DFC's metadata functionality using the DFC CLI.