The GridPP UserGuide

First steps with the DIRAC metadata functionality

Finding files using metadata

When you're uploading vast amounts of data, it's nice to be able to find it later. Metadata - data about the data - can help with this. DIRAC allows you to assign metadata such as strings, integers, and floating point numbers to files and directories (via their Logical File Names in the DIRAC File Catalog). You can then query the DFC to return a list of the files you want.

For example, once you have sourced your DIRAC environment, generated a proxy, and started the DFC CLI, you can find all files associated with the UserGuide experiment like so:

FC:/> find / experiment=UserGuide
Query: {'experiment': 'UserGuide'}
/gridpp/userguide/WELCOME.md
QueryTime 0.98 sec

We have assigned the value UserGuide to the file WELCOME.md for the experiment element or index. The find command in the DFC CLI performs the query for us.

FC:/> help find
 Find all files satisfying the given metadata information 

        usage: find [-q] [-D] <path> <meta_name>=<meta_value> [<meta_name>=<meta_value>]

FC:/> exit

In our query above, <path> was / (i.e. search the entire catalog from the base directory), <meta_name> was experiment (i.e. a metadata string index indicating to which experiment the data belongs), and <meta_value> was UserGuide (OK, so the UserGuide isn't really an experiment - at least not in the scientific sense - but you get the idea!).

You can get a list of all of the available commands in the DFC CLI by using the help command. To list the instructions for a given command (as above), type help [command].

There is only one file belonging to the UserGuide experiment in the DFC, and it's a pretty harmless MarkDown file. But you can hopefully see how, particularly when we start using multiple metadata indices with different types, DIRAC's metadata functionality is going to be pretty useful.

Assigning metadata to a file

We can also use the DFC CLI to assign metadata to our files. Let's create a file with our favourite text editor and upload it to the grid using the DFC CLI:

$ vim TODO.md
$ cat TODO.md
ToDo
====
* Email Charles re. engine
* Re-do punchcards
* Write to Dad
$ dirac-dms-filecatalog-cli 
Starting FileCatalog client

File Catalog Client $Revision: 1.17 $Date: 

FC:/> add /gridpp/user/a/ada.lovelace/TODO.md TODO.md UKI-LT2-QMUL2-disk
File /gridpp/user/a/ada.lovelace/TODO.md successfully uploaded to the UKI-LT2-QMUL2-disk SE

We can now set the owner index for the LFN using the meta set command:

FC:/> meta set /gridpp/user/a/ada.lovelace/TODO.md owner ada.lovelace
/gridpp/user/a/ada.lovelace/TODO.md owner ada.lovelace
Again, use help meta to see the syntax for the meta commands.

We can now find the file again using the find command:

FC:/> find / owner=ada.lovelace
Query: {'owner': 'ada.lovelace'}
/gridpp/user/a/ada.lovelace/TODO.md
QueryTime 0.01 sec

As we've said before, the DFC CLI is useful for small-scale operations on your data. Hopefully, though, you can start to appreciate the power of metadata when it comes to organising your data and performing analyses on it.

The most important thing for the moment, though, is that we can now put data on the Grid (i.e. on a Storage Element). This means we can use it in our Grid jobs without needing to upload with our job as an inputfile. We'll now complete making our example workflow fully Grid-enabled in the next section, Using Grid-based data in your workflow.