Storage and Data Management Testing

From GridPP Wiki
Jump to: navigation, search

Storage and Data Management Testing

This page gives an overview of the test suite used to test Grid Storage and Data Management in roughly incremental order of functionality. The main aim of this page is to document the test suites that an SE will have to pass before it goes into production. A secondary aim is to give the admin a toolbox to test and debug from the client side.

As you know, a Storage Element provides control, information, and file transfer protocols.

Part I: Functional Tests

SRM interface

Generic SRM testing with the dCache client

The dCache srmcp tool (sometimes called "SRM copy" but not to be confused with srmCopy() in the SRM API) is a command line utility for copying files into and out of SEs. It is normally installed on the UIs, the RPM is called dcache-srmclient (not dcache-client).

srmcp -debug -2 file:///`pwd`/localfile srm://


  • the extra / in the local file URL.
  • You need a grid proxy ready
  • The -2 switch says to use SRM version 2 (i.e. 2.2).
  • The SRM port number is added to the normal SURL. For DPM, the standard port number for SRM is 8446. Again, this information is in the information system.
  • The path, if you do not know already, can be queried from the information system.
  • It depends on gridftp working, you may need globus-url-copy installed locally;
  • -debug is helpful and tells you what is going on, but of course you don't need it.

The dCache test suite contains many other utilities which may be useful, e.g. srmrm to remove the file. The usefulness of the test suite is that it does not depend on the information system: once these tests are working, you can move on to the data management tools which do depend on the information system.

Information System

gLite Testing - lcg-utils

Basic usage of the gLite utilities:

Copying a file to an SE

This command will copy and register the file: copy copies into an SE given by -d (only hostname is required, the rest of the information is discovered from the information system).

lcg-cr --vo dteam -d file://`pwd`/localfile

Again, the -v switch (for verbose) may be useful. Note carefully the GUID returned by the above command, it is your "handle" on the file.

Creating another replica of the file.

Using the GUID from above, we now wish to replicate the file to another site:

lcg-rep --vo dteam -d guid:d1b65e01-67fc-443e-b564-492d918b6719

and another one for good measure:

lcg-rep --vo dteam -d guid:d1b65e01-67fc-443e-b564-492d918b6719

Listing Replicas

lcg-lr --vo dteam guid:d1b65e01-67fc-443e-b564-492d918b6719

This command will list the three replicas: the original copy and the two replicas made above.

Deleting a replica

Now we delete the original one. We cannot delete by GUID (well, we can, but we don't want to), so we need to instruct the command which of the three replicas to delete.

lcg-del --vo dteam srm://

Here, the filename is one of the replicas listed with lcg-lr (it is in fact the first copy of the file, the one created with lcg-cr). Now, another lcg-lr on the GUID will confirm that one of the replicas has been deleted.

Copying the file to a local file

We now want to access the file locally.

lcg-cp --vo dteam guid:d1b65e01-67fc-443e-b564-492d918b6719 file://`pwd`/tmpfile

Note that it is like a 'cp': we copy the remote file (specified by GUID, so we don't need to know where it is) to a local file. In fact, if you run this with -v you can see which of the replicas was chosen for transfer, but otherwise you will not know.

Note also that the command is lcg-cp, not lcg-cr. The difference is (in this context) that lcg-cp does not register the new copy in the file catalogue: the local copy we have just created is not available for the grid to access as another replica of the file, it is a private copy.

Deleting all replicas

Finally, we clean up all copies known to the grid by deleting the GUID:

lcg-del -a guid:d1b65e01-67fc-443e-b564-492d918b6719

The -a switch says to delete all. This command will query the catalogue for all copies of the file, and delete them all.

Working without the information system

If the information system is not available, or you want to test an SE which is not in the information system, here is how to do it:

lcg-cp -b -D SRMv2 file:///home/tier1/jensen/castor.ppt srm://\?SFN=/castor/

Notice the -b which says not to use the BDII, and the -D SRMv2 which says that the SRMs are version 2 (you are extremely unlikely to encounter any other flavour).

Finally note the special form of the SURL which:

  • Has a port number of the web service (usually 8443, except usually 8446 for DPM)
  • Has a web services (control) path, or a "web service port", including a path which by remarkable convention for all implementations is /srm/managerv2
  • Contains the SE path as an SFN parameter, after a '?'
  • Has escaped the '?' character with a backslash to avoid the shell interpreting it as a wildcard; the backslash is consumed by the shell and is not passed into the command.

The corresponding command for deleting a file is

lcg-del -l -b -D SRMv2 srm://\?SFN=/castor/

where -l (which is minus ell, not minus one) says not to update the LFC (which was never used), and as usual -b to not use the BDII.

Experiment specific testing



ATLAS Hammercloud tests send a series of Athena jobs to a site using the Ganga grid job submission tool and either the WLCG or PANDA backends. It can therefore establish the functionality of the site to run ATLAS user analysis jobs. Multiple jobs can be sumbitted providing a stress test of the site and, in particular, the storage system. The test site also provides statistics on cpu efficiency and event rates as well as access to log files and other diagnostic info. Jobs can be configured by Tier 2 coordinators, atlas uk experts or various other gridpp staff but if you expect to run a lot of tests then you may want to request an account following the instructions on the page. Currently (May 2010) to accept jobs you will need a CE (and a ANALY queue specified within ATLAS if you want to run PANDA jobs). You will also need you storage element space tokens to be defined within TiersOfAtlas. Ask atlas uk experts to get this defined. Shortly it will be possible to run these tests on datasets provided by a list of physical file names - this will allow testing on Tier 3 sites or on test storage elements not officially known to atlas.

Existing tests can be cloned to easily configure them but the most important configuration variables involved (other than start/end time and site) are given below.

Input type: Can choose from DQ2_LOCAL , DQ2_COPY, FILESTAGER, PANDA. Of these, DQ2_LOCAL uses the local file access method of the site (ie rfio for DPM), the others copy the files to the WN, Filestager copies subsequent files as the job is running as is currently (Mar 2010) the most effective method for DPM sites.

Dataset string: This contains wildcards which should match a sufficient number of datasets on the site to match the scale of the test (ie. to not run on the same files). However it will match a different set of data each time the test is run if more datasets are copied to the site. If a MuonAnalysis type job is run then this should ensure that datasets containing muons are included (one way is for it to contain *mu*) or, at least a broad range of files.

Resubmit enabled - which will continue to submit jobs even once all the datasets have been run on once.

Athena User Area, Athena Option file and Ganga Job Template: Specify the type of job to be run. Popular ones include MuonAnalysis (for which there are plenty of existing tests to compare against) and D3PDMaker which has the advantage of "doing something" for every container in a dataset.

Max Running, Min Queue: This will try and maintain the specified number of jobs running. If resubmit is enabled and the current queued < min_queue and current running < max_running then it will submit a bulk of jobs to process the specified "Number of datasets per bulk"




Other experiments tend to use the gLite utilities - described above.

Part II: Performance Testing


SRM Performance Testing

Running Parametric Jobs

Parametric jobs are submitted in gLite with a JDL file like this one:

  JobType = "Parametric";
  Executable = "a.out";
  StdInput = "input._PARAM_.txt";
  StdOutput = "outfile._PARAM_.txt";
  StdError = "error._PARAM_.txt";
  Requirements = (other.GlueHostArchitecturePlatformType == "x86_64");
  Parameters = 10;
  ParameterStart = 1;
  ParameterStep = 1;
  InputSandbox = { "a.out", "input._PARAM_.txt" };
  OutputSandbox = { "outfile._PARAM_.txt", "error._PARAM_.txt" };

In this case, the executable (which should be in the current directory and is copied along with the job, because it is in the input sandbox) is called a.out and was compiled on x86_64, so the WMS will need to match this architecture. If your job is a shell script, or runs a command already installed on the WN, you will not need this.

As you can see, 10 jobs are run, each with an input file input.1.txt to input.10.txt; these can be replaced with a single file if all the input files are identical (eg contain the name of a single file to access.)

One caveat: WMS requires old style proxies (the ones with CN=proxy in the name), so you must ensure you create one of these when you start. E.g. if you use MyProxy, first set GT_PROXY_MODE=old.

One more caveat: WMS will refuse to send parametric jobs to a single CE. If you want this, you will need to play with the requirements.

For stress testing an SRM, you'd probably replace a.out with a shell script which calls lcg-cp on a filename found in the input file. Doing this is an easy exercise left to the reader.

See Also

Obsolete See Also