Summary of HEPCAL-II for Metadata GroupSteven Hanlon, May 2004 |
| Use Cases for Metadata | Metadata Working Group | |
Introduction
HEPCAL-I focused on the problems of well-organised activities such as reconstruction and Monte Carlo production. HEPCAL-II supplements this with an examination of the problems and requirements of analysis activity.
Analysis Characteristics
Analysis consists of iterative loops over some subset of the following steps:
- Perform queries on dataset metadata to identify datasets of interest;
- Query datasets with event-level metadata to identify events of interest;
- Save results of step 2 for future use;
- Perform analysis activity on selected events. Such activity could be:
- additional filtering;
- reprocessing;
- addition of new information on events;
- Save results of step 4 and publish them somehow. Output can be:
- shallow copy e.g. tag dataset
- deep copy for more efficient access
- new dataset; necessary for adding new information to a dataset.
It is important that a user should be able to see some estimate of the resource cost of a procedure before committing to it.
The important characteristics of analysis activity, from the computing point of view, are identified:
- non-standard software - user code;
- input not known a priori;
- sparse data access pattern;
- large amount of concurrent, uncoordinated submission;
- interactive jobs;
- requirements on response time (as opposed to total throughput);
- detailed provenance information required;
- resource estimator required; and
- resources must be limited and accounted for.
Notes on Metadata
- Distinguish between event-level metadata and dataset-level metadata.
- Can also have event component metadata e.g. sub-detector or physics channel the component is relevant to.
- Metadata for provenance and book-keeping.
- Queries can be split into two sub-queries, one on the dataset-level catalogue, the other at event level, carried out in experiment-dependent software.
Workload Management
Four workload management concepts are discussed:
- No workload management - user program sends the job to a computing element, which runs the query, accesses the data and runs analysis code.
- Single job at best computing element - a query on the dataset catalogue is made to determine the best computing element selection. That element then executes the event-level query on a supplied list of input datasets and executes analysis code.
- Multiple jobs with merged output - a query is made on the dataset catalogue. The job is split between multiple computing elements, and the outputs of these sub-jobs are merged at the end of execution.
- Multiple queries with merged input - a query is made on the dataset catalogue. The event-level query is then distributed to multiple computing elements which select events of interest. Selected events are then merged into an input dataset for the analysis code, running on a single computing element.
It is also important to note that analysis puts demands on response time. The iterative nature of the activity makes it important to be able to get fast responses to test jobs etc.
Requirements
HEPCAL-II describes in detail four requirements specific to analysis work: availability of the provenance of datasets, the logging of analysis activities, persistent interactive sessions and deployment of user analysis code.
Provenance
It should be possible to relate a dataset to entries in the job metadata catalogue regarding the jobs used to produce it. One should be able to retrieve information on the executable name, environment settings and input files. This should allow the output to be reproduced. In this way, it should be possible to ban particular datasets, implicitly banning derived datasets.
One problem with this is determining the minimal amount of data which must be stored to allow the above. It is suggested that it may be best done by pooling metadata from various sources e.g. job catalogue, dataset catalogue, computing element configuration log.
Logging
The system should log the results of any task not immediately discarded as useless, in enough detail to make them repeatable. If the provenance information requirements above are fulfilled, this should just consist of compiling and annotating that information. The system should also allow querying of the status of current tasks. These logs should be usable both by groups working together and by individuals.
Persistent Interactive Sessions
It should be possible to save an interactive session to the Grid and restore it later from elsewhere, maintaining all open datasets, loaded libraries and environment settings, query results, histograms etc.
Analysis Software Deployment
User code may rely on particular libraries, compilers, linkers, environment variables, shells or other factors. There needs to be some mechanism to make sure that the computing element that will run the job has the proper set up. A possibility is to consider dependencies as input datasets so that the workload management system can take them into account when allocating resources to the job.
Use Cases
Use cases in HEPCAL-II are presented differently to those in HEPCAL-I. Whereas, in HEPCAL-I, the tendency was toward detail, HEPCAL-II presents only three use cases:
- Production Analysis;
- Group-level Analysis; and
- End user Analysis.
These differ in terms of permissions and in what is done with the output i.e. end user analysis output is downloaded to a user's local store, whereas production and group-level analyses' results in new, metadata described, datasets for input into further analyses.
Future Work
Two areas identified for future work are:
- requirements on private metadata; and
- resource allocation to individuals and groups.
s.hanlon@physics.gla.ac.uk
Last modified Tue 8 June 2004 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3