Summary of CDF5858 for Metadata Group

Steven Hanlon, May 2004

| Use Cases for Metadata | Metadata Working Group |


Introduction

CDF5858 describes CDF's requirements of a Grid-like system for data access and resource sharing. Use cases are presented.

CDF Data Model

CDF data are divided into data sets, organised by trigger type. The analysis chain starts with Raw Data, which are processed to produce Secondary Data Sets, consisting of higher-level objects, such as tracks and jets, in addition to the raw data. These are then filtered to produce Tertiary Data Sets. These are a subset of the secondary data sets, and may contain less information per event. Finally, Ntuples, which can be processed on local machines, are produced for individual analyses.

Requirements

Requirements on the middleware and data file catalogue are outlined. A basic assumption is that the standard CDF software will be available at all sites, though the possibility of packaging dependencies with the binary for distribution is noted.

Middleware Requirements

  • A web-based tool for generating reports on datasets.
  • The ability to add a new dataset to the catalogue.
  • That data may be copied to/from a particular site, and the appropriate metadata be updated.
  • That the database may be used to track where data is available.
  • That sites publish available system resources.
  • A facility which determines the cost of an operation to the end user.
  • Determination of priority by matching resources needed on a given computing node to resources available.
  • Job status monitoring.
  • A method of determining the most cost effective manner to store data across multiple sites or at one site.
  • That storage resource permissions may be set up.
  • A database of simulation parameters.
  • Security!

Data File Catalogue Requirements

The data file catalogue must contain:

  • data location information;
  • persistency state e.g. disk or tape;
  • reprocessing information e.g. time and type of processing, software version;
  • provenance; and
  • all generator parameters for Monte Carlo files.

Use Cases

High-level use cases are presented based on the actions that an end user would wish to initiate. These are:

  • Identify a dataset
  • Specify a new dataset
  • Submit a job
  • Specify simulation parameters
  • Populate local disks with data
  • Reprocess data
  • Skim to create a compressed sample
  • Create an ntuple
  • Use an ntuple locally
  • Run simulation
  • Run fast simulation - only very simple detector simulation.

The CDF L3 Trigger

It is noted that the CDF L3 Trigger contains functionality useful to a Grid-based system. It includes a database relating software versions to executables. Libraries and dependencies can be packaged for distribution. Finally, database constants can be exported to a flat-file format for local use.

CDF Goals

The document describes three stages of functionality. At stage 1, job submission and data handling are largely manual. At stage 2, data handling is more automated, and there exist tools for determining CPU availability and job monitoring. At stage 3, the system is Grid-like, with jobs being automatically submitted to computing nodes depending on resource conditions.

Plans for the use of the SAM system in CDF, and for Grid enhancements to SAM, are then described in too much detail for this summary.


s.hanlon@physics.gla.ac.uk


Last modified Tue  8 June 2004 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3
For more about GridPP please contact Neasan O'Neill