File Integrity Testing

From GridPP Wiki
Jump to: navigation, search

File Integrity Testing

Executive Summary

The following is the GridPP storage and data management group plan for dealing with file corruption.

Fewer corrupted files were found than expected. We conclude that it is not worth doing anything special about the problem, but decided that it is worth making use of the checksumming features in FTS to keep an eye on the problem. If integrity turns out to become a serious problem later, we can revisit this plan.

Plan

The purpose of file integrity testing is to - in approximate order of importance:

  1. Determine how many files get corrupted;
  2. Attempt to determine why they get (got) corrupted;
  3. Test procedures for detecting file corruption;
  4. Implement toolkit procedures to help fill gaps, when feasible;
  5. Suggest how to deal with corrupted files - these may depend on the experiments' data models;
  6. Suggest how to detect file corruption before they happen when possible.

File corruption

Files can get corrupted during transfers ("in flight") or while stored somewhere on disk ("at rest"). In flight corruption can happen if the file gets truncated in transfer: the most common case is that the file got "created" but the transfer never started, resulting in a zero length file. Faulty memory can cause more subtle corruption errors when the file "passes through" the memory from the network to the disk or vice versa.

At rest corruption can happen if files get corrupted on a disk server, e.g. if the hard disk is dying. The T1 runs a disk checking utility originally developed by CERN to continuously check CASTOR disk servers.

Checksumming

Recent versions of Storage Elements have support for checksumming: the SRM interface provides a getFileMetadata checksum feature, and the GridFTP protocol also provides a checksumming feature.

  • GridFTP servers often do not implement the checksumming feature.
  • Clearly, all clients must support and use a common checksum, otherwise they cannot be compared.

How Checksumming is Done

srmLs with fullDetailedList = TRUE is used to request checksums of files via the SRM interface - see GFD.129 page 49 for srmLs, and file metadata block defined on p.19. There is no way to ask for a "deep" checksum.

Deep Checksumming

Sometimes we want to make absolutely certain we checksum a file as it is stored in the SE, rather than the one (possibly) cached by the SE. We refer to this case as "deep checksumming". Obviously srmGetFileMetadata may return the cached checksum, if any. Indeed, for performance reasons, it is best to return the cached checksum, because checksumming files can be expensive (see Checksumming and Performance below).

For deep checksumming, we could do a query against srmGetFileMetadata and see how long it takes: if it returns instantly then it is probably a cached checksum. We could also make use of the lcg_utils API to call lcg_get_checksum with the force bit set; however, this may fail if gridftp checksums are not properly supported on the SE. A much better approach may be to run a job which fetches the file and checksums it and somehow returns the checksum to a central database.

Checksumming and Performance

The first aspect of checksumming is the choice of checksumming algorithm. The tradeoff is between performance and the ability to detect changes. SHA1 for example is excellent at detecting changes: in fact, two different files having the same checksum ("collisions") would be publishable as research! However, SHA1 was intended to detect intentional changes, changes made by a malicious and resourceful adversary. For our purposes where changes are usually unintentional, arising from simple failures, a simpler and faster checksum like ADLER32 is considered the best compromise. It would be quicker still to just add all words in the file, modulo word size, but there are unintentional changes occurring in practice which are not caught by this type of checksum.

The second aspect of performance is when the checksum is performed. If the checksum is done in flight, the implementation can keep feeding data into the checksumming algorithm as it comes in. Contrast this with an at rest checksum where a separate process must read the file, feeding data into the algorithm: this means an extra process on the system with potentially high CPU usage, and a read stream from the disk which may already be busy doing other things: trying to read a file may significantly impact the performance of a process which is simultanously writing a different file to disk.

Goal 1: Determine how many files get corrupted

We cannot usually check all files, so it makes sense to check representative samples. There are two cases:

  • Files written prior to implementations supporting checksumming
  • Files written after implementations support checksumming

In the latter case, (we expect) the file is checksummed as it comes in, and the checksum is stored along with the file metadata in the Storage Element's own file metadata.

Things to check:

  • Datasets
  • Everything stored on a single disk server
  • Everything within a given spacetoken (esp. for ATLAS)

Goal 2: Why files get corrupted

Short answer: we don't know, first we have to figure out whether they get corrupted at all.

We can draft some possible options based on the T1 experiences:

  • Dodgy memory in a disk server
  • Dodgy disk
  • Timeouts causing files to be truncated or dropped
  • Credential problems - if you authenticate OK to SRM to "create" the file but fail to authenticate correctly to GridFTP for the actual transfer, the result is a zero length file. This can also happen for other reasons.
  • Actual corruptions in transfer(?)

Goal 3: Detecting file corruption

  • Document support for checksumming in GridFTP implementations
  • Document when and whether srmGetFileMetadata returns a cached checksum or a freshly generated one - does it ever time out?
  • What does FTS do when it detects a checksum mismatch?

Goal 4: Detecting file corruption: toolkit tasks

Post-facto (files that already exist):

  • Tool to check file checksum consistency between LFC and SRM.
    • Tool to check file checksum consistency between replicas on the same or different SRMs.
      • lcg_get_checksum --force forces recalculation of a checksum against a SURL.
      • Can copy files locally and checksum them (note: python adler32 implementation is not standards compliant for version <=3.0, and is inconsistent in its lack of compliance for version <=2.6.3 (version 2.6.3, 2.6.4 return signed longs, rather than unsigned longs, versions <2.6.3 return signed or unsigned longs depending on bitness of python interpreter) ).
  • Tool to check file checksum consistency between VO specific catalogues and LFC / SRM (on top of above).

Open questions: What to do with files with checksums of the "wrong" type? (We encounter MD5 checksums for some ATLAS files in our current testing, for example.)

New files:

  • FTS checksumming should ensure that (up-to-date releases of) the big SE implementations store checksums for new files.
    • Ensure that FTS checksums happen and work.
  • Detecting file corruption after transfer - random checks? (Equivalent to sparse version of Post-facto case, with forced checksum calculation on the SE side).
    • Some VOs (ATLAS) do checksums of files staged to WNs (vs known checksums for those files).


Dumps of file data need to be consistent: we have chosen SynCat as the XML schema format to use.

Goal 5: How to deal with corrupted files

If you find corrupted files:

  • Make a note of the SURL of the file.
  • It may be worth checking for other corrupted files on the same disk?
  • Contact the VO (representative) with the list of corrupted files.
  • They should be able to check whether they have other replicas of the file and let you know what to do with them.
  • If possible, in collaboration with the VO rep, compare the original file to the corrupted file - is it clear how it was corrupted?
  • Share your information with the mailing list...!

Goal 6: How to prevent file corruption

Results

Checking 6250 files in ATLAS, no files with bad checksums were found. This was a sample from ~250k files in the SE.

In 520,000 FTS transfers (reported 31 March 2010), there were 13,000 failures, of which 7 were due to checksum problems.

We conclude that file integrity is not a big problem, but it is worth watching the FTS alerts if/when it reports transfer failures due to checksum problems. We should also monitor the situation regularly.

We do not have enough data to reliably identify the cause of the checksum failures. Thus, we are unable to recommend a strategy for dealing with these. However, it may be worth considering running fsprobe on T2s.

The Maths

This ought to be Known, but we could not easily find it in the Literature. Our back of envelope calculations (an admittedly fairly large envelope...) indicated that the expected number of bad files in the original population is expected to occur in the same ratio (bad/all) as in the sample - as one might expect. But the confidence interval is broad, it would seem to be k1/2N/ns (where N is the pop size and ns is the sample size, and k is the number of dodgy files in the sample). See [1] for the back-of-envelope.

TODO List

Task no Owner Description Deadline Status Comment
1 BD Checksum a sample of ATLAS files at Lancaster against checksums kept in DQ2 (using srmLs) 2010.02.17 Closed  
2 BD Checksum sample files locally to get deep checksum, and compare against srmLs checksum 2010.02.17 Open Needs permission at Lancaster
3 NN Figure out when srmLs is deep and when it is shallow (using database stored checksum)   Open Andrea Sciaba documenting SRM behaviour - could provide input to this
4 NN Evaluate fsprobe at T2s Some enterprising T2... Open  
3 MH Report on FTS failures due to file integrity problems Assign to Matt Hodges? Open  

Work by other Groups

Cédric Serfon, and others, are developing multiple tools for Whole System file integrity checking for the ATLAS VO. Their Consistency Service allows users to declare Suspicious (potentially corrupt/missing) files, which are then automatically checked, and their replacement managed (either by copying over from existing replicas elsewhere, or correcting the inconsistency in the catalog hierarchy so that the file is "really deleted"). This, of course, does not provide automated *detection* of catalog inconsistencies, neither does extend to VOs other than ATLAS (it being somewhat DQ2/DDM based).