ATLAS pCache study

From GridPP Wiki
Jump to: navigation, search

This Page is a study of pCache usage within the UK.

pCache at the Tier 1

  • Tier1 did have pCache running for a while but not for long.
  • Broke and then ATLAS did not have effort to continue developement.
  • Have not tried new version.
  • T1 created a directory on each WN for use and then left it to ATLAS to use.
  • Atlas reported improvement in jobs (documentation unavailable atm.) Did see "cache hit" in log files.
  • (Also worked at Lancaster,Glasgow and RAL-PPD.)

pCache at Glasgow

pCache installation and configuration

pCache works by intercepting calls to the standard data copy commands (dccp, rfcp) and replacing them with a check against a cache. If the cache contains a copy of the file, a hard-link is made to it at the "destination" of the copy command. If the cache does not contain a copy, the relevant copy command is run to copy it into the cache, and a hard-link is then made. pCache manages the size of its cache with a usage check on each invocation, and uses the implicit link-counting garbage collection of the filesystem (a file with no hard-links is, by definition, deleted) to avoid deleting files that are still in use by a user process.

At Glasgow, the following was done to implement this:

  1. add the pcache python script to the ATLAS experiment software area (ensuring that it is available to all worker nodes transparently).
  2. add a wrapper script called "rfcp" to the cfengine config for all worker nodes, in the path /opt/pcache/ .
    • this script is configured differently for different classes of worker nodes, but in all cases it invokes pcache, passing the commandline arguments to the pcache instance.
    • this is what actually does the magic intervention for us
  3. add /opt/pcache/ to the $PATH for worker nodes, via a line in the ATLAS setup-local.sh script provided for local customisation.
    • thus, /opt/pcache/rfcp is the first rfcp seen by the WNs and is thus invoked.

Configuration

Configuration of pCache is done simply by choosing the commandline options passed to it in the wrapper script (in our case, /opt/pcache/rfcp ). As we have different sizes of hard disk in our different generations of worker node, we had to set up several versions of the rfcp script to support differing cache sizes.

  • Old CV Worker node ( GB hard disk)
    #! /bin/sh
    ARGS=$(echo $@ | sed 's/rfcp/\/opt\/lcg\/bin\/rfcp/')
    /expsoft/atlas/pcache.py --timeout=10800 --scratch-dir=/tmp/ --storage-root=/dpm/ --max-space=25G /opt/lcg/bin/rfcp $ARGS
  • Newer Viglen Worker node ( GB hard disk)
    #! /bin/sh
    ARGS=$(echo $@ | sed 's/rfcp/\/opt\/lcg\/bin\/rfcp/')
    /expsoft/atlas/pcache.py --timeout=10800 --scratch-dir=/tmp/ --storage-root=/dpm/ --max-space=200G /opt/lcg/bin/rfcp $ARGS

in addition, some of the Viglen Worker nodes were retrofitted with SSDs (Intel 250G X25-M SATA), requiring a third rfcp script to take into account their reduced capacity.

As is mentioned later, we also include a long explicit timeout for copies in these configurations, due to issues with the default timeout.

pCache issues

  • default timeout too short in cases of high storage load
  • cache size competition with scratch space for user jobs
    • cannot be fixed by moving the cache to a separate filesystem, as pcache depends on hardlinks for cache control (and hardlinks can only be made within a filesystem).
    • this is especially an issue with SSD based caches, which would otherwise be a potential solution for storage bottlenecks on WNs.

Some statistics on pCache

Average hit/miss es for pCache on WNs, aggregate, (via pdsh and awk):

10 August 2010:

hits:

  • μ = 199
  • σ = 120
  • n = 252

misses:

  • μ = 622
  • σ = 458
  • n = 267

Fraction of mean hits (compared to mean total requests): 0.24 ± 0.03 (ish)

This is for a workload consisting of mixed ATLAS production and analysis, with no other VOs using pcache. So, the average (local) request load on the DPM should be reduced by around 24%. However, this does not take into account the size of the files receiving the hits: external knowledge about the ATLAS hot files suggests that the hits are most likely to be on smaller files (DBRelease files, for example), and therefore the reduction in network load will be less than 24%.