Optimising dCache Performance
This page aims to provide information for Tier-2 system administrators on how to optimally setup their dCache system in order to successfully participate in the LCG service challenges and the LCG proper. Since it is primarily for use by Tier-2's, it will be assumed that there is no tape backend and the dCache system is composed of a set of nodes that provide the central dCache services along with a set of disk pool nodes for storage. Comments are welcome and can be directed towards Greig Cowan.
Before getting started on trying to optimise your dCache setup, it would be a good idea to check that you actually have a working system.
- All nodes installed with latest versions of the dCache software.
- Grid certificates installed.
- Correct firewall ports open. See dCache FAQ.
- Ability to use srmcp and globus-url-copy clients to copy files into and out of the dCache.
Contents
Doors
dCache uses the concept of doors to specify which nodes present the correct interfaces to the outside world to allow for access to the dCache system. To reduce load on the system, it is best to separate door nodes from pool nodes. However, this requires more hardware than is often available, particularly at Tier-2 sites. In this case, doors and pools will have to coexist on the same machine. Of course, if the doors and pools are separated, then additional internal traffic is required in order to send the data from the pool to the door node before reading via the WAN. This additional transfer step will not necessarily occur if the read occurs from the door node running on the pool node where the data resides.
PNFS databases
Each VO should use its own PNFS database to prevent bottlenecks ocurring due to the single access restriction.
Pool setup
dCache is a highly configurable system. In order to effectively manage the set of available disk pools, it is advisable to create a set of pool groups to which individual pools belong. A set of rules (called links) can be applied to pools and pool groups in order to control how dCache uses a set of pools (i.e. a link can be created which only allows read access to the pools in the group it points to, or only allows write access from hosts in a particular domain). dCache uses the combination of pool groups and links to optimally find the most suitable pool to use in any particular operation. Sites will find the use of links extremely useful when it comes to limiting which VOs can write to which pool.
The dCache developers suggest that optimal performance will be achieved when separate read and write pools are used since this will spread the load when simultaneous read and write requests come into the dCache. One way of setting this up is to do the following in the PoolManager of the dCache admin interface:
psu create unit -net <0.0.0.0/0.0.0.0> psu create ugroup <allnet-cond> psu addto ugroup <allnet-cond> <0.0.0.0/0.0.0.0> psu create link <read-link> <allnet-cond> psu set link <read-link> -readpref=<10> -writepref=<0> -cachepref=<10> psu add link <read-link> <read-pools> psu create link <write-link> <allnet-cond> psu set link <write-link> -readpref=<0> -writepref=<10> -cachepref=<0> psu add link <write-link> <write-pools>
This step sets up separate read and write pool groups. If you want to continue using the dCache as an LCG storage element (which requires pool groups with the names of the supported VOs) then you will need to modify the above commands to reflect that. You could potentially have separate write pools for each VO, then a set of generic read pools that all VOs would use. When the file was copied to the read pool in order to be read this cached copy would subsequently be removed if space was running low.
dCacheSetup file
Buffer sizes
The main dCache configuration file can be found at /opt/d-cache/config/dCacheSetup. This contains various parameters that can control the behaviour of your dCache instance. Two of these parameters are
# ---- Transfer / TCP Buffer Size bufferSize=1048576 tcpBufferSize=1048576
The first is for reading into memory from/to disk to keep disk rate up and the second for network rate. They have their default values above. The values affect the rate in gridftp transfers - you need to keep a enough bytes in flight to maintain your rate, and this depends on your bandwidth delay product:
BDP (bits) = total_available_bandwidth (bits/sec) x round_trip_time (sec)
hence there is not one good answer for each dCache instance (hardware configuration is an issue). As an example, FNAL-CERN transfers use an optimal value of 2 MB for each buffer. It has been found that you do better for single transfers, but for multiple transfers, different values for the 2 buffers tended to thrash the system. Intra-Europe transfers are correspondingly less and is (probably) why the default was set to 1 MB. Of course, you need to have allowed the system TCP buffers to be as large as you specified in dCacheSetup. Finally, srmcp has these buffers as options for the clients as well to help tune specific transfers.
GridFTP Performance Markers
Ensure that the value of performanceMarker
in the dCacheSetup file is less than 180, otherwise transfers initiated with FTS will timeout as the GridFTP performance markers will not arrive quickly enough.
Storage hardware
dCache works best in a distributed environment, where it collects together the available storage resources from a number of pool nodes. If a dCache door is opened up on each of these pool nodes, then each can be used to serve the data that is on that node, reducing the possibility of bottlenecks occurring if data had to be routed through another node.
Filesystems
Some people at Karlsruhe/SARA have already done some of this, with the conclusion that xfs came out on top. What about use of NFS and RAID? Optimisation of mount parameters is really a local issue. Could point people in the direction of benchmarking programs like bonnie++ and iostream.