Performance and Tuning
Contents
- 1 Common Issues ===== Network =
- 2 Implementation Specific Issues
Common Issues ===== Network =
Network issues, tcp buffer size
Generally follow: http://fasterdata.es.net/fasterdata/host-tuning/linux/ with some few tweaks as these are quite agressive settings.
Recently it has been found that these settings were not optimal for some UK DPM sites in transfers to BNL (quite specifically). The UKTcpTuning page lists current tunings used at sites.
In particular the default TCP buffer size was increased to around 1M.
Filesystem
First, avoid using nfs mounted partitions. Performance testing has shown that nfs mounts have dire write performance. It's far, far better to configure each of the disk servers offering space to the SE as nodes in the DPM/dCache setup.
Secondly, use xfs filesystems if possible. Tests at Glasgow showed almost a two fold performance increase using xfs filesystems. xfs also produces the least system load and is a very reliable filesystem. ext4 is also now an option. Specific (though perhaps old now) advice is available on:
Recent work on ZFS hints this maybe a more appropriate file system. The wikipedia page for ZFS is here.
FTS
With FTS you can set how many files are transferred by channel, and how many GridFTP streams are used to transfer each file. For example, if you are transferring 3 files and specify 5 streams per file, a total of 15 streams are written in a single transfer. It turns out that this can generate a high load on the disk if all streams go to a single disk, because there are lots of concurrent competing (non-sequential) writes to the same disk.
Conversely, transferring a single file with a single stream will write nicely to the remote disk but will usually not saturate the network.
In general the best option has been to allow the in-built optimizer to adjust settings for each link as appropriate and to only intervene in exceptional circumstances for defined periods of time. The level setting for the optimizer needs further investigation. To note, sometimes the optimizer is slow to recover from a badly behaving SE unless there is enough files submitted to allow the optimizer to recover.
XrootD
XRootD usage is becoming more prevalent. IT is also being used for both internally for both the control layer of a transfer and as the transport layer of transfer. There are VO specific settings needed (in both Versioning and functionality.) There which VOs are supported need to be taken into account. Usage of XrootD for WAN transfers for both FTS whole file transfer and sparse reading for Wide Are Worker Nodes (WAWN) may also have effects on tuning.
Current transfer performance can be monitored by:
But this will be replaced by:
https://monit-grafana.cern.ch/dashboard/db/xrootd-dashboard?orgId=1
Implementation Specific Issues
DPM
Summary
- Make sure your site has a decent network connection onto the MAN. 1Gb should be considered a minimum these days, ideally 10Gb. Certain sites may need more.
- Purchase a number of disk servers to spread the i/o load across your SE - modern RAID disk systems should get at least 300Mb/s write speed, so you need at least 3 to saturate your 1Gb link.
- Format the filesystem with xfs. Never use filesystems mounted using NFS.
- Consider optimal configuration of the MySQL database backend.
- Settings as sites change to DOME may change considerably.
Filesystem Choice for Disk Servers
See Glasgow DPM Performance Tests.
The best filesystem to use for DPM used to seem to be xfs, or ext4 see the XFS Kernel Howto. To xfs format a filesystem used by DPM, without loosing data, see DPM Filesystm XFS Formatting Howto. ZFS has been shown to has some success. https://en.wikipedia.org/wiki/ZFS
Load Balancing Servers
DPM's filesystem selection "algorithm", for writes, is very basic: each filesystem is weighted equally. This means that it's very important to ensure that you have about and equal number of filesystems on each server (or, more correctly, a number of filesystems scaling roughy with the disk server's network and i/o capabilities).
As an example, if you have two disk servers, and they are of equal speed, then generate the same number of filesystems on each: svr1:/fs1, svr1:/fs2, svr2:/fs1, svr2:/fs2. (Why not just one filesystem on each? Imagine what happens if you then add a less performant piece of hardware.)
If you have one new disk server and one old, then create more filesystems on the new server than on the old.
In this way the load between the servers should be balanced properly.
MySQL Configuration
As the database backend needs to be written to for every transaction that DPM performs, optimisation of the MySQL configuration can give significant benefits. Any changes to the MySQL master configuration file (/etc/my.cnf or equivalent) require the MySQL server daemon to be restarted.
Binary Logging and --single-transaction dumps
By default, MySQL only logs events to a human-readable log which does not contain enough information to reconstruct the database state. One can activate "binary logging" in MySQL, which writes (non-human readable) logs of all database changes as they happen - this is sufficient to allow the database state at any point to be reconstructed from the state when logging began + the list of changes in the binary log.
This has an additional benefit: normally, MySQL must lock tables when dumping their state for backup (as their state cannot change whilst being dumped); with binary logging enabled, MySQL can perform dumps without locking (by marking the dump start time in the logs, and using them to maintain the consistent state from that period whilst the dump continues). This requires that the dump be performed in "single transaction" mode. The advantage here is that one can now perform dumps with almost no visible effect on database availability, as the tables are always available to be written to.
To enable this, in /etc/my.cnf, add:
log-bin expire_logs_days 3
(the latter options setting reasonable total log duration values to save disk space - as long as this is > the period between dumps, there is no issue)
and make sure that your mysql dump script calls mysqldump with the --single-transaction switch set.
If you wish to avoid making explicit backups, relying on the binary logs themselves, then this is not a sufficiently save configuration - you will probably want to enable explicit binary log syncing and other features to ensure that crashes do not leave logs in an inconsistent state.
Indexing heavily used tables
Increasing buffer capacity
The databases used by DPM are created as "InnoDB" databases within MySQL itself. These have various configuration options which can be set in the MySQL configuration script. (In /etc/my.cnf , for example).
Whilst the OS can cache writes and reads to disk in memory, it is usually more efficient to let MySQL cache its operations. For InnoDB databases, the size of the in memory cache used is set by the innodb_buffer_pool_size directive. By default, this is rather small, especially for a large and complex database. Generally, one should alter this to make it as large as possible without squeezing out other processes - on a modern server, 1Gb or more will be very helpful, and should not impact other memory uses much. (Glasgow runs with almost 2/3s of its database server's memory allocated to the cache, but we have a dedicated database server separate from the DPM instance.)
On increasing the size of the InnoDB pool, one should also prevent the OS from being allowed to cache those operations as well (this is just a waste of memory). The interaction between OS and the InnoDB engine is controlled with the innodb_flush_method option, which should be set to O_DIRECT to disable OS-level caching.
Separation of DPM and MySQL backend
In order to reduce the load on the head node, the DPM and MySQL services can be seperated. How to do this is documented in the Dpm Admin Guide and improvements seen in the experiences at UKI-SCOTGRID-GLASGOW documented on the scotgrid wiki
Tuning RFIO Buffer size
The size of the buffer used by rfio can be set by a variable in /etc/shift.conf on the client (worker node) RFIO IOBUFSIZE XXX
Where XXX is in bytes.
There are a number of factors that determine the best setting for this, including the number of concurrent jobs, the file access type (random or sequential), the size of file and amount of data being processed. The buffering should maximise the amount of data being accessed from worker node RAM, minimise the number of requests to the storage servers over the network and minimise the number of IO operations the storage has to process.
For an ideal sequential data file the buffer size would be large enough to buffer at least one data chunk/event so that all access for that chunk/event is to local RAM. Reading directly from the network with no buffering incurs a heavy penalty in overhead (TCP/IP headers, network latency and IO operations) compared to reading from RAM; network latency is typically in ms, RAM in ns. Conversely, reading in more of the file into buffer than is needed also incurs a overhead in wasted network bandwidth and wasted cpu time while waiting for the read to complete.
If the chunks are particularly small the buffer should be big enough to read in a number of events such that the total size is a multiple of the stripe/readahead size on the storage server eg if the storage system reads in at least 256kB for each request it is more efficient to make a single request for 256kB than four requests of 64kB for the same data, particularly when there are many jobs/requests.
Most sites have found that LHC data (particularly for Atlas) analysis presents a particular problem due to the large size of the data files, the pseudo-random nature of access and large numbers of concurrent jobs. The files are too large to buffer completely in RAM and due to the random access any buffered data almost immediately needs to be replaced. This means that any level of buffering is wasteful of bandwidth and cputime while data is buffered.
Generally sites in the UK have found that setting a low value of this variable at around 4kb provides better efficiency and lower bandwidth requirements for an LHC data analysis type job with large (~2Gb) files. Unfortunately, this does not scale particularly well, as this creates a very high number of inefficient IOPS for the storage servers. Current storage servers have only a limited number of IOPS and quickly saturate, leading to overloaded storage and inefficient jobs that have to wait for data.
Currently (Feb 2010) most sites get higher efficiency and higher scaling by using file stager (copying the data files to local worker node disk) as at least some of the file data is cached using standard Linux page cache and there are a smaller number of IOPS to the storage servers. Due to the low load on the storage this can scale much higher at a steady efficiency (the best efficiency a saturated worker node can achieve). File stager performance can be increased by using RAID0, separate system disks, short stroking large disks and tuning the page cache flushing.
This may change soon as experiment reorder data to reduce or eliminate much of the random access, meaning more ideal buffering can be used. Also tagged analysis jobs that only access a fraction of the events in a dataset will also benefit from direct POSIX access, rather than copying GBs of data that isn't needed. Sites should run tests (eg ATLAS HammerCloud) to determine which method and which amount of buffering provides the best throughput, particularly if there are changes in experiment data format or number/nature of storage servers.
Tuning block device readahead
Similarly to the above many sites have found it useful to increase the size of the readahead in the block device.
For example: http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html
This is done with blockdev --setra 16384 /dev/$RAIDDEVICE
To set it to 8 MB which seems a reasonable starting point (more reasonable than defaults).
dCache
See Optimising dCache Performance.
See also the description in the dCache book, the section called Advanced Tuning. Currently only describes configuring multiple queues for movers on a pool (in case a pool has both GridFTP and dcap doors).
This page is a Key Document, and is the responsibility of Brian Davies. It was last reviewed on 2019-01-02 when it was considered to be 61% complete. It was last judged to be accurate on 2018-07-25.