RAL Tier1 Disk Server Tuning

From GridPP Wiki
Jump to: navigation, search

Performance tuning on the disk servers

This is work in progress and will always be, as network and disk tuning varies over motherboards, NICs and disk storage architecture. Currently (15th Oct 2007), there are 3 types of tuning:- wanin, wanout and farmread. The wanin and wanout tuning is identical.

Filesystem Tuning

Change file system to writeback journaling - This will change the ext3 journaling from the default ordered, where all data is forced directly out to the main file system prior to its metadata being committed to the journal; to writeback, where data ordering is not preserved. That is data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after the crash and journal recovery.

Dynamically, this can be achieved by running the following commands. However, this is a check that is very difficult to remember as no evidence is written to the /etc/fstab file and it is preferable for the changes to be added into the /etc/fstab file and the system rebooted. The mount command will not display the tune2fs change.

# tune2fs -o journal_data_writeback /dev/sdb1
# tune2fs -o journal_data_writeback /dev/sdb2
# tune2fs -o journal_data_writeback /dev/sdb3

Statically, change /etc/fstab:-

/dev/sdb1  /exportstage/castor1  ext3  defaults,noatime,nodiratime,data=writeback  0 2
/dev/sdb2  /exportstage/castor2  ext3  defaults,noatime,nodiratime,data=writeback  0 2
/dev/sdb3  /exportstage/castor3  ext3  defaults,noatime,nodiratime,data=writeback  0 2

noatime prevents the access time being updated on files, whilst nodirtime does this for directories. data=writeback will change the journaling from ordered to writeback.

Wanin and Wanout Network Tunables to maximize throughput over the WAN

Add to /etc/sysctl.conf:

Tunable Description
sysctl -w net.ipv4.tcp_rmem="4096 65536 1048576" Set min, default, max receive window
sysctl -w net.ipv4.tcp_wmem="4096 65536 1048576" Set min, default, max transmit window.
sysctl -w net.ipv4.tcp_mem="65536 87380 98304" Set maximum total TCP buffer-space allocatable
sysctl -w net.core.rmem_max=1048576 Set maximum size of TCP receive window
sysctl -w net.core.wmem_max=1048576 Set maximum size of TCP transmit window
sysctl -w net.core.rmem_default=65536 Set default size of TCP receive window
sysctl -w net.core.wmem_default=65536 Set default size of TCP transmit window
sysctl -w net.core.somaxconn=512 Socket accept backlog. The number of connections that can

be in the 3-way handshake + the socket resource allocation process

sysctl -w net.core.netdev_max_backlog=3000 Maximum number of packets that can be queued by the kernel

without the kernel dropping packets

sysctl -w net.core.optmem_max=20480 Socket options memory buffer. Any call to the sock_kmalloc()

function results in memory being allocated from this place


Farmread Network Tunables

Add to /etc/sysctl.conf:

Tunable Description
net.ipv4.tcp_mem="65536 87380 98304" Set maximum total TCP buffer-space allocatable.


VM Tunables for Wanin, Wanout and Farmread

Add to /etc/sysctl.conf:


Tunable Description
sysctl -w vm.min_free_kbytes=102400 Used to force the Linux VM to keep a minimum number of kilobytes free
sysctl -w vm.dirty_background_ratio=5 At what percentage of main memory, data should be written to disk
sysctl -w vm.dirty_expire_centisecs=500 When dirty data is old enough to be eligible for writeout by the pdflush daemons in 100ths of a second
sysctl -w vm.dirty_writeback_centisecs=250 When pdflush will periodically write "old" data out to disk, in 100'ths of a second
sysctl -w vm.dirty_ratio=10 Up to what percentage of main memory, data will be written to filesystem cache
sysctl -w vm.lower_zone_protection=100 Increase the free page threshold by 100, starting page reclamation earlier


NIC Tunables

This will increase the receive ring buffer size. This parameter depends on the hardware and device driver. It cannot always be tuned to 511 and may be able to be tuned greater than 511 in the future. This tuning is for the SuperMicro H8DA8 motherboards that have a Broadcom NIC BCM5704 chip with the tg3 v3.43-rh device driver. The default setting is 200.

ethtool -G eth0 rx 511

Transmit Queue Tunable

Set up ifconfig txqueue - This will increase the send queue from the default of 1000 to 2000. The idea being to ensure that the buffer is kept full. This is the dynamic change.

ifconfig eth0 txqueuelen 2000

To make the NIC and Transmit Queue changes permanent across reboots, add the following to /etc/rc.d/rc.local:

cat<<EOFRC>> /etc/rc.d/rc.local
ethtool -G eth0 rx 511
ifconfig eth0 txqueuelen 2000
EOFRC