RAL Tier1 Disk Server Tuning
Contents
Performance tuning on the disk servers
This is work in progress and will always be, as network and disk tuning varies over motherboards, NICs and disk storage architecture. Currently (15th Oct 2007), there are 3 types of tuning:- wanin, wanout and farmread. The wanin and wanout tuning is identical.
Filesystem Tuning
Change file system to writeback journaling - This will change the ext3 journaling from the default ordered, where all data is forced directly out to the main file system prior to its metadata being committed to the journal; to writeback, where data ordering is not preserved. That is data may be written into the main file system after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after the crash and journal recovery.
Dynamically, this can be achieved by running the following commands. However, this is a check that is very difficult to remember as no evidence is written to the /etc/fstab file and it is preferable for the changes to be added into the /etc/fstab file and the system rebooted. The mount command will not display the tune2fs change.
# tune2fs -o journal_data_writeback /dev/sdb1 # tune2fs -o journal_data_writeback /dev/sdb2 # tune2fs -o journal_data_writeback /dev/sdb3
Statically, change /etc/fstab:-
/dev/sdb1 /exportstage/castor1 ext3 defaults,noatime,nodiratime,data=writeback 0 2 /dev/sdb2 /exportstage/castor2 ext3 defaults,noatime,nodiratime,data=writeback 0 2 /dev/sdb3 /exportstage/castor3 ext3 defaults,noatime,nodiratime,data=writeback 0 2
noatime prevents the access time being updated on files, whilst nodirtime does this for directories. data=writeback will change the journaling from ordered to writeback.
Wanin and Wanout Network Tunables to maximize throughput over the WAN
Add to /etc/sysctl.conf:
Tunable | Description |
---|---|
sysctl -w net.ipv4.tcp_rmem="4096 65536 1048576" |
Set min, default, max receive window |
sysctl -w net.ipv4.tcp_wmem="4096 65536 1048576" |
Set min, default, max transmit window. |
sysctl -w net.ipv4.tcp_mem="65536 87380 98304" |
Set maximum total TCP buffer-space allocatable |
sysctl -w net.core.rmem_max=1048576 |
Set maximum size of TCP receive window |
sysctl -w net.core.wmem_max=1048576 |
Set maximum size of TCP transmit window |
sysctl -w net.core.rmem_default=65536 |
Set default size of TCP receive window |
sysctl -w net.core.wmem_default=65536 |
Set default size of TCP transmit window |
sysctl -w net.core.somaxconn=512 |
Socket accept backlog. The number of connections that can
be in the 3-way handshake + the socket resource allocation process |
sysctl -w net.core.netdev_max_backlog=3000 |
Maximum number of packets that can be queued by the kernel
without the kernel dropping packets |
sysctl -w net.core.optmem_max=20480 |
Socket options memory buffer. Any call to the sock_kmalloc()
function results in memory being allocated from this place |
Farmread Network Tunables
Add to /etc/sysctl.conf:
Tunable | Description |
---|---|
net.ipv4.tcp_mem="65536 87380 98304" |
Set maximum total TCP buffer-space allocatable. |
VM Tunables for Wanin, Wanout and Farmread
Add to /etc/sysctl.conf:
Tunable | Description |
---|---|
sysctl -w vm.min_free_kbytes=102400 |
Used to force the Linux VM to keep a minimum number of kilobytes free |
sysctl -w vm.dirty_background_ratio=5 |
At what percentage of main memory, data should be written to disk |
sysctl -w vm.dirty_expire_centisecs=500 |
When dirty data is old enough to be eligible for writeout by the pdflush daemons in 100ths of a second |
sysctl -w vm.dirty_writeback_centisecs=250 |
When pdflush will periodically write "old" data out to disk, in 100'ths of a second |
sysctl -w vm.dirty_ratio=10 |
Up to what percentage of main memory, data will be written to filesystem cache |
sysctl -w vm.lower_zone_protection=100 |
Increase the free page threshold by 100, starting page reclamation earlier |
NIC Tunables
This will increase the receive ring buffer size. This parameter depends on the hardware and device driver. It cannot always be tuned to 511 and may be able to be tuned greater than 511 in the future. This tuning is for the SuperMicro H8DA8 motherboards that have a Broadcom NIC BCM5704 chip with the tg3 v3.43-rh device driver. The default setting is 200.
ethtool -G eth0 rx 511
Transmit Queue Tunable
Set up ifconfig txqueue - This will increase the send queue from the default of 1000 to 2000. The idea being to ensure that the buffer is kept full. This is the dynamic change.
ifconfig eth0 txqueuelen 2000
To make the NIC and Transmit Queue changes permanent across reboots, add the following to /etc/rc.d/rc.local:
cat<<EOFRC>> /etc/rc.d/rc.local ethtool -G eth0 rx 511 ifconfig eth0 txqueuelen 2000 EOFRC