Improving Samba write performance on Linux
------------------------------------------
Abstract
--------
Samba performance is good in most circumstances, but modern Linux
distributions have improved file systems since Samba was first
developed. In particular, they have a feature that Samba does not take
advantage of by default. In recent work I found that making some
simple changes to Samba could significantly improve the Samba write
performance on Linux from a modern Windows client (Windows 7). This
document will explain how to take advantage of this to increase the
speed of writes on your file server, and how to measure the changes to
ensure this has the desired effect.
Linux and Windows File systems
------------------------------
Linux file systems, like the file systems on UNIX before it, are
designed around the notion of "sparse" files. On such file systems, if
you create a file, and then write one byte at position 500MB from the
start of the file, the underlying file system will only allocate one
single block to store the one byte that was written. Even though the
size of the file on disk will be reported as 500MB+1 bytes, the actual
space used on the disk will only be a single block. The block size of
a file system is fixed when the file system is first created. For
modern disk sizes (1TB or more), the larger the block size the
better. For the two file systems I'm discussing here, ext4 and XFS the
standard block sizes are 4KB for ext4, and 64KB for XFS. Both xfs and
ext4 on Linux do support larger block sizes if the page size of the
running kernel is larger than 4KB, but for most distributions the page
size and maximum block size is 4KB.
Such a file is called a "sparse" file, and has the great advantage
that disk space can be over committed. The sparse ranges of the file
are simply replaced with zero bytes when read, and only committed onto
disk when an application actually does a write into that range.
The Windows NTFS file system, although able to support "sparse" files,
is a more traditional file system in that writing one byte at position
500MB will force the file system to immediately allocate the
intermediate blocks. This can take some time, and so SMB and SMB2
network traffic uses the strategy described below to avoid request
timeouts.
SMB/SMB2 write activities
-------------------------
When a Windows client application sends a request to write one byte at
position 500MB on a newly opened (empty) file, the SMB/SMB2 client
redirector has to ensure that 500MB+1 bytes are really allocated on
the target system. However, sending a simple SMBwriteX/SMB2_WRITE
request with an offset of 500MB could easily cause a client
timeout. An assumption built into the SMB/SMB2 protocol is that the
target file system behaves like a Windows server, so the NTFS driver
on the server would have to allocate 500MB worth of file system blocks
in order to complete this request, which may take longer than the 30
seconds usually allowed for an SMB request.
What the Windows client redirector does in this case is to send a
sequence of 1 byte requests, to cover the extension needed on the open
file. In the reply to an NTCreateX/SMB2_CREATE call, the SMB server
returns a value called the "allocation size", which is equivalent to
the file system block size on a UNIX/Linux style file system. The
"allocation size" is more flexible than the underlying file system
block size, as (at least for Samba) it can be specified on a per-share
basis.
This allocation size is used by the client redirector to specify the
space between each 1 byte write call used to pre-allocate the empty
space when a file is being extended. For example, if the allocation
size is set to 1MB (the default in Samba) then when extending a file
by 500MB the client redirector will issue 500 intermediate 1-byte
SMBwriteX/SMB2_WRITE requests before issuing the real write request
(at 500MB+1 byte) to complete the application write request.
Each of these one byte writes is unlikely to time out, thus allowing
SMB/SMB2 to deal with writes that extend a file to an arbitrary size
(within the limits of the file system and the protocol) without having
to worry about network time outs.
Making writes efficient on Linux
--------------------------------
By default, when Samba receives these 1 byte "extension" write
requests, it simply does a normal one-byte "sparse" write at the
required position in the file. This is very fast, but only causes one
file system block (the block "dirtied" by the one byte write) to be
allocated. When the real data is finally written into the file, the
blocks then have to be allocated for real on the file system. Because
these blocks are not then allocated "in order" on the file system, as
it were, these actual writes can be quite slow.
The most efficient way to allocate file system blocks when data is to
be written into all of the file (for example, a streaming video write)
is to allocate what is called an "extent" on the file system. The
requested blocks are then laid out by the underlying file system (ext4
or XFS) in a very efficient way which causes the actual writes to be
much faster than having to allocate them dynamically.
How does a Linux application (Samba) get access to this new
extent-based allocation call ? Simple, it's built into glibc on modern
Linux distributions via the posix_fallocate() call. In the new patch
that has gone into Samba 3.5.7 (and also all future versions of
Samba), when the smb.conf parameter:
"strict allocate = yes"
is set on a share, whenever a file is extended by an
SMBwriteX/SMB2_WRITE call, call Samba calls posix_fallocate() to ensure
the file extends to at least the size given as the offset in the
SMBwriteX/SMB2_WRITE request, then does the actual write.
In tests done on an ext4 file system, changing to "strict allocate
yes" and using the posix_fallocate() call in this way increased the
write performance of Samba by 2/3 on a NETGEAR ReadyNAS box as
tested by the Intel NASPT test tool, available here:
http://software.intel.com/en-us/articles/intel-nas-performance-toolkit/
The specific tests used to measure the performance increase were the
"File Copy to NAS" and the "HD Video Record" tests.
How do I get the patch ?
------------------------
This patch has been added by default into 3.5.7 and all versions of
Samba subsequent to this. Back ports of this fix are available here:
For Samba 3.5.0 - 3.5.6:
http://samba.org/~jra/fallocate/samba-3-5-x-posix_fallocate.patch
For Samba 3.4.x:
http://samba.org/~jra/fallocate/samba-3-4-x-posix_fallocate.patch
For Samba 3.3.x:
http://samba.org/~jra/fallocate/samba-3-3-x-posix_fallocate.patch
For Samba 3.2.x:
http://samba.org/~jra/fallocate/samba-3-2-x-posix_fallocate.patch
Note that this must be run on a file system that supports extents,
with a kernel modern enough to support the posix_fallocate() call
working directly on the underlying file system (2.6.23 or later), and
a glibc that supports it (glibc 2.7 should be modern enough). Also
note that "strict allocate = yes" *must* be set on the exported share.
As Linux is our primary deployment platform and most Linux
distributions are now using ext4, I'm proposing to change the default
of the "strict allocate" smb.conf parameter to change from
"no" to
"yes" for the 3.6.0 release.
What if I'm using a file system that doesn't support extent allocation ?
------------------------------------------------------------------------
Never fear. You can safely leave "strict allocate = yes". Samba will
call posix_fallocate() and glibc has fall back code within it which
uses a technique very similar to the one the Windows redirector uses
to emulate the extent-based allocation. glibc first calls the Kernel
fallocate system call, and if that fails with ENOSYS (not supported)
glibc will call the statvfs() call to find out the block size on the
underlying file system, and then write one byte per f_bsize
bytes. This is as efficient as can be done without an extent based
allocation system call.
Even if the glibc on the system is old enough to not have this call,
Samba will detect this and has fall back code built in to manually
allocate space in writes of 32K bytes to extend a file to the required
size. This is the slowest fall back option however, but is equivalent
to the pre-3.5.7 code in Samba, so it's still pretty fast.
How do I know this is working ?
-------------------------------
Samba will run faster :-). A *lot* faster. Note this will be seen
mostly on loads heavily dependent on write performance, such as file
copies or streaming writes.
For users, set debug level 10 on smbd and look for messages of the
form:
vfs_fill_sparse: sys_posix_fallocate failed with error XXXX. Falling
back to slow manual allocation
(where XXXX will most commonly be 38, which corresponds to ENOSYS on
Linux). This will not be printed in the case where glibc emulation via
statfs() is being used to do the allocation. As this is being done
inside glibc there is no way for Samba to know if the fast (system
call) fallocate is being used, or the slower statvfs() code is being
used - the posix_fallocate() call succeeds to Samba in both cases. If
you suspect statvfs() emulation is being used, you'll need to
investigate via the "strace" method for developers described below.
For developers the way to confirm this is to examine an strace output
when applied to an smbd process. Use:
strace -p <smb pid> >&/tmp/log
If the "fast" fallocate method is being used, you will see fallocate()
system calls being listed in the log. If glibc statvfs emulation is
being used, you will see a statvfs() call followed by a series of one
byte pwrite() system calls.
Finally, if the Samba fall back code is being used you will see a
sequence of pwrite() system calls each writing 32k with no statvfs()
call before them.
But I want more..
-----------------
As the Windows client issues one byte writes to extend a file
every "allocation size" bytes, we can cheat by changing the allocation
size we return on a per-share basis. For example, if you're mostly
writing large video files onto a share, you can change the allocation
size reported to the Windows client by changing the smb.conf parameter
to something like 100MB, for example :
"allocation roundup size = 104857600"
from the default 1MB size. This can gain a few percent extra
performance but may cause applications that use the allocation size to
behave oddly, or even fail, as Windows never uses a size this
large. As always, be careful and test your workload.
However the underlying change to "strict allocate" is safe, and with
the patch (or new Samba version) will safely improve the write speed
of your servers.
Jeremy Allison,
Samba Team.
jra at samba.org
6th Dec. 2010.
Thanks to Justin Maggard at NETGEAR for helping with this work, and
for proof-reading this paper.