John Feuerstein
2010-Feb-27 17:03 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
Greetings, in contrast to some performance tips regarding small file *read* performance, I want to share these results. The test is rather simple but yields some very remarkable results: 400% improved read performance by simply dropping some of the so called "performance translators"! Please note that this test resembles a simplified version of our workload, which is more or less sequential, read-only small file serving with an average of 100 concurrent clients. (We use GlusterFS as a flat-file backend to a cluster of webservers, which is hit only after missing some caches in a more sophisticated caching infrastructure on top of it) The test-setup is a 3 node AFR cluster, with server+client on each one, single process model (one volfile, the local volume is attached to within the same process to save overhead), connected via 1 Gbit Ethernet. This way each node can continue to operate on it's own, even if the whole internal network for GlusterFS is down. We used commodity hardware for the test. Each node is identical: - Intel Core i7 - 12G RAM - 500GB filesystem - 1 Gbit NIC dedicated for GlusterFS Software: - Linux 2.6.32.8 - GlusterFS 3.0.2 - FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 - Filesystem / Storage Backend: - LVM2 on top of software RAID 1 - ext4 with noatime I will paste the configurations inline, so people can comment on them. /etc/fstab: ------------------------------------------------------------------------- /dev/data/test /mnt/brick/test ext4 noatime 0 2 /etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0 ------------------------------------------------------------------------- *** Please note: this is the final configuration with the best results. All translators are numbered to make the explanation easier later on. Unused translators are commented out... The volume spec is identical on all nodes, except that the bind-address option in the server volume [*4*] is adjusted. *** /etc/glusterfs/test.vol ------------------------------------------------------------------------- # Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com> # # Single Process Model with AFR (Automatic File Replication). ## ## Storage backend ## # # POSIX STORAGE [*1*] # volume posix type storage/posix option directory /mnt/brick/test/glusterfs end-volume # # POSIX LOCKS [*2*] # #volume locks volume brick type features/locks subvolumes posix end-volume ## ## Performance translators (server side) ## # # IO-Threads [*3*] # #volume brick # type performance/io-threads # subvolumes locks # option thread-count 8 #end-volume ### End of performance translators # # TCP/IP server [*4*] # volume server type protocol/server subvolumes brick option transport-type tcp option transport.socket.bind-address 10.1.0.1 # FIXME option transport.socket.listen-port 820 option transport.socket.nodelay on option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3 end-volume # # TCP/IP clients [*5*] # volume node1 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.1 option remote-port 820 option transport.socket.nodelay on end-volume volume node2 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.2 option remote-port 820 option transport.socket.nodelay on end-volume volume node3 type protocol/client option remote-subvolume brick option transport-type tcp/client option remote-host 10.1.0.3 option remote-port 820 option transport.socket.nodelay on end-volume # # Automatic File Replication Translator (AFR) [*6*] # # NOTE: "node3" is the primary metadata node, so this one *must* # be listed first in all volume specs! Also, node3 is the global # favorite-child with the definite file version if any conflict # arises while self-healing... # volume afr type cluster/replicate subvolumes node3 node1 node2 option read-subvolume node2 option favorite-child node3 end-volume ## ## Performance translators (client side) ## # # IO-Threads [*7*] # #volume client-threads-1 # type performance/io-threads # subvolumes afr # option thread-count 8 #end-volume # # Write-Behind [*8*] # volume wb type performance/write-behind subvolumes afr option cache-size 4MB end-volume # # Read-Ahead [*9*] # #volume ra # type performance/read-ahead # subvolumes wb # option page-count 2 #end-volume # # IO-Cache [*10*] # volume cache type performance/io-cache subvolumes wb option cache-size 1024MB option cache-timeout 60 end-volume # # Quick-Read for small files [*11*] # #volume qr # type performance/quick-read # subvolumes cache # option cache-timeout 60 #end-volume # # Metadata prefetch [*12*] # #volume sp # type performance/stat-prefetch # subvolumes qr #end-volume # # IO-Threads [*13*] # #volume client-threads-2 # type performance/io-threads # subvolumes sp # option thread-count 16 #end-volume ### End of performance translators. ------------------------------------------------------------------------- So let's start now. If not explicitely stated, perform on all nodes: # Prepare filesystem mountpoints $ mkdir -p /mnt/brick/test # Mount bricks $ mount /mnt/brick/test # Prepare brick roots (so lost+found won't end up in the volume) $ mkdir -p /mnt/brick/test/glusterfs # Load FUSE $ modprobe fuse # Prepare GlusterFS mountpoints $ mkdir -p /mnt/glusterfs/test # Mount GlusterFS # (we start with Node 3 which should become the metadata master) node3 $ mount /mnt/glusterfs/test node1 $ mount /mnt/glusterfs/test node2 $ mount /mnt/glusterfs/test # While doing the tests, we watch the logs on all nodes for errors: $ tail -f /var/log/glusterfs/test.log For each volume spec change, you have to unmount GlusterFS, change the vol file, and mount GlusterFS again. Before starting tests, make sure everything is running and the volumes on all nodes are attached (watch the log files!). Write the test-data for the read-only tests. These are lot's of 20K files, which resemble most of our css/js/php/python files. You should adjust this to match your workload... ------------------------------------------------------------------------- #!/bin/bash mkdir -p /mnt/glusterfs/test/data cd /mnt/glusterfs/test/data for topdir in x{1..100} do mkdir -p $topdir cd $topdir for subdir in y{1..10} do mkdir $subdir cd $subdir for file in z{1..10} do dd if=/dev/zero of=20K-$RANDOM \ bs=4K count=5 &> /dev/null && echo -n . done cd .. done cd .. done ------------------------------------------------------------------------- OK, in our case /mnt/glusterfs/test/data is now populated with around ~240M of data... enough for some simple tests. Each test-run consists of this simplified simulation of sequentially reading all files, listing dirs and probably doing a stat(): ------------------------------------------------------------------------- $ cd /mnt/glusterfs/test/data # Always populate the io-cache first: $ time tar cf - . > /dev/null # Simulate and time 100 concurrent data consumers: $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait ------------------------------------------------------------------------- OK, so here are the results. As stated, take them with a grain of salt. Make sure you resemble your workload. For example, read-ahead is as we see useless in this case but might improve performance for files with a different size... :) # All translators active except *7* (client io-threads after AFR) real 2m27.555s user 0m3.536s sys 0m6.888s # All translators active except *13* (client io-threads at the end) real 2m23.779s user 0m2.824s sys 0m5.604s # All translators active except *7* and *13* (no client io-threads!) real 0m53.097s user 0m3.512s sys 0m6.436s # All translators active except *7*, *13* and only 8 io-threads in *3* # instead of the default of 16 (server side io-threads) real 0m45.942s user 0m3.472s sys 0m6.612s # All translators active except *3*, *7*, *13* (no io-threads at all!) real 0m40.332s user 0m3.776s sys 0m6.424s # All translators active except *3*, *7*, *12*, *13* (no stat prefetch) real 0m39.205s user 0m3.672s sys 0m6.084s # All translators active except *3*, *7*, *11*, *12*, *13* # (no quickread) real 0m39.116s user 0m3.652s sys 0m5.816s # All translators active except *3*, *7*, *11*, *12*, *13* and # with page-count = 2 in *9* instead of 4 real 0m38.851s user 0m3.492s sys 0m5.796s # All translators active except *3*, *7*, *9*, *11*, *12*, *13* # (no read-ahead) real 0m38.576s user 0m3.356s sys 0m6.076s OK, that's it. Compare the results with all performance translators with the final basic setup without any of the magic: with all performance translators: real 2m27.555s without most performance translators: real 0m38.576s This is a _HUGE_ improvement! (disregard user and sys, they were practically the same in all tests) Some final words: - don't add performance translators blindly (!) - always test with a similar workload you will use in production - never go and copy+paste a volume spec, then moan about bad performance - don't rely on "glusterfs-volgen", it gives you just a starting point! - less translators == less overhead - read documentation for all options of all translators and get an idea: http://www.gluster.com/community/documentation/index.php/Translators (some stuff is still undocumented, but this is open source... so have a look) Best regards, John Feuerstein
miloska at gmail.com
2010-Feb-27 18:00 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
> Some final words: > > - don't add performance translators blindly (!) > - always test with a similar workload you will use in production > - never go and copy+paste a volume spec, then moan about bad performance > - don't rely on "glusterfs-volgen", it gives you just a starting point! > - less translators == less overhead > - read documentation for all options of all translators and get an idea: > http://www.gluster.com/community/documentation/index.php/Translators > (some stuff is still undocumented, but this is open source... so have a > look) >Thanks to share your results. During my tests I found somewhat similar, although I didn't went into details like you with GlusterFS, I was more interested in tunning my RAID controller and ext3 filesystem. Also note - has been reported on the list before - that with the default, glusterfs-volgen created configuration 'ls -lR' does NOT re-sync the content between AFR nodes (at least not with CentOS 5.4 x86_64). 'ls -lR' works without the translators. Seems for me it's safer and faster to use a stripped down config. "Your mileage may vary"
John Feuerstein
2010-Feb-27 18:09 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
After reading the mail again I'm under the impression that I didn't make it clear enough: We don't have a pure read-only, but mostly read-only workload. This is the reason why we've tried GlusterFS with AFR, so we can have a multi-master read/write filesystem with a persitent copy on each node. If we wouldn't need write access every here and then, we could have gone with plain copies of the data. Now another idea is the following, based on the fact that the local ext4 filesystem + VFS cache is *much* faster:> GlusterFS with populated IO-Cache: > real 0m38.576s > user 0m3.356s > sys 0m6.076s# Work directly on the back-end (this is read-only...) $ cd /mnt/brick/test/glusterfs/data # Ext4 without VFS Cache: $ echo 3 > /proc/sys/vm/drop_caches $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait real 0m1.598s user 0m2.136s sys 0m3.696s # Ext4 with VFS Cache: $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait real 0m1.312s user 0m2.264s sys 0m3.256s So the idea now is to bind-mount the backend filesystem *read-only* and use it for all read operations. For all write operations, use the GlusterFS mountpoint which provides locking etc. (This implies some sort of Read/Write splitting, but we can do that...) The downside is that the backend read operations won't make use of the GlusterFS on-demand self-healing. But since 99% of our read-only files are "write once, read a lot of times..." -- this could work out. After a node failure, a simple "ls -lR" should self-heal everything and the backend is fine too. The chance to read a broken file is very low? Any comments on this idea? Is there something else that could go wrong by using the backend in a pure read-only fashion that I've missed? Any ideas why the GlusterFS performance/io-cache translator with a cache-timeout of 60 is still so slow? Is there any way to *really* cache meta and filedata on GlusterFS _without_ hitting the network and thus getting very poor small file performance introduced by network latency? Are there any plans to implement support for FS-Cache [1] (CacheFS, Cachefiles), shipped with recent Linux kernels? Or to improve io-cache likewise? [1] http://people.redhat.com/steved/fscache/docs/FS-Cache.pdf Lots of questions... :) Best regards, John
John Feuerstein
2010-Feb-27 18:56 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
Another thing that makes me wonder is the read-subvolume setting:> volume afr > type cluster/replicate > ... > option read-subvolume node2 > ... > end-volumeSo even if we play around and set this to the local node or some remote node respectively, it won't gain any performance for small files. Looks like the whole bottleneck for small files is meta-data and the global namespace lookup. It would be really great if all of this could be cached within io-cache, only falling back to a namespace query (and probably locking) if something wants to write to the file, or if the result is longer than cache-timeout seconds in the cache. So even if the file has been renamed, is unlinked, has changed permissions / metadata - simply take the version of the io-cache until it's invalidated. At least that is what I would expect the io-cache to do. This will introduce a discrepancy between the cached file version and the real version in the global namespace, but isn't that what one would expect from caching...? Note that the cache-size was in all tests on all nodes 1024MB, and the whole set of test-data was ~240MB. Add some meta-data and it's probably at 250MB. In addition, cache-timeout was 60 seconds, while the whole test took around 40 seconds. So *all* of the read-only test could have been served completely by the io-cache... or am I mistaken here? I'm trying to understand the poor performance, because network latency should be eliminated by the cache. Could some Gluster-Dev please elaborate a bit on that one? Best Regards, John
Ed W
2010-Mar-01 20:44 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
On 27/02/2010 18:56, John Feuerstein wrote:> It would be really great if all of this could be cached within io-cache, > only falling back to a namespace query (and probably locking) if > something wants to write to the file, or if the result is longer than > cache-timeout seconds in the cache. So even if the file has been > renamed, is unlinked, has changed permissions / metadata - simply take > the version of the io-cache until it's invalidated. At least that is > what I would expect the io-cache to do. This will introduce a > discrepancy between the cached file version and the real version in the > global namespace, but isn't that what one would expect from caching...? >I believe samba (and probably others) use a two way lock escalation facility to mitigate a similar problem. So you can "read-lock" or phrased differently, "express your interest in caching some files/metadata" and then if someone changes what you are watching the lock break is pushed to you to invalidate your cache. It seems like something similar would be a candidate for implementation with the gluster native clients? You still have performance issues with random reads because when you try to open some file and you still need to check it's not open/locked/needs replicating from some other brick. However, what you can do is have proactive caching with an active notification of any cache invalidation and this benefits the situation where you re-read stuff you already read, and/or you have an effective read-ahead which is grabbing stuff for you Interesting problem Ed W
Raghavendra G
2010-Mar-17 12:40 UTC
[Gluster-users] GlusterFS 3.0.2 small file read performance benchmark
Hi John, when stdout is redirected to /dev/null, tar on my laptop is not doing any reads (tar cf - . > /dev/null). Can you confirm whether tar is having same behaviour on your test setup? when redirected to any file other than /dev/null, tar is doing reads. Can you attach strace of tar? regards, On Sat, Feb 27, 2010 at 9:03 PM, John Feuerstein <john at feurix.com> wrote:> Greetings, > > in contrast to some performance tips regarding small file *read* > performance, I want to share these results. The test is rather simple > but yields some very remarkable results: 400% improved read performance > by simply dropping some of the so called "performance translators"! > > Please note that this test resembles a simplified version of our > workload, which is more or less sequential, read-only small file serving > with an average of 100 concurrent clients. (We use GlusterFS as a > flat-file backend to a cluster of webservers, which is hit only after > missing some caches in a more sophisticated caching infrastructure on > top of it) > > The test-setup is a 3 node AFR cluster, with server+client on each one, > single process model (one volfile, the local volume is attached to > within the same process to save overhead), connected via 1 Gbit > Ethernet. This way each node can continue to operate on it's own, even > if the whole internal network for GlusterFS is down. > > We used commodity hardware for the test. Each node is identical: > - Intel Core i7 > - 12G RAM > - 500GB filesystem > - 1 Gbit NIC dedicated for GlusterFS > > Software: > - Linux 2.6.32.8 > - GlusterFS 3.0.2 > - FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13 > - Filesystem / Storage Backend: > - LVM2 on top of software RAID 1 > - ext4 with noatime > > I will paste the configurations inline, so people can comment on them. > > > /etc/fstab: > ------------------------------------------------------------------------- > /dev/data/test /mnt/brick/test ext4 noatime 0 2 > > /etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs > noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0 > ------------------------------------------------------------------------- > > > *** > Please note: this is the final configuration with the best results. All > translators are numbered to make the explanation easier later on. Unused > translators are commented out... > The volume spec is identical on all nodes, except that the bind-address > option in the server volume [*4*] is adjusted. > *** > > /etc/glusterfs/test.vol > ------------------------------------------------------------------------- > # Sat Feb 27 16:53:00 CET 2010 John Feuerstein <john at feurix.com> > # > # Single Process Model with AFR (Automatic File Replication). > > > ## > ## Storage backend > ## > > # > # POSIX STORAGE [*1*] > # > volume posix > type storage/posix > option directory /mnt/brick/test/glusterfs > end-volume > > # > # POSIX LOCKS [*2*] > # > #volume locks > volume brick > type features/locks > subvolumes posix > end-volume > > > ## > ## Performance translators (server side) > ## > > # > # IO-Threads [*3*] > # > #volume brick > # type performance/io-threads > # subvolumes locks > # option thread-count 8 > #end-volume > > ### End of performance translators > > > # > # TCP/IP server [*4*] > # > volume server > type protocol/server > subvolumes brick > option transport-type tcp > option transport.socket.bind-address 10.1.0.1 # FIXME > option transport.socket.listen-port 820 > option transport.socket.nodelay on > option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3 > end-volume > > > # > # TCP/IP clients [*5*] > # > volume node1 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.1 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > volume node2 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.2 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > volume node3 > type protocol/client > option remote-subvolume brick > option transport-type tcp/client > option remote-host 10.1.0.3 > option remote-port 820 > option transport.socket.nodelay on > end-volume > > > # > # Automatic File Replication Translator (AFR) [*6*] > # > # NOTE: "node3" is the primary metadata node, so this one *must* > # be listed first in all volume specs! Also, node3 is the global > # favorite-child with the definite file version if any conflict > # arises while self-healing... > # > volume afr > type cluster/replicate > subvolumes node3 node1 node2 > option read-subvolume node2 > option favorite-child node3 > end-volume > > > > ## > ## Performance translators (client side) > ## > > # > # IO-Threads [*7*] > # > #volume client-threads-1 > # type performance/io-threads > # subvolumes afr > # option thread-count 8 > #end-volume > > # > # Write-Behind [*8*] > # > volume wb > type performance/write-behind > subvolumes afr > option cache-size 4MB > end-volume > > > # > # Read-Ahead [*9*] > # > #volume ra > # type performance/read-ahead > # subvolumes wb > # option page-count 2 > #end-volume > > > # > # IO-Cache [*10*] > # > volume cache > type performance/io-cache > subvolumes wb > option cache-size 1024MB > option cache-timeout 60 > end-volume > > # > # Quick-Read for small files [*11*] > # > #volume qr > # type performance/quick-read > # subvolumes cache > # option cache-timeout 60 > #end-volume > > # > # Metadata prefetch [*12*] > # > #volume sp > # type performance/stat-prefetch > # subvolumes qr > #end-volume > > # > # IO-Threads [*13*] > # > #volume client-threads-2 > # type performance/io-threads > # subvolumes sp > # option thread-count 16 > #end-volume > > ### End of performance translators. > ------------------------------------------------------------------------- > > > > So let's start now. If not explicitely stated, perform on all nodes: > > # Prepare filesystem mountpoints > $ mkdir -p /mnt/brick/test > > # Mount bricks > $ mount /mnt/brick/test > > # Prepare brick roots (so lost+found won't end up in the volume) > $ mkdir -p /mnt/brick/test/glusterfs > > # Load FUSE > $ modprobe fuse > > # Prepare GlusterFS mountpoints > $ mkdir -p /mnt/glusterfs/test > > # Mount GlusterFS > # (we start with Node 3 which should become the metadata master) > node3 $ mount /mnt/glusterfs/test > node1 $ mount /mnt/glusterfs/test > node2 $ mount /mnt/glusterfs/test > > # While doing the tests, we watch the logs on all nodes for errors: > $ tail -f /var/log/glusterfs/test.log > > For each volume spec change, you have to unmount GlusterFS, change the > vol file, and mount GlusterFS again. Before starting tests, make sure > everything is running and the volumes on all nodes are attached (watch > the log files!). > > > Write the test-data for the read-only tests. These are lot's of 20K > files, which resemble most of our css/js/php/python files. You should > adjust this to match your workload... > ------------------------------------------------------------------------- > #!/bin/bash > mkdir -p /mnt/glusterfs/test/data > cd /mnt/glusterfs/test/data > for topdir in x{1..100} > do > mkdir -p $topdir > cd $topdir > for subdir in y{1..10} > do > mkdir $subdir > cd $subdir > for file in z{1..10} > do > dd if=/dev/zero of=20K-$RANDOM \ > bs=4K count=5 &> /dev/null && echo -n . > done > cd .. > done > cd .. > done > ------------------------------------------------------------------------- > > OK, in our case /mnt/glusterfs/test/data is now populated with around > ~240M of data... enough for some simple tests. > > Each test-run consists of this simplified simulation of sequentially > reading all files, listing dirs and probably doing a stat(): > > ------------------------------------------------------------------------- > $ cd /mnt/glusterfs/test/data > > # Always populate the io-cache first: > $ time tar cf - . > /dev/null > > # Simulate and time 100 concurrent data consumers: > $ for ((i=0;i<100;i++)); do tar cf - . > /dev/null & done; time wait > ------------------------------------------------------------------------- > > > OK, so here are the results. As stated, take them with a grain of salt. > Make sure you resemble your workload. For example, read-ahead is as we > see useless in this case but might improve performance for files with a > different size... :) > > > # All translators active except *7* (client io-threads after AFR) > real 2m27.555s > user 0m3.536s > sys 0m6.888s > > # All translators active except *13* (client io-threads at the end) > real 2m23.779s > user 0m2.824s > sys 0m5.604s > > # All translators active except *7* and *13* (no client io-threads!) > real 0m53.097s > user 0m3.512s > sys 0m6.436s > > # All translators active except *7*, *13* and only 8 io-threads in *3* # > instead of the default of 16 (server side io-threads) > real 0m45.942s > user 0m3.472s > sys 0m6.612s > > # All translators active except *3*, *7*, *13* (no io-threads at all!) > real 0m40.332s > user 0m3.776s > sys 0m6.424s > > # All translators active except *3*, *7*, *12*, *13* (no stat prefetch) > real 0m39.205s > user 0m3.672s > sys 0m6.084s > > # All translators active except *3*, *7*, *11*, *12*, *13* > # (no quickread) > real 0m39.116s > user 0m3.652s > sys 0m5.816s > > # All translators active except *3*, *7*, *11*, *12*, *13* and > # with page-count = 2 in *9* instead of 4 > real 0m38.851s > user 0m3.492s > sys 0m5.796s > > # All translators active except *3*, *7*, *9*, *11*, *12*, *13* > # (no read-ahead) > real 0m38.576s > user 0m3.356s > sys 0m6.076s > > > OK, that's it. Compare the results with all performance translators with > the final basic setup without any of the magic: > > with all performance translators: real 2m27.555s > without most performance translators: real 0m38.576s > > This is a _HUGE_ improvement! > > (disregard user and sys, they were practically the same in all tests) > > > Some final words: > > - don't add performance translators blindly (!) > - always test with a similar workload you will use in production > - never go and copy+paste a volume spec, then moan about bad performance > - don't rely on "glusterfs-volgen", it gives you just a starting point! > - less translators == less overhead > - read documentation for all options of all translators and get an idea: > http://www.gluster.com/community/documentation/index.php/Translators > (some stuff is still undocumented, but this is open source... so have a > look) > > > Best regards, > John Feuerstein > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >-- Raghavendra G
Raghavendra G
2010-Mar-17 17:33 UTC
[Gluster-users] Fwd: GlusterFS 3.0.2 small file read performance benchmark
sending to gluster-users. ---------- Forwarded message ---------- From: Raghavendra G <raghavendra at gluster.com> Date: Wed, Mar 17, 2010 at 9:07 PM Subject: Re: GlusterFS 3.0.2 small file read performance benchmark To: John Feuerstein <john at feurix.com> Hi John, please find the inlined comments: On Wed, Mar 17, 2010 at 8:21 PM, John Feuerstein <john at feurix.com> wrote:> Hello Raghavendra, > > > when stdout is redirected to /dev/null, tar on my laptop is not doing > > any reads (tar cf - . > /dev/null). Can you confirm whether tar is > > having same behaviour on your test setup? when redirected to any file > > other than /dev/null, tar is doing reads. Can you attach strace of tar? > > indeed, you are right. I don't have the test-systems at hand any more, > but have just confirmed it here on my local machine. I am sorry. > > This is when stdout goes to a file: > > > lstat("./20K-AAA", {st_mode=S_IFREG|0644, st_size=20480, ...}) = 0 > > open("./20K-AAA", O_RDONLY) = 3 > > read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 9216) > = 9216 > > write(1, > "./\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) > = 10240 > > [... more reads and writes here ...] > > fstat(3, {st_mode=S_IFREG|0644, st_size=20480, ...}) = 0 > > close(3) = 0 > > ... and this when it goes to /dev/null: > > > lstat("./20K-AAA", {st_mode=S_IFREG|0644, st_size=20480, ...}) = 0 > > lstat("./20K-AAA", {st_mode=S_IFREG|0644, st_size=20480, ...}) = 0 > > > So the tests actually did not measure meta-data+data read performance > but instead only meta-data read performance. > > I can understand now that io-cache, read-ahead and quick-read could not > possibly help here (since the design of these translators do not affect > fetching meta-data?). > >yes.> But still, it's weird that stat-prefetch makes this test slower. It > looks like the more translators I've used, the more they worked > "against" each other, possibly fighting for locks...? >As for as io-threads is concerned it is not recommended to be used on client side, since there is no blocking layer (sockets are non-blocking). The only reason is to make use of multiple processing units present. In that case, it might be helpful on top of caching translators, since searching through cache may be computationally intensive. As you've said in your previous post, it is indeed true that performance translators should be judicially used depending on the use case. For eg., in case of random-reads, read-ahead will not be useful. And in such cases, they may degrade performance, since the performance translators themselves do some housekeeping stuff.> > After knowing the fact that this was a meta-data only test, the only > interesting measurement left is the final test run (basic config without > unneeded performance translators) compared to the local test on ext4+VFS: > > real 0m38.576s > user 0m3.356s > sys 0m6.076s > > vs > > real 0m1.312s > user 0m2.264s > sys 0m3.256s > > > So a io-cache for meta-data could be great? It was just ~250MB of data, > so even if this test would have read it all, the ~40second difference > would still be meta-data. > >Though it is unfair to compare network file system with local file system, I get the crux of what you are saying. stat-prefetch does do metadata caching, but metadata (corresponding to dentries of a directory) is cached when a directory is read and the life time of cache is from the time dentries are read till the fd corresponding to directory is closed. The targeted use cases were ls -l on huge directory, samba etc. As far as tar is concerned, it does do readdir, but stat on dentries is not sent before the fd is closed. Instead, they are sent after the fd is closed, hence stat-prefetch is not helping here. Thanks for the detailed tests :). Thanks.> > Best regards, > John >regards, -- Raghavendra G -- Raghavendra G