I appear to be hitting a limitation in either the glusterfs FUSE client or the glusterfsd daemon, and I wonder if there are some knobs I can tweak. I have a 12-disk RAID10 array. If I access it locally I get the following figures (#p = number of concurrent reader processes) #p files/sec 1 35.52 2 66.13 5 137.73 10 215.51 20 291.45 30 337.01 If I access it as a single-brick distributed glusterfs volume over 10GE I get the following figures: #p files/sec 1 39.09 2 70.44 5 135.79 10 157.48 20 179.75 30 206.34 The performance tracks very closely the raw RAID10 performance at 1, 2 and 5 concurrent readers. However at 10+ concurrent readers it falls well below what the RAID10 volume is capable of. These files are an average of 650K each, so 200 files/sec = 134MB/s, a little over 10% of the 10GE bandwidth. I am guessing either there is a limit on the number of concurrent operations on the same filesystem/brick, or some sort of window limit I've hit. I have tried: gluster volume set raid10 performance.io-thread-count 64 and restarted glusterd and remounted the filesystem, but it didn't seem to make any difference. I have also tried two separate client machines each trying 15 concurrent connections, but the aggregate throughput is no more than 30 processes on a single client. This suggests to me that glusterfsd (the brick) is the bottleneck. If I attach strace to this process it tells me: Process 1835 attached with 11 threads - interrupt to quit Can I increase that number of threads? Is there anything else I can try? Regards, Brian. Test Methodology: I am using the measurement script below, either pointing it at data/raid10 /on the server (the raw brick) or /mnt/raid10 on the client. The corpus of 100K files between 500KB and 800KB was created using bonnie++ -d /mnd/raid10 -n 98:800k:500k:1000:1024k -s 0 -u root and then killing it after the file creation phase. ------- 8< -------------------------------------------------------------- #!/usr/bin/ruby -w FILEGROUPS = { "sdb" => "/data/sdb/Bonnie.26384/*/*", "sdc" => "/data/sdc/Bonnie.26384/*/*", "sdd" => "/data/sdd/Bonnie.26384/*/*", "sde" => "/data/sde/Bonnie.26384/*/*", "replic" => "/mnt/replic/Bonnie.3385/*/*", "raid10-direct" => "/data/raid10/Bonnie.5021/*/*", "raid10-gluster" => "/mnt/raid10/Bonnie.5021/*/*", } class Perftest attr_accessor :offset def initialize(filenames) @offset = 0 @filenames = filenames @pids = [] end def run(n_files, n_procs=1, dd_args="", random=false) system("echo 3 >/proc/sys/vm/drop_caches") if random files = @filenames.sort_by { rand }[0, n_files] else files = (@filenames + @filenames)[@offset, n_files] @offset = (offset + n_files) % @filenames.size end chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs] n_files = chunks.map { |chunk| chunk.size }.inject(:+) timed(n_files, n_procs, "#{dd_args} #{"[random]" if random}") do @pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit! } } @pids.delete_if { |pid| Process.waitpid(pid) } end end def timed(n_files, n_procs=1, args="") t1 = Time.now yield t2 = Time.now printf "%3d %10.2f %s\n", n_procs, n_files/(t2-t1), args end def run_single(files, dd_args) files.each do |f| system("dd if='#{f}' of=/dev/null #{dd_args} 2>/dev/null") end end def kill_all(sig="TERM") @pids.each { |pid| Process.kill(sig, pid) rescue nil } end end label = ARGV[0] unless glob = FILEGROUPS[label] STDERR.puts "Usage: #{$0} <filegroup>" exit 1 end perftest = Perftest.new(Dir[glob].freeze) # Remember the offset for sequential tests, so that re-runs don't use # cached data at the server. Still better to drop vm caches at the server. memo = "/var/tmp/perftest.offset" perftest.offset = File.read(memo).to_i rescue 0 at_exit do perftest.kill_all File.open(memo,"w") { |f| f.puts perftest.offset } end puts " #p files/sec dd_args" [1,2,5].each do |nprocs| perftest.run(10000, nprocs, "bs=1024k") perftest.run(4000, nprocs, "bs=1024k",1) end [10,20,30].each do |nprocs| perftest.run(10000, nprocs, "bs=1024k") perftest.run(10000, nprocs, "bs=1024k",1) end
concurrency will be affected by your underlying filesystem as well. which filesystem you got? In our environment we have ext4 and xfs. ext4 is a touch faster, but doesn't handle as much concurrency, while xfs handles lots of streams fairly well, but runs slower. However, aside from that don't expect much from me ;) -greg -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120207/48199bb1/attachment.html>
Hi, On Tuesday 07 February 2012 14:09:33 Brian Candler wrote:> I appear to be hitting a limitation in either the glusterfs FUSE client or > the glusterfsd daemon, and I wonder if there are some knobs I can tweak. > > I have a 12-disk RAID10 array. If I access it locally I get the following > figures (#p = number of concurrent reader processes) > > #p files/sec > 1 35.52 > 2 66.13 > 5 137.73 > 10 215.51 > 20 291.45 > 30 337.01 > > If I access it as a single-brick distributed glusterfs volume over 10GE I > get the following figures: > > #p files/sec > 1 39.09 > 2 70.44 > 5 135.79 > 10 157.48 > 20 179.75 > 30 206.34 > > The performance tracks very closely the raw RAID10 performance at 1, 2 and 5 > concurrent readers. However at 10+ concurrent readers it falls well below > what the RAID10 volume is capable of.I did some similar test but with a slower machine and slower disk. The "problem" with distributed filesystems is the distributed locking. And even though your test volume is on one system only, access locking is not only done by the fs in kernel but additionally by the fuse-client and/or the glusterd. That imposes a limit. And when the volume stretches across several machines, even though the reading might be done from the local disk, the locking has to be synchronized across all brick-machines. Another limit. And with gluster, its the client that does all the synchronization, thats why there is one fuse-thread on the client that will max out your cpu when running dbench and the likes no matter whether the volume is local or distributed/replicated or remote. My conclusion from my tests so far is that the most rewarding target for optimizations is the fuse-client of glusterfs. And maybe the way it talks to its companions and to the bricks. But I am not yet finished with my tests and I still hope that glusterfs proves usable for distributed vm-image storage. Have fun, Arnold -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120207/14ff75da/attachment.sig>
Fuse has a limitation where in requests become single threaded at the server layer. I had opened a bug but have not got any traction so far. Adding more concurrent connections starts to do lot of context switching at FUSE layer. This is an inherent limitation. On Tue, Feb 7, 2012 at 6:09 AM, Brian Candler <B.Candler at pobox.com> wrote:> I appear to be hitting a limitation in either the glusterfs FUSE client or > the glusterfsd daemon, and I wonder if there are some knobs I can tweak. > > I have a 12-disk RAID10 array. If I access it locally I get the following > figures (#p = number of concurrent reader processes) > > ?#p ?files/sec > ?1 ? ? ?35.52 > ?2 ? ? ?66.13 > ?5 ? ? 137.73 > ?10 ? ? 215.51 > ?20 ? ? 291.45 > ?30 ? ? 337.01 > > If I access it as a single-brick distributed glusterfs volume over 10GE I > get the following figures: > > ?#p ?files/sec > ?1 ? ? ?39.09 > ?2 ? ? ?70.44 > ?5 ? ? 135.79 > ?10 ? ? 157.48 > ?20 ? ? 179.75 > ?30 ? ? 206.34 > > The performance tracks very closely the raw RAID10 performance at 1, 2 and 5 > concurrent readers. ?However at 10+ concurrent readers it falls well below > what the RAID10 volume is capable of. > > These files are an average of 650K each, so 200 files/sec = 134MB/s, a > little over 10% of the 10GE bandwidth. > > I am guessing either there is a limit on the number of concurrent > operations on the same filesystem/brick, or some sort of window limit I've > hit. > > I have tried: > ? ?gluster volume set raid10 performance.io-thread-count 64 > and restarted glusterd and remounted the filesystem, but it didn't seem > to make any difference. > > I have also tried two separate client machines each trying 15 concurrent > connections, but the aggregate throughput is no more than 30 processes on a > single client. This suggests to me that glusterfsd (the brick) is the > bottleneck. ?If I attach strace to this process it tells me: > > Process 1835 attached with 11 threads - interrupt to quit > > Can I increase that number of threads? Is there anything else I can try? > > Regards, > > Brian. > > Test Methodology: > > I am using the measurement script below, either pointing it at data/raid10 > /on the server (the raw brick) or /mnt/raid10 on the client. The corpus of > 100K files between 500KB and 800KB was created using > > ? ?bonnie++ -d /mnd/raid10 -n 98:800k:500k:1000:1024k -s 0 -u root > > and then killing it after the file creation phase. > > ------- 8< -------------------------------------------------------------- > #!/usr/bin/ruby -w > > FILEGROUPS = { > ?"sdb" => "/data/sdb/Bonnie.26384/*/*", > ?"sdc" => "/data/sdc/Bonnie.26384/*/*", > ?"sdd" => "/data/sdd/Bonnie.26384/*/*", > ?"sde" => "/data/sde/Bonnie.26384/*/*", > ?"replic" => "/mnt/replic/Bonnie.3385/*/*", > ?"raid10-direct" => "/data/raid10/Bonnie.5021/*/*", > ?"raid10-gluster" => "/mnt/raid10/Bonnie.5021/*/*", > } > > class Perftest > ?attr_accessor :offset > > ?def initialize(filenames) > ? ?@offset = 0 > ? ?@filenames = filenames > ? ?@pids = [] > ?end > > ?def run(n_files, n_procs=1, dd_args="", random=false) > ? ?system("echo 3 >/proc/sys/vm/drop_caches") > ? ?if random > ? ? ?files = @filenames.sort_by { rand }[0, n_files] > ? ?else > ? ? ?files = (@filenames + @filenames)[@offset, n_files] > ? ? ?@offset = (offset + n_files) % @filenames.size > ? ?end > ? ?chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs] > ? ?n_files = chunks.map { |chunk| chunk.size }.inject(:+) > ? ?timed(n_files, n_procs, "#{dd_args} #{"[random]" if random}") do > ? ? ?@pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit! } } > ? ? ?@pids.delete_if { |pid| Process.waitpid(pid) } > ? ?end > ?end > > ?def timed(n_files, n_procs=1, args="") > ? ?t1 = Time.now > ? ?yield > ? ?t2 = Time.now > ? ?printf "%3d %10.2f ?%s\n", n_procs, n_files/(t2-t1), args > ?end > > ?def run_single(files, dd_args) > ? ?files.each do |f| > ? ? ?system("dd if='#{f}' of=/dev/null #{dd_args} 2>/dev/null") > ? ?end > ?end > > ?def kill_all(sig="TERM") > ? ?@pids.each { |pid| Process.kill(sig, pid) rescue nil } > ?end > end > > label = ARGV[0] > unless glob = FILEGROUPS[label] > ?STDERR.puts "Usage: #{$0} <filegroup>" > ?exit 1 > end > perftest = Perftest.new(Dir[glob].freeze) > > # Remember the offset for sequential tests, so that re-runs don't use > # cached data at the server. Still better to drop vm caches at the server. > memo = "/var/tmp/perftest.offset" > perftest.offset = File.read(memo).to_i rescue 0 > at_exit do > ?perftest.kill_all > ?File.open(memo,"w") { |f| f.puts perftest.offset } > end > > puts " #p ?files/sec ?dd_args" > [1,2,5].each do |nprocs| > ?perftest.run(10000, nprocs, "bs=1024k") > ?perftest.run(4000, nprocs, "bs=1024k",1) > end > [10,20,30].each do |nprocs| > ?perftest.run(10000, nprocs, "bs=1024k") > ?perftest.run(10000, nprocs, "bs=1024k",1) > end > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
On Tue, Feb 07, 2012 at 06:30:56PM +0100, Arnold Krille wrote:> > GlusterFS is a file-level protocol, more like NFS, and as far as I know > > there is no inherent locking between clients. > > There has to be locking. Otherwise two apps on two machines opening the same > file for writing would destroy each others changes. Therefor one client has to > gather locks on all brick filesystems (which is the same as synchronizing > access with all other clients).If two clients do a write() on the same area of the file, then one will get there first, and the second will overwrite the first. And if there were a lock, how would it help? Someone else please correct me if I'm wrong.> One client opens the file for reading, the other opens the file for trunc|write. > What do you get on the first client? How should this scenario be any save > without some kind of locking.If you want *useful* semantics in that situation then the clients can explicitly request an advisory lock on the file or ranges of the file, if they so wish. But this is not done for them by the filesystem.> I might be wrong and glusterfs really doesn't do any locking to prevent > concurrent write accesses. But if thats true, I think this rules out glusterfs > for any usage above "proof-of-concept".No such locking occurs when two concurrent processes on the same machine read and write a file, so why should it take place when the operation is over a network? Regards, Brian.