thr3ads.net - Gluster users - [Gluster-users] Concurrency limitation? [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Brian Candler

2012-Feb-07 14:09 UTC

[Gluster-users] Concurrency limitation?

I appear to be hitting a limitation in either the glusterfs FUSE client or
the glusterfsd daemon, and I wonder if there are some knobs I can tweak.

I have a 12-disk RAID10 array. If I access it locally I get the following
figures (#p = number of concurrent reader processes)

 #p  files/sec
  1      35.52
  2      66.13
  5     137.73
 10     215.51
 20     291.45
 30     337.01

If I access it as a single-brick distributed glusterfs volume over 10GE I
get the following figures:

 #p  files/sec
  1      39.09
  2      70.44
  5     135.79
 10     157.48
 20     179.75
 30     206.34

The performance tracks very closely the raw RAID10 performance at 1, 2 and 5
concurrent readers.  However at 10+ concurrent readers it falls well below
what the RAID10 volume is capable of.

These files are an average of 650K each, so 200 files/sec = 134MB/s, a
little over 10% of the 10GE bandwidth.

I am guessing either there is a limit on the number of concurrent
operations on the same filesystem/brick, or some sort of window limit I've
hit.

I have tried:
    gluster volume set raid10 performance.io-thread-count 64
and restarted glusterd and remounted the filesystem, but it didn't seem
to make any difference.

I have also tried two separate client machines each trying 15 concurrent
connections, but the aggregate throughput is no more than 30 processes on a
single client. This suggests to me that glusterfsd (the brick) is the
bottleneck.  If I attach strace to this process it tells me:

Process 1835 attached with 11 threads - interrupt to quit

Can I increase that number of threads? Is there anything else I can try?

Regards,

Brian.

Test Methodology:

I am using the measurement script below, either pointing it at data/raid10
/on the server (the raw brick) or /mnt/raid10 on the client. The corpus of
100K files between 500KB and 800KB was created using

    bonnie++ -d /mnd/raid10 -n 98:800k:500k:1000:1024k -s 0 -u root

and then killing it after the file creation phase.

------- 8< --------------------------------------------------------------
#!/usr/bin/ruby -w

FILEGROUPS = {
  "sdb" => "/data/sdb/Bonnie.26384/*/*",
  "sdc" => "/data/sdc/Bonnie.26384/*/*",
  "sdd" => "/data/sdd/Bonnie.26384/*/*",
  "sde" => "/data/sde/Bonnie.26384/*/*",
  "replic" => "/mnt/replic/Bonnie.3385/*/*",
  "raid10-direct" => "/data/raid10/Bonnie.5021/*/*",
  "raid10-gluster" => "/mnt/raid10/Bonnie.5021/*/*",
}

class Perftest
  attr_accessor :offset

  def initialize(filenames)
    @offset = 0
    @filenames = filenames
    @pids = []
  end

  def run(n_files, n_procs=1, dd_args="", random=false)
    system("echo 3 >/proc/sys/vm/drop_caches")
    if random
      files = @filenames.sort_by { rand }[0, n_files] 
    else
      files = (@filenames + @filenames)[@offset, n_files]
      @offset = (offset + n_files) % @filenames.size
    end
    chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs]
    n_files = chunks.map { |chunk| chunk.size }.inject(:+)
    timed(n_files, n_procs, "#{dd_args} #{"[random]" if
random}") do
      @pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit! } }
      @pids.delete_if { |pid| Process.waitpid(pid) }
    end
  end

  def timed(n_files, n_procs=1, args="")
    t1 = Time.now
    yield
    t2 = Time.now
    printf "%3d %10.2f  %s\n", n_procs, n_files/(t2-t1), args
  end
    
  def run_single(files, dd_args)
    files.each do |f|
      system("dd if='#{f}' of=/dev/null #{dd_args}
2>/dev/null")
    end
  end

  def kill_all(sig="TERM")
    @pids.each { |pid| Process.kill(sig, pid) rescue nil }
  end
end

label = ARGV[0]
unless glob = FILEGROUPS[label]
  STDERR.puts "Usage: #{$0} <filegroup>"
  exit 1
end
perftest = Perftest.new(Dir[glob].freeze)

# Remember the offset for sequential tests, so that re-runs don't use
# cached data at the server. Still better to drop vm caches at the server.
memo = "/var/tmp/perftest.offset"
perftest.offset = File.read(memo).to_i rescue 0
at_exit do
  perftest.kill_all
  File.open(memo,"w") { |f| f.puts perftest.offset }
end

puts " #p  files/sec  dd_args"
[1,2,5].each do |nprocs|
  perftest.run(10000, nprocs, "bs=1024k")
  perftest.run(4000, nprocs, "bs=1024k",1)
end
[10,20,30].each do |nprocs|
  perftest.run(10000, nprocs, "bs=1024k")
  perftest.run(10000, nprocs, "bs=1024k",1)
end

Greg Swift

2012-Feb-07 14:18 UTC

head link

[Gluster-users] Concurrency limitation?

concurrency will be affected by your underlying filesystem as well.  which
filesystem you got?  In our environment we have ext4 and xfs.  ext4 is a
touch faster, but doesn't handle as much concurrency, while xfs handles
lots of streams fairly well, but runs slower.

However, aside from that don't expect much from me ;)

-greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120207/48199bb1/attachment.html>

Arnold Krille

2012-Feb-07 15:17 UTC

head link

[Gluster-users] Concurrency limitation?

Hi,

On Tuesday 07 February 2012 14:09:33 Brian Candler
wrote:> I appear to be hitting a limitation in either the glusterfs FUSE client or
> the glusterfsd daemon, and I wonder if there are some knobs I can tweak.
> 
> I have a 12-disk RAID10 array. If I access it locally I get the following
> figures (#p = number of concurrent reader processes)
> 
>  #p  files/sec
>   1      35.52
>   2      66.13
>   5     137.73
>  10     215.51
>  20     291.45
>  30     337.01
> 
> If I access it as a single-brick distributed glusterfs volume over 10GE I
> get the following figures:
> 
>  #p  files/sec
>   1      39.09
>   2      70.44
>   5     135.79
>  10     157.48
>  20     179.75
>  30     206.34
> 
> The performance tracks very closely the raw RAID10 performance at 1, 2 and
5
> concurrent readers.  However at 10+ concurrent readers it falls well below
> what the RAID10 volume is capable of.
I did some similar test but with a slower machine and slower disk.
The "problem" with distributed filesystems is the distributed locking.
And even
though your test volume is on one system only, access locking is not only done 
by the fs in kernel but additionally by the fuse-client and/or the glusterd. 
That imposes a limit. And when the volume stretches across several machines, 
even though the reading might be done from the local disk, the locking has to 
be synchronized across all brick-machines. Another limit.

And with gluster, its the client that does all the synchronization, thats why 
there is one fuse-thread on the client that will max out your cpu when running 
dbench and the likes no matter whether the volume is local or 
distributed/replicated or remote.

My conclusion from my tests so far is that the most rewarding target for 
optimizations is the fuse-client of glusterfs. And maybe the way it talks to 
its companions and to the bricks.

But I am not yet finished with my tests and I still hope that glusterfs proves 
usable for distributed vm-image storage.

Have fun,

Arnold
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120207/14ff75da/attachment.sig>

Mohit Anchlia

2012-Feb-07 17:47 UTC

head link

[Gluster-users] Concurrency limitation?

Fuse has a limitation where in requests become single threaded at the
server layer. I had opened a bug but have not got any traction so far.
Adding more concurrent connections starts to do lot of context
switching at FUSE layer. This is an inherent limitation.

On Tue, Feb 7, 2012 at 6:09 AM, Brian Candler <B.Candler at pobox.com>
wrote:> I appear to be hitting a limitation in either the glusterfs FUSE client or
> the glusterfsd daemon, and I wonder if there are some knobs I can tweak.
>
> I have a 12-disk RAID10 array. If I access it locally I get the following
> figures (#p = number of concurrent reader processes)
>
> ?#p ?files/sec
> ?1 ? ? ?35.52
> ?2 ? ? ?66.13
> ?5 ? ? 137.73
> ?10 ? ? 215.51
> ?20 ? ? 291.45
> ?30 ? ? 337.01
>
> If I access it as a single-brick distributed glusterfs volume over 10GE I
> get the following figures:
>
> ?#p ?files/sec
> ?1 ? ? ?39.09
> ?2 ? ? ?70.44
> ?5 ? ? 135.79
> ?10 ? ? 157.48
> ?20 ? ? 179.75
> ?30 ? ? 206.34
>
> The performance tracks very closely the raw RAID10 performance at 1, 2 and
5
> concurrent readers. ?However at 10+ concurrent readers it falls well below
> what the RAID10 volume is capable of.
>
> These files are an average of 650K each, so 200 files/sec = 134MB/s, a
> little over 10% of the 10GE bandwidth.
>
> I am guessing either there is a limit on the number of concurrent
> operations on the same filesystem/brick, or some sort of window limit
I've
> hit.
>
> I have tried:
> ? ?gluster volume set raid10 performance.io-thread-count 64
> and restarted glusterd and remounted the filesystem, but it didn't seem
> to make any difference.
>
> I have also tried two separate client machines each trying 15 concurrent
> connections, but the aggregate throughput is no more than 30 processes on a
> single client. This suggests to me that glusterfsd (the brick) is the
> bottleneck. ?If I attach strace to this process it tells me:
>
> Process 1835 attached with 11 threads - interrupt to quit
>
> Can I increase that number of threads? Is there anything else I can try?
>
> Regards,
>
> Brian.
>
> Test Methodology:
>
> I am using the measurement script below, either pointing it at data/raid10
> /on the server (the raw brick) or /mnt/raid10 on the client. The corpus of
> 100K files between 500KB and 800KB was created using
>
> ? ?bonnie++ -d /mnd/raid10 -n 98:800k:500k:1000:1024k -s 0 -u root
>
> and then killing it after the file creation phase.
>
> ------- 8<
--------------------------------------------------------------
> #!/usr/bin/ruby -w
>
> FILEGROUPS = {
> ?"sdb" => "/data/sdb/Bonnie.26384/*/*",
> ?"sdc" => "/data/sdc/Bonnie.26384/*/*",
> ?"sdd" => "/data/sdd/Bonnie.26384/*/*",
> ?"sde" => "/data/sde/Bonnie.26384/*/*",
> ?"replic" => "/mnt/replic/Bonnie.3385/*/*",
> ?"raid10-direct" => "/data/raid10/Bonnie.5021/*/*",
> ?"raid10-gluster" => "/mnt/raid10/Bonnie.5021/*/*",
> }
>
> class Perftest
> ?attr_accessor :offset
>
> ?def initialize(filenames)
> ? ?@offset = 0
> ? ?@filenames = filenames
> ? ?@pids = []
> ?end
>
> ?def run(n_files, n_procs=1, dd_args="", random=false)
> ? ?system("echo 3 >/proc/sys/vm/drop_caches")
> ? ?if random
> ? ? ?files = @filenames.sort_by { rand }[0, n_files]
> ? ?else
> ? ? ?files = (@filenames + @filenames)[@offset, n_files]
> ? ? ?@offset = (offset + n_files) % @filenames.size
> ? ?end
> ? ?chunks = files.each_slice(n_files/n_procs).to_a[0, n_procs]
> ? ?n_files = chunks.map { |chunk| chunk.size }.inject(:+)
> ? ?timed(n_files, n_procs, "#{dd_args} #{"[random]" if
random}") do
> ? ? ?@pids = chunks.map { |chunk| fork { run_single(chunk, dd_args); exit!
} }
> ? ? ?@pids.delete_if { |pid| Process.waitpid(pid) }
> ? ?end
> ?end
>
> ?def timed(n_files, n_procs=1, args="")
> ? ?t1 = Time.now
> ? ?yield
> ? ?t2 = Time.now
> ? ?printf "%3d %10.2f ?%s\n", n_procs, n_files/(t2-t1), args
> ?end
>
> ?def run_single(files, dd_args)
> ? ?files.each do |f|
> ? ? ?system("dd if='#{f}' of=/dev/null #{dd_args}
2>/dev/null")
> ? ?end
> ?end
>
> ?def kill_all(sig="TERM")
> ? ?@pids.each { |pid| Process.kill(sig, pid) rescue nil }
> ?end
> end
>
> label = ARGV[0]
> unless glob = FILEGROUPS[label]
> ?STDERR.puts "Usage: #{$0} <filegroup>"
> ?exit 1
> end
> perftest = Perftest.new(Dir[glob].freeze)
>
> # Remember the offset for sequential tests, so that re-runs don't use
> # cached data at the server. Still better to drop vm caches at the server.
> memo = "/var/tmp/perftest.offset"
> perftest.offset = File.read(memo).to_i rescue 0
> at_exit do
> ?perftest.kill_all
> ?File.open(memo,"w") { |f| f.puts perftest.offset }
> end
>
> puts " #p ?files/sec ?dd_args"
> [1,2,5].each do |nprocs|
> ?perftest.run(10000, nprocs, "bs=1024k")
> ?perftest.run(4000, nprocs, "bs=1024k",1)
> end
> [10,20,30].each do |nprocs|
> ?perftest.run(10000, nprocs, "bs=1024k")
> ?perftest.run(10000, nprocs, "bs=1024k",1)
> end
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Brian Candler

2012-Feb-07 19:22 UTC

head link

[Gluster-users] Concurrency limitation?

On Tue, Feb 07, 2012 at 06:30:56PM +0100, Arnold Krille
wrote:> > GlusterFS is a file-level protocol, more like NFS, and as far as I
know
> > there is no inherent locking between clients.
> 
> There has to be locking. Otherwise two apps on two machines opening the
same
> file for writing would destroy each others changes. Therefor one client has
to
> gather locks on all brick filesystems (which is the same as synchronizing 
> access with all other clients).
If two clients do a write() on the same area of the file, then one will get
there first, and the second will overwrite the first. And if there were a
lock, how would it help?

Someone else please correct me if I'm wrong.
> One client opens the file for reading, the other opens the file for
trunc|write.
> What do you get on the first client? How should this scenario be any save 
> without some kind of locking.
If you want *useful* semantics in that situation then the clients can
explicitly request an advisory lock on the file or ranges of the file, if
they so wish.  But this is not done for them by the filesystem.
> I might be wrong and glusterfs really doesn't do any locking to prevent
> concurrent write accesses. But if thats true, I think this rules out
glusterfs
> for any usage above "proof-of-concept".
No such locking occurs when two concurrent processes on the same machine
read and write a file, so why should it take place when the operation is
over a network?

Regards,

Brian.

Gluster users - Feb 2012 - Concurrency limitation?

[Gluster-users] Concurrency limitation?

[Gluster-users] Concurrency limitation?

[Gluster-users] Concurrency limitation?

[Gluster-users] Concurrency limitation?

[Gluster-users] Concurrency limitation?