Willem
2013-Apr-18  18:28 UTC
[Gluster-users] Low (<0.2ms) latency reads, is it possible at all?
I'm testing GlusterFS viability for use with a typical PHP webapp (ie. lots
of small files). I don't care so much for the C in the CAP theorem, as I
have very few writes. I could live with a write propagation delay of 5
minutes (or dirty caches for up to 5 minutes).
So I'm optimizing for low latency reads of small files. My testsetup is 2
node replication. Each node is both server and gluster client. Both are in
sync. I stop glusterfs-server @ node2. @node1, I run a simple benchmark:
repeatedly (to prime the cache) open & close 1000 small files. I have
enabled the client-side io-cache and quick-read translators (see below for
config).
The results are consistently 2 ms per open (O_RDONLY) call. Which is too
slow, unfortunately, as I need < 0.2ms.
The same test with a local Gluster server and NFS mount, I get somewhat
better performance but still 0.6ms.
The same test with Linux NFS server (v3) and local mount, I get 0.12ms per
open.
I can't explain the lag using Gluster, because I can't see any traffic
being sent to node2. I would expect that using the io-cache translator and
local-only operation, the performance would approach that of the kernel FS
cache.
Is this assumption correct? If yes, how would I profile the client sub
system to detect the bottleneck?
If no, then I have to accept that 0.8ms open calls are the best that I
could squeeze out of this system. Then I'll probably look into AFS,
userspace async replication or gluster NFS mount with cachefilesd. Which
would you recommend?
Thanks a lot!
BTW I like Gluster a lot, and hope that it is also suitable for this small
files use case ;)
//Willem
PS Am testing with kernel 3.5.0-17-generic 64bit and gluster 3.2.5-1ubuntu1.
Client volfile:
+------------------------------------------------------------------------------+
  1: volume testvol-client-0
  2:     type protocol/client
  3:     option remote-host g1
  4:     option remote-subvolume /data
  5:     option transport-type tcp
  6: end-volume
  7:
  8: volume testvol-client-1
  9:     type protocol/client
 10:     option remote-host g2
 11:     option remote-subvolume /data
 12:     option transport-type tcp
 13: end-volume
 14:
 15: volume testvol-replicate-0
 16:     type cluster/replicate
 17:     subvolumes testvol-client-0 testvol-client-1
 18: end-volume
 19:
 20: volume testvol-write-behind
 21:     type performance/write-behind
 22:     option flush-behind on
 23:     subvolumes testvol-replicate-0
 24: end-volume
 25:
 26: volume testvol-io-cache
 27:     type performance/io-cache
 28:     option max-file-size 256KB
 29:     option cache-timeout 60
 30:     option priority *.php:3,*:0
 31:     option cache-size 256MB
 32:     subvolumes testvol-write-behind
 33: end-volume
 34:
 35: volume testvol-quick-read
 36:     type performance/quick-read
 37:     option cache-size 256MB
 38:     subvolumes testvol-io-cache
 39: end-volume
 40:
 41: volume testvol
 42:     type debug/io-stats
 43:     option latency-measurement off
 44:     option count-fop-hits off
 45:     subvolumes testvol-quick-read
 46: end-volume
Server volfile:
+------------------------------------------------------------------------------+
  1: volume testvol-posix
  2:     type storage/posix
  3:     option directory /data
  4: end-volume
  5:
  6: volume testvol-access-control
  7:     type features/access-control
  8:     subvolumes testvol-posix
  9: end-volume
 10:
 11: volume testvol-locks
 12:     type features/locks
 13:     subvolumes testvol-access-control
 14: end-volume
 15:
 16: volume testvol-io-threads
 17:     type performance/io-threads
 18:     subvolumes testvol-locks
 19: end-volume
 20:
 21: volume testvol-marker
 22:     type features/marker
 23:     option volume-uuid bc89684f-569c-48b0-bc67-09bfd30ba253
 24:     option timestamp-file /etc/glusterd/vols/testvol/marker.tstamp
 25:     option xtime off
 26:     option quota off
 27:     subvolumes testvol-io-threads
 28: end-volume
 29:
 30: volume /data
 31:     type debug/io-stats
 32:     option latency-measurement off
 33:     option count-fop-hits off
 34:     subvolumes testvol-marker
 35: end-volume
 36:
 37: volume testvol-server
 38:     type protocol/server
 39:     option transport-type tcp
 40:     option auth.addr./data.allow *
 41:     subvolumes /data
 42: end-volume
My benchmark to simulate PHP webapp i/o:
#!/usr/bin/env python
import sys
import os
import time
import optparse
def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%-15.15s %6d ms' % (func.func_name, int ( (t2-t1)*1000.0
))
        return res
    return wrapper
def parse_options():
    parser = optparse.OptionParser()
    parser.add_option("--path", '-p',
default="/mnt/glusterfs",
        help="Base directory for running tests (default:
/mnt/glusterfs)",
    )
    parser.add_option("--num", '-n', type="int",
default=100,
        help="Number of files per test (default: 100)",
    )
    (options, args) = parser.parse_args()
    return options
class FSBench():
    def __init__(self,path="/tmp",num=100):
        self.path = path
        self.num  = num
    @print_timing
    def test_open_read(self):
        for filename in self.get_files():
            f = open(filename)
            data = f.read()
            f.close()
    def get_files(self):
        for i in range(self.num):
            filename = self.path + "/test_%03d" % i
            yield filename
    @print_timing
    def test_stat(self):
        for filename in self.get_files():
            os.stat(filename)
    @print_timing
    def test_stat_nonexist(self):
        for filename in self.get_files():
            try:
                os.stat(filename+"blkdsflskdf")
            except OSError:
                pass
    @print_timing
    def test_write(self):
        for filename in self.get_files():
            f = open(filename,'w')
            f.write('hi there\n')
            f.close()
    @print_timing
    def test_delete(self):
        for filename in self.get_files():
            os.unlink(filename)
if __name__ == '__main__':
    options = parse_options()
    bench   = FSBench(path=options.path, num=options.num)
    bench.test_write()
    bench.test_open_read()
    bench.test_stat()
    bench.test_stat_nonexist()
    bench.test_delete()
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130418/9b8ff478/attachment.html>
Raghavendra Gowdappa
2013-Apr-24  03:31 UTC
[Gluster-users] Low (<0.2ms) latency reads, is it possible at all?
Hi willem, Please find the inlined comments: ----- Original Message -----> From: "Willem" <gwillem at gmail.com> > To: gluster-users at gluster.org > Sent: Thursday, April 18, 2013 11:58:46 PM > Subject: [Gluster-users] Low (<0.2ms) latency reads, is it possible at all? > > I'm testing GlusterFS viability for use with a typical PHP webapp (ie. lots > of small files). I don't care so much for the C in the CAP theorem, as I > have very few writes. I could live with a write propagation delay of 5 > minutes (or dirty caches for up to 5 minutes). > > So I'm optimizing for low latency reads of small files. My testsetup is 2 > node replication. Each node is both server and gluster client. Both are in > sync. I stop glusterfs-server @ node2. @node1, I run a simple benchmark: > repeatedly (to prime the cache) open & close 1000 small files. I have > enabled the client-side io-cache and quick-read translators (see below for > config). > > The results are consistently 2 ms per open (O_RDONLY) call. Which is too > slow, unfortunately, as I need < 0.2ms. > > The same test with a local Gluster server and NFS mount, I get somewhat > better performance but still 0.6ms. > > The same test with Linux NFS server (v3) and local mount, I get 0.12ms per > open. > > I can't explain the lag using Gluster, because I can't see any traffic being > sent to node2. I would expect that using the io-cache translator and > local-only operation, the performance would approach that of the kernel FS > cache. > > Is this assumption correct? If yes, how would I profile the client sub system > to detect the bottleneck? > > If no, then I have to accept that 0.8ms open calls are the best that I could > squeeze out of this system. Then I'll probably look into AFS, userspace > async replication or gluster NFS mount with cachefilesd. Which would you > recommend? > > Thanks a lot! > BTW I like Gluster a lot, and hope that it is also suitable for this small > files use case ;) > > //Willem > > PS Am testing with kernel 3.5.0-17-generic 64bit and gluster 3.2.5-1ubuntu1. > > Client volfile: > +------------------------------------------------------------------------------+ > 1: volume testvol-client-0 > 2: type protocol/client > 3: option remote-host g1 > 4: option remote-subvolume /data > 5: option transport-type tcp > 6: end-volume > 7: > 8: volume testvol-client-1 > 9: type protocol/client > 10: option remote-host g2 > 11: option remote-subvolume /data > 12: option transport-type tcp > 13: end-volume > 14: > 15: volume testvol-replicate-0 > 16: type cluster/replicate > 17: subvolumes testvol-client-0 testvol-client-1 > 18: end-volume > 19: > 20: volume testvol-write-behind > 21: type performance/write-behind > 22: option flush-behind on > 23: subvolumes testvol-replicate-0 > 24: end-volume > 25: > 26: volume testvol-io-cache > 27: type performance/io-cache > 28: option max-file-size 256KB > 29: option cache-timeout 60 > 30: option priority *.php:3,*:0 > 31: option cache-size 256MB > 32: subvolumes testvol-write-behind > 33: end-volume > 34: > 35: volume testvol-quick-read > 36: type performance/quick-readdefault value for option "max-file-size" is 64KB. Seems like your files are bigger than 64KB. Can you add this option and rerun the tests? Also can you rerun the tests by disabling quick-read and compare the results?> 37: option cache-size 256MB > 38: subvolumes testvol-io-cache > 39: end-volume > 40: > 41: volume testvol > 42: type debug/io-stats > 43: option latency-measurement off > 44: option count-fop-hits off > 45: subvolumes testvol-quick-read > 46: end-volume > > Server volfile: > +------------------------------------------------------------------------------+ > 1: volume testvol-posix > 2: type storage/posix > 3: option directory /data > 4: end-volume > 5: > 6: volume testvol-access-control > 7: type features/access-control > 8: subvolumes testvol-posix > 9: end-volume > 10: > 11: volume testvol-locks > 12: type features/locks > 13: subvolumes testvol-access-control > 14: end-volume > 15: > 16: volume testvol-io-threads > 17: type performance/io-threads > 18: subvolumes testvol-locks > 19: end-volume > 20: > 21: volume testvol-marker > 22: type features/marker > 23: option volume-uuid bc89684f-569c-48b0-bc67-09bfd30ba253 > 24: option timestamp-file /etc/glusterd/vols/testvol/marker.tstamp > 25: option xtime off > 26: option quota off > 27: subvolumes testvol-io-threads > 28: end-volume > 29: > 30: volume /data > 31: type debug/io-stats > 32: option latency-measurement off > 33: option count-fop-hits off > 34: subvolumes testvol-marker > 35: end-volume > 36: > 37: volume testvol-server > 38: type protocol/server > 39: option transport-type tcp > 40: option auth.addr./data.allow * > 41: subvolumes /data > 42: end-volume > > My benchmark to simulate PHP webapp i/o: > #!/usr/bin/env python > > import sys > import os > import time > import optparse > > def print_timing(func): > def wrapper(*arg): > t1 = time.time() > res = func(*arg) > t2 = time.time() > print '%-15.15s %6d ms' % (func.func_name, int ( (t2-t1)*1000.0 )) > return res > return wrapper > > > def parse_options(): > parser = optparse.OptionParser() > parser.add_option("--path", '-p', default="/mnt/glusterfs", > help="Base directory for running tests (default: /mnt/glusterfs)", > ) > parser.add_option("--num", '-n', type="int", default=100, > help="Number of files per test (default: 100)", > ) > (options, args) = parser.parse_args() > return options > > class FSBench(): > def __init__(self,path="/tmp",num=100): > self.path = path > self.num = num > @print_timing > def test_open_read(self): > for filename in self.get_files(): > f = open(filename) > data = f.read() > f.close() > def get_files(self): > for i in range(self.num): > filename = self.path + "/test_%03d" % i > yield filename > @print_timing > def test_stat(self): > for filename in self.get_files(): > os.stat(filename) > > @print_timing > def test_stat_nonexist(self): > for filename in self.get_files(): > try: > os.stat(filename+"blkdsflskdf") > except OSError: > pass > @print_timing > def test_write(self): > for filename in self.get_files(): > f = open(filename,'w') > f.write('hi there\n') > f.close() > @print_timing > def test_delete(self): > for filename in self.get_files(): > os.unlink(filename) > if __name__ == '__main__': > > options = parse_options() > bench = FSBench(path=options.path, num=options.num) > bench.test_write() > bench.test_open_read() > bench.test_stat() > bench.test_stat_nonexist() > bench.test_delete() > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users
Marcus Bointon
2013-Apr-24  12:18 UTC
[Gluster-users] Low (<0.2ms) latency reads, is it possible at all?
On 24 Apr 2013, at 14:00, "Willem" <gwillem at gmail.com> wrote:>> I'm testing GlusterFS viability for use with a typical PHP webapp (ie. lots >> of small files). I don't care so much for the C in the CAP theorem, as I >> have very few writes. I could live with a write propagation delay of 5 >> minutes (or dirty caches for up to 5 minutes).We know that gluster's small-file performance isn't good, and since you can live with such long write propagation, reciprocal rsync could be a better and simpler solution. That way you'd get much faster local performance. The only issue really is that deletes aren't possible to propagate correctly with 2-way rsync (because a delete at one end is indistinguishable from an add at the other), but you may be able to live with it. csync2 aims to solve the delete issue with a transaction database, but I could never make it work. To get a consistent 0.2ms off anything you're going to need to be on SSDs - you can't fit everything in your cache. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/