Michael Richardson
2020-Jan-05 01:05 UTC
[Gluster-users] Performance tuning suggestions for nvme on aws
Hi all! I'm experimenting with GFS for the first time have built a simple three-node cluster using AWS 'i3en' type instances. These instances provide raw nvme devices that are incredibly fast. What I'm finding in these tests is that gluster is offering only a fraction of the raw nvme performance in a 3 replica set (ie, 3 nodes with 1 brick each). I'm wondering if there is anything I can do to squeeze more performance out. For testing, I'm running fio using a 16GB test file with a 75/25 read/write split. Basically I'm trying to replicate a MySQL database which is what I'd ideally like to host here (which I realise is probably not practical). My fio test command is: $ fio --name=fio-test2 --filename=fio-test \ --randrepeat=1 \ --ioengine=libaio \ --direct=1 \ --runtime=300 \ --bs=16k \ --iodepth=64 \ --size=16G \ --readwrite=randrw \ --rwmixread=75 \ --group_reporting \ --numjobs=4 When I test this command directly on the nvme disk, I get: READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec When I install the disk into a gluster 3-replica volume, I get: READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec If I do the same but with only 2 replicas, I get the same performance results. I also get the same rough values when doing 'read', 'randread', 'write', and 'randwrite' tests. I'm testing directly on one of the storage nodes, so there's no variables line client/server network performance in the mix. I ran the same test with EBS volumes and I saw similar performance drops when offering up the volume using gluster. A "Provisioned IOPS" EBS volume that could offer 10,000 IOPS directly, was getting only about 3500 IOPS when running as part of a gluster volume. We're using TLS on the management and volume connections, but I'm not seeing any CPU or memory constraint when using these volumes, so I don't believe that is the bottleneck. Similarly, when I try with SSL turned off, I see no change in performance. Does anyone have any suggestions on things I might try to increase performance when using these very fast disks as part of a gluster volume, or is this is to be expected when factoring in all the extra work that gluster needs to do when replicating data around volumes? Thanks very much for your time!! I'll put the two full fio outputs below if anyone wants more details. Mike - First full fio test, nvme device without gluster fio-test: (groupid=0, jobs=4): err= 0: pid=5636: Sat Jan 4 23:09:18 2020 read: IOPS=20.0k, BW=313MiB/s (328MB/s)(47.0GiB/156806msec) slat (usec): min=3, max=6476, avg=88.44, stdev=326.96 clat (usec): min=218, max=89292, avg=11141.58, stdev=1871.14 lat (usec): min=226, max=89311, avg=11230.16, stdev=1883.88 clat percentiles (usec): | 1.00th=[ 3654], 5.00th=[ 8455], 10.00th=[ 9372], 20.00th=[10159], | 30.00th=[10552], 40.00th=[10814], 50.00th=[11076], 60.00th=[11338], | 70.00th=[11731], 80.00th=[12256], 90.00th=[13042], 95.00th=[13960], | 99.00th=[15795], 99.50th=[16581], 99.90th=[19268], 99.95th=[23200], | 99.99th=[36439] bw ( KiB/s): min=75904, max=257120, per=25.00%, avg=80178.59, stdev=9421.58, samples=1252 iops : min= 4744, max=16070, avg=5011.15, stdev=588.85, samples=1252 write: IOPS=6702, BW=105MiB/s (110MB/s)(16.0GiB/156806msec); 0 zone resets slat (usec): min=4, max=5587, avg=88.52, stdev=325.86 clat (usec): min=54, max=29847, avg=4491.18, stdev=1481.06 lat (usec): min=63, max=29859, avg=4579.83, stdev=1508.50 clat percentiles (usec): | 1.00th=[ 947], 5.00th=[ 1975], 10.00th=[ 2737], 20.00th=[ 3458], | 30.00th=[ 3916], 40.00th=[ 4178], 50.00th=[ 4424], 60.00th=[ 4686], | 70.00th=[ 5014], 80.00th=[ 5473], 90.00th=[ 6259], 95.00th=[ 6980], | 99.00th=[ 8717], 99.50th=[ 9503], 99.90th=[10945], 99.95th=[11600], | 99.99th=[13698] bw ( KiB/s): min=23296, max=86432, per=25.00%, avg=26812.24, stdev=3375.69, samples=1252 iops : min= 1456, max= 5402, avg=1675.75, stdev=210.98, samples=1252 lat (usec) : 100=0.01%, 250=0.01%, 500=0.06%, 750=0.11%, 1000=0.10% lat (msec) : 2=1.12%, 4=7.69%, 10=28.88%, 20=61.95%, 50=0.06% lat (msec) : 100=0.01% cpu : usr=1.56%, sys=7.85%, ctx=1905114, majf=0, minf=56 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=3143262,1051042,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=313MiB/s (328MB/s), 313MiB/s-313MiB/s (328MB/s-328MB/s), io=47.0GiB (51.5GB), run=156806-156806msec WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=16.0GiB (17.2GB), run=156806-156806msec Disk stats (read/write): dm-4: ios=3455484/1154933, merge=0/0, ticks=35815316/4420412, in_queue=40257384, util=100.00%, aggrios=3456894/1155354, aggrmerge=0/0, aggrticks=35806896/4414972, aggrin_queue=40309192, aggrutil=99.99% dm-2: ios=3456894/1155354, merge=0/0, ticks=35806896/4414972, in_queue=40309192, util=99.99%, aggrios=1728447/577677, aggrmerge=0/0, aggrticks=17902352/2207092, aggrin_queue=20122108, aggrutil=100.00% dm-1: ios=3456894/1155354, merge=0/0, ticks=35804704/4414184, in_queue=40244216, util=100.00%, aggrios=3143273/1051086, aggrmerge=313621/104268, aggrticks=32277972/3937619, aggrin_queue=36289488, aggrutil=100.00% nvme0n1: ios=3143273/1051086, merge=313621/104268, ticks=32277972/3937619, in_queue=36289488, util=100.00% dm-0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% - Second full fio test, nvme device as part of a gluster volume fio-test2: (groupid=0, jobs=4): err= 0: pid=5537: Sat Jan 4 23:30:28 2020 read: IOPS=5525, BW=86.3MiB/s (90.5MB/s)(25.3GiB/300002msec) slat (nsec): min=1159, max=894687k, avg=9822.60, stdev=990825.87 clat (usec): min=963, max=3141.5k, avg=37455.28, stdev=123109.88 lat (usec): min=968, max=3141.5k, avg=37465.21, stdev=123121.94 clat percentiles (msec): | 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 8], 20.00th=[ 9], | 30.00th=[ 9], 40.00th=[ 9], 50.00th=[ 10], 60.00th=[ 10], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 48], 95.00th=[ 180], | 99.00th=[ 642], 99.50th=[ 860], 99.90th=[ 1435], 99.95th=[ 1687], | 99.99th=[ 2022] bw ( KiB/s): min= 31, max=93248, per=26.30%, avg=23247.24, stdev=20716.86, samples=2280 iops : min= 1, max= 5828, avg=1452.92, stdev=1294.81, samples=2280 write: IOPS=1850, BW=28.9MiB/s (30.3MB/s)(8676MiB/300002msec); 0 zone resets slat (usec): min=21, max=1586.3k, avg=2117.71, stdev=23082.86 clat (usec): min=20, max=2614.0k, avg=23888.03, stdev=99651.34 lat (usec): min=225, max=3141.2k, avg=26006.49, stdev=104758.57 clat percentiles (usec): | 1.00th=[ 889], 5.00th=[ 2343], 10.00th=[ 3654], | 20.00th=[ 5276], 30.00th=[ 5997], 40.00th=[ 6456], | 50.00th=[ 6849], 60.00th=[ 7177], 70.00th=[ 7504], | 80.00th=[ 7963], 90.00th=[ 8979], 95.00th=[ 74974], | 99.00th=[ 513803], 99.50th=[ 717226], 99.90th=[1333789], | 99.95th=[1518339], 99.99th=[1803551] bw ( KiB/s): min= 31, max=30240, per=27.05%, avg=8009.39, stdev=6912.26, samples=2217 iops : min= 1, max= 1890, avg=500.56, stdev=432.02, samples=2217 lat (usec) : 50=0.03%, 100=0.02%, 250=0.01%, 500=0.06%, 750=0.08% lat (usec) : 1000=0.11% lat (msec) : 2=0.66%, 4=1.97%, 10=71.07%, 20=14.47%, 50=2.69% lat (msec) : 100=2.23%, 250=3.21%, 500=1.94%, 750=0.82%, 1000=0.31% cpu : usr=0.59%, sys=1.19%, ctx=1172180, majf=0, minf=56 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=1657579,555275,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=86.3MiB/s (90.5MB/s), 86.3MiB/s-86.3MiB/s (90.5MB/s-90.5MB/s), io=25.3GiB (27.2GB), run=300002-300002msec WRITE: bw=28.9MiB/s (30.3MB/s), 28.9MiB/s-28.9MiB/s (30.3MB/s-30.3MB/s), io=8676MiB (9098MB), run=300002-300002msec -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200105/1380d1d9/attachment.html>