thr3ads.net - Gluster users - [Gluster-users] Gluster linear scale-out performance [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Artem Russakovskii

2020-Jul-24 23:05 UTC

[Gluster-users] Gluster linear scale-out performance

Speaking of fio, could the gluster team please help me understand something?

We've been having lots of performance issues related to gluster using
attached block storage on Linode. At some point, I figured out that Linode
has a cap of 500 IOPS on their block storage
<https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>
(with spikes to 1500 IOPS). The block storage we use is formatted xfs with
4KB bsize (block size).

I then ran a bunch of fio tests on the block storage itself (not the
gluster fuse mount), which performed horribly when the bs parameter was set
to 4k:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test
--filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite
--ramp_time=4
During these tests, fio ETA crawled to over an hour, at some point dropped
to 45min and I did see 500-1500 IOPS flash by briefly, then it went back
down to 0. I/O seems majorly choked for some reason, likely because gluster
is using some of it. Transfer speed with such 4k block size is 2 MB/s with
spikes to 6MB/s. This causes the load on the server to spike up to 100+ and
brings down all our servers.

Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477
IOPS][eta 43m:00s]    Jobs: 1 (f=1):
[w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s]

xfs_info /mnt/citadel_block1
meta-data=/dev/sdc               isize=512    agcount=103, agsize=26214400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=2684354560, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1log
   =internal log           bsize=4096   blocks=51200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

When I increase the --bs param to fio from 4k to, say, 64k, transfer speed
goes up significantly and is more like 50MB/s, and at 256k, it's 200MB/s.

So what I'm trying to understand is:

   1. How does the xfs block size (4KB) relate to the block size in fio
   tests? If we're limited by IOPS, and xfs block size is 4KB, how can fio
   produce better results with varying --bs param?
   2. Would increasing the xfs data block size to something like 64-256KB
   help with our issue of choking IO and skyrocketing load?
   3. The worst hangs and load spikes happen when we reboot one of the
   gluster servers, but not when it's down - when it comes back online. Even
   with gluster not showing anything pending heal, my guess is it's still
   trying to do lots of IO between the 4 nodes for some reason, but I don't
   understand why.

I've been banging my head on the wall with this problem for months.
Appreciate any feedback here.

Thank you.

gluster volume info below

Volume Name: SNIP_data1
Type: Replicate
Volume ID: SNIP
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1
Brick2: forge:/mnt/SNIP_block1/SNIP_data1
Brick3: hive:/mnt/SNIP_block1/SNIP_data1
Brick4: citadel:/mnt/SNIP_block1/SNIP_data1
Options Reconfigured:
cluster.quorum-count: 1
cluster.quorum-type: fixed
network.ping-timeout: 5
network.remote-dio: enable
performance.rda-cache-limit: 256MB
performance.readdir-ahead: on
performance.parallel-readdir: on
network.inode-lru-limit: 500000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
cluster.readdir-optimize: on
performance.io-thread-count: 32
server.event-threads: 4
client.event-threads: 4
performance.read-ahead: off
cluster.lookup-optimize: on
performance.cache-size: 1GB
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
cluster.granular-entry-heal: enable
cluster.data-self-heal-algorithm: full


Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>


On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu> wrote:
> Hi,
>
> I have one more question about the Gluster linear scale-out performance
> regarding the "write-behind off" case specifically -- when
"write-behind"
> is off, and still the stripe volumes and other settings as early thread
> posted, the storage I/O seems not to relate to the number of storage
> nodes. In my experiment, no matter I have 2 brick server nodes or 8 brick
> server nodes, the aggregated gluster I/O performance is ~100MB/sec. And fio
> benchmark measurement gives the same result. If "write behind" is
on, then
> the storage performance is linear scale-out along with the # of brick
> server nodes increasing.
>
> No matter the write behind option is on/off, I thought the gluster I/O
> performance should be pulled and aggregated together as a whole. If that is
> the case, why do I get a consistent gluster performance (~100MB/sec) when
> "write behind" is off? Please advise me if I misunderstood
something.
>
> Thanks,
> Qing
>
>
>
>
> On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu>
wrote:
>
>> fio gives me the correct linear scale-out results, and you're
right, the
>> storage cache is the root cause that makes the dd measurement results
not
>> accurate at all.
>>
>> Thanks,
>> Qing
>>
>>
>> On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at redhat.com>
wrote:
>>
>>>
>>>
>>> On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at g.clemson.edu>
wrote:
>>>
>>>> Hi Yaniv,
>>>>
>>>> Thanks for the quick response. I forget to mention I am testing
the
>>>> writing performance, not reading. In this case, would the
client cache hit
>>>> rate still be a big issue?
>>>>
>>>
>>> It's not hitting the storage directly. Since it's also
single threaded,
>>> it may also not saturate it. I highly recommend testing properly.
>>> Y.
>>>
>>>
>>>> I'll use fio to run my test once again, thanks for the
suggestion.
>>>>
>>>> Thanks,
>>>> Qing
>>>>
>>>> On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul <ykaul at
redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 21 Jul 2020, 21:30 Qing Wang <qw at
g.clemson.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to test Gluster linear scale-out
performance by adding
>>>>>> more storage server/bricks, and measure the storage I/O
performance. To
>>>>>> vary the storage server number, I create several
"stripe" volumes that
>>>>>> contain 2 brick servers, 3 brick servers, 4 brick
servers, and so on. On
>>>>>> gluster client side, I used "dd if=/dev/zero
>>>>>> of=/mnt/glusterfs/dns_test_data_26g bs=1M
count=26000" to create 26G data
>>>>>> (or larger size), and those data will be distributed to
the corresponding
>>>>>> gluster servers (each has gluster brick on it) and
"dd" returns the final
>>>>>> I/O throughput. The Internet is 40G infiniband,
although I didn't do any
>>>>>> specific configurations to use advanced features.
>>>>>>
>>>>>
>>>>> Your dd command is inaccurate, as it'll hit the client
cache. It is
>>>>> also single threaded. I suggest switching to fio.
>>>>> Y.
>>>>>
>>>>>
>>>>>> What confuses me is that the storage I/O seems not to
relate to the
>>>>>> number of storage nodes, but Gluster documents said it
should be linear
>>>>>> scaling. For example, when "write-behind" is
on, and when Infiniband "jumbo
>>>>>> frame" (connected mode) is on, I can get ~800
MB/sec reported by "dd", no
>>>>>> matter I have 2 brick servers or 8 brick servers -- for
2 server case, each
>>>>>> server can have ~400 MB/sec; for 4 server case, each
server can have
>>>>>> ~200MB/sec. That said, each server I/O does aggregate
to the final storage
>>>>>> I/O (800 MB/sec), but this is not "linear
scale-out".
>>>>>>
>>>>>> Can somebody help me to understand why this is the
case? I certainly
>>>>>> can have some misunderstanding/misconfiguration here.
Please correct me if
>>>>>> I do, thanks!
>>>>>>
>>>>>> Best,
>>>>>> Qing
>>>>>> ________
>>>>>>
>>>>>>
>>>>>>
>>>>>> Community Meeting Calendar:
>>>>>>
>>>>>> Schedule -
>>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>>> Bridge: https://bluejeans.com/441850968
>>>>>>
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>>
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200724/a66652e0/attachment.html>

Ravishankar N

2020-Jul-31 03:20 UTC

head link

[Gluster-users] Gluster linear scale-out performance

On 25/07/20 4:35 am, Artem Russakovskii wrote:> Speaking of fio, could the gluster team please help me understand 
> something?
>
> We've been having lots of performance issues related to gluster using 
> attached block storage on Linode. At some point, I figured out that 
> Linode has a cap of 500 IOPS on their block storage 
>
<https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>
> (with spikes to 1500 IOPS). The block storage we use is formatted xfs 
> with 4KB bsize (block size).
>
> I then ran a bunch of fio tests on the block storage itself (not the 
> gluster fuse mount), which performed horribly when the bs parameter 
> was set to 4k:
>
fio--randrepeat=1--ioengine=libaio--direct=1--gtod_reduce=1--name=test--filename=test--bs=4k--iodepth=64--size=4G--readwrite=randwrite--ramp_time=4
> During these tests, fio ETA crawled to over an hour, at some point 
> dropped to 45min and I did see 500-1500 IOPS flash by briefly, then it 
> went back down to 0. I/O seems majorly choked for some reason,?likely 
> because gluster is using some of it. Transfer speed with such 4k block 
> size is 2 MB/s with spikes to 6MB/s. This causes the load on the 
> server to spike up to 100+ and brings down all our servers.
> |Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 
> IOPS][eta 43m:00s] Jobs: 1 (f=1): 
> [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 44m:54s] |
> |xfs_info /mnt/citadel_block1 meta-data=/dev/sdc isize=512 
> agcount=103, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 = 
> crc=1 finobt=1, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096 
> blocks=2684354560, imaxpct=25 = sunit=0 swidth=0 blks naming =version 
> 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 
> blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 
> realtime =none extsz=4096 blocks=0, rtextents=0|
> When I increase the --bs param to fio from 4k to, say, 64k, transfer 
> speed goes up significantly and is more like 50MB/s, and at 256k, it's 
> 200MB/s.
>
> So what I'm trying to understand is:
>
>  1. How does the xfs block size (4KB) relate to the block size in fio
>     tests? If we're limited by IOPS, and xfs block size is 4KB, how
>     can fio produce better results with varying --bs param?
>  2. Would increasing the xfs data block size to something like
>     64-256KB help with our issue of choking IO and skyrocketing load?
>I have experienced similar behavior when running fio tests with bs=4k on 
a gluster volume backed by XFS with a high load (numjobs=32) . When I 
observed the strace of the brick processes (fsync -f -T -p $PID), I saw 
fysnc system calls taking around 2500 seconds which is insane. I'm not 
sure if this is specific to the way fio does its i/o pattern and the way 
XFS handles it. When I used 64k block sizes, the fio tests completed 
just fine.>
>  1. The worst hangs and load spikes happen when we reboot one of the
>     gluster servers, but not when it's down - when it comes back
>     online. Even with gluster not showing anything pending heal, my
>     guess is it's still trying to do lots of IO between the 4 nodes
>     for some reason, but I don't understand why.
>Do you kill all gluster processes (not just glusterd but even the brick 
processes) before issuing reboot? This is necessary to prevent I/O 
stalls. There is stop-all-gluster-processes.sh which should be available 
as a part of the gluster installation (maybe in 
/usr/share/glusterfs/scripts/) which you can use.? Can you check if this 
helps?

Regards,

Ravi
> I've been banging my head on the wall with this problem for months. 
> Appreciate any feedback here.
>
> Thank you.
>
> gluster volume info below
> |Volume Name: SNIP_data1 Type: Replicate Volume ID: SNIP Status: 
> Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: 
> tcp Bricks: Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1 Brick2: 
> forge:/mnt/SNIP_block1/SNIP_data1 Brick3: 
> hive:/mnt/SNIP_block1/SNIP_data1 Brick4: 
> citadel:/mnt/SNIP_block1/SNIP_data1 Options Reconfigured: 
> cluster.quorum-count: 1 cluster.quorum-type: fixed 
> network.ping-timeout: 5 network.remote-dio: enable 
> performance.rda-cache-limit: 256MB performance.readdir-ahead: on 
> performance.parallel-readdir: on network.inode-lru-limit: 500000 
> performance.md-cache-timeout: 600 performance.cache-invalidation: on 
> performance.stat-prefetch: on features.cache-invalidation-timeout: 600 
> features.cache-invalidation: on cluster.readdir-optimize: on 
> performance.io-thread-count: 32 server.event-threads: 4 
> client.event-threads: 4 performance.read-ahead: off 
> cluster.lookup-optimize: on performance.cache-size: 1GB 
> cluster.self-heal-daemon: enable transport.address-family: inet 
> nfs.disable: on performance.client-io-threads: on 
> cluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full|
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror 
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net <http://beerpla.net/> | @ArtemR
<http://twitter.com/ArtemR>
>
>
> On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu 
> <mailto:qw at g.clemson.edu>> wrote:
>
>     Hi,
>
>     I have one more question about the Gluster linear scale-out
>     performance regarding the "write-behind off" case
specifically --
>     when "write-behind" is off, and still the stripe volumes and
other
>     settings as early thread posted, the storage I/O seems not to
>     relate to the number of storage nodes. In my experiment, no matter
>     I have 2 brick server nodes or 8 brick server nodes, the
>     aggregated gluster I/O performance is ~100MB/sec. And fio
>     benchmark measurement gives the same result. If "write
behind" is
>     on, then the storage performance is linear scale-out along with
>     the # of brick server nodes increasing.
>
>     No matter the write behind option is on/off, I thought the gluster
>     I/O performance should be pulled and aggregated together as a
>     whole. If that?is the case, why do I get a consistent?gluster
>     performance (~100MB/sec) when "write behind" is off?
Please?advise
>     me if I misunderstood something.
>
>     Thanks,
>     Qing
>
>
>
>
>     On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu
>     <mailto:qw at g.clemson.edu>> wrote:
>
>         fio gives me the correct linear scale-out results, and you're
>         right, the storage cache is the root cause that?makes the dd
>         measurement results not accurate at all.
>
>         Thanks,
>         Qing
>
>
>         On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at redhat.com
>         <mailto:ykaul at redhat.com>> wrote:
>
>
>
>             On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at g.clemson.edu
>             <mailto:qw at g.clemson.edu>> wrote:
>
>                 Hi?Yaniv,
>
>                 Thanks for the quick response. I forget to mention I
>                 am testing the writing performance, not reading. In
>                 this case, would the client cache hit rate still be a
>                 big issue?
>
>
>             It's not hitting the storage directly. Since it's also
>             single threaded, it may also not saturate it. I highly
>             recommend testing properly.
>             Y.
>
>
>                 I'll use fio to run my test once again, thanks for the
>                 suggestion.
>
>                 Thanks,
>                 Qing
>
>                 On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul
>                 <ykaul at redhat.com <mailto:ykaul at
redhat.com>> wrote:
>
>
>
>                     On Tue, 21 Jul 2020, 21:30 Qing Wang
>                     <qw at g.clemson.edu <mailto:qw at
g.clemson.edu>> wrote:
>
>                         Hi,
>
>                         I am trying to test Gluster linear scale-out
>                         performance by adding more storage
>                         server/bricks, and measure the storage I/O
>                         performance. To vary the storage server
>                         number, I create several "stripe" volumes
that
>                         contain 2 brick servers, 3 brick servers, 4
>                         brick servers, and so on. On gluster client
>                         side, I used "dd if=/dev/zero
>                         of=/mnt/glusterfs/dns_test_data_26g bs=1M
>                         count=26000" to create 26G data (or larger
>                         size), and those data will be distributed to
>                         the corresponding gluster?servers (each has
>                         gluster brick on it) and "dd" returns the
>                         final I/O throughput. The Internet is 40G
>                         infiniband, although I didn't do any specific
>                         configurations to use advanced features.
>
>
>                     Your dd command is inaccurate, as it'll hit the
>                     client cache. It is also single threaded. I
>                     suggest switching to fio.
>                     Y.
>
>
>                         What confuses me is that the storage I/O seems
>                         not to relate to the number of storage
>                         nodes,?but Gluster documents said it should be
>                         linear scaling. For example, when
>                         "write-behind" is on, and when Infiniband
>                         "jumbo frame" (connected mode) is on, I
can
>                         get ~800 MB/sec reported by "dd", no
matter I
>                         have 2 brick servers or 8 brick servers -- for
>                         2 server case, each server can have ~400
>                         MB/sec; for 4 server case, each server can
>                         have ~200MB/sec. That said, each server I/O
>                         does aggregate to the final storage I/O (800
>                         MB/sec), but this is not "linear
scale-out".
>
>                         Can somebody help me to understand why this is
>                         the case? I certainly can have some
>                         misunderstanding/misconfiguration here. Please
>                         correct me if I do, thanks!
>
>                         Best,
>                         Qing
>                         ________
>
>
>
>                         Community Meeting Calendar:
>
>                         Schedule -
>                         Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>                         Bridge: https://bluejeans.com/441850968
>
>                         Gluster-users mailing list
>                         Gluster-users at gluster.org
>                         <mailto:Gluster-users at gluster.org>
>                        
https://lists.gluster.org/mailman/listinfo/gluster-users
>
>     ________
>
>
>
>     Community Meeting Calendar:
>
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://bluejeans.com/441850968
>
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>     https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200731/582cd5b1/attachment.html>

Artem Russakovskii

2020-Aug-03 17:54 UTC

head link

[Gluster-users] Gluster linear scale-out performance

>
> Do you kill all gluster processes (not just glusterd but even the brick
> processes) before issuing reboot? This is necessary to prevent I/O stalls.
> There is stop-all-gluster-processes.sh which should be available as a part
> of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which
> you can use.  Can you check if this helps?
>A reboot shuts down gracefully, so those processes are shut down before the
reboot begins.

We've moved on to discussing this matter in the gluster slack, there's a
lot more info there now about the above. The gist is heavy xfs
fragmentation when bricks are almost full (95-96%) made healing as well as
disk accesses a lot more expensive and slow, and prone to hanging.

What's still not clear is why a slowdown of one brick/gluster instance
affects similarly affects all bricks/gluster instances, on other servers,
and how that can be optimized/mitigated.

Sincerely,
Artem

--
Founder, Android Police <http://www.androidpolice.com>, APK Mirror
<http://www.apkmirror.com/>, Illogical Robot LLC
beerpla.net | @ArtemR <http://twitter.com/ArtemR>


On Thu, Jul 30, 2020 at 8:21 PM Ravishankar N <ravishankar at redhat.com>
wrote:
>
> On 25/07/20 4:35 am, Artem Russakovskii wrote:
>
> Speaking of fio, could the gluster team please help me understand
> something?
>
> We've been having lots of performance issues related to gluster using
> attached block storage on Linode. At some point, I figured out that Linode
> has a cap of 500 IOPS on their block storage
>
<https://www.linode.com/community/questions/19437/does-a-dedicated-cpu-or-high-memory-plan-improve-disk-io-performance#answer-72142>
> (with spikes to 1500 IOPS). The block storage we use is formatted xfs with
> 4KB bsize (block size).
>
> I then ran a bunch of fio tests on the block storage itself (not the
> gluster fuse mount), which performed horribly when the bs parameter was set
> to 4k:
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test --filename=test --bs=4k --iodepth=64 --size=4G
> --readwrite=randwrite --ramp_time=4
> During these tests, fio ETA crawled to over an hour, at some point dropped
> to 45min and I did see 500-1500 IOPS flash by briefly, then it went back
> down to 0. I/O seems majorly choked for some reason, likely because gluster
> is using some of it. Transfer speed with such 4k block size is 2 MB/s with
> spikes to 6MB/s. This causes the load on the server to spike up to 100+ and
> brings down all our servers.
>
> Jobs: 1 (f=1): [w(1)][20.3%][r=0KiB/s,w=5908KiB/s][r=0,w=1477 IOPS][eta
43m:00s]    Jobs: 1 (f=1): [w(1)][21.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta
44m:54s]
>
> xfs_info /mnt/citadel_block1
> meta-data=/dev/sdc               isize=512    agcount=103, agsize=26214400
blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, rmapbt=0
>          =                       reflink=0
> data     =                       bsize=4096   blocks=2684354560, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1log     
=internal log           bsize=4096   blocks=51200, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> When I increase the --bs param to fio from 4k to, say, 64k, transfer speed
> goes up significantly and is more like 50MB/s, and at 256k, it's
200MB/s.
>
> So what I'm trying to understand is:
>
>    1. How does the xfs block size (4KB) relate to the block size in fio
>    tests? If we're limited by IOPS, and xfs block size is 4KB, how can
fio
>    produce better results with varying --bs param?
>    2. Would increasing the xfs data block size to something like 64-256KB
>    help with our issue of choking IO and skyrocketing load?
>
> I have experienced similar behavior when running fio tests with bs=4k on a
> gluster volume backed by XFS with a high load (numjobs=32) . When I
> observed the strace of the brick processes (fsync -f -T -p $PID), I saw
> fysnc system calls taking around 2500 seconds which is insane. I'm not
sure
> if this is specific to the way fio does its i/o pattern and the way XFS
> handles it. When I used 64k block sizes, the fio tests completed just fine.
>
>
>    1. The worst hangs and load spikes happen when we reboot one of the
>    gluster servers, but not when it's down - when it comes back online.
Even
>    with gluster not showing anything pending heal, my guess is it's
still
>    trying to do lots of IO between the 4 nodes for some reason, but I
don't
>    understand why.
>
> Do you kill all gluster processes (not just glusterd but even the brick
> processes) before issuing reboot? This is necessary to prevent I/O stalls.
> There is stop-all-gluster-processes.sh which should be available as a part
> of the gluster installation (maybe in /usr/share/glusterfs/scripts/) which
> you can use.  Can you check if this helps?
>
> Regards,
>
> Ravi
>
> I've been banging my head on the wall with this problem for months.
> Appreciate any feedback here.
>
> Thank you.
>
> gluster volume info below
>
> Volume Name: SNIP_data1
> Type: Replicate
> Volume ID: SNIP
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 4 = 4
> Transport-type: tcp
> Bricks:
> Brick1: nexus2:/mnt/SNIP_block1/SNIP_data1
> Brick2: forge:/mnt/SNIP_block1/SNIP_data1
> Brick3: hive:/mnt/SNIP_block1/SNIP_data1
> Brick4: citadel:/mnt/SNIP_block1/SNIP_data1
> Options Reconfigured:
> cluster.quorum-count: 1
> cluster.quorum-type: fixed
> network.ping-timeout: 5
> network.remote-dio: enable
> performance.rda-cache-limit: 256MB
> performance.readdir-ahead: on
> performance.parallel-readdir: on
> network.inode-lru-limit: 500000
> performance.md-cache-timeout: 600
> performance.cache-invalidation: on
> performance.stat-prefetch: on
> features.cache-invalidation-timeout: 600
> features.cache-invalidation: on
> cluster.readdir-optimize: on
> performance.io-thread-count: 32
> server.event-threads: 4
> client.event-threads: 4
> performance.read-ahead: off
> cluster.lookup-optimize: on
> performance.cache-size: 1GB
> cluster.self-heal-daemon: enable
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
> cluster.granular-entry-heal: enable
> cluster.data-self-heal-algorithm: full
>
>
> Sincerely,
> Artem
>
> --
> Founder, Android Police <http://www.androidpolice.com>, APK Mirror
> <http://www.apkmirror.com/>, Illogical Robot LLC
> beerpla.net | @ArtemR <http://twitter.com/ArtemR>
>
>
> On Thu, Jul 23, 2020 at 12:08 AM Qing Wang <qw at g.clemson.edu>
wrote:
>
>> Hi,
>>
>> I have one more question about the Gluster linear scale-out performance
>> regarding the "write-behind off" case specifically -- when
"write-behind"
>> is off, and still the stripe volumes and other settings as early thread
>> posted, the storage I/O seems not to relate to the number of storage
>> nodes. In my experiment, no matter I have 2 brick server nodes or 8
brick
>> server nodes, the aggregated gluster I/O performance is ~100MB/sec. And
fio
>> benchmark measurement gives the same result. If "write
behind" is on, then
>> the storage performance is linear scale-out along with the # of brick
>> server nodes increasing.
>>
>> No matter the write behind option is on/off, I thought the gluster I/O
>> performance should be pulled and aggregated together as a whole. If
that is
>> the case, why do I get a consistent gluster performance (~100MB/sec)
when
>> "write behind" is off? Please advise me if I misunderstood
something.
>>
>> Thanks,
>> Qing
>>
>>
>>
>>
>> On Tue, Jul 21, 2020 at 7:29 PM Qing Wang <qw at g.clemson.edu>
wrote:
>>
>>> fio gives me the correct linear scale-out results, and you're
right, the
>>> storage cache is the root cause that makes the dd measurement
results not
>>> accurate at all.
>>>
>>> Thanks,
>>> Qing
>>>
>>>
>>> On Tue, Jul 21, 2020 at 2:53 PM Yaniv Kaul <ykaul at
redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, 21 Jul 2020, 21:43 Qing Wang <qw at
g.clemson.edu> wrote:
>>>>
>>>>> Hi Yaniv,
>>>>>
>>>>> Thanks for the quick response. I forget to mention I am
testing the
>>>>> writing performance, not reading. In this case, would the
client cache hit
>>>>> rate still be a big issue?
>>>>>
>>>>
>>>> It's not hitting the storage directly. Since it's also
single threaded,
>>>> it may also not saturate it. I highly recommend testing
properly.
>>>> Y.
>>>>
>>>>
>>>>> I'll use fio to run my test once again, thanks for the
suggestion.
>>>>>
>>>>> Thanks,
>>>>> Qing
>>>>>
>>>>> On Tue, Jul 21, 2020 at 2:38 PM Yaniv Kaul <ykaul at
redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 21 Jul 2020, 21:30 Qing Wang <qw at
g.clemson.edu> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to test Gluster linear scale-out
performance by adding
>>>>>>> more storage server/bricks, and measure the storage
I/O performance. To
>>>>>>> vary the storage server number, I create several
"stripe" volumes that
>>>>>>> contain 2 brick servers, 3 brick servers, 4 brick
servers, and so on. On
>>>>>>> gluster client side, I used "dd if=/dev/zero
>>>>>>> of=/mnt/glusterfs/dns_test_data_26g bs=1M
count=26000" to create 26G data
>>>>>>> (or larger size), and those data will be
distributed to the corresponding
>>>>>>> gluster servers (each has gluster brick on it) and
"dd" returns the final
>>>>>>> I/O throughput. The Internet is 40G infiniband,
although I didn't do any
>>>>>>> specific configurations to use advanced features.
>>>>>>>
>>>>>>
>>>>>> Your dd command is inaccurate, as it'll hit the
client cache. It is
>>>>>> also single threaded. I suggest switching to fio.
>>>>>> Y.
>>>>>>
>>>>>>
>>>>>>> What confuses me is that the storage I/O seems not
to relate to the
>>>>>>> number of storage nodes, but Gluster documents said
it should be linear
>>>>>>> scaling. For example, when "write-behind"
is on, and when Infiniband "jumbo
>>>>>>> frame" (connected mode) is on, I can get ~800
MB/sec reported by "dd", no
>>>>>>> matter I have 2 brick servers or 8 brick servers --
for 2 server case, each
>>>>>>> server can have ~400 MB/sec; for 4 server case,
each server can have
>>>>>>> ~200MB/sec. That said, each server I/O does
aggregate to the final storage
>>>>>>> I/O (800 MB/sec), but this is not "linear
scale-out".
>>>>>>>
>>>>>>> Can somebody help me to understand why this is the
case? I certainly
>>>>>>> can have some misunderstanding/misconfiguration
here. Please correct me if
>>>>>>> I do, thanks!
>>>>>>>
>>>>>>> Best,
>>>>>>> Qing
>>>>>>> ________
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Community Meeting Calendar:
>>>>>>>
>>>>>>> Schedule -
>>>>>>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>>>>>>> Bridge: https://bluejeans.com/441850968
>>>>>>>
>>>>>>> Gluster-users mailing list
>>>>>>> Gluster-users at gluster.org
>>>>>>>
https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>
>>>>>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing listGluster-users at
gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200803/c854e0e1/attachment.html>

Gluster users - Aug 2020 - Gluster linear scale-out performance

[Gluster-users] Gluster linear scale-out performance

[Gluster-users] Gluster linear scale-out performance

[Gluster-users] Gluster linear scale-out performance