thr3ads.net - Gluster users - [Gluster-users] Brick Reboot => VMs slowdown, client crashes [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Carl Sirotic

2019-Aug-23 22:06 UTC

[Gluster-users] Brick Reboot => VMs slowdown, client crashes

Okay,

so it means, at least I am not getting the expected behavior and there 
is hope.

I put the quorum settings that I was told a couple of emails ago.

After applying virt group, they are

cluster.quorum-type auto
cluster.quorum-count (null)
cluster.server-quorum-type server
cluster.server-quorum-ratio 0
cluster.quorum-reads no

Also,

I just put the ping timeout to 5 seconds now.


Carl

On 2019-08-23 5:45 p.m., Ingo Fischer wrote:> Hi Carl,
>
> In my understanding and experience (I have a replica 3 System running 
> too) this should not happen. Can you tell your client and server 
> quorum settings?
>
> Ingo
>
> Am 23.08.2019 um 15:53 schrieb Carl Sirotic 
> <csirotic at evoqarchitecture.com <mailto:csirotic at
evoqarchitecture.com>>:
>
>> However,
>>
>> I must have misunderstood the whole concept of gluster.
>>
>> In a replica 3, for me, it's completely unacceptable, regardless of
>> the options, that all my VMs go down when I reboot one node.
>>
>> The whole purpose of having a full 3 copy of my data on the fly is 
>> suposed to be this.
>>
>> I am in the process of sharding every file.
>>
>> But even if the healing time would be longer, I would still expect a 
>> non-sharded replica 3 brick with vm boot disk, to not go down if I 
>> reboot one of its copy.
>>
>>
>> I am not very impressed by gluster so far.
>>
>> Carl
>>
>> On 2019-08-19 4:15 p.m., Darrell Budic wrote:
>>> /var/lib/glusterd/groups/virt is a good start for ideas, notably 
>>> some thread settings and choose-local=off to improve read 
>>> performance. If you don?t have at least 10 cores on your servers, 
>>> you may want to lower the recommended shd-max-threads=8 to no more 
>>> than half your CPU cores to keep healing from swamping out regular 
>>> work.
>>>
>>> It?s also starting to depend on what your backing store and 
>>> networking setup are, so you?re going to want to test changes and 
>>> find what works best for your setup.
>>>
>>> In addition to the virt group settings, I use these on most of my 
>>> volumes, SSD or HDD backed, with the default 64M shard size:
>>>
>>> performance.io <http://performance.io>-thread-count: 32#
seemed good
>>> for my system, particularly a ZFS backed volume with lots of
spindles
>>> client.event-threads: 8
>>> cluster.data-self-heal-algorithm: full# 10G networking, uses more 
>>> net/less cpu to heal. probably don?t use this for 1G networking?
>>> performance.stat-prefetch: on
>>> cluster.read-hash-mode: 3# distribute reads to least loaded server 
>>> (by read queue depth)
>>>
>>> and these two only on my HDD backed volume:
>>>
>>> performance.cache-size: 1G
>>> performance.write-behind-window-size: 64MB
>>>
>>> but I suspect these two need another round or six of tuning to tell
>>> if they are making a difference.
>>>
>>> I use the throughput-performance tuned profile on my servers, so
you
>>> should be in good shape there.
>>>
>>>> On Aug 19, 2019, at 12:22 PM, Guy Boisvert 
>>>> <guy.boisvert at ingtegration.com 
>>>> <mailto:guy.boisvert at ingtegration.com>> wrote:
>>>>
>>>> On 2019-08-19 12:08 p.m., Darrell Budic wrote:
>>>>> You also need to make sure your volume is setup properly
for best
>>>>> performance. Did you apply the gluster virt group to your
volumes,
>>>>> or at least features.shard = on on your VM volume?
>>>>
>>>> That's what we did here:
>>>>
>>>>
>>>> gluster volume set W2K16_Rhenium cluster.quorum-type auto
>>>> gluster volume set W2K16_Rhenium network.ping-timeout 10
>>>> gluster volume set W2K16_Rhenium auth.allow \*
>>>> gluster volume set W2K16_Rhenium group virt
>>>> gluster volume set W2K16_Rhenium storage.owner-uid 36
>>>> gluster volume set W2K16_Rhenium storage.owner-gid 36
>>>> gluster volume set W2K16_Rhenium features.shard on
>>>> gluster volume set W2K16_Rhenium features.shard-block-size
256MB
>>>> gluster volume set W2K16_Rhenium
cluster.data-self-heal-algorithm full
>>>> gluster volume set W2K16_Rhenium performance.low-prio-threads
32
>>>>
>>>> tuned-adm profile random-io??? ??? (a profile i added in CentOS
7)
>>>>
>>>>
>>>> cat /usr/lib/tuned/random-io/tuned.conf
>>>> ==========================================>>>>
[main]
>>>> summary=Optimize for Gluster virtual machine storage
>>>> include=throughput-performance
>>>>
>>>> [sysctl]
>>>>
>>>> vm.dirty_ratio = 5
>>>> vm.dirty_background_ratio = 2
>>>>
>>>>
>>>> Any more optimization to add to this?
>>>>
>>>>
>>>> Guy
>>>>
>>>> -- 
>>>> Guy Boisvert, ing.
>>>> IngTegration inc.
>>>> http://www.ingtegration.com
>>>> https://www.linkedin.com/in/guy-boisvert-8990487
>>>>
>>>> AVIS DE CONFIDENTIALITE : ce message peut contenir des
>>>> renseignements confidentiels appartenant exclusivement a
>>>> IngTegration Inc. ou a ses filiales. Si vous n'etes pas
>>>> le destinataire indique ou prevu dans ce ?message (ou
>>>> responsable de livrer ce message a la personne indiquee ou
>>>> prevue) ou si vous pensez que ce message vous a ete adresse
>>>> par erreur, vous ne pouvez pas utiliser ou reproduire ce
>>>> message, ni le livrer a quelqu'un d'autre. Dans ce cas,
vous
>>>> devez le detruire et vous etes prie d'avertir
l'expediteur
>>>> en repondant au courriel.
>>>>
>>>> CONFIDENTIALITY NOTICE : Proprietary/Confidential Information
>>>> belonging to IngTegration Inc. and its affiliates may be
>>>> contained in this message. If you are not a recipient
>>>> indicated or intended in this message (or responsible for
>>>> delivery of this message to such person), or you think for
>>>> any reason that this message may have been addressed to you
>>>> in error, you may not use or copy or deliver this message to
>>>> anyone else. In such case, you should destroy this message
>>>> and are asked to notify the sender by reply email.
>>>>
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190823/bceb285a/attachment.html>

Darrell Budic

2019-Aug-29 19:58 UTC

head link

[Gluster-users] Brick Reboot => VMs slowdown, client crashes

You may be mis-understanding the way the gluster system works in detail here,
but you?ve got the right idea overall. Since gluster is maintaining 3 copies of
your data, you can lose a drive or a whole system and things will keep going
without interruption (well, mostly, if a host node was using the system that
just died, it may pause briefly before re-connecting to one that is still
running via a backup-server setting or your dns configs). While the system is
still going with one node down, that node is falling behind and new disk writes,
and the remaining ones are keeping track of what?s changing. Once you
repair/recover/reboot the down node, it will rejoin the cluster. Now the
recovered system has to catch up, and it does this by having the other two nodes
send it the changes. In the meantime, gluster is serving any reads for that data
from one of the up to date nodes, even if you ask the one you just restarted. In
order to do this healing, it had to lock the files to ensure no changes are made
while it copies a chunk of them over the recovered node. When it locks them,
your hypervisor notices they have gone read-only, and especially if it has a
pending write for that file, may pause the VM because this looks like a storage
issue to it. Once the file gets unlocked, it can be written again, and your
hypervisor notices and will generally reactivate your VM. You may see delays
too, especially if you only have 1G networking between your host nodes while
everything is getting copied around. And your files could be being locked,
updated, unlocked, locked again a few seconds or minutes later, etc.

That?s where sharding comes into play, once you have a file broken up into
shards, gluster can get away with only locking the particular shard it needs to
heal, and leaving the whole disk image unlocked. You may still catch a brief
pause if you try and write the specific segment of the file gluster is healing
at the moment, but it?s also going to be much faster because it?s a small chuck
of the file, and copies quickly.

Also, check out
https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/
<https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/server-quorum/>,
you probably want to set cluster.server-quorum-ratio to 50 for a replica-3 setup
to avoid the possibility of split-brains. Your cluster will go write only if it
loses two nodes though, but you can always make a change to the
server-quorum-ratio later if you need to keep it running temporarily.

Hope that makes sense of what?s going on for you,

  -Darrell
> On Aug 23, 2019, at 5:06 PM, Carl Sirotic <csirotic at
evoqarchitecture.com> wrote:
> 
> Okay,
> 
> so it means, at least I am not getting the expected behavior and there is
hope.
> 
> I put the quorum settings that I was told a couple of emails ago.
> 
> After applying virt group, they are
> 
> cluster.quorum-type                     auto
> cluster.quorum-count                    (null)
> cluster.server-quorum-type              server
> cluster.server-quorum-ratio             0
> cluster.quorum-reads                    no
> 
> 
> Also,
> 
> I just put the ping timeout to 5 seconds now.
> 
> 
> Carl
> 
> On 2019-08-23 5:45 p.m., Ingo Fischer wrote:
>> Hi Carl,
>> 
>> In my understanding and experience (I have a replica 3 System running
too) this should not happen. Can you tell your client and server quorum
settings?
>> 
>> Ingo
>> 
>> Am 23.08.2019 um 15:53 schrieb Carl Sirotic <csirotic at
evoqarchitecture.com <mailto:csirotic at evoqarchitecture.com>>:
>> 
>>> However,
>>> 
>>> I must have misunderstood the whole concept of gluster.
>>> 
>>> In a replica 3, for me, it's completely unacceptable,
regardless of the options, that all my VMs go down when I reboot one node.
>>> 
>>> The whole purpose of having a full 3 copy of my data on the fly is
suposed to be this.
>>> 
>>> I am in the process of sharding every file.
>>> 
>>> But even if the healing time would be longer, I would still expect
a non-sharded replica 3 brick with vm boot disk, to not go down if I reboot one
of its copy.
>>> 
>>> 
>>> 
>>> I am not very impressed by gluster so far.
>>> 
>>> Carl
>>> 
>>> On 2019-08-19 4:15 p.m., Darrell Budic wrote:
>>>> /var/lib/glusterd/groups/virt is a good start for ideas,
notably some thread settings and choose-local=off to improve read performance.
If you don?t have at least 10 cores on your servers, you may want to lower the
recommended shd-max-threads=8 to no more than half your CPU cores to keep
healing from swamping out regular work.
>>>> 
>>>> It?s also starting to depend on what your backing store and
networking setup are, so you?re going to want to test changes and find what
works best for your setup.
>>>> 
>>>> In addition to the virt group settings, I use these on most of
my volumes, SSD or HDD backed, with the default 64M shard size:
>>>> 
>>>> performance.io <http://performance.io/>-thread-count: 32	
# seemed good for my system, particularly a ZFS backed volume with lots of
spindles
>>>> client.event-threads: 8				
>>>> cluster.data-self-heal-algorithm: full	# 10G networking, uses
more net/less cpu to heal. probably don?t use this for 1G networking?
>>>> performance.stat-prefetch: on
>>>> cluster.read-hash-mode: 3			# distribute reads to least loaded
server (by read queue depth)
>>>> 
>>>> and these two only on my HDD backed volume:
>>>> 
>>>> performance.cache-size: 1G
>>>> performance.write-behind-window-size: 64MB
>>>> 
>>>> but I suspect these two need another round or six of tuning to
tell if they are making a difference.
>>>> 
>>>> I use the throughput-performance tuned profile on my servers,
so you should be in good shape there.
>>>> 
>>>>> On Aug 19, 2019, at 12:22 PM, Guy Boisvert <guy.boisvert
at ingtegration.com <mailto:guy.boisvert at ingtegration.com>> wrote:
>>>>> 
>>>>> On 2019-08-19 12:08 p.m., Darrell Budic wrote:
>>>>>> You also need to make sure your volume is setup
properly for best performance. Did you apply the gluster virt group to your
volumes, or at least features.shard = on on your VM volume?
>>>>> 
>>>>> That's what we did here:
>>>>> 
>>>>> 
>>>>> gluster volume set W2K16_Rhenium cluster.quorum-type auto
>>>>> gluster volume set W2K16_Rhenium network.ping-timeout 10
>>>>> gluster volume set W2K16_Rhenium auth.allow \*
>>>>> gluster volume set W2K16_Rhenium group virt
>>>>> gluster volume set W2K16_Rhenium storage.owner-uid 36
>>>>> gluster volume set W2K16_Rhenium storage.owner-gid 36
>>>>> gluster volume set W2K16_Rhenium features.shard on
>>>>> gluster volume set W2K16_Rhenium features.shard-block-size
256MB
>>>>> gluster volume set W2K16_Rhenium
cluster.data-self-heal-algorithm full
>>>>> gluster volume set W2K16_Rhenium
performance.low-prio-threads 32
>>>>> 
>>>>> tuned-adm profile random-io        (a profile i added in
CentOS 7)
>>>>> 
>>>>> 
>>>>> cat /usr/lib/tuned/random-io/tuned.conf
>>>>>
==========================================>>>>> [main]
>>>>> summary=Optimize for Gluster virtual machine storage
>>>>> include=throughput-performance
>>>>> 
>>>>> [sysctl]
>>>>> 
>>>>> vm.dirty_ratio = 5
>>>>> vm.dirty_background_ratio = 2
>>>>> 
>>>>> 
>>>>> Any more optimization to add to this?
>>>>> 
>>>>> 
>>>>> Guy
>>>>> 
>>>>> -- 
>>>>> Guy Boisvert, ing.
>>>>> IngTegration inc.
>>>>> http://www.ingtegration.com
<http://www.ingtegration.com/>
>>>>> https://www.linkedin.com/in/guy-boisvert-8990487
<https://www.linkedin.com/in/guy-boisvert-8990487>
>>>>> 
>>>>> AVIS DE CONFIDENTIALITE : ce message peut contenir des
>>>>> renseignements confidentiels appartenant exclusivement a
>>>>> IngTegration Inc. ou a ses filiales. Si vous n'etes pas
>>>>> le destinataire indique ou prevu dans ce  message (ou
>>>>> responsable de livrer ce message a la personne indiquee ou
>>>>> prevue) ou si vous pensez que ce message vous a ete adresse
>>>>> par erreur, vous ne pouvez pas utiliser ou reproduire ce
>>>>> message, ni le livrer a quelqu'un d'autre. Dans ce
cas, vous
>>>>> devez le detruire et vous etes prie d'avertir
l'expediteur
>>>>> en repondant au courriel.
>>>>> 
>>>>> CONFIDENTIALITY NOTICE : Proprietary/Confidential
Information
>>>>> belonging to IngTegration Inc. and its affiliates may be
>>>>> contained in this message. If you are not a recipient
>>>>> indicated or intended in this message (or responsible for
>>>>> delivery of this message to such person), or you think for
>>>>> any reason that this message may have been addressed to you
>>>>> in error, you may not use or copy or deliver this message
to
>>>>> anyone else. In such case, you should destroy this message
>>>>> and are asked to notify the sender by reply email.
>>>>> 
>>>> 
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
<https://lists.gluster.org/mailman/listinfo/gluster-users>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190829/db392fbb/attachment.html>

Gluster users - Aug 2019 - Brick Reboot => VMs slowdown, client crashes

[Gluster-users] Brick Reboot => VMs slowdown, client crashes

[Gluster-users] Brick Reboot => VMs slowdown, client crashes