thr3ads.net - Gluster users - [Gluster-users] Exact purpose of network.ping-timeout [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Omar Kohl

2017-Dec-27 11:17 UTC

[Gluster-users] Exact purpose of network.ping-timeout

Hi,
> If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume.
Exactly! ONLY 10 seconds instead of the default 42 seconds :-)

As I said before the problem with the 42 seconds is that a Windows Samba Client
will disconnect (and therefore interrupt any read/write operation) after waiting
for about 25 seconds. So 42 seconds is too high. In this case it would therefore
make more sense to reduce the ping-timeout, right?

Has anyone done any performance measurements on what the implications of a low
ping-timeout are? What are the costs of "triggering heals all the
time"?

On a related note I found the extras/hook-scripts/start/post/S29CTDBsetup.sh
script that mounts a CTDB (Samba) share and explicitly sets the ping-timeout to
10 seconds. There is a comment saying: "Make sure ping-timeout is not
default for CTDB volume". Unfortunately there is no explanation in the
script, in the commit or in the Gerrit review history
(https://review.gluster.org/#/c/7569/, https://review.gluster.org/#/c/8007/) for
WHY you make sure ping-timeout is not default. Can anyone tell me the reason?

Kind regards,
Omar

-----Urspr?ngliche Nachricht-----
Von: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] Im Auftrag von lemonnierk at ulrar.net
Gesendet: Dienstag, 26. Dezember 2017 22:05
An: gluster-users at gluster.org
Betreff: Re: [Gluster-users] Exact purpose of network.ping-timeout

Hi,

It's just the delay for which a node can stop responding before being marked
as down.
Basically that's how long a node can go down before a heal becomes necessary
to bring it back.

If you set it to 10 seconds, and a node goes down, you'll see a 10 seconds
freez in all I/O for the volume. That's why you don't want it too high
(having a 2 minutes freez on I/O for example would be pretty bad, depending on
what you host), but you don't want it too low either (to avoid triggering
heals all the time).

You can configure it because it depends on what you host. You might be okay with
a few minutes freez to avoid a heal, or you might not care about heals at all
and prefer a very low value to avoid feezes.
The default value should work pretty well for most things though

On Tue, Dec 26, 2017 at 01:11:48PM +0000, Omar Kohl
wrote:> Hi,
> 
> I have a question regarding the "ping-timeout" option. I have
been researching its purpose for a few days and it is not completely clear to
me. Especially that it is apparently strongly encouraged by the Gluster
community not to change or at least decrease this value!
> 
> Assuming that I set ping-timeout to 10 seconds (instead of the default 42)
this would mean that if I have a network outage of 11 seconds then Gluster
internally would have to re-allocate some resources that it freed after the 10
seconds, correct? But apart from that there are no negative implications, are
there? For instance if I'm copying files during the network outage then
those files will continue copying after those 11 seconds.
> 
> This means that the only purpose of ping-timeout is to save those extra
resources that are used by "short" network outages. Is that correct?
> 
> If I am confident that my network will not have many 11 second outages and
if they do occur I am willing to incur those extra costs due to resource
allocation is there any reason not to set ping-timeout to 10 seconds?
> 
> The problem I have with a long ping-timeout is that the Windows Samba
Client disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
shuts down ungracefully then the Samba Client disconnects and the file that was
being copied is incomplete on the server. These "costs" seem to be
much higher than the potential costs of those Gluster resource re-allocations.
But it is hard to estimate because there is not clear documentation what exactly
those Gluster costs are.
> 
> In general I would be very interested in a comprehensive explanation of
ping-timeout and the up- and downsides of setting high or low values for it.
> 
> Kinds regards,
> Omar
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users

Sam McLeod

2017-Dec-28 01:57 UTC

head link

[Gluster-users] Exact purpose of network.ping-timeout

10 seconds is a very long time for files to go away for applications used at any
scale, it is however what I've set our failover time to after being shocked
by the default of 42 seconds.

--
Sam McLeod
https://smcleod.net
https://twitter.com/s_mcleod
> On 27 Dec 2017, at 10:17 pm, Omar Kohl <omar.kohl at iternity.com>
wrote:
> 
> Hi,
> 
>> If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume.
> 
> Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
> 
> As I said before the problem with the 42 seconds is that a Windows Samba
Client will disconnect (and therefore interrupt any read/write operation) after
waiting for about 25 seconds. So 42 seconds is too high. In this case it would
therefore make more sense to reduce the ping-timeout, right?
> 
> Has anyone done any performance measurements on what the implications of a
low ping-timeout are? What are the costs of "triggering heals all the
time"?
> 
> On a related note I found the
extras/hook-scripts/start/post/S29CTDBsetup.sh script that mounts a CTDB (Samba)
share and explicitly sets the ping-timeout to 10 seconds. There is a comment
saying: "Make sure ping-timeout is not default for CTDB volume".
Unfortunately there is no explanation in the script, in the commit or in the
Gerrit review history (https://review.gluster.org/#/c/7569/,
https://review.gluster.org/#/c/8007/) for WHY you make sure ping-timeout is not
default. Can anyone tell me the reason?
> 
> Kind regards,
> Omar
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171228/97decac9/attachment.html>

lemonnierk at ulrar.net

2017-Dec-28 09:03 UTC

head link

[Gluster-users] Exact purpose of network.ping-timeout

Can't tell you, I only use gluster for VM disks.
The heal will hammer performances pretty bad, but that really depends on
what you do, so I'd say test it a bunch and use whatever works best.

I think they advise for a high value to make sure you don't have two
nodes marked down in cose succession, which could either cause a
split-brain or make your volume readonly for a while, depending on your
config and number of nodes.

On Wed, Dec 27, 2017 at 11:17:01AM +0000, Omar Kohl
wrote:> Hi,
> 
> > If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume.
> 
> Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
> 
> As I said before the problem with the 42 seconds is that a Windows Samba
Client will disconnect (and therefore interrupt any read/write operation) after
waiting for about 25 seconds. So 42 seconds is too high. In this case it would
therefore make more sense to reduce the ping-timeout, right?
> 
> Has anyone done any performance measurements on what the implications of a
low ping-timeout are? What are the costs of "triggering heals all the
time"?
> 
> On a related note I found the
extras/hook-scripts/start/post/S29CTDBsetup.sh script that mounts a CTDB (Samba)
share and explicitly sets the ping-timeout to 10 seconds. There is a comment
saying: "Make sure ping-timeout is not default for CTDB volume".
Unfortunately there is no explanation in the script, in the commit or in the
Gerrit review history (https://review.gluster.org/#/c/7569/,
https://review.gluster.org/#/c/8007/) for WHY you make sure ping-timeout is not
default. Can anyone tell me the reason?
> 
> Kind regards,
> Omar
> 
> -----Urspr?ngliche Nachricht-----
> Von: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] Im Auftrag von lemonnierk at ulrar.net
> Gesendet: Dienstag, 26. Dezember 2017 22:05
> An: gluster-users at gluster.org
> Betreff: Re: [Gluster-users] Exact purpose of network.ping-timeout
> 
> Hi,
> 
> It's just the delay for which a node can stop responding before being
marked as down.
> Basically that's how long a node can go down before a heal becomes
necessary to bring it back.
> 
> If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume. That's why you don't want it
too high (having a 2 minutes freez on I/O for example would be pretty bad,
depending on what you host), but you don't want it too low either (to avoid
triggering heals all the time).
> 
> You can configure it because it depends on what you host. You might be okay
with a few minutes freez to avoid a heal, or you might not care about heals at
all and prefer a very low value to avoid feezes.
> The default value should work pretty well for most things though
> 
> On Tue, Dec 26, 2017 at 01:11:48PM +0000, Omar Kohl wrote:
> > Hi,
> > 
> > I have a question regarding the "ping-timeout" option. I
have been researching its purpose for a few days and it is not completely clear
to me. Especially that it is apparently strongly encouraged by the Gluster
community not to change or at least decrease this value!
> > 
> > Assuming that I set ping-timeout to 10 seconds (instead of the default
42) this would mean that if I have a network outage of 11 seconds then Gluster
internally would have to re-allocate some resources that it freed after the 10
seconds, correct? But apart from that there are no negative implications, are
there? For instance if I'm copying files during the network outage then
those files will continue copying after those 11 seconds.
> > 
> > This means that the only purpose of ping-timeout is to save those
extra resources that are used by "short" network outages. Is that
correct?
> > 
> > If I am confident that my network will not have many 11 second outages
and if they do occur I am willing to incur those extra costs due to resource
allocation is there any reason not to set ping-timeout to 10 seconds?
> > 
> > The problem I have with a long ping-timeout is that the Windows Samba
Client disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
shuts down ungracefully then the Samba Client disconnects and the file that was
being copied is incomplete on the server. These "costs" seem to be
much higher than the potential costs of those Gluster resource re-allocations.
But it is hard to estimate because there is not clear documentation what exactly
those Gluster costs are.
> > 
> > In general I would be very interested in a comprehensive explanation
of ping-timeout and the up- and downsides of setting high or low values for it.
> > 
> > Kinds regards,
> > Omar
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-users
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Digital signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171228/6d9f28b0/attachment.sig>

lemonnierk at ulrar.net

2017-Dec-28 23:00 UTC

head link

[Gluster-users] Exact purpose of network.ping-timeout

I/O is frozen, so you don't get errors, just a delay when accessing.
It's completly transparent, and for VM disks at least even 40 seconds is
fine, not long enough for a web server to timeout, the visitor just
thinks the site was slow for a minute.

Really hasn't been that bad here, but I guess it all depends on what
the files are

On Thu, Dec 28, 2017 at 12:57:21PM +1100, Sam McLeod
wrote:> 10 seconds is a very long time for files to go away for applications used
at any scale, it is however what I've set our failover time to after being
shocked by the default of 42 seconds.
> 
> --
> Sam McLeod
> https://smcleod.net
> https://twitter.com/s_mcleod
> 
> > On 27 Dec 2017, at 10:17 pm, Omar Kohl <omar.kohl at
iternity.com> wrote:
> > 
> > Hi,
> > 
> >> If you set it to 10 seconds, and a node goes down, you'll see
a 10 seconds freez in all I/O for the volume.
> > 
> > Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
> > 
> > As I said before the problem with the 42 seconds is that a Windows
Samba Client will disconnect (and therefore interrupt any read/write operation)
after waiting for about 25 seconds. So 42 seconds is too high. In this case it
would therefore make more sense to reduce the ping-timeout, right?
> > 
> > Has anyone done any performance measurements on what the implications
of a low ping-timeout are? What are the costs of "triggering heals all the
time"?
> > 
> > On a related note I found the
extras/hook-scripts/start/post/S29CTDBsetup.sh script that mounts a CTDB (Samba)
share and explicitly sets the ping-timeout to 10 seconds. There is a comment
saying: "Make sure ping-timeout is not default for CTDB volume".
Unfortunately there is no explanation in the script, in the commit or in the
Gerrit review history (https://review.gluster.org/#/c/7569/,
https://review.gluster.org/#/c/8007/) for WHY you make sure ping-timeout is not
default. Can anyone tell me the reason?
> > 
> > Kind regards,
> > Omar
> > 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Digital signature
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171228/4494909b/attachment.sig>

Joe Julian

2017-Dec-29 00:08 UTC

head link

[Gluster-users] Exact purpose of network.ping-timeout

The reason for the long (42 second) ping-timeout is because re-establishing
fd's and locks can be a very expensive operation. With an average MTBF of
45000 hours for a server, even just a replica 2 would result in a 42 second MTTR
every 2.6 years, or 6 nines of uptime.

On December 27, 2017 3:17:01 AM PST, Omar Kohl <omar.kohl at iternity.com>
wrote:>Hi,
>
>> If you set it to 10 seconds, and a node goes down, you'll see a 10
>seconds freez in all I/O for the volume.
>
>Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
>
>As I said before the problem with the 42 seconds is that a Windows
>Samba Client will disconnect (and therefore interrupt any read/write
>operation) after waiting for about 25 seconds. So 42 seconds is too
>high. In this case it would therefore make more sense to reduce the
>ping-timeout, right?
>
>Has anyone done any performance measurements on what the implications
>of a low ping-timeout are? What are the costs of "triggering heals all
>the time"?
>
>On a related note I found the
>extras/hook-scripts/start/post/S29CTDBsetup.sh script that mounts a
>CTDB (Samba) share and explicitly sets the ping-timeout to 10 seconds.
>There is a comment saying: "Make sure ping-timeout is not default for
>CTDB volume". Unfortunately there is no explanation in the script, in
>the commit or in the Gerrit review history
>(https://review.gluster.org/#/c/7569/,
>https://review.gluster.org/#/c/8007/) for WHY you make sure
>ping-timeout is not default. Can anyone tell me the reason?
>
>Kind regards,
>Omar
>
>-----Urspr?ngliche Nachricht-----
>Von: gluster-users-bounces at gluster.org
>[mailto:gluster-users-bounces at gluster.org] Im Auftrag von
>lemonnierk at ulrar.net
>Gesendet: Dienstag, 26. Dezember 2017 22:05
>An: gluster-users at gluster.org
>Betreff: Re: [Gluster-users] Exact purpose of network.ping-timeout
>
>Hi,
>
>It's just the delay for which a node can stop responding before being
>marked as down.
>Basically that's how long a node can go down before a heal becomes
>necessary to bring it back.
>
>If you set it to 10 seconds, and a node goes down, you'll see a 10
>seconds freez in all I/O for the volume. That's why you don't want
it
>too high (having a 2 minutes freez on I/O for example would be pretty
>bad, depending on what you host), but you don't want it too low either
>(to avoid triggering heals all the time).
>
>You can configure it because it depends on what you host. You might be
>okay with a few minutes freez to avoid a heal, or you might not care
>about heals at all and prefer a very low value to avoid feezes.
>The default value should work pretty well for most things though
>
>On Tue, Dec 26, 2017 at 01:11:48PM +0000, Omar Kohl wrote:
>> Hi,
>> 
>> I have a question regarding the "ping-timeout" option. I have
been
>researching its purpose for a few days and it is not completely clear
>to me. Especially that it is apparently strongly encouraged by the
>Gluster community not to change or at least decrease this value!
>> 
>> Assuming that I set ping-timeout to 10 seconds (instead of the
>default 42) this would mean that if I have a network outage of 11
>seconds then Gluster internally would have to re-allocate some
>resources that it freed after the 10 seconds, correct? But apart from
>that there are no negative implications, are there? For instance if I'm
>copying files during the network outage then those files will continue
>copying after those 11 seconds.
>> 
>> This means that the only purpose of ping-timeout is to save those
>extra resources that are used by "short" network outages. Is that
>correct?
>> 
>> If I am confident that my network will not have many 11 second
>outages and if they do occur I am willing to incur those extra costs
>due to resource allocation is there any reason not to set ping-timeout
>to 10 seconds?
>> 
>> The problem I have with a long ping-timeout is that the Windows Samba
>Client disconnects after 25 seconds. So if one of the nodes of a
>Gluster cluster shuts down ungracefully then the Samba Client
>disconnects and the file that was being copied is incomplete on the
>server. These "costs" seem to be much higher than the potential
costs
>of those Gluster resource re-allocations. But it is hard to estimate
>because there is not clear documentation what exactly those Gluster
>costs are.
>> 
>> In general I would be very interested in a comprehensive explanation
>of ping-timeout and the up- and downsides of setting high or low values
>for it.
>> 
>> Kinds regards,
>> Omar
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://lists.gluster.org/mailman/listinfo/gluster-users
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171228/9c53c221/attachment.html>

Sam McLeod

2017-Dec-29 04:19 UTC

head link

[Gluster-users] Exact purpose of network.ping-timeout

Sure, if you never restart / autoscale anything and if your use case isn't
bothered with up to 42 seconds of downtime, for us - 42 seconds is a really long
time for something like a patient management system to refuse file attachments
from being uploaded etc...

We apply a strict patching policy for security and kernel updates, we often also
load balance between underlying physical hosts and if the virtual hosts have
lots of storage it can be quicker to let them shutdown and start on another
host.

So for us, gone are the old Unix days of caring about uptime, a huge part of our
measurement of success and risk reduction has become how quickly we can not just
deploy our software / web apps into production but also how quickly our platform
can be reformed, patched and migrated as is effective.

So in reality, I'd probably rolling restart our three node gluster clusters
every few weeks or so depending on what patches have been released etc...

--
Sam McLeod
https://smcleod.net
https://twitter.com/s_mcleod
> On 29 Dec 2017, at 11:08 am, Joe Julian <joe at julianfamily.org>
wrote:
> 
> The reason for the long (42 second) ping-timeout is because re-establishing
fd's and locks can be a very expensive operation. With an average MTBF of
45000 hours for a server, even just a replica 2 would result in a 42 second MTTR
every 2.6 years, or 6 nines of uptime.
> 
> On December 27, 2017 3:17:01 AM PST, Omar Kohl <omar.kohl at
iternity.com> wrote:
> Hi,
> 
>  If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume.
> 
> Exactly! ONLY 10 seconds instead of the default 42 seconds :-)
> 
> As I said before the problem with the 42 seconds is that a Windows Samba
Client will disconnect (and therefore interrupt any read/write operation) after
waiting for about 25 seconds. So 42 seconds is too high. In this case it would
therefore make more sense to reduce the ping-timeout, right?
> 
> Has anyone done any performance measurements on what the implications of a
low ping-timeout are? What are the costs of "triggering heals all the
time"?
> 
> On a related note I found the
extras/hook-scripts/start/post/S29CTDBsetup.sh <http://s29ctdbsetup.sh/>
script that mounts a CTDB (Samba) share and explicitly sets the ping-timeout to
10 seconds. There is a comment saying: "Make sure ping-timeout is not
default for CTDB volume". Unfortunately there is no explanation in the
script, in the commit or in the Gerrit review history
(https://review.gluster.org/#/c/7569
<https://review.gluster.org/#/c/7569>/,
https://review.gluster.org/#/c/8007
<https://review.gluster.org/#/c/8007>/) for WHY you make sure ping-timeout
is not default. Can anyone tell me the reason?
> 
> Kind regards,
> Omar
> 
> -----Urspr?ngliche Nachricht-----
> Von: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at
gluster.org] Im Auftrag von lemonnierk at ulrar.net
> Gesendet: Dienstag, 26. Dezember 2017 22:05
> An: gluster-users at gluster.org
> Betreff: Re: [Gluster-users] Exact purpose of network.ping
<http://network.ping/>-timeout
> 
> Hi,
> 
> It's just the delay for which a node can stop responding before being
marked as down.
> Basically that's how long a node can go down before a heal becomes
necessary to bring it back.
> 
> If you set it to 10 seconds, and a node goes down, you'll see a 10
seconds freez in all I/O for the volume. That's why you don't want it
too high (having a 2 minutes freez on I/O for example would be pretty bad,
depending on what you host), but you don't want it too low either (to avoid
triggering heals all the time).
> 
> You can configure it because it depends on what you host. You might be okay
with a few minutes freez to avoid a heal, or you might not care about heals at
all and prefer a very low value to avoid feezes.
> The default value should work pretty well for most things though
> 
> On Tue, Dec 26, 2017 at 01:11:48PM +0000, Omar Kohl wrote:
>  Hi,
>  
>  I have a question regarding the "ping-timeout" option. I have
been researching its purpose for a few days and it is not completely clear to
me. Especially that it is apparently strongly encouraged by the Gluster
community not to change or at least decrease this value!
>  
>  Assuming that I set ping-timeout to 10 seconds (instead of the default 42)
this would mean that if I have a network outage of 11 seconds then Gluster
internally would have to re-allocate some resources that it freed after the 10
seconds, correct? But apart from that there are no negative implications, are
there? For instance if I'm copying files during the network outage then
those files will continue copying after those 11 seconds.
>  
>  This means that the only purpose of ping-timeout is to save those extra
resources that are used by "short" network outages. Is that correct?
>  
>  If I am confident that my network will not have many 11 second outages and
if they do occur I am willing to incur those extra costs due to resource
allocation is there any reason not to set ping-timeout to 10 seconds?
>  
>  The problem I have with a long ping-timeout is that the Windows Samba
Client disconnects after 25 seconds. So if one of the nodes of a Gluster cluster
shuts down ungracefully then the Samba Client disconnects and the file that was
being copied is incomplete on the server. These "costs" seem to be
much higher than the potential costs of those Gluster resource re-allocations.
But it is hard to estimate because there is not clear documentation what exactly
those Gluster costs are.
>  
>  In general I would be very interested in a comprehensive explanation of
ping-timeout and the up- and downsides of setting high or low values for it.
>  
>  Kinds regards,
>  Omar
> 
>  Gluster-users mailing list
>  Gluster-users at gluster.org
>  http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
> 
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
<http://lists.gluster.org/mailman/listinfo/gluster-users>
> 
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20171229/ae87ac29/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

Gluster users - Dec 2017 - Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

[Gluster-users] Exact purpose of network.ping-timeout

Possibly Parallel Threads