thr3ads.net - Gluster users - [Gluster-users] Fwd: VM freeze issue on simple gluster setup. [Dec 2019]

If this information is useful, please help other people find it:
Share via:

2019-Dec-11 22:31 UTC

[Gluster-users] Fwd: VM freeze issue on simple gluster setup.

<BUMP> so I can get some sort of resolution on the issue (i.e. is it 
hardware, Gluster etc)

I guess what I really need to know is

1) Node 2 complains that it cant reach node 1 and node 3.? If this was 
an OS/Hardware networking issue and not internal to Gluster , then why 
didn't node1 and node3 have error message complaining about not reaching 
node2

2) how significant is it that the node was running 6.5 while node 1 and 
node 2 were running 6.4

-wk

-------- Forwarded Message --------
Subject: 	VM freeze issue on simple gluster setup.
Date: 	Thu, 5 Dec 2019 16:23:35 -0800
From: 	WK <wkmail at bneit.com>
To: 	Gluster Users <gluster-users at gluster.org>



I have a replica2+arbiter setup that is used for VMs.

ip #.1 is the arb

ip #.2 and #.3 are the kvm hosts.

Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse The 
Gluster networking uses a? two ethernet card teamd/round-robin setup 
which *should* have stayed up if one of the ports had failed.

I just had a number of VMs go Read-Only due to the below communication 
failure at 22:00 but only on kvm host? #2

VMs on the same gluster volumes on kvm host 3 were unaffected.

The logs on host #2 show the following:

[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
server 10.255.1.1:49153 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
server 10.255.1.3:49152 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757191] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected from 
GL1image-client-2. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757246] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected from 
GL1image-client-1. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757266] W [MSGID: 108001] 
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum is 
not met
[2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind] (--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x v1) 
op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)
[2019-12-05 22:00:43.790655] W [MSGID: 114031] 
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-2: 
remote operation failed
[2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind] (--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.x v1) 
op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)
[2019-12-05 22:00:43.790703] W [MSGID: 114031] 
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-1: 
remote operation failed
[2019-12-05 22:00:43.790774] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 0-GL1image-client-1: 
remote operation failed [Transport endpoint is not connected]
[2019-12-05 22:00:43.790777] E [rpc-clnt.c:346:saved_frames_unwind] (--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x v1) 
op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542 (xid=0x825bffc)
[2019-12-05 22:00:43.790794] W [MSGID: 114029] 
[client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-1: 
failed to send the fop
[2019-12-05 22:00:43.790806] W [MSGID: 114031] 
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 0-GL1image-client-2: 
remote operation failed
[2019-12-05 22:00:43.790825] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 0-GL1image-client-2: 
remote operation failed [Transport endpoint is not connected]
[2019-12-05 22:00:43.790842] W [MSGID: 114029] 
[client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-2: 
failed to send the fop

the fop/transport not connected errors just repeat for another 50 lines 
or so until I hit 22:00:46 seconds at which point the Volumes appear to 
be fine (though the VMs were still read-only until I rebooted.

[2019-12-05 22:00:46.987242] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701328: READ => -1 
gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transport 
endpoint is not connected)
[2019-12-05 22:00:47.029947] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701329: READ => -1 
gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transport 
endpoint is not connected)
[2019-12-05 22:00:49.901075] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701330: READ => -1 
gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transport 
endpoint is not connected)
[2019-12-05 22:00:49.923525] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701331: READ => -1 
gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transport 
endpoint is not connected)
[2019-12-05 22:00:49.970219] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701332: READ => -1 
gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transport 
endpoint is not connected)
[2019-12-05 22:00:50.023932] W [fuse-bridge.c:2827:fuse_readv_cbk] 
0-glusterfs-fuse: 91701333: READ => -1 
gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transport 
endpoint is not connected)
[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-2: changing port to 49153 (from 0)
[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-1: changing port to 49152 (from 0)
[2019-12-05 22:00:46.115076] E [MSGID: 133014] 
[shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat failed: 
7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint is not connected]
[2019-12-05 22:00:54.820394] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1: 
Connected to GL1image-client-1, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820447] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.820549] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2: 
Connected to GL1image-client-2, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820568] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.821381] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying
CHILD-UP
[2019-12-05 22:00:54.821406] I [MSGID: 108002] 
[afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum is met
[2019-12-05 22:00:54.821446] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying
CHILD-UP

What is odd is that the gluster logs on the #3 and #1 show absolutely 
ZERO gluster errors around that time nor do I show any Network/teamd 
errors on any of the? 3 nodes (including the problem node #2)

I've checked dmesg/syslog and every other log file on the box.

According to a staff member, we had this same kvm host have the same 
problem about 3 weeks ago, it was written up as a fluke possible due to 
excess disk I/O, since we have been using gluster for years and rarely 
have seen issues, especially with very basic gluster usage.

In this case those VMs weren't overly busy and now we have a repeat problem.

So I am wondering where else I can look to diagnose the problem or 
should I abandon the hardware/setup?

I assume its a networking issue and not on gluster, but I am confused 
why gluster nodes #1 and #3 didn't complain about not seeing #2? If the 
networking did drop out should they have noticed?

There also doesn't appear to be any visible hard disk issues (smartd is 
running)

Side Note: I have reset the tcp-timeout back to 42 seconds and will look 
at upgrading to 6.6. I also see that the ARB and the unaffected Gluster 
node were running Gluster 6.4 (I don't know why #2 is on 6.5 but I am 
checking on that as well, we turn off auto-upgrade)

Maybe the mismatched versions are the culprit?

Also, we have a large of these replica 2+1 gluster setups running 
gluster version from 5.x up and none of the others have had this issue

Any advise would be appreciated.

Sincerely,

Wk





-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191211/b0a6adb1/attachment.html>

Ravishankar N

2019-Dec-12 12:34 UTC

head link

[Gluster-users] Fwd: VM freeze issue on simple gluster setup.

On 12/12/19 4:01 am, WK wrote:>
> <BUMP> so I can get some sort of resolution on the issue (i.e. is it 
> hardware, Gluster etc)
>
> I guess what I really need to know is
>
> 1) Node 2 complains that it cant reach node 1 and node 3.? If this was 
> an OS/Hardware networking issue and not internal to Gluster , then why 
> didn't node1 and node3 have error message complaining about not 
> reaching node2
>[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
server 10.255.1.1:49153 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
server 10.255.1.3:49152 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757191] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected from 
GL1image-client-2. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757246] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected from 
GL1image-client-1. Client process will keep trying to connect to 
glusterd until brick's port is available

[2019-12-05 22:00:43.757266] W [MSGID: 108001] 
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum is 
not met

This seems to indicate the mount on node 2 cannot reach 2 bricks. If 
quorum is not met, you will get ENOTCONN on the mount. Maybe check if 
the mount is still disconnected from the bricks (either statedump or 
looking at the .meta folder)?
> 2) how significant is it that the node was running 6.5 while node 1 
> and node 2 were running 6.4
Minor versions should be fine but it is always a good idea to have all 
nodes on the same version.

HTH,
Ravi>
> -wk
>
> -------- Forwarded Message --------
> Subject: 	VM freeze issue on simple gluster setup.
> Date: 	Thu, 5 Dec 2019 16:23:35 -0800
> From: 	WK <wkmail at bneit.com>
> To: 	Gluster Users <gluster-users at gluster.org>
>
>
>
> I have a replica2+arbiter setup that is used for VMs.
>
> ip #.1 is the arb
>
> ip #.2 and #.3 are the kvm hosts.
>
> Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse The 
> Gluster networking uses a? two ethernet card teamd/round-robin setup 
> which *should* have stayed up if one of the ports had failed.
>
> I just had a number of VMs go Read-Only due to the below communication 
> failure at 22:00 but only on kvm host? #2
>
> VMs on the same gluster volumes on kvm host 3 were unaffected.
>
> The logs on host #2 show the following:
>
> [2019-12-05 22:00:43.739804] C 
> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
> server 10.255.1.1:49153 has not responded in the last 21 seconds, 
> disconnecting.
> [2019-12-05 22:00:43.757095] C 
> [rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
> server 10.255.1.3:49152 has not responded in the last 21 seconds, 
> disconnecting.
> [2019-12-05 22:00:43.757191] I [MSGID: 114018] 
> [client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected 
> from GL1image-client-2. Client process will keep trying to connect to 
> glusterd until brick's port is available
> [2019-12-05 22:00:43.757246] I [MSGID: 114018] 
> [client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected 
> from GL1image-client-1. Client process will keep trying to connect to 
> glusterd until brick's port is available
> [2019-12-05 22:00:43.757266] W [MSGID: 108001] 
> [afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum 
> is not met
> [2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
> ))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x 
> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)
> [2019-12-05 22:00:43.790655] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
> 0-GL1image-client-2: remote operation failed
> [2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
> ))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.x 
> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)
> [2019-12-05 22:00:43.790703] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
> 0-GL1image-client-1: remote operation failed
> [2019-12-05 22:00:43.790774] E [MSGID: 114031] 
> [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 
> 0-GL1image-client-1: remote operation failed [Transport endpoint is 
> not connected]
> [2019-12-05 22:00:43.790777] E [rpc-clnt.c:346:saved_frames_unwind] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
> (--> 
>
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45]
> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
> ))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x 
> v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542 (xid=0x825bffc)
> [2019-12-05 22:00:43.790794] W [MSGID: 114029] 
> [client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-1: 
> failed to send the fop
> [2019-12-05 22:00:43.790806] W [MSGID: 114031] 
> [client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
> 0-GL1image-client-2: remote operation failed
> [2019-12-05 22:00:43.790825] E [MSGID: 114031] 
> [client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk] 
> 0-GL1image-client-2: remote operation failed [Transport endpoint is 
> not connected]
> [2019-12-05 22:00:43.790842] W [MSGID: 114029] 
> [client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-2: 
> failed to send the fop
>
> the fop/transport not connected errors just repeat for another 50 
> lines or so until I hit 22:00:46 seconds at which point the Volumes 
> appear to be fine (though the VMs were still read-only until I rebooted.
>
> [2019-12-05 22:00:46.987242] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701328: READ => -1 
> gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:47.029947] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701329: READ => -1 
> gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:49.901075] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701330: READ => -1 
> gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:49.923525] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701331: READ => -1 
> gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:49.970219] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701332: READ => -1 
> gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:50.023932] W [fuse-bridge.c:2827:fuse_readv_cbk] 
> 0-glusterfs-fuse: 91701333: READ => -1 
> gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58 (Transport 
> endpoint is not connected)
> [2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
> 0-GL1image-client-2: changing port to 49153 (from 0)
> [2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
> 0-GL1image-client-1: changing port to 49152 (from 0)
> [2019-12-05 22:00:46.115076] E [MSGID: 133014] 
> [shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat failed: 
> 7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint is not connected]
> [2019-12-05 22:00:54.820394] I [MSGID: 114046] 
> [client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1: 
> Connected to GL1image-client-1, attached to remote volume 
> '/GLUSTER/GL1image'.
> [2019-12-05 22:00:54.820447] I [MSGID: 114042] 
> [client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10 
> fds open - Delaying child_up until they are re-opened
> [2019-12-05 22:00:54.820549] I [MSGID: 114046] 
> [client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2: 
> Connected to GL1image-client-2, attached to remote volume 
> '/GLUSTER/GL1image'.
> [2019-12-05 22:00:54.820568] I [MSGID: 114042] 
> [client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10 
> fds open - Delaying child_up until they are re-opened
> [2019-12-05 22:00:54.821381] I [MSGID: 114041] 
> [client-handshake.c:318:client_child_up_reopen_done] 
> 0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying
CHILD-UP
> [2019-12-05 22:00:54.821406] I [MSGID: 108002] 
> [afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum 
> is met
> [2019-12-05 22:00:54.821446] I [MSGID: 114041] 
> [client-handshake.c:318:client_child_up_reopen_done] 
> 0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying
CHILD-UP
>
> What is odd is that the gluster logs on the #3 and #1 show absolutely 
> ZERO gluster errors around that time nor do I show any Network/teamd 
> errors on any of the? 3 nodes (including the problem node #2)
>
> I've checked dmesg/syslog and every other log file on the box.
>
> According to a staff member, we had this same kvm host have the same 
> problem about 3 weeks ago, it was written up as a fluke possible due 
> to excess disk I/O, since we have been using gluster for years and 
> rarely have seen issues, especially with very basic gluster usage.
>
> In this case those VMs weren't overly busy and now we have a repeat 
> problem.
>
> So I am wondering where else I can look to diagnose the problem or 
> should I abandon the hardware/setup?
>
> I assume its a networking issue and not on gluster, but I am confused 
> why gluster nodes #1 and #3 didn't complain about not seeing #2? If 
> the networking did drop out should they have noticed?
>
> There also doesn't appear to be any visible hard disk issues (smartd 
> is running)
>
> Side Note: I have reset the tcp-timeout back to 42 seconds and will 
> look at upgrading to 6.6. I also see that the ARB and the unaffected 
> Gluster node were running Gluster 6.4 (I don't know why #2 is on 6.5 
> but I am checking on that as well, we turn off auto-upgrade)
>
> Maybe the mismatched versions are the culprit?
>
> Also, we have a large of these replica 2+1 gluster setups running 
> gluster version from 5.x up and none of the others have had this issue
>
> Any advise would be appreciated.
>
> Sincerely,
>
> Wk
>
>
>
>
>
>
> ________
>
> Community Meeting Calendar:
>
> APAC Schedule -
> Every 2nd and 4th Tuesday at 11:30 AM IST
> Bridge: https://bluejeans.com/441850968
>
> NA/EMEA Schedule -
> Every 1st and 3rd Tuesday at 01:00 PM EDT
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20191212/07c4d448/attachment.html>

Gluster users - Dec 2019 - Fwd: VM freeze issue on simple gluster setup.

[Gluster-users] Fwd: VM freeze issue on simple gluster setup.

[Gluster-users] Fwd: VM freeze issue on simple gluster setup.