thr3ads.net - Gluster users - [Gluster-users] NFS timeouts? [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Yannick Perret

2016-Dec-01 12:12 UTC

[Gluster-users] NFS timeouts?

Hello,
I have a client machine that mounts as NFS a replicate x2 volume. 
Practicaly this is configured with automount such as:
DIR-NAME -rw,soft,intr server1,server2:/VOLUME

Gluster servers are using 3.6.7.
Sometimes the NFS blocks on client with
server server2 not responding, timed out  (here it was connected on server2)
but network communication is fine beetween the two machines (they are 
connected to the same switch, I can ssh on each, they ping each other?).

I can also see few "xs_tcp_setup_socket: connect returned unhandled 
error -107" on the client.
On 'server2' side I can see in the gluster nfs logs:

[2016-12-01 10:50:15.887927] W [rpcsvc.c:261:rpcsvc_program_actor] 
0-rpc-service: RPC program version not available (req 100003 2)
[2016-12-01 10:50:15.887965] E 
[rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2016-12-01 10:50:15.901880] W [rpcsvc.c:261:rpcsvc_program_actor] 
0-rpc-service: RPC program version not available (req 100003 4)
[2016-12-01 10:50:15.901900] E 
[rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2016-12-01 10:51:03.777145] W [rpcsvc.c:261:rpcsvc_program_actor] 
0-rpc-service: RPC program version not available (req 100003 2)
[2016-12-01 10:51:03.777191] E 
[rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2016-12-01 10:51:03.790561] W [rpcsvc.c:261:rpcsvc_program_actor] 
0-rpc-service: RPC program version not available (req 100003 4)
[2016-12-01 10:51:03.790580] E 
[rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully

at a time that correspond to the NFS timeouts.

This problem occurs "often" (at least each day or each 2 days), and 
neither client nor servers are on heavy load (memory and CPU far to be 
full).

Any idea about what can be the reason and how to prevent it to occur?
I reduced the autofs timeout in order to reduce impact but it is not a 
very nice solution? Note: I can't use the glusterfs client instead of 
NFS because of the memory leaks that still exist in it.

Thanks.

Regards,
--
Y.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3369 bytes
Desc: Signature cryptographique S/MIME
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161201/c2f3af3e/attachment.p7s>

Yannick Perret

2016-Dec-01 12:34 UTC

head link

[Gluster-users] NFS timeouts?

Le 01/12/2016 ? 13:12, Yannick Perret a ?crit :> Hello,
> I have a client machine that mounts as NFS a replicate x2 volume. 
> Practicaly this is configured with automount such as:
> DIR-NAME -rw,soft,intr server1,server2:/VOLUME
>
> Gluster servers are using 3.6.7.
> Sometimes the NFS blocks on client with
> server server2 not responding, timed out  (here it was connected on 
> server2)
> but network communication is fine beetween the two machines (they are 
> connected to the same switch, I can ssh on each, they ping each other?).
>
> I can also see few "xs_tcp_setup_socket: connect returned unhandled 
> error -107" on the client.
> On 'server2' side I can see in the gluster nfs logs:
>
> [2016-12-01 10:50:15.887927] W [rpcsvc.c:261:rpcsvc_program_actor] 
> 0-rpc-service: RPC program version not available (req 100003 2)
> [2016-12-01 10:50:15.887965] E 
> [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
> to complete successfully
> [2016-12-01 10:50:15.901880] W [rpcsvc.c:261:rpcsvc_program_actor] 
> 0-rpc-service: RPC program version not available (req 100003 4)
> [2016-12-01 10:50:15.901900] E 
> [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
> to complete successfully
> [2016-12-01 10:51:03.777145] W [rpcsvc.c:261:rpcsvc_program_actor] 
> 0-rpc-service: RPC program version not available (req 100003 2)
> [2016-12-01 10:51:03.777191] E 
> [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
> to complete successfully
> [2016-12-01 10:51:03.790561] W [rpcsvc.c:261:rpcsvc_program_actor] 
> 0-rpc-service: RPC program version not available (req 100003 4)
> [2016-12-01 10:51:03.790580] E 
> [rpcsvc.c:544:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
> to complete successfully
>It looks like these correspond to the NFS re-connection (client trying 
NFSv2 and NFSv4 I think).

Just before that here are the logs:
l_layout_new_directory] 0-HOME-LIRIS-dht: assigning range size 
0xffe76e40 to HOME-LIRIS-replicate-0
[2016-12-01 10:48:36.990028] W 
[client-rpc-fops.c:2145:client3_3_setattr_cbk] 0-HOME-LIRIS-client-1: 
remote operation failed: Op?ration non permise
[2016-12-01 10:48:36.990303] W 
[client-rpc-fops.c:2145:client3_3_setattr_cbk] 0-HOME-LIRIS-client-0: 
remote operation failed: Op?ration non permise
The message "I [MSGID: 109036] 
[dht-common.c:6296:dht_log_new_layout_for_dir_selfheal] 
0-HOME-LIRIS-dht: Setting layout of 
<gfid:6f8bb427-eea5-4dd5-b004-9db8582bdda2>/_indexer.lock with 
[Subvol_name: HOME-LIRIS-replicate-0, Err: -1 , Start: 0 , Stop: 
4294967295 ], " repeated 2 times between [2016-12-01 10:48:36.404738] 
and [2016-12-01 10:48:36.949907]
[2016-12-01 10:48:36.990728] I [MSGID: 109036] 
[dht-common.c:6296:dht_log_new_layout_for_dir_selfheal] 
0-HOME-LIRIS-dht: Setting layout of 
<gfid:6f8bb427-eea5-4dd5-b004-9db8582bdda2>/39132555496bb098708af2d5e7b56d67
with [Subvol_name: HOME-LIRIS-replicate-0, Err: -1 , Start: 0 , Stop: 
4294967295 ],
[2016-12-01 10:50:10.360020] I [dht-rename.c:1344:dht_rename] 
0-HOME-LIRIS-dht: renaming 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_km1NUe 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) => 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/general.php 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)
[2016-12-01 10:50:10.423561] I [dht-rename.c:1344:dht_rename] 
0-HOME-LIRIS-dht: renaming 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_2pOZ5T 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) => 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/1.php 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)
[2016-12-01 10:50:10.485882] I [dht-rename.c:1344:dht_rename] 
0-HOME-LIRIS-dht: renaming 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/tmp_86Lmpz 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0) => 
<gfid:2a1f640e-ff3e-4a56-8019-64ec6d803fc1>/general.php 
(hash=HOME-LIRIS-replicate-0/cache=HOME-LIRIS-replicate-0)


I also tried to set "nfs.mount-rmtab /dev/shm/glusterfs.rmtab" as I
read
on an old thread. Will check if it change something.

Regards,
--
Y.
> at a time that correspond to the NFS timeouts.
>
> This problem occurs "often" (at least each day or each 2 days),
and
> neither client nor servers are on heavy load (memory and CPU far to be 
> full).
>
> Any idea about what can be the reason and how to prevent it to occur?
> I reduced the autofs timeout in order to reduce impact but it is not a 
> very nice solution? Note: I can't use the glusterfs client instead of 
> NFS because of the memory leaks that still exist in it.
>
> Thanks.
>
> Regards,
> -- 
> Y.
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161201/7ffec564/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3369 bytes
Desc: Signature cryptographique S/MIME
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161201/7ffec564/attachment.p7s>

Gluster users - Dec 2016 - NFS timeouts?

[Gluster-users] NFS timeouts?

[Gluster-users] NFS timeouts?