thr3ads.net - Gluster users - [Gluster-users] Broken Pipe

If this information is useful, please help other people find it:
Share via:

Gustave Dahl

2016-Dec-18 19:52 UTC

[Gluster-users] Broken Pipe - Intermittent

I have been running gluster as a storage backend to OpenNebula for about a
year and it has been running great. I have had an intermittent problem that
has gotten worse over the last couple of days and I could use some help. 

 

Setup

====
Gluster: 3.7.11

Hyper Converged Setup - Gluster with KVM's on the same machines with Gluster
in a Slice on each server.

 

Four Servers - Each with 4 Bricks

 

Type: Distributed-Replicate

Number of Bricks: 4 x 3 = 12

 

Bricks are 1TB SSD's

 

Gluster Status:  http://pastebin.com/Nux7VB4b

Gluster Info:  http://pastebin.com/G5qR0kZq

 

Gluster is supporting qcow2 images that the KVM's are using.  Image Sizes:
10GB up to 300GB images.

 

The volume is mounted on each node with glusterfs as a shared file system.
The KVM's using the images are using libgfapi ( i.e.
file=gluster://shchst01:24007/shchst01/d8fcfdb97bc462aca502d5fe703afc66 )

 

Issue

=====
This setup has been running well, with the exception of this intermittent
problem.  This only happens on one node.  It has happened on other bricks
(all on the same node) but more freqently on Node 2: Brick 4

 

It starts here:  http://pastebin.com/YgeJ5VA9

 

Dec 18 02:08:54 shchhv02 kernel: XFS: possible memory allocation deadlock in
kmem_alloc (mode:0x250)

 

This continues until:

 

Dec 18 02:11:10 shchhv02 storage-shchst01[14728]: [2016-12-18
08:11:10.428138] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
4-shchst01-client-11: server xxx.xx.xx.11:49155 has not responded in the
last 42 seconds, disconnecting.

 

storage log:  http://pastebin.com/vxCdRnEg

 

[2016-12-18 08:11:10.435927] E [MSGID: 114031]
[client-rpc-fops.c:2886:client3_3_opendir_cbk] 4-shchst01-client-11: remote
operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
endpoint is not connected]

[2016-12-18 08:11:10.436240] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
4-shchst01-client-11: forced unwinding frame type(GF-DUMP) op(NULL(2))
called at 2016-12-18 08:10:28.424311 (xid=0x36883d)

[2016-12-18 08:11:10.436255] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk]
4-shchst01-client-11: socket disconnected

[2016-12-18 08:11:10.436369] E [rpc-clnt.c:362:saved_frames_unwind] (-->
/lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2] (-->
/lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e] (-->
/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e] (-->
/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a] (-->
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
4-shchst01-client-11: forced unwinding frame type(GlusterFS 3.3)
op(LOOKUP(27)) called at 2016-12-18 08:10:38.370507 (xid=0x36883e)

[2016-12-18 08:11:10.436388] W [MSGID: 114031]
[client-rpc-fops.c:2974:client3_3_lookup_cbk] 4-shchst01-client-11: remote
operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport
endpoint is not connected]

[2016-12-18 08:11:10.436488] I [MSGID: 114018]
[client.c:2030:client_rpc_notify] 4-shchst01-client-11: disconnected from
shchst01-client-11. Client process will keep trying to connect to glusterd
until brick's port is available

The message "W [MSGID: 114031] [client-rpc-fops.c:1572:client3_3_fstat_cbk]
4-shchst01-client-11: remote operation failed [Transport endpoint is not
connected]" repeated 3 times between [2016-12-18 08:11:10.432640] and
[2016-12-18 08:11:10.433530]

The message "W [MSGID: 114031]
[client-rpc-fops.c:2669:client3_3_readdirp_cbk] 4-shchst01-client-11: remote
operation failed [Transport endpoint is not connected]" repeated 15 times
between [2016-12-18 08:11:10.428844] and [2016-12-18 08:11:10.435727]

The message "W [MSGID: 114061] [client-rpc-fops.c:4560:client3_3_fstat]
4-shchst01-client-11:  (00000000-0000-0000-0000-000000000001) remote_fd is
-1. EBADFD [File descriptor in bad state]" repeated 11 times between
[2016-12-18 08:11:10.433598] and [2016-12-18 08:11:10.435742]

 

brick 4 log:  http://pastebin.com/kQcNyGk2

 

[2016-12-18 08:08:33.000483] I [dict.c:473:dict_get]
(-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc]
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk
+0xa7) [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
[0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]

[2016-12-18 08:08:33.003178] I [dict.c:473:dict_get]
(-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc]
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk
+0xa7) [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
[0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]

[2016-12-18 08:08:34.021937] I [dict.c:473:dict_get]
(-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f8504feccbc]
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk
+0xa7) [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
[0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]

[2016-12-18 08:10:11.671642] E [server-helpers.c:390:server_alloc_frame]
(-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x2fb) [0x7f8504dad73b]
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server3_3_fxattrop+
0x86) [0x7f84f48a9a76]
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(get_frame_from_requ
est+0x2fb) [0x7f84f489eedb] ) 0-server: invalid argument: client [Invalid
argument]

[2016-12-18 08:10:11.671689] E [rpcsvc.c:565:rpcsvc_check_and_reply_error]
0-rpcsvc: rpc actor failed to complete successfully

[2016-12-18 08:10:11.671808] I [login.c:81:gf_auth] 0-auth/login: allowed
user names: b7391aaa-d0cb-4db6-9e4c-999310c97eb6

[2016-12-18 08:10:11.671820] I [MSGID: 115029]
[server-handshake.c:690:server_setvolume] 0-shchst01-server: accepted client
from shchhv03-13679-2016/12/17-22:57:24:920194-shchst01-client-11-0-2
(version: 3.7.11)

[2016-12-18 08:13:31.526854] W [socket.c:589:__socket_rwv]
0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65527 failed (Broken pipe)

[2016-12-18 08:13:31.526909] I [socket.c:2356:socket_event_handler]
0-transport: disconnecting now

[2016-12-18 08:13:31.526935] I [MSGID: 115036]
[server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting connection
from shchhv03-10686-2016/12/16-06:08:16:797591-shchst01-client-11-0-6

[2016-12-18 08:13:31.526976] I [MSGID: 115013]
[server-helpers.c:294:do_fd_cleanup] 0-shchst01-server: fd cleanup on
/b40877dae051c076b95c160f2f639e45

[2016-12-18 08:13:31.527008] W [socket.c:589:__socket_rwv]
0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65525 failed (Broken pipe)

[2016-12-18 08:13:31.527009] I [socket.c:3378:socket_submit_reply]
0-tcp.shchst01-server: not connected (priv->connected = -1)

[2016-12-18 08:13:31.527040] E [rpcsvc.c:1314:rpcsvc_submit_generic]
0-rpc-service: failed to submit message (XID: 0x309470, Program: GlusterFS
3.3, ProgVers: 330, Proc: 16) to rpc-transport (tcp.shchst01-server)

[2016-12-18 08:13:31.527099] I [socket.c:2356:socket_event_handler]
0-transport: disconnecting now

[2016-12-18 08:13:31.527114] E [server.c:205:server_submit_reply]
(-->/usr/lib64/glusterfs/3.7.11/xlator/debug/io-stats.so(io_stats_fsync_cbk+
0xc8) [0x7f84f4ada308]
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_fsync_cbk+0x
384) [0x7f84f48b0444]
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_submit_reply
+0x2f6) [0x7f84f489b086] ) 0-: Reply submission failed

[2016-12-18 08:13:31.527121] I [MSGID: 115036]
[server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting connection
from shchhv02-15410-2016/12/17-06:07:39:376627-shchst01-client-11-0-6

 

statedump (brick 4), taken later in the day:  http://pastebin.com/DEE3RbT8

 

Temp Resolution Path

===================
There is a rise in load on the node, as well as on one particular KVM (on
another node).  If we catch the load rise and clear pagecache, it seems to
clear and resolve.  I have not been able to catch it enough to provide more
details.

 

echo 1 > /proc/sys/vm/drop_caches

 

There is something that I am missing.  I would appreciate any help to get me
to root cause and resolution.

 

Thanks,

 

Gustave

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161218/163716be/attachment.html>

Mohammed Rafi K C

2016-Dec-19 15:04 UTC

head link

[Gluster-users] Broken Pipe - Intermittent

It might be because of the fragmentation caused by extent map in XFS
filesystem. see here in https://bugzilla.kernel.org/show_bug.cgi?id=73831


Regards

Rafi KC


On 12/19/2016 01:22 AM, Gustave Dahl wrote:>
> I have been running gluster as a storage backend to OpenNebula for
> about a year and it has been running great. I have had an intermittent
> problem that has gotten worse over the last couple of days and I could
> use some help.
>
>  
>
> Setup
>
> ====>
> Gluster: 3.7.11
>
> Hyper Converged Setup - Gluster with KVM?s on the same machines with
> Gluster in a Slice on each server.
>
>  
>
> Four Servers - Each with 4 Bricks
>
>  
>
> Type: Distributed-Replicate
>
> Number of Bricks: 4 x 3 = 12
>
>  
>
> Bricks are 1TB SSD's
>
>  
>
> Gluster Status:  http://pastebin.com/Nux7VB4b
>
> Gluster Info:  http://pastebin.com/G5qR0kZq
>
>  
>
> Gluster is supporting qcow2 images that the KVM?s are using.  Image
> Sizes:  10GB up to 300GB images.
>
>  
>
> The volume is mounted on each node with glusterfs as a shared file
> system.  The KVM's using the images are using libgfapi ( i.e.
> file=gluster://shchst01:24007/shchst01/d8fcfdb97bc462aca502d5fe703afc66 )
>
>  
>
> Issue
>
> =====>
> This setup has been running well, with the exception of this
> intermittent problem.  This only happens on one node.  It has happened
> on other bricks (all on the same node) but more freqently on Node 2:
> Brick 4
>
>  
>
> It starts here:  http://pastebin.com/YgeJ5VA9
>
>  
>
> Dec 18 02:08:54 shchhv02 kernel: XFS: possible memory allocation
> deadlock in kmem_alloc (mode:0x250)
>
>  
>
> This continues until:
>
>  
>
> Dec 18 02:11:10 shchhv02 storage-shchst01[14728]: [2016-12-18
> 08:11:10.428138] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
> 4-shchst01-client-11: server xxx.xx.xx.11:49155 has not responded in
> the last 42 seconds, disconnecting.
>
>  
>
> storage log:  http://pastebin.com/vxCdRnEg
>
>  
>
> [2016-12-18 08:11:10.435927] E [MSGID: 114031]
> [client-rpc-fops.c:2886:client3_3_opendir_cbk] 4-shchst01-client-11:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
>
> [2016-12-18 08:11:10.436240] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
> 4-shchst01-client-11: forced unwinding frame type(GF-DUMP) op(NULL(2))
> called at 2016-12-18 08:10:28.424311 (xid=0x36883d)
>
> [2016-12-18 08:11:10.436255] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk]
> 4-shchst01-client-11: socket disconnected
>
> [2016-12-18 08:11:10.436369] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
> 4-shchst01-client-11: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2016-12-18 08:10:38.370507 (xid=0x36883e)
>
> [2016-12-18 08:11:10.436388] W [MSGID: 114031]
> [client-rpc-fops.c:2974:client3_3_lookup_cbk] 4-shchst01-client-11:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
>
> [2016-12-18 08:11:10.436488] I [MSGID: 114018]
> [client.c:2030:client_rpc_notify] 4-shchst01-client-11: disconnected
> from shchst01-client-11. Client process will keep trying to connect to
> glusterd until brick's port is available
>
> The message "W [MSGID: 114031]
> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 4-shchst01-client-11:
> remote operation failed [Transport endpoint is not connected]"
> repeated 3 times between [2016-12-18 08:11:10.432640] and [2016-12-18
> 08:11:10.433530]
>
> The message "W [MSGID: 114031]
> [client-rpc-fops.c:2669:client3_3_readdirp_cbk] 4-shchst01-client-11:
> remote operation failed [Transport endpoint is not connected]"
> repeated 15 times between [2016-12-18 08:11:10.428844] and [2016-12-18
> 08:11:10.435727]
>
> The message "W [MSGID: 114061]
> [client-rpc-fops.c:4560:client3_3_fstat] 4-shchst01-client-11: 
> (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File
> descriptor in bad state]" repeated 11 times between [2016-12-18
> 08:11:10.433598] and [2016-12-18 08:11:10.435742]
>
>  
>
> brick 4 log:  http://pastebin.com/kQcNyGk2
>
>  
>
> [2016-12-18 08:08:33.000483] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:08:33.003178] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:08:34.021937] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:10:11.671642] E
> [server-helpers.c:390:server_alloc_frame]
> (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x2fb)
> [0x7f8504dad73b]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server3_3_fxattrop+0x86)
> [0x7f84f48a9a76]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(get_frame_from_request+0x2fb)
> [0x7f84f489eedb] ) 0-server: invalid argument: client [Invalid argument]
>
> [2016-12-18 08:10:11.671689] E
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> to complete successfully
>
> [2016-12-18 08:10:11.671808] I [login.c:81:gf_auth] 0-auth/login:
> allowed user names: b7391aaa-d0cb-4db6-9e4c-999310c97eb6
>
> [2016-12-18 08:10:11.671820] I [MSGID: 115029]
> [server-handshake.c:690:server_setvolume] 0-shchst01-server: accepted
> client from
> shchhv03-13679-2016/12/17-22:57:24:920194-shchst01-client-11-0-2
> (version: 3.7.11)
>
> [2016-12-18 08:13:31.526854] W [socket.c:589:__socket_rwv]
> 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65527 failed (Broken pipe)
>
> [2016-12-18 08:13:31.526909] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
>
> [2016-12-18 08:13:31.526935] I [MSGID: 115036]
> [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting
> connection from
> shchhv03-10686-2016/12/16-06:08:16:797591-shchst01-client-11-0-6
>
> [2016-12-18 08:13:31.526976] I [MSGID: 115013]
> [server-helpers.c:294:do_fd_cleanup] 0-shchst01-server: fd cleanup on
> /b40877dae051c076b95c160f2f639e45
>
> [2016-12-18 08:13:31.527008] W [socket.c:589:__socket_rwv]
> 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65525 failed (Broken pipe)
>
> [2016-12-18 08:13:31.527009] I [socket.c:3378:socket_submit_reply]
> 0-tcp.shchst01-server: not connected (priv->connected = -1)
>
> [2016-12-18 08:13:31.527040] E [rpcsvc.c:1314:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x309470, Program:
> GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport
> (tcp.shchst01-server)
>
> [2016-12-18 08:13:31.527099] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
>
> [2016-12-18 08:13:31.527114] E [server.c:205:server_submit_reply]
>
(-->/usr/lib64/glusterfs/3.7.11/xlator/debug/io-stats.so(io_stats_fsync_cbk+0xc8)
> [0x7f84f4ada308]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_fsync_cbk+0x384)
> [0x7f84f48b0444]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_submit_reply+0x2f6)
> [0x7f84f489b086] ) 0-: Reply submission failed
>
> [2016-12-18 08:13:31.527121] I [MSGID: 115036]
> [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting
> connection from
> shchhv02-15410-2016/12/17-06:07:39:376627-shchst01-client-11-0-6
>
>  
>
> statedump (brick 4), taken later in the day:  http://pastebin.com/DEE3RbT8
>
>  
>
> Temp Resolution Path
>
> ===================>
> There is a rise in load on the node, as well as on one particular KVM
> (on another node).  If we catch the load rise and clear pagecache, it
> seems to clear and resolve.  I have not been able to catch it enough
> to provide more details.
>
>  
>
> echo 1 > /proc/sys/vm/drop_caches
>
>  
>
> There is something that I am missing.  I would appreciate any help to
> get me to root cause and resolution.
>
>  
>
> Thanks,
>
>  
>
> Gustave
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161219/5d3aec50/attachment.html>

Gluster users - Dec 2016 - Broken Pipe - Intermittent

[Gluster-users] Broken Pipe - Intermittent

[Gluster-users] Broken Pipe - Intermittent