It might be because of the fragmentation caused by extent map in XFS
filesystem. see here in https://bugzilla.kernel.org/show_bug.cgi?id=73831
Regards
Rafi KC
On 12/19/2016 01:22 AM, Gustave Dahl wrote:>
> I have been running gluster as a storage backend to OpenNebula for
> about a year and it has been running great. I have had an intermittent
> problem that has gotten worse over the last couple of days and I could
> use some help.
>
>
>
> Setup
>
> ====>
> Gluster: 3.7.11
>
> Hyper Converged Setup - Gluster with KVM?s on the same machines with
> Gluster in a Slice on each server.
>
>
>
> Four Servers - Each with 4 Bricks
>
>
>
> Type: Distributed-Replicate
>
> Number of Bricks: 4 x 3 = 12
>
>
>
> Bricks are 1TB SSD's
>
>
>
> Gluster Status: http://pastebin.com/Nux7VB4b
>
> Gluster Info: http://pastebin.com/G5qR0kZq
>
>
>
> Gluster is supporting qcow2 images that the KVM?s are using. Image
> Sizes: 10GB up to 300GB images.
>
>
>
> The volume is mounted on each node with glusterfs as a shared file
> system. The KVM's using the images are using libgfapi ( i.e.
> file=gluster://shchst01:24007/shchst01/d8fcfdb97bc462aca502d5fe703afc66 )
>
>
>
> Issue
>
> =====>
> This setup has been running well, with the exception of this
> intermittent problem. This only happens on one node. It has happened
> on other bricks (all on the same node) but more freqently on Node 2:
> Brick 4
>
>
>
> It starts here: http://pastebin.com/YgeJ5VA9
>
>
>
> Dec 18 02:08:54 shchhv02 kernel: XFS: possible memory allocation
> deadlock in kmem_alloc (mode:0x250)
>
>
>
> This continues until:
>
>
>
> Dec 18 02:11:10 shchhv02 storage-shchst01[14728]: [2016-12-18
> 08:11:10.428138] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired]
> 4-shchst01-client-11: server xxx.xx.xx.11:49155 has not responded in
> the last 42 seconds, disconnecting.
>
>
>
> storage log: http://pastebin.com/vxCdRnEg
>
>
>
> [2016-12-18 08:11:10.435927] E [MSGID: 114031]
> [client-rpc-fops.c:2886:client3_3_opendir_cbk] 4-shchst01-client-11:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
>
> [2016-12-18 08:11:10.436240] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
> 4-shchst01-client-11: forced unwinding frame type(GF-DUMP) op(NULL(2))
> called at 2016-12-18 08:10:28.424311 (xid=0x36883d)
>
> [2016-12-18 08:11:10.436255] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk]
> 4-shchst01-client-11: socket disconnected
>
> [2016-12-18 08:11:10.436369] E [rpc-clnt.c:362:saved_frames_unwind]
> (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f06efbaeae2]
> (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f06ef97990e]
> (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f06ef979a1e]
> (-->
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7a)[0x7f06ef97b40a]
> (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7f06ef97bc38] )))))
> 4-shchst01-client-11: forced unwinding frame type(GlusterFS 3.3)
> op(LOOKUP(27)) called at 2016-12-18 08:10:38.370507 (xid=0x36883e)
>
> [2016-12-18 08:11:10.436388] W [MSGID: 114031]
> [client-rpc-fops.c:2974:client3_3_lookup_cbk] 4-shchst01-client-11:
> remote operation failed. Path: /
> (00000000-0000-0000-0000-000000000001) [Transport endpoint is not
> connected]
>
> [2016-12-18 08:11:10.436488] I [MSGID: 114018]
> [client.c:2030:client_rpc_notify] 4-shchst01-client-11: disconnected
> from shchst01-client-11. Client process will keep trying to connect to
> glusterd until brick's port is available
>
> The message "W [MSGID: 114031]
> [client-rpc-fops.c:1572:client3_3_fstat_cbk] 4-shchst01-client-11:
> remote operation failed [Transport endpoint is not connected]"
> repeated 3 times between [2016-12-18 08:11:10.432640] and [2016-12-18
> 08:11:10.433530]
>
> The message "W [MSGID: 114031]
> [client-rpc-fops.c:2669:client3_3_readdirp_cbk] 4-shchst01-client-11:
> remote operation failed [Transport endpoint is not connected]"
> repeated 15 times between [2016-12-18 08:11:10.428844] and [2016-12-18
> 08:11:10.435727]
>
> The message "W [MSGID: 114061]
> [client-rpc-fops.c:4560:client3_3_fstat] 4-shchst01-client-11:
> (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File
> descriptor in bad state]" repeated 11 times between [2016-12-18
> 08:11:10.433598] and [2016-12-18 08:11:10.435742]
>
>
>
> brick 4 log: http://pastebin.com/kQcNyGk2
>
>
>
> [2016-12-18 08:08:33.000483] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:08:33.003178] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:08:34.021937] I [dict.c:473:dict_get]
> (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac)
> [0x7f8504feccbc]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/features/marker.so(marker_getxattr_cbk+0xa7)
> [0x7f84f5734917] -->/lib64/libglusterfs.so.0(dict_get+0xac)
> [0x7f8504fdd0fc] ) 0-dict: !this || key=() [Invalid argument]
>
> [2016-12-18 08:10:11.671642] E
> [server-helpers.c:390:server_alloc_frame]
> (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x2fb)
> [0x7f8504dad73b]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server3_3_fxattrop+0x86)
> [0x7f84f48a9a76]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(get_frame_from_request+0x2fb)
> [0x7f84f489eedb] ) 0-server: invalid argument: client [Invalid argument]
>
> [2016-12-18 08:10:11.671689] E
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
> to complete successfully
>
> [2016-12-18 08:10:11.671808] I [login.c:81:gf_auth] 0-auth/login:
> allowed user names: b7391aaa-d0cb-4db6-9e4c-999310c97eb6
>
> [2016-12-18 08:10:11.671820] I [MSGID: 115029]
> [server-handshake.c:690:server_setvolume] 0-shchst01-server: accepted
> client from
> shchhv03-13679-2016/12/17-22:57:24:920194-shchst01-client-11-0-2
> (version: 3.7.11)
>
> [2016-12-18 08:13:31.526854] W [socket.c:589:__socket_rwv]
> 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65527 failed (Broken pipe)
>
> [2016-12-18 08:13:31.526909] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
>
> [2016-12-18 08:13:31.526935] I [MSGID: 115036]
> [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting
> connection from
> shchhv03-10686-2016/12/16-06:08:16:797591-shchst01-client-11-0-6
>
> [2016-12-18 08:13:31.526976] I [MSGID: 115013]
> [server-helpers.c:294:do_fd_cleanup] 0-shchst01-server: fd cleanup on
> /b40877dae051c076b95c160f2f639e45
>
> [2016-12-18 08:13:31.527008] W [socket.c:589:__socket_rwv]
> 0-tcp.shchst01-server: writev on xxx.xxx.xx.12:65525 failed (Broken pipe)
>
> [2016-12-18 08:13:31.527009] I [socket.c:3378:socket_submit_reply]
> 0-tcp.shchst01-server: not connected (priv->connected = -1)
>
> [2016-12-18 08:13:31.527040] E [rpcsvc.c:1314:rpcsvc_submit_generic]
> 0-rpc-service: failed to submit message (XID: 0x309470, Program:
> GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport
> (tcp.shchst01-server)
>
> [2016-12-18 08:13:31.527099] I [socket.c:2356:socket_event_handler]
> 0-transport: disconnecting now
>
> [2016-12-18 08:13:31.527114] E [server.c:205:server_submit_reply]
>
(-->/usr/lib64/glusterfs/3.7.11/xlator/debug/io-stats.so(io_stats_fsync_cbk+0xc8)
> [0x7f84f4ada308]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_fsync_cbk+0x384)
> [0x7f84f48b0444]
>
-->/usr/lib64/glusterfs/3.7.11/xlator/protocol/server.so(server_submit_reply+0x2f6)
> [0x7f84f489b086] ) 0-: Reply submission failed
>
> [2016-12-18 08:13:31.527121] I [MSGID: 115036]
> [server.c:552:server_rpc_notify] 0-shchst01-server: disconnecting
> connection from
> shchhv02-15410-2016/12/17-06:07:39:376627-shchst01-client-11-0-6
>
>
>
> statedump (brick 4), taken later in the day: http://pastebin.com/DEE3RbT8
>
>
>
> Temp Resolution Path
>
> ===================>
> There is a rise in load on the node, as well as on one particular KVM
> (on another node). If we catch the load rise and clear pagecache, it
> seems to clear and resolve. I have not been able to catch it enough
> to provide more details.
>
>
>
> echo 1 > /proc/sys/vm/drop_caches
>
>
>
> There is something that I am missing. I would appreciate any help to
> get me to root cause and resolution.
>
>
>
> Thanks,
>
>
>
> Gustave
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20161219/5d3aec50/attachment.html>