Hi, I really appreciate if someone can help me fix my nfs crash, its happening a lot and it's causing lots of issues to my VMs;the problem is every few hours the native nfs crash and the volume become unavailable from the affected node unless i restart glusterd.the volume is used by vmware esxi as a datastore for it's VMs with the following options; OS: CentOS 7.2Gluster: 3.7.13 Volume Name: vlm01Type: Distributed-ReplicateVolume ID: eacd8248-dca3-4530-9aed-7714a5a114f2Status: StartedNumber of Bricks: 7 x 3 = 21Transport-type: tcpBricks:Brick1: gfs01:/bricks/b01/vlm01Brick2: gfs02:/bricks/b01/vlm01Brick3: gfs03:/bricks/b01/vlm01Brick4: gfs01:/bricks/b02/vlm01Brick5: gfs02:/bricks/b02/vlm01Brick6: gfs03:/bricks/b02/vlm01Brick7: gfs01:/bricks/b03/vlm01Brick8: gfs02:/bricks/b03/vlm01Brick9: gfs03:/bricks/b03/vlm01Brick10: gfs01:/bricks/b04/vlm01Brick11: gfs02:/bricks/b04/vlm01Brick12: gfs03:/bricks/b04/vlm01Brick13: gfs01:/bricks/b05/vlm01Brick14: gfs02:/bricks/b05/vlm01Brick15: gfs03:/bricks/b05/vlm01Brick16: gfs01:/bricks/b06/vlm01Brick17: gfs02:/bricks/b06/vlm01Brick18: gfs03:/bricks/b06/vlm01Brick19: gfs01:/bricks/b07/vlm01Brick20: gfs02:/bricks/b07/vlm01Brick21: gfs03:/bricks/b07/vlm01Options Reconfigured:performance.readdir-ahead: offperformance.quick-read: offperformance.read-ahead: offperformance.io-cache: offperformance.stat-prefetch: offcluster.eager-lock: enablenetwork.remote-dio: enablecluster.quorum-type: autocluster.server-quorum-type: serverperformance.strict-write-ordering: onperformance.write-behind: offcluster.data-self-heal-algorithm: fullcluster.self-heal-window-size: 128features.shard-block-size: 16MBfeatures.shard: onauth.allow: 192.168.221.50,192.168.221.51,192.168.221.52,192.168.221.56,192.168.208.130,192.168.208.131,192.168.208.132,192.168.208.89,192.168.208.85,192.168.208.208.86network.ping-timeout: 10 latest bt; (gdb) bt #0 0x00007f196acab210 in pthread_spin_lock () from /lib64/libpthread.so.0#1 0x00007f196be7bcd5 in fd_anonymous (inode=0x0) at fd.c:804#2 0x00007f195deb1787 in shard_common_inode_write_do (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716#3 0x00007f195deb1a53 in shard_common_inode_write_post_lookup_shards_handler (frame=<optimized out>, this=<optimized out>) at shard.c:3769#4 0x00007f195deaaff5 in shard_common_lookup_shards_cbk (frame=0x7f19699f1164, cookie=<optimized out>, this=0x7f195802ac10, op_ret=0, op_errno=<optimized out>, inode=<optimized out>, buf=0x7f194970bc40, xdata=0x7f196c15451c, postparent=0x7f194970bcb0) at shard.c:1601#5 0x00007f195e10a141 in dht_lookup_cbk (frame=0x7f196998e7d4, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, inode=0x7f195c532b18, stbuf=0x7f194970bc40, xattr=0x7f196c15451c, postparent=0x7f194970bcb0) at dht-common.c:2174#6 0x00007f195e3931f3 in afr_lookup_done (frame=frame at entry=0x7f196997f8a4, this=this at entry=0x7f1958022a20) at afr-common.c:1825#7 0x00007f195e393b84 in afr_lookup_metadata_heal_check (frame=frame at entry=0x7f196997f8a4, this=0x7f1958022a20, this at entry=0xe3a929e0b67fa500) at afr-common.c:2068#8 0x00007f195e39434f in afr_lookup_entry_heal (frame=frame at entry=0x7f196997f8a4, this=0xe3a929e0b67fa500, this at entry=0x7f1958022a20) at afr-common.c:2157#9 0x00007f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4, cookie=<optimized out>, this=0x7f1958022a20, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>, buf=0x7f195effa940, xdata=0x7f196c1853b0, postparent=0x7f195effa9b0) at afr-common.c:2205#10 0x00007f195e5e2e42 in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f196999952c) at client-rpc-fops.c:2981#11 0x00007f196bc0ca30 in rpc_clnt_handle_reply (clnt=clnt at entry=0x7f19583adaf0, pollin=pollin at entry=0x7f195907f930) at rpc-clnt.c:764#12 0x00007f196bc0ccef in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f19583adb20, event=<optimized out>, data=0x7f195907f930) at rpc-clnt.c:925#13 0x00007f196bc087c3 in rpc_transport_notify (this=this at entry=0x7f19583bd770, event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry=0x7f195907f930) at rpc-transport.c:546#14 0x00007f1960acf9a4 in socket_event_poll_in (this=this at entry=0x7f19583bd770) at socket.c:2353#15 0x00007f1960ad25e4 in socket_event_handler (fd=fd at entry=25, idx=idx at entry=14, data=0x7f19583bd770, poll_in=1, poll_out=0, poll_err=0) at socket.c:2466#16 0x00007f196beacf7a in event_dispatch_epoll_handler (event=0x7f195effae80, event_pool=0x7f196dbf5f20) at event-epoll.c:575#17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at event-epoll.c:678#18 0x00007f196aca6dc5 in start_thread () from /lib64/libpthread.so.0#19 0x00007f196a5ebced in clone () from /lib64/libc.so.6 nfs logs and the core dump can be found in the dropbox link below;https://db.tt/rZrC9d7f thanks in advance.Respectfully Mahdi A. Mahdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160730/046231be/attachment.html>
Inode stored in the shard xlator local is NULL. CCin Kruthika to comment. Thanks, Soumya> > (gdb) bt > #0 0x00007f196acab210 in pthread_spin_lock () from /lib64/libpthread.so.0 > #1 0x00007f196be7bcd5 in fd_anonymous (inode=0x0) at fd.c:804 > #2 0x00007f195deb1787 in shard_common_inode_write_do > (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716 > #3 0x00007f195deb1a53 in > shard_common_inode_write_post_lookup_shards_handler (frame=<optimized > out>, this=<optimized out>) at shard.c:3769 > #4 0x00007f195deaaff5 in shard_common_lookup_shards_cbk > (frame=0x7f19699f1164, cookie=<optimized out>, this=0x7f195802ac10, > op_ret=0, > op_errno=<optimized out>, inode=<optimized out>, buf=0x7f194970bc40, > xdata=0x7f196c15451c, postparent=0x7f194970bcb0) at shard.c:1601 > #5 0x00007f195e10a141 in dht_lookup_cbk (frame=0x7f196998e7d4, > cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, > inode=0x7f195c532b18, > stbuf=0x7f194970bc40, xattr=0x7f196c15451c, > postparent=0x7f194970bcb0) at dht-common.c:2174 > #6 0x00007f195e3931f3 in afr_lookup_done > (frame=frame at entry=0x7f196997f8a4, this=this at entry=0x7f1958022a20) at > afr-common.c:1825 > #7 0x00007f195e393b84 in afr_lookup_metadata_heal_check > (frame=frame at entry=0x7f196997f8a4, this=0x7f1958022a20, > this at entry=0xe3a929e0b67fa500) > at afr-common.c:2068 > #8 0x00007f195e39434f in afr_lookup_entry_heal > (frame=frame at entry=0x7f196997f8a4, this=0xe3a929e0b67fa500, > this at entry=0x7f1958022a20) at afr-common.c:2157 > #9 0x00007f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4, > cookie=<optimized out>, this=0x7f1958022a20, op_ret=<optimized out>, > op_errno=<optimized out>, inode=<optimized out>, buf=0x7f195effa940, > xdata=0x7f196c1853b0, postparent=0x7f195effa9b0) at afr-common.c:2205 > #10 0x00007f195e5e2e42 in client3_3_lookup_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f196999952c) > at client-rpc-fops.c:2981 > #11 0x00007f196bc0ca30 in rpc_clnt_handle_reply > (clnt=clnt at entry=0x7f19583adaf0, pollin=pollin at entry=0x7f195907f930) at > rpc-clnt.c:764 > #12 0x00007f196bc0ccef in rpc_clnt_notify (trans=<optimized out>, > mydata=0x7f19583adb20, event=<optimized out>, data=0x7f195907f930) at > rpc-clnt.c:925 > #13 0x00007f196bc087c3 in rpc_transport_notify > (this=this at entry=0x7f19583bd770, > event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, > data=data at entry=0x7f195907f930) > at rpc-transport.c:546 > #14 0x00007f1960acf9a4 in socket_event_poll_in > (this=this at entry=0x7f19583bd770) at socket.c:2353 > #15 0x00007f1960ad25e4 in socket_event_handler (fd=fd at entry=25, > idx=idx at entry=14, data=0x7f19583bd770, poll_in=1, poll_out=0, > poll_err=0) at socket.c:2466 > #16 0x00007f196beacf7a in event_dispatch_epoll_handler > (event=0x7f195effae80, event_pool=0x7f196dbf5f20) at event-epoll.c:575 > #17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at event-epoll.c:678 > #18 0x00007f196aca6dc5 in start_thread () from /lib64/libpthread.so.0 > #19 0x00007f196a5ebced in clone () from /lib64/libc.so.6 > > > > > nfs logs and the core dump can be found in the dropbox link below; > https://db.tt/rZrC9d7f > > > thanks in advance. > > Respectfully* > **Mahdi A. Mahdi* > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >
Could you also print and share the values of the following variables from the backtrace please: i. cur_block ii. last_block iii. local->first_block iv. odirect v. fd->flags vi. local->call_count -Krutika On Sat, Jul 30, 2016 at 5:04 PM, Mahdi Adnan <mahdi.adnan at outlook.com> wrote:> Hi, > > I really appreciate if someone can help me fix my nfs crash, its happening > a lot and it's causing lots of issues to my VMs; > the problem is every few hours the native nfs crash and the volume become > unavailable from the affected node unless i restart glusterd. > the volume is used by vmware esxi as a datastore for it's VMs with the > following options; > > > OS: CentOS 7.2 > Gluster: 3.7.13 > > Volume Name: vlm01 > Type: Distributed-Replicate > Volume ID: eacd8248-dca3-4530-9aed-7714a5a114f2 > Status: Started > Number of Bricks: 7 x 3 = 21 > Transport-type: tcp > Bricks: > Brick1: gfs01:/bricks/b01/vlm01 > Brick2: gfs02:/bricks/b01/vlm01 > Brick3: gfs03:/bricks/b01/vlm01 > Brick4: gfs01:/bricks/b02/vlm01 > Brick5: gfs02:/bricks/b02/vlm01 > Brick6: gfs03:/bricks/b02/vlm01 > Brick7: gfs01:/bricks/b03/vlm01 > Brick8: gfs02:/bricks/b03/vlm01 > Brick9: gfs03:/bricks/b03/vlm01 > Brick10: gfs01:/bricks/b04/vlm01 > Brick11: gfs02:/bricks/b04/vlm01 > Brick12: gfs03:/bricks/b04/vlm01 > Brick13: gfs01:/bricks/b05/vlm01 > Brick14: gfs02:/bricks/b05/vlm01 > Brick15: gfs03:/bricks/b05/vlm01 > Brick16: gfs01:/bricks/b06/vlm01 > Brick17: gfs02:/bricks/b06/vlm01 > Brick18: gfs03:/bricks/b06/vlm01 > Brick19: gfs01:/bricks/b07/vlm01 > Brick20: gfs02:/bricks/b07/vlm01 > Brick21: gfs03:/bricks/b07/vlm01 > Options Reconfigured: > performance.readdir-ahead: off > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > performance.strict-write-ordering: on > performance.write-behind: off > cluster.data-self-heal-algorithm: full > cluster.self-heal-window-size: 128 > features.shard-block-size: 16MB > features.shard: on > auth.allow: > 192.168.221.50,192.168.221.51,192.168.221.52,192.168.221.56,192.168.208.130,192.168.208.131,192.168.208.132,192.168.208.89,192.168.208.85,192.168.208.208.86 > network.ping-timeout: 10 > > > latest bt; > > > (gdb) bt > #0 0x00007f196acab210 in pthread_spin_lock () from /lib64/libpthread.so.0 > #1 0x00007f196be7bcd5 in fd_anonymous (inode=0x0) at fd.c:804 > #2 0x00007f195deb1787 in shard_common_inode_write_do > (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716 > #3 0x00007f195deb1a53 in > shard_common_inode_write_post_lookup_shards_handler (frame=<optimized out>, > this=<optimized out>) at shard.c:3769 > #4 0x00007f195deaaff5 in shard_common_lookup_shards_cbk > (frame=0x7f19699f1164, cookie=<optimized out>, this=0x7f195802ac10, > op_ret=0, > op_errno=<optimized out>, inode=<optimized out>, buf=0x7f194970bc40, > xdata=0x7f196c15451c, postparent=0x7f194970bcb0) at shard.c:1601 > #5 0x00007f195e10a141 in dht_lookup_cbk (frame=0x7f196998e7d4, > cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, > inode=0x7f195c532b18, > stbuf=0x7f194970bc40, xattr=0x7f196c15451c, postparent=0x7f194970bcb0) > at dht-common.c:2174 > #6 0x00007f195e3931f3 in afr_lookup_done (frame=frame at entry=0x7f196997f8a4, > this=this at entry=0x7f1958022a20) at afr-common.c:1825 > #7 0x00007f195e393b84 in afr_lookup_metadata_heal_check (frame=frame at entry=0x7f196997f8a4, > this=0x7f1958022a20, this at entry=0xe3a929e0b67fa500) > at afr-common.c:2068 > #8 0x00007f195e39434f in afr_lookup_entry_heal (frame=frame at entry=0x7f196997f8a4, > this=0xe3a929e0b67fa500, this at entry=0x7f1958022a20) at afr-common.c:2157 > #9 0x00007f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4, > cookie=<optimized out>, this=0x7f1958022a20, op_ret=<optimized out>, > op_errno=<optimized out>, inode=<optimized out>, buf=0x7f195effa940, > xdata=0x7f196c1853b0, postparent=0x7f195effa9b0) at afr-common.c:2205 > #10 0x00007f195e5e2e42 in client3_3_lookup_cbk (req=<optimized out>, > iov=<optimized out>, count=<optimized out>, myframe=0x7f196999952c) > at client-rpc-fops.c:2981 > #11 0x00007f196bc0ca30 in rpc_clnt_handle_reply (clnt=clnt at entry=0x7f19583adaf0, > pollin=pollin at entry=0x7f195907f930) at rpc-clnt.c:764 > #12 0x00007f196bc0ccef in rpc_clnt_notify (trans=<optimized out>, > mydata=0x7f19583adb20, event=<optimized out>, data=0x7f195907f930) at > rpc-clnt.c:925 > #13 0x00007f196bc087c3 in rpc_transport_notify (this=this at entry=0x7f19583bd770, > event=event at entry=RPC_TRANSPORT_MSG_RECEIVED, data=data at entry > =0x7f195907f930) > at rpc-transport.c:546 > #14 0x00007f1960acf9a4 in socket_event_poll_in (this=this at entry=0x7f19583bd770) > at socket.c:2353 > #15 0x00007f1960ad25e4 in socket_event_handler (fd=fd at entry=25, > idx=idx at entry=14, data=0x7f19583bd770, poll_in=1, poll_out=0, poll_err=0) > at socket.c:2466 > #16 0x00007f196beacf7a in event_dispatch_epoll_handler > (event=0x7f195effae80, event_pool=0x7f196dbf5f20) at event-epoll.c:575 > #17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at event-epoll.c:678 > #18 0x00007f196aca6dc5 in start_thread () from /lib64/libpthread.so.0 > #19 0x00007f196a5ebced in clone () from /lib64/libc.so.6 > > > > > nfs logs and the core dump can be found in the dropbox link below; > https://db.tt/rZrC9d7f > > > thanks in advance. > > Respectfully > *Mahdi A. Mahdi* > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160801/a46cb6e3/attachment.html>