Glomski, Patrick
2015-Dec-21 20:55 UTC
[Gluster-users] glusterfsd crash due to page allocation failure
Hello, We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started encountering dmesg page allocation errors (stack trace is appended). It appears that glusterfsd now sometimes fills up the cache completely and crashes with a page allocation failure. I *believe* it mainly happens when copying lots of new data to the system, running a 'find', or similar. Hosts are all Scientific Linux 6.6 and these errors occur consistently on two separate gluster pools. Has anyone else seen this issue and are there any known fixes for it via sysctl kernel parameters or other means? Please let me know of any other diagnostic information that would help. Thanks, Patrick [1458118.134697] glusterfsd: page allocation failure. order:5, mode:0x20> [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted > 2.6.32-573.3.1.el6.x86_64 #1 > [1458118.134702] Call Trace: > [1458118.134714] [<ffffffff8113770c>] ? __alloc_pages_nodemask+0x7dc/0x950 > [1458118.134728] [<ffffffffa0321800>] ? mlx4_ib_post_send+0x680/0x1f90 > [mlx4_ib] > [1458118.134733] [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170 > [1458118.134735] [<ffffffff81177aaa>] ? fallback_alloc+0x1ba/0x270 > [1458118.134736] [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320 > [1458118.134738] [<ffffffff81177829>] ? ____cache_alloc_node+0x99/0x160 > [1458118.134743] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > [1458118.134744] [<ffffffff81178479>] ? __kmalloc+0x199/0x230 > [1458118.134746] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > [1458118.134748] [<ffffffff8146001a>] ? __pskb_pull_tail+0x2aa/0x360 > [1458118.134751] [<ffffffff8146f389>] ? harmonize_features+0x29/0x70 > [1458118.134753] [<ffffffff8146f9f4>] ? dev_hard_start_xmit+0x1c4/0x490 > [1458118.134758] [<ffffffff8148cf8a>] ? sch_direct_xmit+0x15a/0x1c0 > [1458118.134759] [<ffffffff8146ff68>] ? dev_queue_xmit+0x228/0x320 > [1458118.134762] [<ffffffff8147665d>] ? neigh_connected_output+0xbd/0x100 > [1458118.134766] [<ffffffff814abc67>] ? ip_finish_output+0x287/0x360 > [1458118.134767] [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0 > [1458118.134769] [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0 > [1458118.134770] [<ffffffff814ab085>] ? ip_local_out+0x25/0x30 > [1458118.134772] [<ffffffff814ab580>] ? ip_queue_xmit+0x190/0x420 > [1458118.134773] [<ffffffff81137059>] ? __alloc_pages_nodemask+0x129/0x950 > [1458118.134776] [<ffffffff814c0c54>] ? tcp_transmit_skb+0x4b4/0x8b0 > [1458118.134778] [<ffffffff814c319a>] ? tcp_write_xmit+0x1da/0xa90 > [1458118.134779] [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60 > [1458118.134780] [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40 > [1458118.134782] [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20 > [1458118.134786] [<ffffffff8145836b>] ? sock_aio_write+0x19b/0x1c0 > [1458118.134788] [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0 > [1458118.134791] [<ffffffff8119169b>] ? do_sync_readv_writev+0xfb/0x140 > [1458118.134797] [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40 > [1458118.134801] [<ffffffff8123e92f>] ? selinux_file_permission+0xbf/0x150 > [1458118.134804] [<ffffffff812316d6>] ? security_file_permission+0x16/0x20 > [1458118.134806] [<ffffffff81192746>] ? do_readv_writev+0xd6/0x1f0 > [1458118.134807] [<ffffffff811928a6>] ? vfs_writev+0x46/0x60 > [1458118.134809] [<ffffffff811929d1>] ? sys_writev+0x51/0xd0 > [1458118.134812] [<ffffffff810e88ae>] ? __audit_syscall_exit+0x25e/0x290 > [1458118.134816] [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151221/d3e09c03/attachment.html>
Pranith Kumar Karampuri
2015-Dec-22 04:59 UTC
[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure
hi Glomski, This is the second time I am hearing about memory allocation problems in 3.7.6 but this time on brick side. Are you able to recreate this issue? Will it be possible to get statedumps of the bricks processes just before they crash? Pranith On 12/22/2015 02:25 AM, Glomski, Patrick wrote:> Hello, > > We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started > encountering dmesg page allocation errors (stack trace is appended). > > It appears that glusterfsd now sometimes fills up the cache completely > and crashes with a page allocation failure. I *believe* it mainly > happens when copying lots of new data to the system, running a 'find', > or similar. Hosts are all Scientific Linux 6.6 and these errors occur > consistently on two separate gluster pools. > > Has anyone else seen this issue and are there any known fixes for it > via sysctl kernel parameters or other means? > > Please let me know of any other diagnostic information that would help. > > Thanks, > Patrick > > > [1458118.134697] glusterfsd: page allocation failure. order:5, > mode:0x20 > [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted > 2.6.32-573.3.1.el6.x86_64 #1 > [1458118.134702] Call Trace: > [1458118.134714] [<ffffffff8113770c>] ? > __alloc_pages_nodemask+0x7dc/0x950 > [1458118.134728] [<ffffffffa0321800>] ? > mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib] > [1458118.134733] [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170 > [1458118.134735] [<ffffffff81177aaa>] ? fallback_alloc+0x1ba/0x270 > [1458118.134736] [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320 > [1458118.134738] [<ffffffff81177829>] ? > ____cache_alloc_node+0x99/0x160 > [1458118.134743] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > [1458118.134744] [<ffffffff81178479>] ? __kmalloc+0x199/0x230 > [1458118.134746] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > [1458118.134748] [<ffffffff8146001a>] ? __pskb_pull_tail+0x2aa/0x360 > [1458118.134751] [<ffffffff8146f389>] ? harmonize_features+0x29/0x70 > [1458118.134753] [<ffffffff8146f9f4>] ? > dev_hard_start_xmit+0x1c4/0x490 > [1458118.134758] [<ffffffff8148cf8a>] ? sch_direct_xmit+0x15a/0x1c0 > [1458118.134759] [<ffffffff8146ff68>] ? dev_queue_xmit+0x228/0x320 > [1458118.134762] [<ffffffff8147665d>] ? > neigh_connected_output+0xbd/0x100 > [1458118.134766] [<ffffffff814abc67>] ? ip_finish_output+0x287/0x360 > [1458118.134767] [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0 > [1458118.134769] [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0 > [1458118.134770] [<ffffffff814ab085>] ? ip_local_out+0x25/0x30 > [1458118.134772] [<ffffffff814ab580>] ? ip_queue_xmit+0x190/0x420 > [1458118.134773] [<ffffffff81137059>] ? > __alloc_pages_nodemask+0x129/0x950 > [1458118.134776] [<ffffffff814c0c54>] ? tcp_transmit_skb+0x4b4/0x8b0 > [1458118.134778] [<ffffffff814c319a>] ? tcp_write_xmit+0x1da/0xa90 > [1458118.134779] [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60 > [1458118.134780] [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40 > [1458118.134782] [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20 > [1458118.134786] [<ffffffff8145836b>] ? sock_aio_write+0x19b/0x1c0 > [1458118.134788] [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0 > [1458118.134791] [<ffffffff8119169b>] ? > do_sync_readv_writev+0xfb/0x140 > [1458118.134797] [<ffffffff810a14b0>] ? > autoremove_wake_function+0x0/0x40 > [1458118.134801] [<ffffffff8123e92f>] ? > selinux_file_permission+0xbf/0x150 > [1458118.134804] [<ffffffff812316d6>] ? > security_file_permission+0x16/0x20 > [1458118.134806] [<ffffffff81192746>] ? do_readv_writev+0xd6/0x1f0 > [1458118.134807] [<ffffffff811928a6>] ? vfs_writev+0x46/0x60 > [1458118.134809] [<ffffffff811929d1>] ? sys_writev+0x51/0xd0 > [1458118.134812] [<ffffffff810e88ae>] ? > __audit_syscall_exit+0x25e/0x290 > [1458118.134816] [<ffffffff8100b0d2>] ? > system_call_fastpath+0x16/0x1b > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151222/d162f1d2/attachment.html>
Niels de Vos
2015-Dec-22 16:38 UTC
[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure
On Mon, Dec 21, 2015 at 03:55:08PM -0500, Glomski, Patrick wrote:> Hello, > > We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started > encountering dmesg page allocation errors (stack trace is appended). > > It appears that glusterfsd now sometimes fills up the cache completely and > crashes with a page allocation failure. I *believe* it mainly happens when > copying lots of new data to the system, running a 'find', or similar. Hosts > are all Scientific Linux 6.6 and these errors occur consistently on two > separate gluster pools. > > Has anyone else seen this issue and are there any known fixes for it via > sysctl kernel parameters or other means? > > Please let me know of any other diagnostic information that would help.Could you explain a little more about this? The below is a message from the kernel telling you that the mlx4_ib (Mellanox Infiniband?) driver is requesting more continuous memory than is immediately available. So, the questions I have regarding this: 1. how is infiniband involved/configured in this environment? 2. was there a change/update of the driver (kernel update maybe?) 3. do you get a coredump of the glusterfsd process when this happens? 4. is this a fuse mount process, or a brick process? (check by PID?) Thanks, Niels> > Thanks, > Patrick > > > [1458118.134697] glusterfsd: page allocation failure. order:5, mode:0x20 > > [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted > > 2.6.32-573.3.1.el6.x86_64 #1 > > [1458118.134702] Call Trace: > > [1458118.134714] [<ffffffff8113770c>] ? __alloc_pages_nodemask+0x7dc/0x950 > > [1458118.134728] [<ffffffffa0321800>] ? mlx4_ib_post_send+0x680/0x1f90 > > [mlx4_ib] > > [1458118.134733] [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170 > > [1458118.134735] [<ffffffff81177aaa>] ? fallback_alloc+0x1ba/0x270 > > [1458118.134736] [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320 > > [1458118.134738] [<ffffffff81177829>] ? ____cache_alloc_node+0x99/0x160 > > [1458118.134743] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > > [1458118.134744] [<ffffffff81178479>] ? __kmalloc+0x199/0x230 > > [1458118.134746] [<ffffffff8145f732>] ? pskb_expand_head+0x62/0x280 > > [1458118.134748] [<ffffffff8146001a>] ? __pskb_pull_tail+0x2aa/0x360 > > [1458118.134751] [<ffffffff8146f389>] ? harmonize_features+0x29/0x70 > > [1458118.134753] [<ffffffff8146f9f4>] ? dev_hard_start_xmit+0x1c4/0x490 > > [1458118.134758] [<ffffffff8148cf8a>] ? sch_direct_xmit+0x15a/0x1c0 > > [1458118.134759] [<ffffffff8146ff68>] ? dev_queue_xmit+0x228/0x320 > > [1458118.134762] [<ffffffff8147665d>] ? neigh_connected_output+0xbd/0x100 > > [1458118.134766] [<ffffffff814abc67>] ? ip_finish_output+0x287/0x360 > > [1458118.134767] [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0 > > [1458118.134769] [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0 > > [1458118.134770] [<ffffffff814ab085>] ? ip_local_out+0x25/0x30 > > [1458118.134772] [<ffffffff814ab580>] ? ip_queue_xmit+0x190/0x420 > > [1458118.134773] [<ffffffff81137059>] ? __alloc_pages_nodemask+0x129/0x950 > > [1458118.134776] [<ffffffff814c0c54>] ? tcp_transmit_skb+0x4b4/0x8b0 > > [1458118.134778] [<ffffffff814c319a>] ? tcp_write_xmit+0x1da/0xa90 > > [1458118.134779] [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60 > > [1458118.134780] [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40 > > [1458118.134782] [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20 > > [1458118.134786] [<ffffffff8145836b>] ? sock_aio_write+0x19b/0x1c0 > > [1458118.134788] [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0 > > [1458118.134791] [<ffffffff8119169b>] ? do_sync_readv_writev+0xfb/0x140 > > [1458118.134797] [<ffffffff810a14b0>] ? autoremove_wake_function+0x0/0x40 > > [1458118.134801] [<ffffffff8123e92f>] ? selinux_file_permission+0xbf/0x150 > > [1458118.134804] [<ffffffff812316d6>] ? security_file_permission+0x16/0x20 > > [1458118.134806] [<ffffffff81192746>] ? do_readv_writev+0xd6/0x1f0 > > [1458118.134807] [<ffffffff811928a6>] ? vfs_writev+0x46/0x60 > > [1458118.134809] [<ffffffff811929d1>] ? sys_writev+0x51/0xd0 > > [1458118.134812] [<ffffffff810e88ae>] ? __audit_syscall_exit+0x25e/0x290 > > [1458118.134816] [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b > >> _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: not available URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20151222/9b7f17dd/attachment.sig>