thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Pranith Kumar Karampuri

2015-Dec-22 04:59 UTC

[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure

hi Glomski,
         This is the second time I am hearing about memory allocation 
problems in 3.7.6 but this time on brick side. Are you able to recreate 
this issue? Will it be possible to get statedumps of the bricks 
processes just before they crash?

Pranith

On 12/22/2015 02:25 AM, Glomski, Patrick wrote:> Hello,
>
> We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started 
> encountering dmesg page allocation errors (stack trace is appended).
>
> It appears that glusterfsd now sometimes fills up the cache completely 
> and crashes with a page allocation failure. I *believe* it mainly 
> happens when copying lots of new data to the system, running a
'find',
> or similar. Hosts are all Scientific Linux 6.6 and these errors occur 
> consistently on two separate gluster pools.
>
> Has anyone else seen this issue and are there any known fixes for it 
> via sysctl kernel parameters or other means?
>
> Please let me know of any other diagnostic information that would help.
>
> Thanks,
> Patrick
>
>
>     [1458118.134697] glusterfsd: page allocation failure. order:5,
>     mode:0x20
>     [1458118.134701] Pid: 6010, comm: glusterfsd Not tainted
>     2.6.32-573.3.1.el6.x86_64 #1
>     [1458118.134702] Call Trace:
>     [1458118.134714]  [<ffffffff8113770c>] ?
>     __alloc_pages_nodemask+0x7dc/0x950
>     [1458118.134728]  [<ffffffffa0321800>] ?
>     mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
>     [1458118.134733]  [<ffffffff81176e92>] ? kmem_getpages+0x62/0x170
>     [1458118.134735]  [<ffffffff81177aaa>] ?
fallback_alloc+0x1ba/0x270
>     [1458118.134736]  [<ffffffff811774ff>] ? cache_grow+0x2cf/0x320
>     [1458118.134738]  [<ffffffff81177829>] ?
>     ____cache_alloc_node+0x99/0x160
>     [1458118.134743]  [<ffffffff8145f732>] ?
pskb_expand_head+0x62/0x280
>     [1458118.134744]  [<ffffffff81178479>] ? __kmalloc+0x199/0x230
>     [1458118.134746]  [<ffffffff8145f732>] ?
pskb_expand_head+0x62/0x280
>     [1458118.134748]  [<ffffffff8146001a>] ?
__pskb_pull_tail+0x2aa/0x360
>     [1458118.134751]  [<ffffffff8146f389>] ?
harmonize_features+0x29/0x70
>     [1458118.134753]  [<ffffffff8146f9f4>] ?
>     dev_hard_start_xmit+0x1c4/0x490
>     [1458118.134758]  [<ffffffff8148cf8a>] ?
sch_direct_xmit+0x15a/0x1c0
>     [1458118.134759]  [<ffffffff8146ff68>] ?
dev_queue_xmit+0x228/0x320
>     [1458118.134762]  [<ffffffff8147665d>] ?
>     neigh_connected_output+0xbd/0x100
>     [1458118.134766]  [<ffffffff814abc67>] ?
ip_finish_output+0x287/0x360
>     [1458118.134767]  [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0
>     [1458118.134769]  [<ffffffff814ab04f>] ? __ip_local_out+0x9f/0xb0
>     [1458118.134770]  [<ffffffff814ab085>] ? ip_local_out+0x25/0x30
>     [1458118.134772]  [<ffffffff814ab580>] ?
ip_queue_xmit+0x190/0x420
>     [1458118.134773]  [<ffffffff81137059>] ?
>     __alloc_pages_nodemask+0x129/0x950
>     [1458118.134776]  [<ffffffff814c0c54>] ?
tcp_transmit_skb+0x4b4/0x8b0
>     [1458118.134778]  [<ffffffff814c319a>] ?
tcp_write_xmit+0x1da/0xa90
>     [1458118.134779]  [<ffffffff81178cbd>] ? __kmalloc_node+0x4d/0x60
>     [1458118.134780]  [<ffffffff814c3a80>] ? tcp_push_one+0x30/0x40
>     [1458118.134782]  [<ffffffff814b410c>] ? tcp_sendmsg+0x9cc/0xa20
>     [1458118.134786]  [<ffffffff8145836b>] ?
sock_aio_write+0x19b/0x1c0
>     [1458118.134788]  [<ffffffff814581d0>] ? sock_aio_write+0x0/0x1c0
>     [1458118.134791]  [<ffffffff8119169b>] ?
>     do_sync_readv_writev+0xfb/0x140
>     [1458118.134797]  [<ffffffff810a14b0>] ?
>     autoremove_wake_function+0x0/0x40
>     [1458118.134801]  [<ffffffff8123e92f>] ?
>     selinux_file_permission+0xbf/0x150
>     [1458118.134804]  [<ffffffff812316d6>] ?
>     security_file_permission+0x16/0x20
>     [1458118.134806]  [<ffffffff81192746>] ?
do_readv_writev+0xd6/0x1f0
>     [1458118.134807]  [<ffffffff811928a6>] ? vfs_writev+0x46/0x60
>     [1458118.134809]  [<ffffffff811929d1>] ? sys_writev+0x51/0xd0
>     [1458118.134812]  [<ffffffff810e88ae>] ?
>     __audit_syscall_exit+0x25e/0x290
>     [1458118.134816]  [<ffffffff8100b0d2>] ?
>     system_call_fastpath+0x16/0x1b
>
>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151222/d162f1d2/attachment.html>

David Robinson

2015-Dec-22 15:40 UTC

head link

[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure

Pranith,

This issue continues to happen.  If you could provide instructions for 
getting you the statedump, I would be happy to send that information.
I am not sure how to get a statedump just before the crash as the crash 
is intermittent.

David


------ Original Message ------
From: "Pranith Kumar Karampuri" <pkarampu at redhat.com>
To: "Glomski, Patrick" <patrick.glomski at corvidtec.com>; 
gluster-devel at gluster.org; gluster-users at gluster.org
Cc: "David Robinson" <david.robinson at corvidtec.com>
Sent: 12/21/2015 11:59:33 PM
Subject: Re: [Gluster-devel] glusterfsd crash due to page allocation 
failure
>hi Glomski,
>         This is the second time I am hearing about memory allocation 
>problems in 3.7.6 but this time on brick side. Are you able to recreate 
>this issue? Will it be possible to get statedumps of the bricks 
>processes just before they crash?
>
>Pranith
>
>On 12/22/2015 02:25 AM, Glomski, Patrick wrote:
>>Hello,
>>
>>We've recently upgraded from gluster 3.6.6 to 3.7.6 and have started
>>encountering dmesg page allocation errors (stack trace is appended).
>>
>>It appears that glusterfsd now sometimes fills up the cache completely 
>>and crashes with a page allocation failure. I *believe* it mainly 
>>happens when copying lots of new data to the system, running a
'find',
>>or similar. Hosts are all Scientific Linux 6.6 and these errors occur 
>>consistently on two separate gluster pools.
>>
>>Has anyone else seen this issue and are there any known fixes for it 
>>via sysctl kernel parameters or other means?
>>
>>Please let me know of any other diagnostic information that would 
>>help.
>>
>>Thanks,
>>Patrick
>>
>>
>>>[1458118.134697] glusterfsd: page allocation failure. order:5, 
>>>mode:0x20
>>>[1458118.134701] Pid: 6010, comm: glusterfsd Not tainted 
>>>2.6.32-573.3.1.el6.x86_64 #1
>>>[1458118.134702] Call Trace:
>>>[1458118.134714]  [<ffffffff8113770c>] ? 
>>>__alloc_pages_nodemask+0x7dc/0x950
>>>[1458118.134728]  [<ffffffffa0321800>] ? 
>>>mlx4_ib_post_send+0x680/0x1f90 [mlx4_ib]
>>>[1458118.134733]  [<ffffffff81176e92>] ?
kmem_getpages+0x62/0x170
>>>[1458118.134735]  [<ffffffff81177aaa>] ?
fallback_alloc+0x1ba/0x270
>>>[1458118.134736]  [<ffffffff811774ff>] ?
cache_grow+0x2cf/0x320
>>>[1458118.134738]  [<ffffffff81177829>] ? 
>>>____cache_alloc_node+0x99/0x160
>>>[1458118.134743]  [<ffffffff8145f732>] ?
pskb_expand_head+0x62/0x280
>>>[1458118.134744]  [<ffffffff81178479>] ? __kmalloc+0x199/0x230
>>>[1458118.134746]  [<ffffffff8145f732>] ?
pskb_expand_head+0x62/0x280
>>>[1458118.134748]  [<ffffffff8146001a>] ?
__pskb_pull_tail+0x2aa/0x360
>>>[1458118.134751]  [<ffffffff8146f389>] ?
harmonize_features+0x29/0x70
>>>[1458118.134753]  [<ffffffff8146f9f4>] ? 
>>>dev_hard_start_xmit+0x1c4/0x490
>>>[1458118.134758]  [<ffffffff8148cf8a>] ?
sch_direct_xmit+0x15a/0x1c0
>>>[1458118.134759]  [<ffffffff8146ff68>] ?
dev_queue_xmit+0x228/0x320
>>>[1458118.134762]  [<ffffffff8147665d>] ? 
>>>neigh_connected_output+0xbd/0x100
>>>[1458118.134766]  [<ffffffff814abc67>] ?
ip_finish_output+0x287/0x360
>>>[1458118.134767]  [<ffffffff814abdf8>] ? ip_output+0xb8/0xc0
>>>[1458118.134769]  [<ffffffff814ab04f>] ?
__ip_local_out+0x9f/0xb0
>>>[1458118.134770]  [<ffffffff814ab085>] ?
ip_local_out+0x25/0x30
>>>[1458118.134772]  [<ffffffff814ab580>] ?
ip_queue_xmit+0x190/0x420
>>>[1458118.134773]  [<ffffffff81137059>] ? 
>>>__alloc_pages_nodemask+0x129/0x950
>>>[1458118.134776]  [<ffffffff814c0c54>] ?
tcp_transmit_skb+0x4b4/0x8b0
>>>[1458118.134778]  [<ffffffff814c319a>] ?
tcp_write_xmit+0x1da/0xa90
>>>[1458118.134779]  [<ffffffff81178cbd>] ?
__kmalloc_node+0x4d/0x60
>>>[1458118.134780]  [<ffffffff814c3a80>] ?
tcp_push_one+0x30/0x40
>>>[1458118.134782]  [<ffffffff814b410c>] ?
tcp_sendmsg+0x9cc/0xa20
>>>[1458118.134786]  [<ffffffff8145836b>] ?
sock_aio_write+0x19b/0x1c0
>>>[1458118.134788]  [<ffffffff814581d0>] ?
sock_aio_write+0x0/0x1c0
>>>[1458118.134791]  [<ffffffff8119169b>] ? 
>>>do_sync_readv_writev+0xfb/0x140
>>>[1458118.134797]  [<ffffffff810a14b0>] ? 
>>>autoremove_wake_function+0x0/0x40
>>>[1458118.134801]  [<ffffffff8123e92f>] ? 
>>>selinux_file_permission+0xbf/0x150
>>>[1458118.134804]  [<ffffffff812316d6>] ? 
>>>security_file_permission+0x16/0x20
>>>[1458118.134806]  [<ffffffff81192746>] ?
do_readv_writev+0xd6/0x1f0
>>>[1458118.134807]  [<ffffffff811928a6>] ? vfs_writev+0x46/0x60
>>>[1458118.134809]  [<ffffffff811929d1>] ? sys_writev+0x51/0xd0
>>>[1458118.134812]  [<ffffffff810e88ae>] ? 
>>>__audit_syscall_exit+0x25e/0x290
>>>[1458118.134816]  [<ffffffff8100b0d2>] ? 
>>>system_call_fastpath+0x16/0x1b
>>
>>
>>
>>_______________________________________________ Gluster-devel mailing 
>>list 
>>Gluster-devel at
gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20151222/a00ad012/attachment.html>

Gluster users - Dec 2015 - [Gluster-devel] glusterfsd crash due to page allocation failure

[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure

[Gluster-users] [Gluster-devel] glusterfsd crash due to page allocation failure