thr3ads.net - Xen devel - NFS related netback hang [Apr 2013]

If this information is useful, please help other people find it:
Share via:

G.R.

2013-Apr-11 13:34 UTC

NFS related netback hang

Hi,
I suffer from strange NFS related network issue for a while.
The issue shows up when copying from dom0 to domU through a NFS mount.
After a short while, the transfer suddenly freezes and the domU
network simply stops any response. Force mounting the NFS mount
generally resolves the situation, but some times you can really be in
bad luck.

Lucky enough, I captured the following log in a recent instance. It
appears to be a dead-lock when the netback tries to get some free
pages from NFS. I''m not sure if this is the whole story. Any
suggestion how to solve the issue?

Thanks,
Timothy


Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
ffff880210213900     0  2255      2 0x00000000
Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
0000000000000246 0000000000000000 ffffffff818133f0
Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
__lock_page+0x66/0x66
Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
io_schedule+0x55/0x6b
Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
sleep_on_page+0x7/0xc
Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
__wait_on_bit_lock+0x3c/0x85
Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
find_get_pages+0xea/0x100
Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
__lock_page+0x61/0x66
Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
autoremove_wake_function+0x2a/0x2a
Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
truncate_inode_pages_range+0x28b/0x2f8
Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
nfs_evict_inode+0x12/0x23
Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
evict+0xa3/0x153
Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
dispose_list+0x27/0x31
Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
evict_inodes+0xe7/0xf4
Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
generic_shutdown_super+0x3e/0xc5
Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
kill_anon_super+0x9/0x11
Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
nfs_kill_super+0xd/0x16
Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
deactivate_locked_super+0x2c/0x5c
Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
__put_nfs_open_context+0xbf/0xe1
Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
nfs_commitdata_release+0x10/0x19
Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
nfs_initiate_commit+0xd9/0xe4
Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
nfs_commit_inode+0x81/0x111
Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
nfs_release_page+0x40/0x4f
Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
shrink_page_list+0x4f5/0x6d8
Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
shrink_inactive_list+0x1dd/0x33f
Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
shrink_lruvec+0x2e0/0x44d
Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
shrink_zone+0x53/0x8a
Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
do_try_to_free_pages+0x1c6/0x3f4
Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
try_to_free_pages+0xc4/0x11e
Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
__alloc_pages_nodemask+0x440/0x72f
Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
alloc_pages_current+0xb2/0xcd
Apr 11 21:22:27 gaia kernel: [429242.015805]  [<ffffffffa01be188>] ?
xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015808]  [<ffffffffa01bf30c>] ?
xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015812]  [<ffffffff812c0068>] ?
pnp_show_options+0x43e/0x482
Apr 11 21:22:27 gaia kernel: [429242.015816]  [<ffffffffa01beb3f>] ?
xen_netbk_rx_action+0x688/0x6f8 [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015819]  [<ffffffff81061198>] ?
finish_task_switch+0x4c/0x7c
Apr 11 21:22:27 gaia kernel: [429242.015823]  [<ffffffffa01bf7e9>] ?
xen_netbk_kthread+0x12e/0x790 [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015826]  [<ffffffff8105833a>] ?
abort_exclusive_wait+0x79/0x79
Apr 11 21:22:27 gaia kernel: [429242.015829]  [<ffffffffa01bf6bb>] ?
xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015832]  [<ffffffffa01bf6bb>] ?
xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015834]  [<ffffffff81057a31>] ?
kthread+0x67/0x6f
Apr 11 21:22:27 gaia kernel: [429242.015838]  [<ffffffff814d6904>] ?
kernel_thread_helper+0x4/0x10
Apr 11 21:22:27 gaia kernel: [429242.015841]  [<ffffffff814d143c>] ?
retint_restore_args+0x5/0x6
Apr 11 21:22:27 gaia kernel: [429242.015844]  [<ffffffff814d6900>] ?
gs_change+0x13/0x13

G.R.

2013-Apr-11 13:55 UTC

head link

NFS related netback hang

Hi,
I''m suffering from strange NFS related network issue for a while.

The issue shows up when copying from dom0 to domU through a NFS mount.
After a short while, the transfer suddenly freezes and the domU
network simply stops any response. Force mounting the NFS mount
generally resolves the freeze. But some times you can really be in
bad luck that the trick does not work.

Lucky enough, I captured the following log in a recent instance. It
appears to be a dead-lock when the netback tries to get some free
pages from NFS. I''m not sure if this is the whole story. Any
suggestion how to solve the issue?

Thanks,
Timothy

Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
blocked for more than 120 seconds.
Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
ffff880210213900     0  2255      2 0x00000000
Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
0000000000000246 0000000000000000 ffffffff818133f0
Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
__lock_page+0x66/0x66
Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
io_schedule+0x55/0x6b
Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
sleep_on_page+0x7/0xc
Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
__wait_on_bit_lock+0x3c/0x85
Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
find_get_pages+0xea/0x100
Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
__lock_page+0x61/0x66
Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
autoremove_wake_function+0x2a/0x2a
Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
truncate_inode_pages_range+0x28b/0x2f8
Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
nfs_evict_inode+0x12/0x23
Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
evict+0xa3/0x153
Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
dispose_list+0x27/0x31
Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
evict_inodes+0xe7/0xf4
Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
generic_shutdown_super+0x3e/0xc5
Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
kill_anon_super+0x9/0x11
Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
nfs_kill_super+0xd/0x16
Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
deactivate_locked_super+0x2c/0x5c
Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
__put_nfs_open_context+0xbf/0xe1
Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
nfs_commitdata_release+0x10/0x19
Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
nfs_initiate_commit+0xd9/0xe4
Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
nfs_commit_inode+0x81/0x111
Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
nfs_release_page+0x40/0x4f
Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
shrink_page_list+0x4f5/0x6d8
Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
shrink_inactive_list+0x1dd/0x33f
Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
shrink_lruvec+0x2e0/0x44d
Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
shrink_zone+0x53/0x8a
Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
do_try_to_free_pages+0x1c6/0x3f4
Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
try_to_free_pages+0xc4/0x11e
Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
__alloc_pages_nodemask+0x440/0x72f
Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
alloc_pages_current+0xb2/0xcd
Apr 11 21:22:27 gaia kernel: [429242.015805]  [<ffffffffa01be188>] ?
xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015808]  [<ffffffffa01bf30c>] ?
xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015812]  [<ffffffff812c0068>] ?
pnp_show_options+0x43e/0x482
Apr 11 21:22:27 gaia kernel: [429242.015816]  [<ffffffffa01beb3f>] ?
xen_netbk_rx_action+0x688/0x6f8 [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015819]  [<ffffffff81061198>] ?
finish_task_switch+0x4c/0x7c
Apr 11 21:22:27 gaia kernel: [429242.015823]  [<ffffffffa01bf7e9>] ?
xen_netbk_kthread+0x12e/0x790 [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015826]  [<ffffffff8105833a>] ?
abort_exclusive_wait+0x79/0x79
Apr 11 21:22:27 gaia kernel: [429242.015829]  [<ffffffffa01bf6bb>] ?
xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015832]  [<ffffffffa01bf6bb>] ?
xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
Apr 11 21:22:27 gaia kernel: [429242.015834]  [<ffffffff81057a31>] ?
kthread+0x67/0x6f
Apr 11 21:22:27 gaia kernel: [429242.015838]  [<ffffffff814d6904>] ?
kernel_thread_helper+0x4/0x10
Apr 11 21:22:27 gaia kernel: [429242.015841]  [<ffffffff814d143c>] ?
retint_restore_args+0x5/0x6
Apr 11 21:22:27 gaia kernel: [429242.015844]  [<ffffffff814d6900>] ?
gs_change+0x13/0x13

G.R.

2013-Apr-11 15:08 UTC

head link

Re: NFS related netback hang

Hi Konrad,
Do you have any solution to this?

Thanks,
Timothy

On Thu, Apr 11, 2013 at 9:55 PM, G.R. <firemeteor@users.sourceforge.net>
wrote:> Hi,
> I''m suffering from strange NFS related network issue for a while.
>
> The issue shows up when copying from dom0 to domU through a NFS mount.
> After a short while, the transfer suddenly freezes and the domU
> network simply stops any response. Force mounting the NFS mount
> generally resolves the freeze. But some times you can really be in
> bad luck that the trick does not work.
>
> Lucky enough, I captured the following log in a recent instance. It
> appears to be a dead-lock when the netback tries to get some free
> pages from NFS. I''m not sure if this is the whole story. Any
> suggestion how to solve the issue?
>
> Thanks,
> Timothy
>
> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
> blocked for more than 120 seconds.
> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
> ffff880210213900     0  2255      2 0x00000000
> Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
> 0000000000000246 0000000000000000 ffffffff818133f0
> Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
> Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
> Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
> __lock_page+0x66/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
> io_schedule+0x55/0x6b
> Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
> sleep_on_page+0x7/0xc
> Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
> __wait_on_bit_lock+0x3c/0x85
> Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
> find_get_pages+0xea/0x100
> Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
> __lock_page+0x61/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
> autoremove_wake_function+0x2a/0x2a
> Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
> truncate_inode_pages_range+0x28b/0x2f8
> Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
> nfs_evict_inode+0x12/0x23
> Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
> evict+0xa3/0x153
> Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
> dispose_list+0x27/0x31
> Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
> evict_inodes+0xe7/0xf4
> Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
> generic_shutdown_super+0x3e/0xc5
> Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
> kill_anon_super+0x9/0x11
> Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
> nfs_kill_super+0xd/0x16
> Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
> deactivate_locked_super+0x2c/0x5c
> Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
> __put_nfs_open_context+0xbf/0xe1
> Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
> nfs_commitdata_release+0x10/0x19
> Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
> nfs_initiate_commit+0xd9/0xe4
> Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
> nfs_commit_inode+0x81/0x111
> Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
> nfs_release_page+0x40/0x4f
> Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
> shrink_page_list+0x4f5/0x6d8
> Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
> shrink_inactive_list+0x1dd/0x33f
> Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
> shrink_lruvec+0x2e0/0x44d
> Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
> shrink_zone+0x53/0x8a
> Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
> do_try_to_free_pages+0x1c6/0x3f4
> Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
> try_to_free_pages+0xc4/0x11e
> Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
> __alloc_pages_nodemask+0x440/0x72f
> Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
> alloc_pages_current+0xb2/0xcd
> Apr 11 21:22:27 gaia kernel: [429242.015805]  [<ffffffffa01be188>] ?
> xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015808]  [<ffffffffa01bf30c>] ?
> xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015812]  [<ffffffff812c0068>] ?
> pnp_show_options+0x43e/0x482
> Apr 11 21:22:27 gaia kernel: [429242.015816]  [<ffffffffa01beb3f>] ?
> xen_netbk_rx_action+0x688/0x6f8 [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015819]  [<ffffffff81061198>] ?
> finish_task_switch+0x4c/0x7c
> Apr 11 21:22:27 gaia kernel: [429242.015823]  [<ffffffffa01bf7e9>] ?
> xen_netbk_kthread+0x12e/0x790 [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015826]  [<ffffffff8105833a>] ?
> abort_exclusive_wait+0x79/0x79
> Apr 11 21:22:27 gaia kernel: [429242.015829]  [<ffffffffa01bf6bb>] ?
> xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015832]  [<ffffffffa01bf6bb>] ?
> xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015834]  [<ffffffff81057a31>] ?
> kthread+0x67/0x6f
> Apr 11 21:22:27 gaia kernel: [429242.015838]  [<ffffffff814d6904>] ?
> kernel_thread_helper+0x4/0x10
> Apr 11 21:22:27 gaia kernel: [429242.015841]  [<ffffffff814d143c>] ?
> retint_restore_args+0x5/0x6
> Apr 11 21:22:27 gaia kernel: [429242.015844]  [<ffffffff814d6900>] ?
> gs_change+0x13/0x13

Wei Liu

2013-Apr-11 16:39 UTC

head link

Re: NFS related netback hang

On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi,
> I''m suffering from strange NFS related network issue for a while.
> 
> The issue shows up when copying from dom0 to domU through a NFS mount.
> After a short while, the transfer suddenly freezes and the domU
> network simply stops any response. Force mounting the NFS mount
> generally resolves the freeze. But some times you can really be in
> bad luck that the trick does not work.
> 
> Lucky enough, I captured the following log in a recent instance. It
> appears to be a dead-lock when the netback tries to get some free
> pages from NFS. I''m not sure if this is the whole story. Any
> suggestion how to solve the issue?
> 
Ian, does this look familiar to you? IIRC you once discovered some NFS
bug related to SKB life cycle tracking.


Wei.

Ian Campbell

2013-Apr-11 16:50 UTC

head link

Re: NFS related netback hang

On Thu, 2013-04-11 at 17:39 +0100, Wei Liu wrote:> On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:
> > Hi,
> > I''m suffering from strange NFS related network issue for a
while.
> > 
> > The issue shows up when copying from dom0 to domU through a NFS mount.
> > After a short while, the transfer suddenly freezes and the domU
> > network simply stops any response. Force mounting the NFS mount
> > generally resolves the freeze. But some times you can really be in
> > bad luck that the trick does not work.
> > 
> > Lucky enough, I captured the following log in a recent instance. It
> > appears to be a dead-lock when the netback tries to get some free
> > pages from NFS. I''m not sure if this is the whole story. Any
> > suggestion how to solve the issue?
> > 
> 
> Ian, does this look familiar to you? IIRC you once discovered some NFS
> bug related to SKB life cycle tracking.
Not with upstream netback which does grant copy not grant maps.

Ian.

Wei Liu

2013-Apr-11 17:02 UTC

head link

Re: NFS related netback hang

On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi,
> I''m suffering from strange NFS related network issue for a while.
> 
> The issue shows up when copying from dom0 to domU through a NFS mount.
> After a short while, the transfer suddenly freezes and the domU
> network simply stops any response. Force mounting the NFS mount
> generally resolves the freeze. But some times you can really be in
> bad luck that the trick does not work.
> 
> Lucky enough, I captured the following log in a recent instance. It
> appears to be a dead-lock when the netback tries to get some free
> pages from NFS. I''m not sure if this is the whole story. Any
> suggestion how to solve the issue?
> 
Unfortunately I cannot reproduce this at the moment. Please provide
detail configurations and steps.

I use NFS to test my netback / netfront patches, but I''ve never seen
anything like this. In my case I mostly export a dir in Dom0 then DomU
writes to it. Presumably in your case DomU exports a dir then Dom0
writes to it? I did simple test for both cases and couldn''t see this
problem.


Wei.
> Thanks,
> Timothy
> 
> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
> blocked for more than 120 seconds.
> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
> ffff880210213900     0  2255      2 0x00000000
> Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
> 0000000000000246 0000000000000000 ffffffff818133f0
> Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
> Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
> Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
> __lock_page+0x66/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
> io_schedule+0x55/0x6b
> Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
> sleep_on_page+0x7/0xc
> Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
> __wait_on_bit_lock+0x3c/0x85
> Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
> find_get_pages+0xea/0x100
> Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
> __lock_page+0x61/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
> autoremove_wake_function+0x2a/0x2a
> Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
> truncate_inode_pages_range+0x28b/0x2f8
> Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
> nfs_evict_inode+0x12/0x23
> Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
> evict+0xa3/0x153
> Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
> dispose_list+0x27/0x31
> Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
> evict_inodes+0xe7/0xf4
> Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
> generic_shutdown_super+0x3e/0xc5
> Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
> kill_anon_super+0x9/0x11
> Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
> nfs_kill_super+0xd/0x16
> Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
> deactivate_locked_super+0x2c/0x5c
> Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
> __put_nfs_open_context+0xbf/0xe1
> Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
> nfs_commitdata_release+0x10/0x19
> Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
> nfs_initiate_commit+0xd9/0xe4
> Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
> nfs_commit_inode+0x81/0x111
> Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
> nfs_release_page+0x40/0x4f
> Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
> shrink_page_list+0x4f5/0x6d8
> Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
> shrink_inactive_list+0x1dd/0x33f
> Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
> shrink_lruvec+0x2e0/0x44d
> Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
> shrink_zone+0x53/0x8a
> Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
> do_try_to_free_pages+0x1c6/0x3f4
> Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
> try_to_free_pages+0xc4/0x11e
> Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
> __alloc_pages_nodemask+0x440/0x72f
> Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
> alloc_pages_current+0xb2/0xcd
> Apr 11 21:22:27 gaia kernel: [429242.015805]  [<ffffffffa01be188>] ?
> xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015808]  [<ffffffffa01bf30c>] ?
> xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015812]  [<ffffffff812c0068>] ?
> pnp_show_options+0x43e/0x482
> Apr 11 21:22:27 gaia kernel: [429242.015816]  [<ffffffffa01beb3f>] ?
> xen_netbk_rx_action+0x688/0x6f8 [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015819]  [<ffffffff81061198>] ?
> finish_task_switch+0x4c/0x7c
> Apr 11 21:22:27 gaia kernel: [429242.015823]  [<ffffffffa01bf7e9>] ?
> xen_netbk_kthread+0x12e/0x790 [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015826]  [<ffffffff8105833a>] ?
> abort_exclusive_wait+0x79/0x79
> Apr 11 21:22:27 gaia kernel: [429242.015829]  [<ffffffffa01bf6bb>] ?
> xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015832]  [<ffffffffa01bf6bb>] ?
> xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback]
> Apr 11 21:22:27 gaia kernel: [429242.015834]  [<ffffffff81057a31>] ?
> kthread+0x67/0x6f
> Apr 11 21:22:27 gaia kernel: [429242.015838]  [<ffffffff814d6904>] ?
> kernel_thread_helper+0x4/0x10
> Apr 11 21:22:27 gaia kernel: [429242.015841]  [<ffffffff814d143c>] ?
> retint_restore_args+0x5/0x6
> Apr 11 21:22:27 gaia kernel: [429242.015844]  [<ffffffff814d6900>] ?
> gs_change+0x13/0x13
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Wei Liu

2013-Apr-11 17:38 UTC

head link

Re: NFS related netback hang

On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi,
> I''m suffering from strange NFS related network issue for a while.
> 
> The issue shows up when copying from dom0 to domU through a NFS mount.
> After a short while, the transfer suddenly freezes and the domU
> network simply stops any response. Force mounting the NFS mount
> generally resolves the freeze. But some times you can really be in
> bad luck that the trick does not work.
> 
> Lucky enough, I captured the following log in a recent instance. It
> appears to be a dead-lock when the netback tries to get some free
> pages from NFS. I''m not sure if this is the whole story. Any
> suggestion how to solve the issue?
> 
BTW xen_netbk_alloc_page tries to allocate page from generic page pool.
It is not specific to NFS.
> Thanks,
> Timothy
> 
> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
> blocked for more than 120 seconds.
> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
> ffff880210213900     0  2255      2 0x00000000
> Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
> 0000000000000246 0000000000000000 ffffffff818133f0
> Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
> Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
> Apr 11 21:22:27 gaia kernel: [429242.015711]  [<ffffffff810c1bb5>] ?
> __lock_page+0x66/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015715]  [<ffffffff814d06cb>] ?
> io_schedule+0x55/0x6b
> Apr 11 21:22:27 gaia kernel: [429242.015718]  [<ffffffff810c1bbc>] ?
> sleep_on_page+0x7/0xc
> Apr 11 21:22:27 gaia kernel: [429242.015720]  [<ffffffff814cf6c0>] ?
> __wait_on_bit_lock+0x3c/0x85
> Apr 11 21:22:27 gaia kernel: [429242.015723]  [<ffffffff810c3f7a>] ?
> find_get_pages+0xea/0x100
> Apr 11 21:22:27 gaia kernel: [429242.015726]  [<ffffffff810c1bb0>] ?
> __lock_page+0x61/0x66
> Apr 11 21:22:27 gaia kernel: [429242.015729]  [<ffffffff81058364>] ?
> autoremove_wake_function+0x2a/0x2a
> Apr 11 21:22:27 gaia kernel: [429242.015732]  [<ffffffff810cd110>] ?
> truncate_inode_pages_range+0x28b/0x2f8
> Apr 11 21:22:27 gaia kernel: [429242.015737]  [<ffffffff811c91d2>] ?
> nfs_evict_inode+0x12/0x23
> Apr 11 21:22:27 gaia kernel: [429242.015740]  [<ffffffff8111cdae>] ?
> evict+0xa3/0x153
> Apr 11 21:22:27 gaia kernel: [429242.015743]  [<ffffffff8111ce85>] ?
> dispose_list+0x27/0x31
> Apr 11 21:22:27 gaia kernel: [429242.015746]  [<ffffffff8111db6b>] ?
> evict_inodes+0xe7/0xf4
> Apr 11 21:22:27 gaia kernel: [429242.015749]  [<ffffffff8110b3af>] ?
> generic_shutdown_super+0x3e/0xc5
> Apr 11 21:22:27 gaia kernel: [429242.015752]  [<ffffffff8110b49e>] ?
> kill_anon_super+0x9/0x11
> Apr 11 21:22:27 gaia kernel: [429242.015755]  [<ffffffff811ca7b0>] ?
> nfs_kill_super+0xd/0x16
> Apr 11 21:22:27 gaia kernel: [429242.015758]  [<ffffffff8110b717>] ?
> deactivate_locked_super+0x2c/0x5c
> Apr 11 21:22:27 gaia kernel: [429242.015761]  [<ffffffff811c901d>] ?
> __put_nfs_open_context+0xbf/0xe1
> Apr 11 21:22:27 gaia kernel: [429242.015764]  [<ffffffff811d07db>] ?
> nfs_commitdata_release+0x10/0x19
> Apr 11 21:22:27 gaia kernel: [429242.015766]  [<ffffffff811d0f8c>] ?
> nfs_initiate_commit+0xd9/0xe4
> Apr 11 21:22:27 gaia kernel: [429242.015769]  [<ffffffff811d1bae>] ?
> nfs_commit_inode+0x81/0x111
> Apr 11 21:22:27 gaia kernel: [429242.015772]  [<ffffffff811c86f4>] ?
> nfs_release_page+0x40/0x4f
> Apr 11 21:22:27 gaia kernel: [429242.015775]  [<ffffffff810d0940>] ?
> shrink_page_list+0x4f5/0x6d8
> Apr 11 21:22:27 gaia kernel: [429242.015780]  [<ffffffff810d0f03>] ?
> shrink_inactive_list+0x1dd/0x33f
> Apr 11 21:22:27 gaia kernel: [429242.015783]  [<ffffffff810d15fa>] ?
> shrink_lruvec+0x2e0/0x44d
> Apr 11 21:22:27 gaia kernel: [429242.015787]  [<ffffffff810d17ba>] ?
> shrink_zone+0x53/0x8a
> Apr 11 21:22:27 gaia kernel: [429242.015790]  [<ffffffff810d1bcd>] ?
> do_try_to_free_pages+0x1c6/0x3f4
> Apr 11 21:22:27 gaia kernel: [429242.015794]  [<ffffffff810d20a3>] ?
> try_to_free_pages+0xc4/0x11e
> Apr 11 21:22:27 gaia kernel: [429242.015797]  [<ffffffff810c9018>] ?
> __alloc_pages_nodemask+0x440/0x72f
> Apr 11 21:22:27 gaia kernel: [429242.015801]  [<ffffffff810f592d>] ?
> alloc_pages_current+0xb2/0xcd
Judging from the stack trace above, it looks like the system is trying
to squeeze some memory out from NFS. Probably it is just that your
system is suffering from OOM? Then NFS failed to commit its changes to
disk for some reason and hung.


Wei.

G.R.

2013-Apr-12 03:10 UTC

head link

Re: NFS related netback hang

On Fri, Apr 12, 2013 at 1:38 AM, Wei Liu <wei.liu2@citrix.com>
wrote:> On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:
>> Hi,
>> I'm suffering from strange NFS related network issue for a while.
>>
>> The issue shows up when copying from dom0 to domU through a NFS mount.
>> After a short while, the transfer suddenly freezes and the domU
>> network simply stops any response. Force mounting the NFS mount
>> generally resolves the freeze. But some times you can really be in
>> bad luck that the trick does not work.
>>
>> Lucky enough, I captured the following log in a recent instance. It
>> appears to be a dead-lock when the netback tries to get some free
>> pages from NFS. I'm not sure if this is the whole story. Any
>> suggestion how to solve the issue?
>>
>
> BTW xen_netbk_alloc_page tries to allocate page from generic page pool.
> It is not specific to NFS.
>
>> Thanks,
>> Timothy
>>
>> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255
>> blocked for more than 120 seconds.
>> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0       D
>> ffff880210213900     0  2255      2 0x00000000
>> Apr 11 21:22:27 gaia kernel: [429242.015693]  ffff8801fee04ea0
>> 0000000000000246 0000000000000000 ffffffff818133f0
>> Apr 11 21:22:27 gaia kernel: [429242.015697]  0000000000013900
>> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0
>> Apr 11 21:22:27 gaia kernel: [429242.015700]  ffff8801fed87488
>> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488
>> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace:
>> Apr 11 21:22:27 gaia kernel: [429242.015711] 
[<ffffffff810c1bb5>] ?
>> __lock_page+0x66/0x66
>> Apr 11 21:22:27 gaia kernel: [429242.015715] 
[<ffffffff814d06cb>] ?
>> io_schedule+0x55/0x6b
>> Apr 11 21:22:27 gaia kernel: [429242.015718] 
[<ffffffff810c1bbc>] ?
>> sleep_on_page+0x7/0xc
>> Apr 11 21:22:27 gaia kernel: [429242.015720] 
[<ffffffff814cf6c0>] ?
>> __wait_on_bit_lock+0x3c/0x85
>> Apr 11 21:22:27 gaia kernel: [429242.015723] 
[<ffffffff810c3f7a>] ?
>> find_get_pages+0xea/0x100
>> Apr 11 21:22:27 gaia kernel: [429242.015726] 
[<ffffffff810c1bb0>] ?
>> __lock_page+0x61/0x66
>> Apr 11 21:22:27 gaia kernel: [429242.015729] 
[<ffffffff81058364>] ?
>> autoremove_wake_function+0x2a/0x2a
>> Apr 11 21:22:27 gaia kernel: [429242.015732] 
[<ffffffff810cd110>] ?
>> truncate_inode_pages_range+0x28b/0x2f8
>> Apr 11 21:22:27 gaia kernel: [429242.015737] 
[<ffffffff811c91d2>] ?
>> nfs_evict_inode+0x12/0x23
>> Apr 11 21:22:27 gaia kernel: [429242.015740] 
[<ffffffff8111cdae>] ?
>> evict+0xa3/0x153
>> Apr 11 21:22:27 gaia kernel: [429242.015743] 
[<ffffffff8111ce85>] ?
>> dispose_list+0x27/0x31
>> Apr 11 21:22:27 gaia kernel: [429242.015746] 
[<ffffffff8111db6b>] ?
>> evict_inodes+0xe7/0xf4
>> Apr 11 21:22:27 gaia kernel: [429242.015749] 
[<ffffffff8110b3af>] ?
>> generic_shutdown_super+0x3e/0xc5
>> Apr 11 21:22:27 gaia kernel: [429242.015752] 
[<ffffffff8110b49e>] ?
>> kill_anon_super+0x9/0x11
>> Apr 11 21:22:27 gaia kernel: [429242.015755] 
[<ffffffff811ca7b0>] ?
>> nfs_kill_super+0xd/0x16
>> Apr 11 21:22:27 gaia kernel: [429242.015758] 
[<ffffffff8110b717>] ?
>> deactivate_locked_super+0x2c/0x5c
>> Apr 11 21:22:27 gaia kernel: [429242.015761] 
[<ffffffff811c901d>] ?
>> __put_nfs_open_context+0xbf/0xe1
>> Apr 11 21:22:27 gaia kernel: [429242.015764] 
[<ffffffff811d07db>] ?
>> nfs_commitdata_release+0x10/0x19
>> Apr 11 21:22:27 gaia kernel: [429242.015766] 
[<ffffffff811d0f8c>] ?
>> nfs_initiate_commit+0xd9/0xe4
>> Apr 11 21:22:27 gaia kernel: [429242.015769] 
[<ffffffff811d1bae>] ?
>> nfs_commit_inode+0x81/0x111
>> Apr 11 21:22:27 gaia kernel: [429242.015772] 
[<ffffffff811c86f4>] ?
>> nfs_release_page+0x40/0x4f
>> Apr 11 21:22:27 gaia kernel: [429242.015775] 
[<ffffffff810d0940>] ?
>> shrink_page_list+0x4f5/0x6d8
>> Apr 11 21:22:27 gaia kernel: [429242.015780] 
[<ffffffff810d0f03>] ?
>> shrink_inactive_list+0x1dd/0x33f
>> Apr 11 21:22:27 gaia kernel: [429242.015783] 
[<ffffffff810d15fa>] ?
>> shrink_lruvec+0x2e0/0x44d
>> Apr 11 21:22:27 gaia kernel: [429242.015787] 
[<ffffffff810d17ba>] ?
>> shrink_zone+0x53/0x8a
>> Apr 11 21:22:27 gaia kernel: [429242.015790] 
[<ffffffff810d1bcd>] ?
>> do_try_to_free_pages+0x1c6/0x3f4
>> Apr 11 21:22:27 gaia kernel: [429242.015794] 
[<ffffffff810d20a3>] ?
>> try_to_free_pages+0xc4/0x11e
>> Apr 11 21:22:27 gaia kernel: [429242.015797] 
[<ffffffff810c9018>] ?
>> __alloc_pages_nodemask+0x440/0x72f
>> Apr 11 21:22:27 gaia kernel: [429242.015801] 
[<ffffffff810f592d>] ?
>> alloc_pages_current+0xb2/0xcd
>
> Judging from the stack trace above, it looks like the system is trying
> to squeeze some memory out from NFS. Probably it is just that your
> system is suffering from OOM? Then NFS failed to commit its changes to
> disk for some reason and hung.
>
Yes, it's not specific to NFS page, but I'm just bad luck enough.
I agree with your suspect, the chance depends on the memory pressure in dom0.
So here is a proper setup to reproduce the issue:
1. dom0 with SWAP disabled and with limited memory allocated.
2. domU serves storage and exports NFS
3. dom0 mounts the domU storage and writes to it.
4. You need to achieve high speed to expose this issue.

In my case, domU owns a dedicated SATA controller so there is no
blkback overhead. Not sure if this is important factor to achieve high
speed.
And the transfer is a normal file copy instead of O_SYNC / O_DIRECT
access so they can be cached in client side for some short period.
Finally the transfer speed && memory size is crucial.

With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a
USB2 port without problem at about 32MB/s.
But using a USB3 port, the same file generally sucks at 1.2GB. And the
'dd if=/dev/zero' sucks ever quicker.
With around 1～2GB memory to dom0, the freeze happens much earlier, but
I did not check the exact time.

I'm on a custom build of xen 4.2.1 testing release (built around Jan
this year?), with some patches related to graphics pass-through. But I
guess the patch is not relevant.
The dom0 kernel is version 3.6.11, 64 bit version.

One thing I forgot to mention is the sign of memory leakage.
I'm not very sure about it, but my dom0 reported OOM several days before.
And typically I don't use dom0 for other purpose other than serve
backends. The allocated memory should be around 2GB and that should be
plenty for a dom0.
Are there any known leakage bug out there?

Thanks,
Timothy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Liu

2013-Apr-12 09:27 UTC

head link

Re: NFS related netback hang

On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote:> 
> Yes, it's not specific to NFS page, but I'm just bad luck enough.
> I agree with your suspect, the chance depends on the memory pressure in
dom0.
> So here is a proper setup to reproduce the issue:
> 1. dom0 with SWAP disabled and with limited memory allocated.
> 2. domU serves storage and exports NFS
> 3. dom0 mounts the domU storage and writes to it.
> 4. You need to achieve high speed to expose this issue.
> 
My setup is almost the same. The write speed is around
35-45MB/s if I do:

dd if=/dev/zero of=/mnt/t bs=1 count=200

However if I do count=2000, the speed slows down to 24MB/s. I suspect
that's the memory pressure in Dom0 - my Dom0 only has 1024MB Ram. But I
still didn't see any error.
> In my case, domU owns a dedicated SATA controller so there is no
> blkback overhead. Not sure if this is important factor to achieve high
> speed.
> And the transfer is a normal file copy instead of O_SYNC / O_DIRECT
> access so they can be cached in client side for some short period.
> Finally the transfer speed && memory size is crucial.
> 
> With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a
> USB2 port without problem at about 32MB/s.
> But using a USB3 port, the same file generally sucks at 1.2GB. And the
> 'dd if=/dev/zero' sucks ever quicker.
> With around 1～2GB memory to dom0, the freeze happens much earlier, but
> I did not check the exact time.
> 
And I also use iperf, which can achieve 7GB/s transfer between Dom0 and
DomU, presumably that's fast enough?
> I'm on a custom build of xen 4.2.1 testing release (built around Jan
> this year?), with some patches related to graphics pass-through. But I
> guess the patch is not relevant.
> The dom0 kernel is version 3.6.11, 64 bit version.
> 
> One thing I forgot to mention is the sign of memory leakage.
> I'm not very sure about it, but my dom0 reported OOM several days
before.
> And typically I don't use dom0 for other purpose other than serve
> backends. The allocated memory should be around 2GB and that should be
> plenty for a dom0.
> Are there any known leakage bug out there?
> 
Not that I know of, page allocation / deallocation in netback is quite
simple.


Wei.
> Thanks,
> Timothy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

G.R.

2013-Apr-12 10:34 UTC

head link

Re: NFS related netback hang

On Fri, Apr 12, 2013 at 5:27 PM, Wei Liu <wei.liu2@citrix.com>
wrote:> On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote:
>>
>> Yes, it's not specific to NFS page, but I'm just bad luck
enough.
>> I agree with your suspect, the chance depends on the memory pressure in
dom0.
>> So here is a proper setup to reproduce the issue:
>> 1. dom0 with SWAP disabled and with limited memory allocated.
>> 2. domU serves storage and exports NFS
>> 3. dom0 mounts the domU storage and writes to it.
>> 4. You need to achieve high speed to expose this issue.
>>
>
> My setup is almost the same. The write speed is around
> 35-45MB/s if I do:
>
> dd if=/dev/zero of=/mnt/t bs=1 count=200
>
> However if I do count=2000, the speed slows down to 24MB/s. I suspect
> that's the memory pressure in Dom0 - my Dom0 only has 1024MB Ram. But I
> still didn't see any error.
>That's weird, the stack trace can prove that the issue exists. And the
issue stands theoretically.
But why this is common in my build and cannot be reproduced in yours?
There must be some factor got missed here.
Is there any kernel config affecting the memory management behavior in
dom0? What's your dom0 kernel version?
Is there anything that could matter in nfs config?
Do you enable memory ballooning for dom0? I do. But does it matter?

I still believe the key factor is to stress the memory.
Maybe you can try further limit the memory size and use a larger file size.

I become uncertain about how the transfer speed affects.
I can achieve 10GB/s in iperf test without issue.
And ftp transfer also works without problem at 50MB/s
But may be the higher net speed is a negative factor here -- NFS may
be able to commit changes in faster speed.
Probably we should feed data faster than NFS can handle so that memory
is used up quickly?
But the back pressure from down stream should slow down the speed that
upstream is eating the memory.
How does the throttling works? Anyway to control?

I'll check why my dom0 reported OOM, may be that's one factor too.

Thanks,
Timothy
>> In my case, domU owns a dedicated SATA controller so there is no
>> blkback overhead. Not sure if this is important factor to achieve high
>> speed.
>> And the transfer is a normal file copy instead of O_SYNC / O_DIRECT
>> access so they can be cached in client side for some short period.
>> Finally the transfer speed && memory size is crucial.
>>
>> With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a
>> USB2 port without problem at about 32MB/s.
>> But using a USB3 port, the same file generally sucks at 1.2GB. And the
>> 'dd if=/dev/zero' sucks ever quicker.
>> With around 1～2GB memory to dom0, the freeze happens much earlier, but
>> I did not check the exact time.
>>
>
> And I also use iperf, which can achieve 7GB/s transfer between Dom0 and
> DomU, presumably that's fast enough?
>
>> I'm on a custom build of xen 4.2.1 testing release (built around
Jan
>> this year?), with some patches related to graphics pass-through. But I
>> guess the patch is not relevant.
>> The dom0 kernel is version 3.6.11, 64 bit version.
>>
>> One thing I forgot to mention is the sign of memory leakage.
>> I'm not very sure about it, but my dom0 reported OOM several days
before.
>> And typically I don't use dom0 for other purpose other than serve
>> backends. The allocated memory should be around 2GB and that should be
>> plenty for a dom0.
>> Are there any known leakage bug out there?
>>
>
> Not that I know of, page allocation / deallocation in netback is quite
> simple.
>
>
> Wei.
>
>> Thanks,
>> Timothy
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Wei Liu

2013-Apr-12 10:59 UTC

head link

Re: NFS related netback hang

On Fri, Apr 12, 2013 at 11:34:50AM +0100, G.R. wrote:> On Fri, Apr 12, 2013 at 5:27 PM, Wei Liu <wei.liu2@citrix.com> wrote:
> > On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote:
> >>
> >> Yes, it''s not specific to NFS page, but I''m just
bad luck enough.
> >> I agree with your suspect, the chance depends on the memory
pressure in dom0.
> >> So here is a proper setup to reproduce the issue:
> >> 1. dom0 with SWAP disabled and with limited memory allocated.
> >> 2. domU serves storage and exports NFS
> >> 3. dom0 mounts the domU storage and writes to it.
> >> 4. You need to achieve high speed to expose this issue.
> >>
> >
> > My setup is almost the same. The write speed is around
> > 35-45MB/s if I do:
> >
> > dd if=/dev/zero of=/mnt/t bs=1 count=200
> >
Oh this is in fact bs=1M.
> > However if I do count=2000, the speed slows down to 24MB/s. I suspect
> > that''s the memory pressure in Dom0 - my Dom0 only has 1024MB
Ram. But I
> > still didn''t see any error.
> >
> That''s weird, the stack trace can prove that the issue exists. And
the
> issue stands theoretically.
> But why this is common in my build and cannot be reproduced in yours?
> There must be some factor got missed here.
> Is there any kernel config affecting the memory management behavior in
> dom0? What''s your dom0 kernel version?
I use default memory management options, i.e. I didn''t touch any of
those. I use 3.8-rc7.
> Is there anything that could matter in nfs config?
I use following line in /etc/exports:
/ *(rw)
> Do you enable memory ballooning for dom0? I do. But does it matter?
CONFIG_XEN_BALLOON=y

I have my config file attached.
> 
> I still believe the key factor is to stress the memory.
> Maybe you can try further limit the memory size and use a larger file size.
> 
> I become uncertain about how the transfer speed affects.
> I can achieve 10GB/s in iperf test without issue.
> And ftp transfer also works without problem at 50MB/s
> But may be the higher net speed is a negative factor here -- NFS may
> be able to commit changes in faster speed.
> Probably we should feed data faster than NFS can handle so that memory
> is used up quickly?
> But the back pressure from down stream should slow down the speed that
> upstream is eating the memory.
> How does the throttling works? Anyway to control?
> 
> I''ll check why my dom0 reported OOM, may be that''s one
factor too.
> 
This is a good starting point. :-)


Wei.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

G.R.

2013-Apr-13 06:19 UTC

head link

Re: NFS related netback hang

>> I still believe the key factor is to stress the memory.
>> Maybe you can try further limit the memory size and use a larger file
size.
>>
>> I become uncertain about how the transfer speed affects.
>> I can achieve 10GB/s in iperf test without issue.
>> And ftp transfer also works without problem at 50MB/s
>> But may be the higher net speed is a negative factor here -- NFS may
>> be able to commit changes in faster speed.
>> Probably we should feed data faster than NFS can handle so that memory
>> is used up quickly?
>> But the back pressure from down stream should slow down the speed that
>> upstream is eating the memory.
>> How does the throttling works? Anyway to control?
>>
>> I''ll check why my dom0 reported OOM, may be that''s
one factor too.
>>
>
> This is a good starting point. :-)
>
It seems that the OOM-killer was only triggered on kernel version 3.6.9.
It does not show up in 3.6.11 while the issue still exists.
So I guess there are some changes in mm behavior in recent kernels.
Probably I should try with your kernel version.

But anyway, let''s see some existing data first.
Please find the full oom_kill log in the attached file.
It seems that OOM_KILL is caused by the freeze issue, since the NFS
server (domU) becomes unresponsive first.
There are about 900MB dirty pages. (writeback:152893 unstable:72860,
can they be simply added?)
I don''t remember the totoal memory at that time due to the ballooning.
The total DRAM in the host is 8GB.

Thanks,
Timothy


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

G.R.

2013-Apr-13 07:06 UTC

head link

Re: NFS related netback hang

On Sat, Apr 13, 2013 at 2:19 PM, G.R. <firemeteor@users.sourceforge.net>
wrote:>>> I still believe the key factor is to stress the memory.
>>> Maybe you can try further limit the memory size and use a larger
file size.
>>>
>>> I become uncertain about how the transfer speed affects.
>>> I can achieve 10GB/s in iperf test without issue.
>>> And ftp transfer also works without problem at 50MB/s
>>> But may be the higher net speed is a negative factor here -- NFS
may
>>> be able to commit changes in faster speed.
>>> Probably we should feed data faster than NFS can handle so that
memory
>>> is used up quickly?
>>> But the back pressure from down stream should slow down the speed
that
>>> upstream is eating the memory.
>>> How does the throttling works? Anyway to control?
>>>
>>> I''ll check why my dom0 reported OOM, may be
that''s one factor too.
>>>
>>
>> This is a good starting point. :-)
>>
>
> It seems that the OOM-killer was only triggered on kernel version 3.6.9.
> It does not show up in 3.6.11 while the issue still exists.
> So I guess there are some changes in mm behavior in recent kernels.
> Probably I should try with your kernel version.
>
> But anyway, let''s see some existing data first.
> Please find the full oom_kill log in the attached file.
> It seems that OOM_KILL is caused by the freeze issue, since the NFS
> server (domU) becomes unresponsive first.
> There are about 900MB dirty pages. (writeback:152893 unstable:72860,
> can they be simply added?)
> I don''t remember the totoal memory at that time due to the
ballooning.
> The total DRAM in the host is 8GB.
>
> Thanks,
> Timothy
I did another experiment by dumping the live statistics during reproducing.
The command line used:
dd if=/dev/zero of=foobar bs=1M count=4096

Statistics:

Output from vmstat -m: while true;  do  sleep 1; date;vmstat -m|grep
-i nfs; done
Sat Apr 13 13:35:31 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data               36     36    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              16     16   1008      4
nfs_page                      0      0    128     30
Sat Apr 13 13:35:32 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data             7836   7836    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              20     20   1008      4
nfs_page                  91080  91080    128     30
Sat Apr 13 13:35:33 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data            15920  15920    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              20     20   1008      4
nfs_page                 134220 134220    128     30
Sat Apr 13 13:35:34 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data            23596  23596    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              18     20   1008      4
nfs_page                 167730 167730    128     30
Sat Apr 13 13:35:35 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data            30228  30228    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              18     20   1008      4
nfs_page                 196650 196650    128     30
Sat Apr 13 13:35:36 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data            33124  33124    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              17     20   1008      4
nfs_page                 209096 209100    128     30

... later when I tried umount -f
Sat Apr 13 13:38:04 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data            33119  33124    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              17     20   1008      4
nfs_page                 209019 209100    128     30
Sat Apr 13 13:38:05 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data              314    568    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              17     20   1008      4
nfs_page                  34959  35610    128     30
Sat Apr 13 13:38:06 CST 2013
nfs_direct_cache              0      0    200     19
nfs_commit_data               0      0    704     11
nfs_write_data              314    568    960      4
nfs_read_data                 0      0    896      4
nfs_inode_cache              17     20   1008      4
nfs_page                  34959  35610    128     30

And I made a table for the nfs_page item:

Seconds nfs_page delta nfs_page delta MB
1                        0                    N/A  N/A
2                 91080               91080 355.7812
3               134220               43140 168.5156
4               167730               33510 130.8984
5               196650               28920 112.9688
6               209096               12446 48.61719
n               209019                   -77 -0.3008
n+1             34959            -174060 -679.92
n+2             34959                        0 0


Output from: vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 906256  11720 110764    0    0     0     4  215  623  0
0 100  0
 0  0      0 906504  11720 110764    0    0     0     4  185  513  0  0 100  0
 1  0      0 858888  11728 155728    0    0     0    32  239  524  0  1 99  0
 2  1      0 480316  10368 509768    0    0    40     4 55192 130021  0  2 76 21
 0  2      0 308452   9344 667840    0    0   124     4 36367 59147  0  0 78 21
 0  2      0 170548   8860 793952    0    0    32     4 27040 42421  0  1 77 22
 0  2      0  52020   8400 903524    0    0     0     4 22961 36318  0  0 77 22
 0  0      0  42160   4780 916328    0    0     4    48 7544 12293  0  0 90  9
 0  0      0  41796   4780 916160    0    0     0     4  141  398  0  0 100  0
 0  0      0  41292   4780 916160    0    0     0     4  132  463  0  0 100  0
 0  0      0  41168   4780 916164    0    0     0     4  146  461  0  0 100  0
 0  0      0  40796   4780 916156    0    0     0     4  142  447  0  0 100  0
 0  0      0  41176   4788 916160    0    0     0    36  170  536  0  0 100  0

later when I tried umount -f
 0  1      0  44360   4748 911660    0    0     0    32  205  493  0  0 87 13
 0  1      0  44724   4748 911668    0    0     0     4  143  429  0  0 87 13
 0  1      0  44616   4748 911672    0    0     0     4  187  515  0  0 87 13
 0  0      0 103616   4828 911936    0    0   384     4 2572 41347  0  1 94  4
 0  0      0 103772   4828 911936    0    0     0     4  120  321  0  0 100  0
 0  0      0 104632   4836 911968    0    0     0    40  155  438  0  0 100  0

Finally, I attach a file that holds the /proc/meminfo at 1 second interval.
This is from a different capture though, because I used wrong command at first.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

G.R.

2013-Apr-14 07:45 UTC

head link

Re: NFS related netback hang

On Sat, Apr 13, 2013 at 3:06 PM, G.R. <firemeteor@users.sourceforge.net>
wrote:> On Sat, Apr 13, 2013 at 2:19 PM, G.R.
<firemeteor@users.sourceforge.net> wrote:
>>>> I still believe the key factor is to stress the memory.
>>>> Maybe you can try further limit the memory size and use a
larger file size.
>>>>
>>>> I become uncertain about how the transfer speed affects.
>>>> I can achieve 10GB/s in iperf test without issue.
>>>> And ftp transfer also works without problem at 50MB/s
>>>> But may be the higher net speed is a negative factor here --
NFS may
>>>> be able to commit changes in faster speed.
>>>> Probably we should feed data faster than NFS can handle so that
memory
>>>> is used up quickly?
>>>> But the back pressure from down stream should slow down the
speed that
>>>> upstream is eating the memory.
>>>> How does the throttling works? Anyway to control?
>>>>
>>>> I''ll check why my dom0 reported OOM, may be
that''s one factor too.
>>>>
>>>
>>> This is a good starting point. :-)
>>>
>>
>> It seems that the OOM-killer was only triggered on kernel version
3.6.9.
>> It does not show up in 3.6.11 while the issue still exists.
>> So I guess there are some changes in mm behavior in recent kernels.
>> Probably I should try with your kernel version.
>>
>> But anyway, let''s see some existing data first.
>> Please find the full oom_kill log in the attached file.
>> It seems that OOM_KILL is caused by the freeze issue, since the NFS
>> server (domU) becomes unresponsive first.
>> There are about 900MB dirty pages. (writeback:152893 unstable:72860,
>> can they be simply added?)
>> I don''t remember the totoal memory at that time due to the
ballooning.
>> The total DRAM in the host is 8GB.
Hi Wei,
I think I find something important for this issue.
I moved to a brand new kernel version -- 3.8.7 yesterday.
Both your configuration and mine are tried.
And the configuration really matters -- mine configuration fails while
yours does not.

I made a quick comparison and found some key difference -- kernel preemption.
I enabled the kernel preemption and use 1000 Hz time slices.
I just have my config attached, may be there are something else that
matters. You can simply check it out.

To fix the issue, I wonder if it''s feasible to reserve some pages for
low memory situation?
Also, the performance drop upon large file transfer does not make sense.
With raw network speed at about 7Gbps, why we end up with about 30MB/s
with large files?
Current HDD should be able to sustain 100MB/s in sequential writes.

Thanks,
Timothy


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

G.R.

2013-Apr-15 14:40 UTC

head link

Re: NFS related netback hang

Hi Wei,
Could you take a look?

Thanks,
Timothy> Hi Wei,
> I think I find something important for this issue.
> I moved to a brand new kernel version -- 3.8.7 yesterday.
> Both your configuration and mine are tried.
> And the configuration really matters -- mine configuration fails while
> yours does not.
>
> I made a quick comparison and found some key difference -- kernel
preemption.
> I enabled the kernel preemption and use 1000 Hz time slices.
> I just have my config attached, may be there are something else that
> matters. You can simply check it out.
>
> To fix the issue, I wonder if it''s feasible to reserve some pages
for
> low memory situation?
> Also, the performance drop upon large file transfer does not make sense.
> With raw network speed at about 7Gbps, why we end up with about 30MB/s
> with large files?
> Current HDD should be able to sustain 100MB/s in sequential writes.

Xen devel - Apr 2013 - NFS related netback hang

NFS related netback hang

NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang

Re: NFS related netback hang