Hi, I suffer from strange NFS related network issue for a while. The issue shows up when copying from dom0 to domU through a NFS mount. After a short while, the transfer suddenly freezes and the domU network simply stops any response. Force mounting the NFS mount generally resolves the situation, but some times you can really be in bad luck. Lucky enough, I captured the following log in a recent instance. It appears to be a dead-lock when the netback tries to get some free pages from NFS. I''m not sure if this is the whole story. Any suggestion how to solve the issue? Thanks, Timothy Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D ffff880210213900 0 2255 2 0x00000000 Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 0000000000000246 0000000000000000 ffffffff818133f0 Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? __lock_page+0x66/0x66 Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? io_schedule+0x55/0x6b Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? sleep_on_page+0x7/0xc Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? __wait_on_bit_lock+0x3c/0x85 Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? find_get_pages+0xea/0x100 Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? __lock_page+0x61/0x66 Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? autoremove_wake_function+0x2a/0x2a Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? truncate_inode_pages_range+0x28b/0x2f8 Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? nfs_evict_inode+0x12/0x23 Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? evict+0xa3/0x153 Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? dispose_list+0x27/0x31 Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? evict_inodes+0xe7/0xf4 Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? generic_shutdown_super+0x3e/0xc5 Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? kill_anon_super+0x9/0x11 Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? nfs_kill_super+0xd/0x16 Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? deactivate_locked_super+0x2c/0x5c Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? __put_nfs_open_context+0xbf/0xe1 Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? nfs_commitdata_release+0x10/0x19 Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? nfs_initiate_commit+0xd9/0xe4 Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? nfs_commit_inode+0x81/0x111 Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? nfs_release_page+0x40/0x4f Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? shrink_page_list+0x4f5/0x6d8 Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? shrink_inactive_list+0x1dd/0x33f Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? shrink_lruvec+0x2e0/0x44d Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? shrink_zone+0x53/0x8a Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? do_try_to_free_pages+0x1c6/0x3f4 Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? try_to_free_pages+0xc4/0x11e Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? __alloc_pages_nodemask+0x440/0x72f Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? alloc_pages_current+0xb2/0xcd Apr 11 21:22:27 gaia kernel: [429242.015805] [<ffffffffa01be188>] ? xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015808] [<ffffffffa01bf30c>] ? xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015812] [<ffffffff812c0068>] ? pnp_show_options+0x43e/0x482 Apr 11 21:22:27 gaia kernel: [429242.015816] [<ffffffffa01beb3f>] ? xen_netbk_rx_action+0x688/0x6f8 [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015819] [<ffffffff81061198>] ? finish_task_switch+0x4c/0x7c Apr 11 21:22:27 gaia kernel: [429242.015823] [<ffffffffa01bf7e9>] ? xen_netbk_kthread+0x12e/0x790 [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015826] [<ffffffff8105833a>] ? abort_exclusive_wait+0x79/0x79 Apr 11 21:22:27 gaia kernel: [429242.015829] [<ffffffffa01bf6bb>] ? xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015832] [<ffffffffa01bf6bb>] ? xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015834] [<ffffffff81057a31>] ? kthread+0x67/0x6f Apr 11 21:22:27 gaia kernel: [429242.015838] [<ffffffff814d6904>] ? kernel_thread_helper+0x4/0x10 Apr 11 21:22:27 gaia kernel: [429242.015841] [<ffffffff814d143c>] ? retint_restore_args+0x5/0x6 Apr 11 21:22:27 gaia kernel: [429242.015844] [<ffffffff814d6900>] ? gs_change+0x13/0x13
Hi, I''m suffering from strange NFS related network issue for a while. The issue shows up when copying from dom0 to domU through a NFS mount. After a short while, the transfer suddenly freezes and the domU network simply stops any response. Force mounting the NFS mount generally resolves the freeze. But some times you can really be in bad luck that the trick does not work. Lucky enough, I captured the following log in a recent instance. It appears to be a dead-lock when the netback tries to get some free pages from NFS. I''m not sure if this is the whole story. Any suggestion how to solve the issue? Thanks, Timothy Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255 blocked for more than 120 seconds. Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D ffff880210213900 0 2255 2 0x00000000 Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 0000000000000246 0000000000000000 ffffffff818133f0 Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? __lock_page+0x66/0x66 Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? io_schedule+0x55/0x6b Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? sleep_on_page+0x7/0xc Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? __wait_on_bit_lock+0x3c/0x85 Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? find_get_pages+0xea/0x100 Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? __lock_page+0x61/0x66 Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? autoremove_wake_function+0x2a/0x2a Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? truncate_inode_pages_range+0x28b/0x2f8 Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? nfs_evict_inode+0x12/0x23 Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? evict+0xa3/0x153 Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? dispose_list+0x27/0x31 Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? evict_inodes+0xe7/0xf4 Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? generic_shutdown_super+0x3e/0xc5 Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? kill_anon_super+0x9/0x11 Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? nfs_kill_super+0xd/0x16 Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? deactivate_locked_super+0x2c/0x5c Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? __put_nfs_open_context+0xbf/0xe1 Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? nfs_commitdata_release+0x10/0x19 Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? nfs_initiate_commit+0xd9/0xe4 Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? nfs_commit_inode+0x81/0x111 Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? nfs_release_page+0x40/0x4f Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? shrink_page_list+0x4f5/0x6d8 Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? shrink_inactive_list+0x1dd/0x33f Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? shrink_lruvec+0x2e0/0x44d Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? shrink_zone+0x53/0x8a Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? do_try_to_free_pages+0x1c6/0x3f4 Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? try_to_free_pages+0xc4/0x11e Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? __alloc_pages_nodemask+0x440/0x72f Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? alloc_pages_current+0xb2/0xcd Apr 11 21:22:27 gaia kernel: [429242.015805] [<ffffffffa01be188>] ? xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015808] [<ffffffffa01bf30c>] ? xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015812] [<ffffffff812c0068>] ? pnp_show_options+0x43e/0x482 Apr 11 21:22:27 gaia kernel: [429242.015816] [<ffffffffa01beb3f>] ? xen_netbk_rx_action+0x688/0x6f8 [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015819] [<ffffffff81061198>] ? finish_task_switch+0x4c/0x7c Apr 11 21:22:27 gaia kernel: [429242.015823] [<ffffffffa01bf7e9>] ? xen_netbk_kthread+0x12e/0x790 [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015826] [<ffffffff8105833a>] ? abort_exclusive_wait+0x79/0x79 Apr 11 21:22:27 gaia kernel: [429242.015829] [<ffffffffa01bf6bb>] ? xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015832] [<ffffffffa01bf6bb>] ? xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] Apr 11 21:22:27 gaia kernel: [429242.015834] [<ffffffff81057a31>] ? kthread+0x67/0x6f Apr 11 21:22:27 gaia kernel: [429242.015838] [<ffffffff814d6904>] ? kernel_thread_helper+0x4/0x10 Apr 11 21:22:27 gaia kernel: [429242.015841] [<ffffffff814d143c>] ? retint_restore_args+0x5/0x6 Apr 11 21:22:27 gaia kernel: [429242.015844] [<ffffffff814d6900>] ? gs_change+0x13/0x13
Hi Konrad, Do you have any solution to this? Thanks, Timothy On Thu, Apr 11, 2013 at 9:55 PM, G.R. <firemeteor@users.sourceforge.net> wrote:> Hi, > I''m suffering from strange NFS related network issue for a while. > > The issue shows up when copying from dom0 to domU through a NFS mount. > After a short while, the transfer suddenly freezes and the domU > network simply stops any response. Force mounting the NFS mount > generally resolves the freeze. But some times you can really be in > bad luck that the trick does not work. > > Lucky enough, I captured the following log in a recent instance. It > appears to be a dead-lock when the netback tries to get some free > pages from NFS. I''m not sure if this is the whole story. Any > suggestion how to solve the issue? > > Thanks, > Timothy > > Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255 > blocked for more than 120 seconds. > Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D > ffff880210213900 0 2255 2 0x00000000 > Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 > 0000000000000246 0000000000000000 ffffffff818133f0 > Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 > ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 > Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 > ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 > Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: > Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? > __lock_page+0x66/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? > io_schedule+0x55/0x6b > Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? > sleep_on_page+0x7/0xc > Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? > __wait_on_bit_lock+0x3c/0x85 > Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? > find_get_pages+0xea/0x100 > Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? > __lock_page+0x61/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? > autoremove_wake_function+0x2a/0x2a > Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? > truncate_inode_pages_range+0x28b/0x2f8 > Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? > nfs_evict_inode+0x12/0x23 > Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? > evict+0xa3/0x153 > Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? > dispose_list+0x27/0x31 > Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? > evict_inodes+0xe7/0xf4 > Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? > generic_shutdown_super+0x3e/0xc5 > Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? > kill_anon_super+0x9/0x11 > Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? > nfs_kill_super+0xd/0x16 > Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? > deactivate_locked_super+0x2c/0x5c > Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? > __put_nfs_open_context+0xbf/0xe1 > Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? > nfs_commitdata_release+0x10/0x19 > Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? > nfs_initiate_commit+0xd9/0xe4 > Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? > nfs_commit_inode+0x81/0x111 > Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? > nfs_release_page+0x40/0x4f > Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? > shrink_page_list+0x4f5/0x6d8 > Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? > shrink_inactive_list+0x1dd/0x33f > Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? > shrink_lruvec+0x2e0/0x44d > Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? > shrink_zone+0x53/0x8a > Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? > do_try_to_free_pages+0x1c6/0x3f4 > Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? > try_to_free_pages+0xc4/0x11e > Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? > __alloc_pages_nodemask+0x440/0x72f > Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? > alloc_pages_current+0xb2/0xcd > Apr 11 21:22:27 gaia kernel: [429242.015805] [<ffffffffa01be188>] ? > xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015808] [<ffffffffa01bf30c>] ? > xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015812] [<ffffffff812c0068>] ? > pnp_show_options+0x43e/0x482 > Apr 11 21:22:27 gaia kernel: [429242.015816] [<ffffffffa01beb3f>] ? > xen_netbk_rx_action+0x688/0x6f8 [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015819] [<ffffffff81061198>] ? > finish_task_switch+0x4c/0x7c > Apr 11 21:22:27 gaia kernel: [429242.015823] [<ffffffffa01bf7e9>] ? > xen_netbk_kthread+0x12e/0x790 [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015826] [<ffffffff8105833a>] ? > abort_exclusive_wait+0x79/0x79 > Apr 11 21:22:27 gaia kernel: [429242.015829] [<ffffffffa01bf6bb>] ? > xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015832] [<ffffffffa01bf6bb>] ? > xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015834] [<ffffffff81057a31>] ? > kthread+0x67/0x6f > Apr 11 21:22:27 gaia kernel: [429242.015838] [<ffffffff814d6904>] ? > kernel_thread_helper+0x4/0x10 > Apr 11 21:22:27 gaia kernel: [429242.015841] [<ffffffff814d143c>] ? > retint_restore_args+0x5/0x6 > Apr 11 21:22:27 gaia kernel: [429242.015844] [<ffffffff814d6900>] ? > gs_change+0x13/0x13
On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi, > I''m suffering from strange NFS related network issue for a while. > > The issue shows up when copying from dom0 to domU through a NFS mount. > After a short while, the transfer suddenly freezes and the domU > network simply stops any response. Force mounting the NFS mount > generally resolves the freeze. But some times you can really be in > bad luck that the trick does not work. > > Lucky enough, I captured the following log in a recent instance. It > appears to be a dead-lock when the netback tries to get some free > pages from NFS. I''m not sure if this is the whole story. Any > suggestion how to solve the issue? >Ian, does this look familiar to you? IIRC you once discovered some NFS bug related to SKB life cycle tracking. Wei.
On Thu, 2013-04-11 at 17:39 +0100, Wei Liu wrote:> On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote: > > Hi, > > I''m suffering from strange NFS related network issue for a while. > > > > The issue shows up when copying from dom0 to domU through a NFS mount. > > After a short while, the transfer suddenly freezes and the domU > > network simply stops any response. Force mounting the NFS mount > > generally resolves the freeze. But some times you can really be in > > bad luck that the trick does not work. > > > > Lucky enough, I captured the following log in a recent instance. It > > appears to be a dead-lock when the netback tries to get some free > > pages from NFS. I''m not sure if this is the whole story. Any > > suggestion how to solve the issue? > > > > Ian, does this look familiar to you? IIRC you once discovered some NFS > bug related to SKB life cycle tracking.Not with upstream netback which does grant copy not grant maps. Ian.
On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi, > I''m suffering from strange NFS related network issue for a while. > > The issue shows up when copying from dom0 to domU through a NFS mount. > After a short while, the transfer suddenly freezes and the domU > network simply stops any response. Force mounting the NFS mount > generally resolves the freeze. But some times you can really be in > bad luck that the trick does not work. > > Lucky enough, I captured the following log in a recent instance. It > appears to be a dead-lock when the netback tries to get some free > pages from NFS. I''m not sure if this is the whole story. Any > suggestion how to solve the issue? >Unfortunately I cannot reproduce this at the moment. Please provide detail configurations and steps. I use NFS to test my netback / netfront patches, but I''ve never seen anything like this. In my case I mostly export a dir in Dom0 then DomU writes to it. Presumably in your case DomU exports a dir then Dom0 writes to it? I did simple test for both cases and couldn''t see this problem. Wei.> Thanks, > Timothy > > Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255 > blocked for more than 120 seconds. > Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D > ffff880210213900 0 2255 2 0x00000000 > Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 > 0000000000000246 0000000000000000 ffffffff818133f0 > Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 > ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 > Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 > ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 > Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: > Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? > __lock_page+0x66/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? > io_schedule+0x55/0x6b > Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? > sleep_on_page+0x7/0xc > Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? > __wait_on_bit_lock+0x3c/0x85 > Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? > find_get_pages+0xea/0x100 > Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? > __lock_page+0x61/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? > autoremove_wake_function+0x2a/0x2a > Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? > truncate_inode_pages_range+0x28b/0x2f8 > Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? > nfs_evict_inode+0x12/0x23 > Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? > evict+0xa3/0x153 > Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? > dispose_list+0x27/0x31 > Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? > evict_inodes+0xe7/0xf4 > Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? > generic_shutdown_super+0x3e/0xc5 > Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? > kill_anon_super+0x9/0x11 > Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? > nfs_kill_super+0xd/0x16 > Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? > deactivate_locked_super+0x2c/0x5c > Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? > __put_nfs_open_context+0xbf/0xe1 > Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? > nfs_commitdata_release+0x10/0x19 > Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? > nfs_initiate_commit+0xd9/0xe4 > Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? > nfs_commit_inode+0x81/0x111 > Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? > nfs_release_page+0x40/0x4f > Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? > shrink_page_list+0x4f5/0x6d8 > Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? > shrink_inactive_list+0x1dd/0x33f > Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? > shrink_lruvec+0x2e0/0x44d > Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? > shrink_zone+0x53/0x8a > Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? > do_try_to_free_pages+0x1c6/0x3f4 > Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? > try_to_free_pages+0xc4/0x11e > Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? > __alloc_pages_nodemask+0x440/0x72f > Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? > alloc_pages_current+0xb2/0xcd > Apr 11 21:22:27 gaia kernel: [429242.015805] [<ffffffffa01be188>] ? > xen_netbk_alloc_page.isra.17+0x15/0x4d [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015808] [<ffffffffa01bf30c>] ? > xen_netbk_tx_build_gops+0x3fc/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015812] [<ffffffff812c0068>] ? > pnp_show_options+0x43e/0x482 > Apr 11 21:22:27 gaia kernel: [429242.015816] [<ffffffffa01beb3f>] ? > xen_netbk_rx_action+0x688/0x6f8 [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015819] [<ffffffff81061198>] ? > finish_task_switch+0x4c/0x7c > Apr 11 21:22:27 gaia kernel: [429242.015823] [<ffffffffa01bf7e9>] ? > xen_netbk_kthread+0x12e/0x790 [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015826] [<ffffffff8105833a>] ? > abort_exclusive_wait+0x79/0x79 > Apr 11 21:22:27 gaia kernel: [429242.015829] [<ffffffffa01bf6bb>] ? > xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015832] [<ffffffffa01bf6bb>] ? > xen_netbk_tx_build_gops+0x7ab/0x7ab [xen_netback] > Apr 11 21:22:27 gaia kernel: [429242.015834] [<ffffffff81057a31>] ? > kthread+0x67/0x6f > Apr 11 21:22:27 gaia kernel: [429242.015838] [<ffffffff814d6904>] ? > kernel_thread_helper+0x4/0x10 > Apr 11 21:22:27 gaia kernel: [429242.015841] [<ffffffff814d143c>] ? > retint_restore_args+0x5/0x6 > Apr 11 21:22:27 gaia kernel: [429242.015844] [<ffffffff814d6900>] ? > gs_change+0x13/0x13 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote:> Hi, > I''m suffering from strange NFS related network issue for a while. > > The issue shows up when copying from dom0 to domU through a NFS mount. > After a short while, the transfer suddenly freezes and the domU > network simply stops any response. Force mounting the NFS mount > generally resolves the freeze. But some times you can really be in > bad luck that the trick does not work. > > Lucky enough, I captured the following log in a recent instance. It > appears to be a dead-lock when the netback tries to get some free > pages from NFS. I''m not sure if this is the whole story. Any > suggestion how to solve the issue? >BTW xen_netbk_alloc_page tries to allocate page from generic page pool. It is not specific to NFS.> Thanks, > Timothy > > Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255 > blocked for more than 120 seconds. > Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D > ffff880210213900 0 2255 2 0x00000000 > Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 > 0000000000000246 0000000000000000 ffffffff818133f0 > Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 > ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 > Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 > ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 > Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: > Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? > __lock_page+0x66/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? > io_schedule+0x55/0x6b > Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? > sleep_on_page+0x7/0xc > Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? > __wait_on_bit_lock+0x3c/0x85 > Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? > find_get_pages+0xea/0x100 > Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? > __lock_page+0x61/0x66 > Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? > autoremove_wake_function+0x2a/0x2a > Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? > truncate_inode_pages_range+0x28b/0x2f8 > Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? > nfs_evict_inode+0x12/0x23 > Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? > evict+0xa3/0x153 > Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? > dispose_list+0x27/0x31 > Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? > evict_inodes+0xe7/0xf4 > Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? > generic_shutdown_super+0x3e/0xc5 > Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? > kill_anon_super+0x9/0x11 > Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? > nfs_kill_super+0xd/0x16 > Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? > deactivate_locked_super+0x2c/0x5c > Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? > __put_nfs_open_context+0xbf/0xe1 > Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? > nfs_commitdata_release+0x10/0x19 > Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? > nfs_initiate_commit+0xd9/0xe4 > Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? > nfs_commit_inode+0x81/0x111 > Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? > nfs_release_page+0x40/0x4f > Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? > shrink_page_list+0x4f5/0x6d8 > Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? > shrink_inactive_list+0x1dd/0x33f > Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? > shrink_lruvec+0x2e0/0x44d > Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? > shrink_zone+0x53/0x8a > Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? > do_try_to_free_pages+0x1c6/0x3f4 > Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? > try_to_free_pages+0xc4/0x11e > Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? > __alloc_pages_nodemask+0x440/0x72f > Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? > alloc_pages_current+0xb2/0xcdJudging from the stack trace above, it looks like the system is trying to squeeze some memory out from NFS. Probably it is just that your system is suffering from OOM? Then NFS failed to commit its changes to disk for some reason and hung. Wei.
On Fri, Apr 12, 2013 at 1:38 AM, Wei Liu <wei.liu2@citrix.com> wrote:> On Thu, Apr 11, 2013 at 02:55:48PM +0100, G.R. wrote: >> Hi, >> I'm suffering from strange NFS related network issue for a while. >> >> The issue shows up when copying from dom0 to domU through a NFS mount. >> After a short while, the transfer suddenly freezes and the domU >> network simply stops any response. Force mounting the NFS mount >> generally resolves the freeze. But some times you can really be in >> bad luck that the trick does not work. >> >> Lucky enough, I captured the following log in a recent instance. It >> appears to be a dead-lock when the netback tries to get some free >> pages from NFS. I'm not sure if this is the whole story. Any >> suggestion how to solve the issue? >> > > BTW xen_netbk_alloc_page tries to allocate page from generic page pool. > It is not specific to NFS. > >> Thanks, >> Timothy >> >> Apr 11 21:22:27 gaia kernel: [429242.015643] INFO: task netback/0:2255 >> blocked for more than 120 seconds. >> Apr 11 21:22:27 gaia kernel: [429242.015665] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> Apr 11 21:22:27 gaia kernel: [429242.015690] netback/0 D >> ffff880210213900 0 2255 2 0x00000000 >> Apr 11 21:22:27 gaia kernel: [429242.015693] ffff8801fee04ea0 >> 0000000000000246 0000000000000000 ffffffff818133f0 >> Apr 11 21:22:27 gaia kernel: [429242.015697] 0000000000013900 >> ffff8801fed87fd8 ffff8801fed87fd8 ffff8801fee04ea0 >> Apr 11 21:22:27 gaia kernel: [429242.015700] ffff8801fed87488 >> ffff880210213900 ffff8801fee04ea0 ffff8801fed87488 >> Apr 11 21:22:27 gaia kernel: [429242.015703] Call Trace: >> Apr 11 21:22:27 gaia kernel: [429242.015711] [<ffffffff810c1bb5>] ? >> __lock_page+0x66/0x66 >> Apr 11 21:22:27 gaia kernel: [429242.015715] [<ffffffff814d06cb>] ? >> io_schedule+0x55/0x6b >> Apr 11 21:22:27 gaia kernel: [429242.015718] [<ffffffff810c1bbc>] ? >> sleep_on_page+0x7/0xc >> Apr 11 21:22:27 gaia kernel: [429242.015720] [<ffffffff814cf6c0>] ? >> __wait_on_bit_lock+0x3c/0x85 >> Apr 11 21:22:27 gaia kernel: [429242.015723] [<ffffffff810c3f7a>] ? >> find_get_pages+0xea/0x100 >> Apr 11 21:22:27 gaia kernel: [429242.015726] [<ffffffff810c1bb0>] ? >> __lock_page+0x61/0x66 >> Apr 11 21:22:27 gaia kernel: [429242.015729] [<ffffffff81058364>] ? >> autoremove_wake_function+0x2a/0x2a >> Apr 11 21:22:27 gaia kernel: [429242.015732] [<ffffffff810cd110>] ? >> truncate_inode_pages_range+0x28b/0x2f8 >> Apr 11 21:22:27 gaia kernel: [429242.015737] [<ffffffff811c91d2>] ? >> nfs_evict_inode+0x12/0x23 >> Apr 11 21:22:27 gaia kernel: [429242.015740] [<ffffffff8111cdae>] ? >> evict+0xa3/0x153 >> Apr 11 21:22:27 gaia kernel: [429242.015743] [<ffffffff8111ce85>] ? >> dispose_list+0x27/0x31 >> Apr 11 21:22:27 gaia kernel: [429242.015746] [<ffffffff8111db6b>] ? >> evict_inodes+0xe7/0xf4 >> Apr 11 21:22:27 gaia kernel: [429242.015749] [<ffffffff8110b3af>] ? >> generic_shutdown_super+0x3e/0xc5 >> Apr 11 21:22:27 gaia kernel: [429242.015752] [<ffffffff8110b49e>] ? >> kill_anon_super+0x9/0x11 >> Apr 11 21:22:27 gaia kernel: [429242.015755] [<ffffffff811ca7b0>] ? >> nfs_kill_super+0xd/0x16 >> Apr 11 21:22:27 gaia kernel: [429242.015758] [<ffffffff8110b717>] ? >> deactivate_locked_super+0x2c/0x5c >> Apr 11 21:22:27 gaia kernel: [429242.015761] [<ffffffff811c901d>] ? >> __put_nfs_open_context+0xbf/0xe1 >> Apr 11 21:22:27 gaia kernel: [429242.015764] [<ffffffff811d07db>] ? >> nfs_commitdata_release+0x10/0x19 >> Apr 11 21:22:27 gaia kernel: [429242.015766] [<ffffffff811d0f8c>] ? >> nfs_initiate_commit+0xd9/0xe4 >> Apr 11 21:22:27 gaia kernel: [429242.015769] [<ffffffff811d1bae>] ? >> nfs_commit_inode+0x81/0x111 >> Apr 11 21:22:27 gaia kernel: [429242.015772] [<ffffffff811c86f4>] ? >> nfs_release_page+0x40/0x4f >> Apr 11 21:22:27 gaia kernel: [429242.015775] [<ffffffff810d0940>] ? >> shrink_page_list+0x4f5/0x6d8 >> Apr 11 21:22:27 gaia kernel: [429242.015780] [<ffffffff810d0f03>] ? >> shrink_inactive_list+0x1dd/0x33f >> Apr 11 21:22:27 gaia kernel: [429242.015783] [<ffffffff810d15fa>] ? >> shrink_lruvec+0x2e0/0x44d >> Apr 11 21:22:27 gaia kernel: [429242.015787] [<ffffffff810d17ba>] ? >> shrink_zone+0x53/0x8a >> Apr 11 21:22:27 gaia kernel: [429242.015790] [<ffffffff810d1bcd>] ? >> do_try_to_free_pages+0x1c6/0x3f4 >> Apr 11 21:22:27 gaia kernel: [429242.015794] [<ffffffff810d20a3>] ? >> try_to_free_pages+0xc4/0x11e >> Apr 11 21:22:27 gaia kernel: [429242.015797] [<ffffffff810c9018>] ? >> __alloc_pages_nodemask+0x440/0x72f >> Apr 11 21:22:27 gaia kernel: [429242.015801] [<ffffffff810f592d>] ? >> alloc_pages_current+0xb2/0xcd > > Judging from the stack trace above, it looks like the system is trying > to squeeze some memory out from NFS. Probably it is just that your > system is suffering from OOM? Then NFS failed to commit its changes to > disk for some reason and hung. >Yes, it's not specific to NFS page, but I'm just bad luck enough. I agree with your suspect, the chance depends on the memory pressure in dom0. So here is a proper setup to reproduce the issue: 1. dom0 with SWAP disabled and with limited memory allocated. 2. domU serves storage and exports NFS 3. dom0 mounts the domU storage and writes to it. 4. You need to achieve high speed to expose this issue. In my case, domU owns a dedicated SATA controller so there is no blkback overhead. Not sure if this is important factor to achieve high speed. And the transfer is a normal file copy instead of O_SYNC / O_DIRECT access so they can be cached in client side for some short period. Finally the transfer speed && memory size is crucial. With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a USB2 port without problem at about 32MB/s. But using a USB3 port, the same file generally sucks at 1.2GB. And the 'dd if=/dev/zero' sucks ever quicker. With around 1~2GB memory to dom0, the freeze happens much earlier, but I did not check the exact time. I'm on a custom build of xen 4.2.1 testing release (built around Jan this year?), with some patches related to graphics pass-through. But I guess the patch is not relevant. The dom0 kernel is version 3.6.11, 64 bit version. One thing I forgot to mention is the sign of memory leakage. I'm not very sure about it, but my dom0 reported OOM several days before. And typically I don't use dom0 for other purpose other than serve backends. The allocated memory should be around 2GB and that should be plenty for a dom0. Are there any known leakage bug out there? Thanks, Timothy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote:> > Yes, it's not specific to NFS page, but I'm just bad luck enough. > I agree with your suspect, the chance depends on the memory pressure in dom0. > So here is a proper setup to reproduce the issue: > 1. dom0 with SWAP disabled and with limited memory allocated. > 2. domU serves storage and exports NFS > 3. dom0 mounts the domU storage and writes to it. > 4. You need to achieve high speed to expose this issue. >My setup is almost the same. The write speed is around 35-45MB/s if I do: dd if=/dev/zero of=/mnt/t bs=1 count=200 However if I do count=2000, the speed slows down to 24MB/s. I suspect that's the memory pressure in Dom0 - my Dom0 only has 1024MB Ram. But I still didn't see any error.> In my case, domU owns a dedicated SATA controller so there is no > blkback overhead. Not sure if this is important factor to achieve high > speed. > And the transfer is a normal file copy instead of O_SYNC / O_DIRECT > access so they can be cached in client side for some short period. > Finally the transfer speed && memory size is crucial. > > With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a > USB2 port without problem at about 32MB/s. > But using a USB3 port, the same file generally sucks at 1.2GB. And the > 'dd if=/dev/zero' sucks ever quicker. > With around 1~2GB memory to dom0, the freeze happens much earlier, but > I did not check the exact time. >And I also use iperf, which can achieve 7GB/s transfer between Dom0 and DomU, presumably that's fast enough?> I'm on a custom build of xen 4.2.1 testing release (built around Jan > this year?), with some patches related to graphics pass-through. But I > guess the patch is not relevant. > The dom0 kernel is version 3.6.11, 64 bit version. > > One thing I forgot to mention is the sign of memory leakage. > I'm not very sure about it, but my dom0 reported OOM several days before. > And typically I don't use dom0 for other purpose other than serve > backends. The allocated memory should be around 2GB and that should be > plenty for a dom0. > Are there any known leakage bug out there? >Not that I know of, page allocation / deallocation in netback is quite simple. Wei.> Thanks, > Timothy_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Apr 12, 2013 at 5:27 PM, Wei Liu <wei.liu2@citrix.com> wrote:> On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote: >> >> Yes, it's not specific to NFS page, but I'm just bad luck enough. >> I agree with your suspect, the chance depends on the memory pressure in dom0. >> So here is a proper setup to reproduce the issue: >> 1. dom0 with SWAP disabled and with limited memory allocated. >> 2. domU serves storage and exports NFS >> 3. dom0 mounts the domU storage and writes to it. >> 4. You need to achieve high speed to expose this issue. >> > > My setup is almost the same. The write speed is around > 35-45MB/s if I do: > > dd if=/dev/zero of=/mnt/t bs=1 count=200 > > However if I do count=2000, the speed slows down to 24MB/s. I suspect > that's the memory pressure in Dom0 - my Dom0 only has 1024MB Ram. But I > still didn't see any error. >That's weird, the stack trace can prove that the issue exists. And the issue stands theoretically. But why this is common in my build and cannot be reproduced in yours? There must be some factor got missed here. Is there any kernel config affecting the memory management behavior in dom0? What's your dom0 kernel version? Is there anything that could matter in nfs config? Do you enable memory ballooning for dom0? I do. But does it matter? I still believe the key factor is to stress the memory. Maybe you can try further limit the memory size and use a larger file size. I become uncertain about how the transfer speed affects. I can achieve 10GB/s in iperf test without issue. And ftp transfer also works without problem at 50MB/s But may be the higher net speed is a negative factor here -- NFS may be able to commit changes in faster speed. Probably we should feed data faster than NFS can handle so that memory is used up quickly? But the back pressure from down stream should slow down the speed that upstream is eating the memory. How does the throttling works? Anyway to control? I'll check why my dom0 reported OOM, may be that's one factor too. Thanks, Timothy>> In my case, domU owns a dedicated SATA controller so there is no >> blkback overhead. Not sure if this is important factor to achieve high >> speed. >> And the transfer is a normal file copy instead of O_SYNC / O_DIRECT >> access so they can be cached in client side for some short period. >> Finally the transfer speed && memory size is crucial. >> >> With a 4GB memory allocate to dom0, I can copy a file (> 2GB) from a >> USB2 port without problem at about 32MB/s. >> But using a USB3 port, the same file generally sucks at 1.2GB. And the >> 'dd if=/dev/zero' sucks ever quicker. >> With around 1~2GB memory to dom0, the freeze happens much earlier, but >> I did not check the exact time. >> > > And I also use iperf, which can achieve 7GB/s transfer between Dom0 and > DomU, presumably that's fast enough? > >> I'm on a custom build of xen 4.2.1 testing release (built around Jan >> this year?), with some patches related to graphics pass-through. But I >> guess the patch is not relevant. >> The dom0 kernel is version 3.6.11, 64 bit version. >> >> One thing I forgot to mention is the sign of memory leakage. >> I'm not very sure about it, but my dom0 reported OOM several days before. >> And typically I don't use dom0 for other purpose other than serve >> backends. The allocated memory should be around 2GB and that should be >> plenty for a dom0. >> Are there any known leakage bug out there? >> > > Not that I know of, page allocation / deallocation in netback is quite > simple. > > > Wei. > >> Thanks, >> Timothy_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, Apr 12, 2013 at 11:34:50AM +0100, G.R. wrote:> On Fri, Apr 12, 2013 at 5:27 PM, Wei Liu <wei.liu2@citrix.com> wrote: > > On Fri, Apr 12, 2013 at 04:10:34AM +0100, G.R. wrote: > >> > >> Yes, it''s not specific to NFS page, but I''m just bad luck enough. > >> I agree with your suspect, the chance depends on the memory pressure in dom0. > >> So here is a proper setup to reproduce the issue: > >> 1. dom0 with SWAP disabled and with limited memory allocated. > >> 2. domU serves storage and exports NFS > >> 3. dom0 mounts the domU storage and writes to it. > >> 4. You need to achieve high speed to expose this issue. > >> > > > > My setup is almost the same. The write speed is around > > 35-45MB/s if I do: > > > > dd if=/dev/zero of=/mnt/t bs=1 count=200 > >Oh this is in fact bs=1M.> > However if I do count=2000, the speed slows down to 24MB/s. I suspect > > that''s the memory pressure in Dom0 - my Dom0 only has 1024MB Ram. But I > > still didn''t see any error. > > > That''s weird, the stack trace can prove that the issue exists. And the > issue stands theoretically. > But why this is common in my build and cannot be reproduced in yours? > There must be some factor got missed here. > Is there any kernel config affecting the memory management behavior in > dom0? What''s your dom0 kernel version?I use default memory management options, i.e. I didn''t touch any of those. I use 3.8-rc7.> Is there anything that could matter in nfs config?I use following line in /etc/exports: / *(rw)> Do you enable memory ballooning for dom0? I do. But does it matter?CONFIG_XEN_BALLOON=y I have my config file attached.> > I still believe the key factor is to stress the memory. > Maybe you can try further limit the memory size and use a larger file size. > > I become uncertain about how the transfer speed affects. > I can achieve 10GB/s in iperf test without issue. > And ftp transfer also works without problem at 50MB/s > But may be the higher net speed is a negative factor here -- NFS may > be able to commit changes in faster speed. > Probably we should feed data faster than NFS can handle so that memory > is used up quickly? > But the back pressure from down stream should slow down the speed that > upstream is eating the memory. > How does the throttling works? Anyway to control? > > I''ll check why my dom0 reported OOM, may be that''s one factor too. >This is a good starting point. :-) Wei. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>> I still believe the key factor is to stress the memory. >> Maybe you can try further limit the memory size and use a larger file size. >> >> I become uncertain about how the transfer speed affects. >> I can achieve 10GB/s in iperf test without issue. >> And ftp transfer also works without problem at 50MB/s >> But may be the higher net speed is a negative factor here -- NFS may >> be able to commit changes in faster speed. >> Probably we should feed data faster than NFS can handle so that memory >> is used up quickly? >> But the back pressure from down stream should slow down the speed that >> upstream is eating the memory. >> How does the throttling works? Anyway to control? >> >> I''ll check why my dom0 reported OOM, may be that''s one factor too. >> > > This is a good starting point. :-) >It seems that the OOM-killer was only triggered on kernel version 3.6.9. It does not show up in 3.6.11 while the issue still exists. So I guess there are some changes in mm behavior in recent kernels. Probably I should try with your kernel version. But anyway, let''s see some existing data first. Please find the full oom_kill log in the attached file. It seems that OOM_KILL is caused by the freeze issue, since the NFS server (domU) becomes unresponsive first. There are about 900MB dirty pages. (writeback:152893 unstable:72860, can they be simply added?) I don''t remember the totoal memory at that time due to the ballooning. The total DRAM in the host is 8GB. Thanks, Timothy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Sat, Apr 13, 2013 at 2:19 PM, G.R. <firemeteor@users.sourceforge.net> wrote:>>> I still believe the key factor is to stress the memory. >>> Maybe you can try further limit the memory size and use a larger file size. >>> >>> I become uncertain about how the transfer speed affects. >>> I can achieve 10GB/s in iperf test without issue. >>> And ftp transfer also works without problem at 50MB/s >>> But may be the higher net speed is a negative factor here -- NFS may >>> be able to commit changes in faster speed. >>> Probably we should feed data faster than NFS can handle so that memory >>> is used up quickly? >>> But the back pressure from down stream should slow down the speed that >>> upstream is eating the memory. >>> How does the throttling works? Anyway to control? >>> >>> I''ll check why my dom0 reported OOM, may be that''s one factor too. >>> >> >> This is a good starting point. :-) >> > > It seems that the OOM-killer was only triggered on kernel version 3.6.9. > It does not show up in 3.6.11 while the issue still exists. > So I guess there are some changes in mm behavior in recent kernels. > Probably I should try with your kernel version. > > But anyway, let''s see some existing data first. > Please find the full oom_kill log in the attached file. > It seems that OOM_KILL is caused by the freeze issue, since the NFS > server (domU) becomes unresponsive first. > There are about 900MB dirty pages. (writeback:152893 unstable:72860, > can they be simply added?) > I don''t remember the totoal memory at that time due to the ballooning. > The total DRAM in the host is 8GB. > > Thanks, > TimothyI did another experiment by dumping the live statistics during reproducing. The command line used: dd if=/dev/zero of=foobar bs=1M count=4096 Statistics: Output from vmstat -m: while true; do sleep 1; date;vmstat -m|grep -i nfs; done Sat Apr 13 13:35:31 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 36 36 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 16 16 1008 4 nfs_page 0 0 128 30 Sat Apr 13 13:35:32 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 7836 7836 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 20 20 1008 4 nfs_page 91080 91080 128 30 Sat Apr 13 13:35:33 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 15920 15920 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 20 20 1008 4 nfs_page 134220 134220 128 30 Sat Apr 13 13:35:34 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 23596 23596 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 18 20 1008 4 nfs_page 167730 167730 128 30 Sat Apr 13 13:35:35 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 30228 30228 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 18 20 1008 4 nfs_page 196650 196650 128 30 Sat Apr 13 13:35:36 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 33124 33124 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 17 20 1008 4 nfs_page 209096 209100 128 30 ... later when I tried umount -f Sat Apr 13 13:38:04 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 33119 33124 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 17 20 1008 4 nfs_page 209019 209100 128 30 Sat Apr 13 13:38:05 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 314 568 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 17 20 1008 4 nfs_page 34959 35610 128 30 Sat Apr 13 13:38:06 CST 2013 nfs_direct_cache 0 0 200 19 nfs_commit_data 0 0 704 11 nfs_write_data 314 568 960 4 nfs_read_data 0 0 896 4 nfs_inode_cache 17 20 1008 4 nfs_page 34959 35610 128 30 And I made a table for the nfs_page item: Seconds nfs_page delta nfs_page delta MB 1 0 N/A N/A 2 91080 91080 355.7812 3 134220 43140 168.5156 4 167730 33510 130.8984 5 196650 28920 112.9688 6 209096 12446 48.61719 n 209019 -77 -0.3008 n+1 34959 -174060 -679.92 n+2 34959 0 0 Output from: vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 906256 11720 110764 0 0 0 4 215 623 0 0 100 0 0 0 0 906504 11720 110764 0 0 0 4 185 513 0 0 100 0 1 0 0 858888 11728 155728 0 0 0 32 239 524 0 1 99 0 2 1 0 480316 10368 509768 0 0 40 4 55192 130021 0 2 76 21 0 2 0 308452 9344 667840 0 0 124 4 36367 59147 0 0 78 21 0 2 0 170548 8860 793952 0 0 32 4 27040 42421 0 1 77 22 0 2 0 52020 8400 903524 0 0 0 4 22961 36318 0 0 77 22 0 0 0 42160 4780 916328 0 0 4 48 7544 12293 0 0 90 9 0 0 0 41796 4780 916160 0 0 0 4 141 398 0 0 100 0 0 0 0 41292 4780 916160 0 0 0 4 132 463 0 0 100 0 0 0 0 41168 4780 916164 0 0 0 4 146 461 0 0 100 0 0 0 0 40796 4780 916156 0 0 0 4 142 447 0 0 100 0 0 0 0 41176 4788 916160 0 0 0 36 170 536 0 0 100 0 later when I tried umount -f 0 1 0 44360 4748 911660 0 0 0 32 205 493 0 0 87 13 0 1 0 44724 4748 911668 0 0 0 4 143 429 0 0 87 13 0 1 0 44616 4748 911672 0 0 0 4 187 515 0 0 87 13 0 0 0 103616 4828 911936 0 0 384 4 2572 41347 0 1 94 4 0 0 0 103772 4828 911936 0 0 0 4 120 321 0 0 100 0 0 0 0 104632 4836 911968 0 0 0 40 155 438 0 0 100 0 Finally, I attach a file that holds the /proc/meminfo at 1 second interval. This is from a different capture though, because I used wrong command at first. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Sat, Apr 13, 2013 at 3:06 PM, G.R. <firemeteor@users.sourceforge.net> wrote:> On Sat, Apr 13, 2013 at 2:19 PM, G.R. <firemeteor@users.sourceforge.net> wrote: >>>> I still believe the key factor is to stress the memory. >>>> Maybe you can try further limit the memory size and use a larger file size. >>>> >>>> I become uncertain about how the transfer speed affects. >>>> I can achieve 10GB/s in iperf test without issue. >>>> And ftp transfer also works without problem at 50MB/s >>>> But may be the higher net speed is a negative factor here -- NFS may >>>> be able to commit changes in faster speed. >>>> Probably we should feed data faster than NFS can handle so that memory >>>> is used up quickly? >>>> But the back pressure from down stream should slow down the speed that >>>> upstream is eating the memory. >>>> How does the throttling works? Anyway to control? >>>> >>>> I''ll check why my dom0 reported OOM, may be that''s one factor too. >>>> >>> >>> This is a good starting point. :-) >>> >> >> It seems that the OOM-killer was only triggered on kernel version 3.6.9. >> It does not show up in 3.6.11 while the issue still exists. >> So I guess there are some changes in mm behavior in recent kernels. >> Probably I should try with your kernel version. >> >> But anyway, let''s see some existing data first. >> Please find the full oom_kill log in the attached file. >> It seems that OOM_KILL is caused by the freeze issue, since the NFS >> server (domU) becomes unresponsive first. >> There are about 900MB dirty pages. (writeback:152893 unstable:72860, >> can they be simply added?) >> I don''t remember the totoal memory at that time due to the ballooning. >> The total DRAM in the host is 8GB.Hi Wei, I think I find something important for this issue. I moved to a brand new kernel version -- 3.8.7 yesterday. Both your configuration and mine are tried. And the configuration really matters -- mine configuration fails while yours does not. I made a quick comparison and found some key difference -- kernel preemption. I enabled the kernel preemption and use 1000 Hz time slices. I just have my config attached, may be there are something else that matters. You can simply check it out. To fix the issue, I wonder if it''s feasible to reserve some pages for low memory situation? Also, the performance drop upon large file transfer does not make sense. With raw network speed at about 7Gbps, why we end up with about 30MB/s with large files? Current HDD should be able to sustain 100MB/s in sequential writes. Thanks, Timothy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Hi Wei, Could you take a look? Thanks, Timothy> Hi Wei, > I think I find something important for this issue. > I moved to a brand new kernel version -- 3.8.7 yesterday. > Both your configuration and mine are tried. > And the configuration really matters -- mine configuration fails while > yours does not. > > I made a quick comparison and found some key difference -- kernel preemption. > I enabled the kernel preemption and use 1000 Hz time slices. > I just have my config attached, may be there are something else that > matters. You can simply check it out. > > To fix the issue, I wonder if it''s feasible to reserve some pages for > low memory situation? > Also, the performance drop upon large file transfer does not make sense. > With raw network speed at about 7Gbps, why we end up with about 30MB/s > with large files? > Current HDD should be able to sustain 100MB/s in sequential writes.