Hi Dongxiao, We''re seeing a crash in netbk_gop_frag_copy and I was wondering if it might related to the multithreaded/tasklet netback patches. This is in a 2.6.27 traditional Xen kernel but the most recent netback change made in that tree was the backport of the multithread patches. The crash appears to correspond to the "copy_gop->source.domid src_pend->netif->domid" line of netbk_gop_frag_copy, specifically src_pend->netif seems to be NULL. Have you seen anything like this? One thing I did notice is that 020ba906 "xen/netback: Multiple tasklets support." includes this hunk: @@ -321,7 +331,8 @@ static u16 netbk_gop_frag(struct xen_netif *netif, struct netbk_rx_meta *meta, copy_gop = npo->copy + npo->copy_prod++; copy_gop->flags = GNTCOPY_dest_gref; - if (idx > -1) { + if (PageForeign(page)) { + struct xen_netbk *netbk = &xen_netbk[group]; struct pending_tx_info *src_pend = &netbk->pending_tx_info[idx]; copy_gop->source.domid = src_pend->netif->domid; copy_gop->source.u.ref = src_pend->req.gref; I''m not sure it is guaranteed that all foreign pages which reach this point are netback pages, is it? gnttab_copy_grant_page also sets PageForeign for example. Does this change relate to the removal of the if ((idx >= MAX_PENDING_REQS) || (netbk->mmap_pages[idx] != pg)) return -1; check from netif_page_index in a3031942 "xen/netback: Introduce a new struct type page_ext."? Do you think we perhaps need to reinstate a similar check to this as well as first checking that group is a sensible offset into the xen_netbk array? Thanks, Ian. [2010-07-11 03:50:28 UTC] BUG: unable to handle kernel paging request at f0052dac [2010-07-11 03:50:28 UTC] IP: [<c0284ed8>] netbk_gop_frag_copy+0x78/0x200 [2010-07-11 03:50:28 UTC] Oops: 0000 [#1] SMP [2010-07-11 03:50:28 UTC] last sysfs file: /sys/devices/xen-backend/vbd-1669-51712/statistics/rd_usecs [2010-07-11 03:50:28 UTC] Modules linked in: tun nfs nfs_acl dm_round_robin scsi_dh_emc dm_multipath scsi_dh bonding hfsplus lockd sunrpc bridge stp llc(N) ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables binfmt_misc nls_utf8 isofs(N) sbs sbshc fan battery ac parport_pc lp parport nvram sg evdev(N) usb_storage libusual(N) container bnx2 zlib_inflate(N) usbhid thermal ff_memless qla2xxx processor button thermal_sys hpilo scsi_transport_fc e1000e piix serio_raw 8250_pnp ide_cd_mod 8250 cdrom serial_core ata_piix libata dock rtc_cmos rtc_core rtc_lib pcspkr ide_generic dm_snapshot dm_zero dm_mirror dm_log dm_mod ide_disk cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd usbcore fbcon(N) font(N) tileblit(N) bitblit(N) softcursor(N) [2010-07-11 03:50:28 UTC] Supported: No, Unsupported modules are loaded [2010-07-11 03:50:28 UTC] [2010-07-11 03:50:28 UTC] Pid: 1173, comm: netback/2 Tainted: G (2.6.27.45-0.1.1.xs5.6.900.128.111247xen #1) [2010-07-11 03:50:28 UTC] EIP: 0061:[<c0284ed8>] EFLAGS: 00010246 CPU: 0 [2010-07-11 03:50:28 UTC] EIP is at netbk_gop_frag_copy+0x78/0x200 [2010-07-11 03:50:28 UTC] EAX: f0052d9c EBX: f0087384 ECX: 00000000 EDX: f0052cfc [2010-07-11 03:50:28 UTC] ESI: 00000006 EDI: eccfbf44 EBP: eccfbee0 ESP: eccfbec0 [2010-07-11 03:50:28 UTC] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 [2010-07-11 03:50:28 UTC] Process netback/2 (pid: 1173, ti=eccfa000 task=ed81d030 task.ti=eccfa000) [2010-07-11 03:50:28 UTC] Stack: c15a8ba0 cdfb2480 ffff1cfc f008ad3c 00000000 00000006 0000000c 00000001 [2010-07-11 03:50:28 UTC] eccfbfa4 c02852b4 00000006 000000d2 00000000 c16bdb34 eccfbf88 eccfbf20 [2010-07-11 03:50:28 UTC] f008730c f008630c eca73bdc f008530c 00000001 f007d630 eea80200 ed81d030 [2010-07-11 03:50:28 UTC] Call Trace: [2010-07-11 03:50:28 UTC] [<c02852b4>] ? net_rx_action+0x254/0x920 [2010-07-11 03:50:28 UTC] [<c0287407>] ? netbk_action_thread+0x97/0x170 [2010-07-11 03:50:28 UTC] [<c013de00>] ? autoremove_wake_function+0x0/0x50 [2010-07-11 03:50:28 UTC] [<c0287370>] ? netbk_action_thread+0x0/0x170 [2010-07-11 03:50:28 UTC] [<c013daa2>] ? kthread+0x42/0x70 [2010-07-11 03:50:28 UTC] [<c013da60>] ? kthread+0x0/0x70 [2010-07-11 03:50:28 UTC] [<c010569b>] ? kernel_thread_helper+0x7/0x10 [2010-07-11 03:50:28 UTC] ======================[2010-07-11 03:50:28 UTC] Code: 69 db 04 e3 00 00 8d 04 40 8d 54 82 f4 89 55 ec 89 5d e8 eb 65 8b 45 f0 8b 55 e8 03 15 a8 94 53 c0 c1 e0 04 8d 84 10 a0 00 00 00 <8b> 50 10 0f b7 12 66 89 53 04 8b 40 04 66 c7 43 12 03 00 89 03 [2010-07-11 03:50:28 UTC] EIP: [<c0284ed8>] netbk_gop_frag_copy+0x78/0x200 SS:ESP 0069:eccfbec0 [2010-07-11 03:50:28 UTC] ---[ end trace cf7f02bf1fe43242 ]--- _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell wrote:> Hi Dongxiao, > > We''re seeing a crash in netbk_gop_frag_copy and I was wondering if it > might related to the multithreaded/tasklet netback patches. This is > in a > 2.6.27 traditional Xen kernel but the most recent netback change made > in > that tree was the backport of the multithread patches. > > The crash appears to correspond to the "copy_gop->source.domid > src_pend->netif->domid" line of netbk_gop_frag_copy, specifically > src_pend->netif seems to be NULL. > > Have you seen anything like this?No I didn''t met this kind of error before. Did you encounter this issue while doing inter-domain communication? I am a bit suspicious how you get the group number in netbk_gop_frag_copy(), could you paste your rebased patch?> > One thing I did notice is that 020ba906 "xen/netback: Multiple > tasklets > support." includes this hunk: > @@ -321,7 +331,8 @@ static u16 netbk_gop_frag(struct xen_netif > *netif, struct netbk_rx_meta *meta, > > copy_gop = npo->copy + npo->copy_prod++; > copy_gop->flags = GNTCOPY_dest_gref; > - if (idx > -1) { > + if (PageForeign(page)) { > + struct xen_netbk *netbk = &xen_netbk[group]; > struct pending_tx_info *src_pend > &netbk->pending_tx_info[idx]; copy_gop->source.domid > = src_pend->netif->domid; copy_gop->source.u.ref > src_pend->req.gref; > > I''m not sure it is guaranteed that all foreign pages which reach this > point are netback pages, is it? gnttab_copy_grant_page also sets > PageForeign for example.Here the page is guaranteed to be netback pages. See callers of netbk_gop_frag_copy().> > Does this change relate to the removal of the > if ((idx >= MAX_PENDING_REQS) || (netbk->mmap_pages[idx] !> pg)) return -1; > check from netif_page_index in a3031942 "xen/netback: Introduce a new > struct type page_ext."?Actually this logic still exists in the code, see lines: + BUG_ON(idx < 0 || idx >= MAX_PENDING_REQS); + BUG_ON(netbk->mmap_pages[idx] != page); Thanks, Dongxiao> > Do you think we perhaps need to reinstate a similar check to this as > well as first checking that group is a sensible offset into the > xen_netbk array? > > Thanks, > Ian. > > [2010-07-11 03:50:28 UTC] BUG: unable to handle kernel paging request > at > f0052dac > [2010-07-11 03:50:28 UTC] IP: [<c0284ed8>] > netbk_gop_frag_copy+0x78/0x200 [2010-07-11 03:50:28 UTC] Oops: 0000 > [#1] SMP [2010-07-11 03:50:28 UTC] last sysfs file: > /sys/devices/xen-backend/vbd-1669-51712/statistics/rd_usecs > [2010-07-11 03:50:28 UTC] Modules linked in: tun nfs nfs_acl > dm_round_robin scsi_dh_emc dm_multipath scsi_dh bonding hfsplus lockd > sunrpc bridge stp llc(N) ipt_REJECT nf_conntrack_ipv4 xt_state > nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables binfmt_misc > nls_utf8 isofs(N) sbs sbshc fan battery ac parport_pc lp parport > nvram sg evdev(N) usb_storage libusual(N) container bnx2 > zlib_inflate(N) usbhid thermal ff_memless qla2xxx processor button > thermal_sys hpilo scsi_transport_fc e1000e piix serio_raw 8250_pnp > ide_cd_mod 8250 cdrom serial_core ata_piix libata dock rtc_cmos > rtc_core rtc_lib pcspkr ide_generic dm_snapshot dm_zero dm_mirror > dm_log dm_mod ide_disk cciss sd_mod scsi_mod ext3 jbd uhci_hcd > ohci_hcd ehci_hcd usbcore fbcon(N) font(N) tileblit(N) bitblit(N) > softcursor(N) [2010-07-11 03:50:28 UTC] Supported: No, Unsupported > modules are loaded [2010-07-11 03:50:28 UTC] [2010-07-11 03:50:28 > UTC] Pid: 1173, comm: netback/2 Tainted: G > (2.6.27.45-0.1.1.xs5.6.900.128.111247xen #1) [2010-07-11 03:50:28 > UTC] EIP: 0061:[<c0284ed8>] EFLAGS: 00010246 CPU: 0 [2010-07-11 > 03:50:28 UTC] EIP is at netbk_gop_frag_copy+0x78/0x200 [2010-07-11 > 03:50:28 UTC] EAX: f0052d9c EBX: f0087384 ECX: 00000000 EDX: f0052cfc > [2010-07-11 03:50:28 UTC] ESI: 00000006 EDI: eccfbf44 EBP: eccfbee0 > ESP: eccfbec0 [2010-07-11 03:50:28 UTC] DS: 007b ES: 007b FS: 00d8 > GS: 0000 SS: 0069 [2010-07-11 03:50:28 UTC] Process netback/2 (pid: > 1173, ti=eccfa000 task=ed81d030 task.ti=eccfa000) [2010-07-11 > 03:50:28 UTC] Stack: c15a8ba0 cdfb2480 ffff1cfc f008ad3c 00000000 > 00000006 0000000c 00000001 [2010-07-11 03:50:28 UTC] eccfbfa4 > c02852b4 00000006 000000d2 00000000 c16bdb34 eccfbf88 eccfbf20 > [2010-07-11 03:50:28 UTC] f008730c f008630c eca73bdc f008530c > 00000001 f007d630 eea80200 ed81d030 [2010-07-11 03:50:28 UTC] Call > Trace: [2010-07-11 03:50:28 UTC] [<c02852b4>] ? > net_rx_action+0x254/0x920 [2010-07-11 03:50:28 UTC] [<c0287407>] ? > netbk_action_thread+0x97/0x170 [2010-07-11 03:50:28 UTC] > [<c013de00>] ? autoremove_wake_function+0x0/0x50 [2010-07-11 03:50:28 > UTC] [<c0287370>] ? netbk_action_thread+0x0/0x170 [2010-07-11 > 03:50:28 UTC] [<c013daa2>] ? kthread+0x42/0x70 [2010-07-11 03:50:28 > UTC] [<c013da60>] ? kthread+0x0/0x70 [2010-07-11 03:50:28 UTC] > [<c010569b>] ? kernel_thread_helper+0x7/0x10 [2010-07-11 03:50:28 > UTC] ======================= [2010-07-11 03:50:28 UTC] Code: 69 db > 04 e3 00 00 8d 04 40 8d 54 82 f4 89 55 ec 89 5d e8 eb 65 8b 45 f0 8b > 55 e8 03 15 a8 94 53 c0 c1 e0 04 8d 84 10 a0 00 00 00 <8b> 50 10 0f > b7 12 66 89 53 04 8b 40 04 66 c7 43 12 03 00 89 03 [2010-07-11 > 03:50:28 UTC] EIP: [<c0284ed8>] netbk_gop_frag_copy+0x78/0x200 SS:ESP > 0069:eccfbec0 [2010-07-11 03:50:28 UTC] ---[ end trace > cf7f02bf1fe43242 ]---_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2010-07-21 at 08:56 +0100, Xu, Dongxiao wrote:> Ian Campbell wrote: > > Hi Dongxiao, > > > > We''re seeing a crash in netbk_gop_frag_copy and I was wondering if it > > might related to the multithreaded/tasklet netback patches. This is > > in a > > 2.6.27 traditional Xen kernel but the most recent netback change made > > in > > that tree was the backport of the multithread patches. > > > > The crash appears to correspond to the "copy_gop->source.domid > > src_pend->netif->domid" line of netbk_gop_frag_copy, specifically > > src_pend->netif seems to be NULL. > > > > Have you seen anything like this? > > No I didn''t met this kind of error before. > Did you encounter this issue while doing inter-domain communication? > I am a bit suspicious how you get the group number in netbk_gop_frag_copy(), could you paste your rebased patch?The full patch queue is available at http://xenbits.xen.org/XCP/linux-2.6.27.pq.hg which applies to pristine 2.6.27 from kernel.org, for example the v2.6.27 tag in http://kernel.org/hg/linux-2.6 The issue was originally observed at revision 128:dd3f71e86327. Note that netbk_gop_frag_copy() is the same in this tree as xen.git and that netback generally is pretty much in sync with xen.git.> > One thing I did notice is that 020ba906 "xen/netback: Multiple > > tasklets > > support." includes this hunk: > > @@ -321,7 +331,8 @@ static u16 netbk_gop_frag(struct xen_netif > > *netif, struct netbk_rx_meta *meta, > > > > copy_gop = npo->copy + npo->copy_prod++; > > copy_gop->flags = GNTCOPY_dest_gref; > > - if (idx > -1) { > > + if (PageForeign(page)) { > > + struct xen_netbk *netbk = &xen_netbk[group]; > > struct pending_tx_info *src_pend > > &netbk->pending_tx_info[idx]; copy_gop->source.domid > > = src_pend->netif->domid; copy_gop->source.u.ref > > src_pend->req.gref; > > > > I''m not sure it is guaranteed that all foreign pages which reach this > > point are netback pages, is it? gnttab_copy_grant_page also sets > > PageForeign for example. > > Here the page is guaranteed to be netback pages. > See callers of netbk_gop_frag_copy().The only caller is netbk_gop_skb (twice) which is itself called from net_rx_action() which takes SKBs from netbk->rx_queue. SKBs are put on netbk->rx_queue by netif_be_start_xmit which is called by the generic kernel infrastructure with SKBs which can have either come from other VIFs or from the physical network etc and hence the pages therein could either be foreign or local. Therefore netbk_gop_frag_copy needs to know if a page is foreign or not. Remember that this function is called on the guest receive path, i.e. pages going from domain 0 to domain U (so domain 0 transmit path, yes, the nomenclature used in netback is backwards and a bit confusing). Note that foreign pages in this context will (most likely) belong to a different netif instance, i.e. not the one passed to netbk_gop_skb.> > Does this change relate to the removal of the > > if ((idx >= MAX_PENDING_REQS) || (netbk->mmap_pages[idx] !> > pg)) return -1; > > check from netif_page_index in a3031942 "xen/netback: Introduce a new > > struct type page_ext."? > > Actually this logic still exists in the code, see lines: > > + BUG_ON(idx < 0 || idx >= MAX_PENDING_REQS); > + BUG_ON(netbk->mmap_pages[idx] != page);That''s only in netif_page_release though which would come later than this crash I think, and only on the guest TX path in any case. Ian.> > Thanks, > Dongxiao > > > > > Do you think we perhaps need to reinstate a similar check to this as > > well as first checking that group is a sensible offset into the > > xen_netbk array? > > > > Thanks, > > Ian. > > > > [2010-07-11 03:50:28 UTC] BUG: unable to handle kernel paging request > > at > > f0052dac > > [2010-07-11 03:50:28 UTC] IP: [<c0284ed8>] > > netbk_gop_frag_copy+0x78/0x200 [2010-07-11 03:50:28 UTC] Oops: 0000 > > [#1] SMP [2010-07-11 03:50:28 UTC] last sysfs file: > > /sys/devices/xen-backend/vbd-1669-51712/statistics/rd_usecs > > [2010-07-11 03:50:28 UTC] Modules linked in: tun nfs nfs_acl > > dm_round_robin scsi_dh_emc dm_multipath scsi_dh bonding hfsplus lockd > > sunrpc bridge stp llc(N) ipt_REJECT nf_conntrack_ipv4 xt_state > > nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables binfmt_misc > > nls_utf8 isofs(N) sbs sbshc fan battery ac parport_pc lp parport > > nvram sg evdev(N) usb_storage libusual(N) container bnx2 > > zlib_inflate(N) usbhid thermal ff_memless qla2xxx processor button > > thermal_sys hpilo scsi_transport_fc e1000e piix serio_raw 8250_pnp > > ide_cd_mod 8250 cdrom serial_core ata_piix libata dock rtc_cmos > > rtc_core rtc_lib pcspkr ide_generic dm_snapshot dm_zero dm_mirror > > dm_log dm_mod ide_disk cciss sd_mod scsi_mod ext3 jbd uhci_hcd > > ohci_hcd ehci_hcd usbcore fbcon(N) font(N) tileblit(N) bitblit(N) > > softcursor(N) [2010-07-11 03:50:28 UTC] Supported: No, Unsupported > > modules are loaded [2010-07-11 03:50:28 UTC] [2010-07-11 03:50:28 > > UTC] Pid: 1173, comm: netback/2 Tainted: G > > (2.6.27.45-0.1.1.xs5.6.900.128.111247xen #1) [2010-07-11 03:50:28 > > UTC] EIP: 0061:[<c0284ed8>] EFLAGS: 00010246 CPU: 0 [2010-07-11 > > 03:50:28 UTC] EIP is at netbk_gop_frag_copy+0x78/0x200 [2010-07-11 > > 03:50:28 UTC] EAX: f0052d9c EBX: f0087384 ECX: 00000000 EDX: f0052cfc > > [2010-07-11 03:50:28 UTC] ESI: 00000006 EDI: eccfbf44 EBP: eccfbee0 > > ESP: eccfbec0 [2010-07-11 03:50:28 UTC] DS: 007b ES: 007b FS: 00d8 > > GS: 0000 SS: 0069 [2010-07-11 03:50:28 UTC] Process netback/2 (pid: > > 1173, ti=eccfa000 task=ed81d030 task.ti=eccfa000) [2010-07-11 > > 03:50:28 UTC] Stack: c15a8ba0 cdfb2480 ffff1cfc f008ad3c 00000000 > > 00000006 0000000c 00000001 [2010-07-11 03:50:28 UTC] eccfbfa4 > > c02852b4 00000006 000000d2 00000000 c16bdb34 eccfbf88 eccfbf20 > > [2010-07-11 03:50:28 UTC] f008730c f008630c eca73bdc f008530c > > 00000001 f007d630 eea80200 ed81d030 [2010-07-11 03:50:28 UTC] Call > > Trace: [2010-07-11 03:50:28 UTC] [<c02852b4>] ? > > net_rx_action+0x254/0x920 [2010-07-11 03:50:28 UTC] [<c0287407>] ? > > netbk_action_thread+0x97/0x170 [2010-07-11 03:50:28 UTC] > > [<c013de00>] ? autoremove_wake_function+0x0/0x50 [2010-07-11 03:50:28 > > UTC] [<c0287370>] ? netbk_action_thread+0x0/0x170 [2010-07-11 > > 03:50:28 UTC] [<c013daa2>] ? kthread+0x42/0x70 [2010-07-11 03:50:28 > > UTC] [<c013da60>] ? kthread+0x0/0x70 [2010-07-11 03:50:28 UTC] > > [<c010569b>] ? kernel_thread_helper+0x7/0x10 [2010-07-11 03:50:28 > > UTC] ======================= [2010-07-11 03:50:28 UTC] Code: 69 db > > 04 e3 00 00 8d 04 40 8d 54 82 f4 89 55 ec 89 5d e8 eb 65 8b 45 f0 8b > > 55 e8 03 15 a8 94 53 c0 c1 e0 04 8d 84 10 a0 00 00 00 <8b> 50 10 0f > > b7 12 66 89 53 04 8b 40 04 66 c7 43 12 03 00 89 03 [2010-07-11 > > 03:50:28 UTC] EIP: [<c0284ed8>] netbk_gop_frag_copy+0x78/0x200 SS:ESP > > 0069:eccfbec0 [2010-07-11 03:50:28 UTC] ---[ end trace > > cf7f02bf1fe43242 ]--- >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel