Hello, I''m playing with InfiniBand driver in Xen both Dom0 and DomU. I found out that the driver is somewhat unstable in the Dom0 and in the DomU it does not work either which is something I would like to fix. In Dom0, the following sequence kills the machine: modprobe ib_mthca rmmod ib_mthca insmod ib_mthca debug_level=1 It produces oops with random process. If I turn on SLAB DEBUG, I got the following report: Redzone: 0x1600000016/0x1700000017. Last user: <0000001800000018>(0x1800000018) 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 Prev obj: start=0000000398f5120b, len=256 InfiniBand developers told be that they are using SLAB_DEBUG and modprobe/rmmod works OK in non-Xen environment. So, any thoughts or suggestions to catch the bug? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/7/07 01:54, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> InfiniBand developers told be that they are using SLAB_DEBUG and > modprobe/rmmod works OK in non-Xen environment. > > So, any thoughts or suggestions to catch the bug?It could be DMAing randomly. Is there a debug flag in the driver to provide verbose tracing? Or you could add tracing (or audit) the places that it initiates DMA and see whether the DMA addresses are correctly calculated. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Jul 08, 2007 at 10:59:20AM +0100, Keir Fraser wrote:> It could be DMAing randomly. Is there a debug flag in the driver to provide > verbose tracing? Or you could add tracing (or audit) the places that it > initiates DMA and see whether the DMA addresses are correctly calculated.Yeah, I will try to add tracing to DMA actions. However, can address bars influence DMA actions, mainly if provided by Dom0 for DomU. More precisely, if I bypass address bars logic (in pciback) and write the addresses directly to the PCI config space, isn''t Xen Dom0 confused and possibly sets up DMA transfers incorrectly? I.e., which part physically sets up the DMA -- Dom0 or DomU? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 8/7/07 11:05, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:>> It could be DMAing randomly. Is there a debug flag in the driver to provide >> verbose tracing? Or you could add tracing (or audit) the places that it >> initiates DMA and see whether the DMA addresses are correctly calculated. > > Yeah, I will try to add tracing to DMA actions. However, can address bars > influence DMA actions, mainly if provided by Dom0 for DomU. > > More precisely, if I bypass address bars logic (in pciback) and write the > addresses directly to the PCI config space, isn''t Xen Dom0 confused and > possibly sets up DMA transfers incorrectly? I.e., which part physically sets > up the DMA -- Dom0 or DomU?DomU sets up its own DMAs. The BARs do not come into it (except that they will indicate where device registers are, and that''s how the DMA is programmed). It sounds like you have this SLAB bug in dom0 and domU right now. You should get your driver working properly in dom0 before taking the next step of debugging it in domU. Otherwise you are trying to track down multiple problems simultaneously. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Jul 08, 2007 at 11:11:58AM +0100, Keir Fraser wrote:> DomU sets up its own DMAs. The BARs do not come into it (except that they > will indicate where device registers are, and that''s how the DMA is > programmed).One of maintainers of IB driver suggested a theory. What is the way that Xen allocates memory? Are there guarantees about allocation of physical contiguous memory? E.g., if I request alloc_pages(GPF_KERNEL, 6) - (256kB), are these pages physically contiguous? Or are these pages guest contiguous only? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>From: Lukas Hejtmanek >Sent: 2007年7月9日 15:38 > >On Sun, Jul 08, 2007 at 11:11:58AM +0100, Keir Fraser wrote: >> DomU sets up its own DMAs. The BARs do not come into it (except >that they >> will indicate where device registers are, and that''s how the DMA is >> programmed). > >One of maintainers of IB driver suggested a theory. What is the way that >Xen >allocates memory? Are there guarantees about allocation of physical >contiguous >memory? E.g., if I request alloc_pages(GPF_KERNEL, 6) - (256kB), are >these >pages physically contiguous? Or are these pages guest contiguous >only? >Alloc_pages doesn''t ensure you physically contiguous. But you can if turning to standard DMA allocation interface. Thanks, Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 08:38, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> On Sun, Jul 08, 2007 at 11:11:58AM +0100, Keir Fraser wrote: >> DomU sets up its own DMAs. The BARs do not come into it (except that they >> will indicate where device registers are, and that''s how the DMA is >> programmed). > > One of maintainers of IB driver suggested a theory. What is the way that Xen > allocates memory? Are there guarantees about allocation of physical contiguous > memory? E.g., if I request alloc_pages(GPF_KERNEL, 6) - (256kB), are these > pages physically contiguous? Or are these pages guest contiguous only?Nope. To get contiguous pages on Xen you need to use pci_alloc_consistent, or pci_map_single, or similar. Most drivers are okay with this, because you are not supposed to program DMA without having got a dma_addr_t from Linux''s DMA API, and we hook into the DMA API to serve up contiguous regions (via the swiotlb [ie. bounce buffers], if necessary). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 09:31:16AM +0100, Keir Fraser wrote:> Nope. To get contiguous pages on Xen you need to use pci_alloc_consistent, > or pci_map_single, or similar. Most drivers are okay with this, because you > are not supposed to program DMA without having got a dma_addr_t from Linux''s > DMA API, and we hook into the DMA API to serve up contiguous regions (via > the swiotlb [ie. bounce buffers], if necessary).According to IB developers, it uses alloc_pages and then pci_map_sg which they state that is DMA API. So, is it correct approach? I did debug like this - so it prints physical and virtual addresses and they seems to be perfectly contiguous. Are the physical addresses real physical addresses or virtualized by Xen hypervisor? static int mthca_alloc_icm_pages(struct mthca_dev *pdev, struct device *dev, struct scatterlist *mem, int order, gfp_t gfp_mask) { int o; void *page; mem->page = alloc_pages(gfp_mask, order); if (!mem->page) return -ENOMEM; mthca_err(pdev, "Alloc pages starts\n"); page = page_address(mem->page); o = PAGE_SIZE << order; while(o > 0) { mthca_err(pdev, "Page phys. addr %p, virt %p\n", virt_to_phys(page), page); o -= PAGE_SIZE; page += PAGE_SIZE; } mem->length = PAGE_SIZE << order; mem->offset = 0; return 0; } -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 13:07, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> According to IB developers, it uses alloc_pages and then pci_map_sg which they > state that is DMA API. So, is it correct approach?Well, we currently assume that segments of a scatterlist (as passed to pci_map_sg) do not cross page boundaries. It looks like this assumption is broken (certainly for the infiniband driver!). I''ve just checked in the attached patch to fix this issue. Please give it a try in your dom0 tree.> I did debug like this - so it prints physical and virtual addresses and they > seems to be perfectly contiguous. Are the physical addresses real physical > addresses or virtualized by Xen hypervisor?You want to use virt_to_machine(). As you say, the addresses returned by virt_to_phys() are virtual-physical, which is not what you want. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 01:47:34PM +0100, Keir Fraser wrote:> Well, we currently assume that segments of a scatterlist (as passed to > pci_map_sg) do not cross page boundaries. It looks like this assumption is > broken (certainly for the infiniband driver!). > > I''ve just checked in the attached patch to fix this issue. Please give it a > try in your dom0 tree.for some reason, insmod/rmmod works fine even without the patch in Dom0. The patch wasn''t applied cleanly anyway, I''m using stock Xen 3.1 which does not have the code: dma = gnttab_dma_map_page(virt_to_page(ptr)) + offset_in_page(ptr); Anyway, in DomU it produced message (but also an oops): Fatal DMA error! Please use ''swiotlb=force'' ----------- cut here --------- please bite here --------- Kernel BUG at arch/x86_64/kernel/../../i386/kernel/pci-dma-xen.c:100> > I did debug like this - so it prints physical and virtual addresses and they > > seems to be perfectly contiguous. Are the physical addresses real physical > > addresses or virtualized by Xen hypervisor? > > You want to use virt_to_machine(). As you say, the addresses returned by > virt_to_phys() are virtual-physical, which is not what you want.Well, we got it. The virt_to_machine shows that although the pages are contiguous, they are in the reversed order! As can be seen below. Should the swiotlb=force solve the problem? ib_mthca 0000:08:00.0: Alloc pages starts ib_mthca 0000:08:00.0: Page phys. addr 0000000025455000, virt ffff880099d40000 ib_mthca 0000:08:00.0: Page phys. addr 0000000025454000, virt ffff880099d41000 ib_mthca 0000:08:00.0: Page phys. addr 0000000025453000, virt ffff880099d42000 ib_mthca 0000:08:00.0: Page phys. addr 0000000025452000, virt ffff880099d43000 ib_mthca 0000:08:00.0: Page phys. addr 0000000025451000, virt ffff880099d44000 ib_mthca 0000:08:00.0: Page phys. addr 0000000025450000, virt ffff880099d45000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544f000, virt ffff880099d46000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544e000, virt ffff880099d47000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544d000, virt ffff880099d48000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544c000, virt ffff880099d49000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544b000, virt ffff880099d4a000 ib_mthca 0000:08:00.0: Page phys. addr 000000002544a000, virt ffff880099d4b000 -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 14:24, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> Well, we got it. The virt_to_machine shows that although the pages are > contiguous, they are in the reversed order! As can be seen below. Should the > swiotlb=force solve the problem?Yes, although that won''t be needed in dom0 (which always has a swiotlb). But clearly it is *definitely* needed for domU driving the infiniband device. You definitely need the patch that I posted. If it won''t apply cleanly to your kernel tree then you''ll have to manually apply it, or move to the current ''unstable'' linux-2.6.18-xen.hg tree. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 02:42:34PM +0100, Keir Fraser wrote:> On 9/7/07 14:24, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote: > > > Well, we got it. The virt_to_machine shows that although the pages are > > contiguous, they are in the reversed order! As can be seen below. Should the > > swiotlb=force solve the problem? > > Yes, although that won''t be needed in dom0 (which always has a swiotlb). But > clearly it is *definitely* needed for domU driving the infiniband device. > > You definitely need the patch that I posted. If it won''t apply cleanly to > your kernel tree then you''ll have to manually apply it, or move to the > current ''unstable'' linux-2.6.18-xen.hg tree.Yeah, it solved the oops, thanks! However, I got another oops in __sync_single because host addr is invalid. I suppose, it is because in sync_single it picks up invalid line from the io_tlb_orig_addr. It uses index 3332 which is not inserted by map_page. The invalid address is 0x0021d1242de00000 but it is strange because I added memset to zero io_tlb_orig_addr at the beginning, however, such address is still there even if the index were not inserted by the map_page. -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 16:39, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:>> You definitely need the patch that I posted. If it won''t apply cleanly to >> your kernel tree then you''ll have to manually apply it, or move to the >> current ''unstable'' linux-2.6.18-xen.hg tree. > > Yeah, it solved the oops, thanks! However, I got another oops in __sync_single > because host addr is invalid. > > I suppose, it is because in sync_single it picks up invalid line from the > io_tlb_orig_addr. It uses index 3332 which is not inserted by map_page. > The invalid address is 0x0021d1242de00000 but it is strange because I added > memset to zero io_tlb_orig_addr at the beginning, however, such address is > still there even if the index were not inserted by the map_page.Nothing should read from an io_tlb_orig_addr[] slot that hasn''t been initialised by map_single(). That''s because sync_single() is only valid to be called on a memory region that was previously map_single()d. So what you''re seeing is rather odd. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 05:13:10PM +0100, Keir Fraser wrote:> >> You definitely need the patch that I posted. If it won''t apply cleanly to > >> your kernel tree then you''ll have to manually apply it, or move to the > >> current ''unstable'' linux-2.6.18-xen.hg tree. > > > > Yeah, it solved the oops, thanks! However, I got another oops in __sync_single > > because host addr is invalid. > > > > I suppose, it is because in sync_single it picks up invalid line from the > > io_tlb_orig_addr. It uses index 3332 which is not inserted by map_page. > > The invalid address is 0x0021d1242de00000 but it is strange because I added > > memset to zero io_tlb_orig_addr at the beginning, however, such address is > > still there even if the index were not inserted by the map_page. > > Nothing should read from an io_tlb_orig_addr[] slot that hasn''t been > initialised by map_single(). That''s because sync_single() is only valid to > be called on a memory region that was previously map_single()d. So what > you''re seeing is rather odd.Well, it looks like that __sync_single is called on the first page that has been allocated in the order 6 batch. So, are you saying that this is something incorrect? Because >0 orders do not pass through map_single(), if I understand correctly. -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 18:11, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> Well, it looks like that __sync_single is called on the first page that has > been allocated in the order 6 batch. So, are you saying that this is something > incorrect? > > Because >0 orders do not pass through map_single(), if I understand correctly.By my understanding, the infiniband driver is doing an order-6 allocation, then stuffing that multi-page region into a single element of a scatterlist. It is then calling dma_map_sg(), which [on Xen] calls swiotlb_map_sg(), which calls map_single() on that multi-page extent. Am I misunderstanding something? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 06:21:24PM +0100, Keir Fraser wrote:> By my understanding, the infiniband driver is doing an order-6 allocation, > then stuffing that multi-page region into a single element of a scatterlist. > It is then calling dma_map_sg(), which [on Xen] calls swiotlb_map_sg(), > which calls map_single() on that multi-page extent. Am I misunderstanding > something?Not exactly. IB driver does alloc_pages (order-6). Then it calls pci_map_sg() which is basically dma_map_sg(). This one invokes swiotlb_map_sg() which possibly calls map_single() (right now I''m not sure, I must check it, but if yes, it creates index 3328). Then later, IB driver invokes dma_sync_single on the first page from that order-6 allocation via dma_handle. And in __sync_single it crashes because sync_single picks up index 3332 instead of 3328. Btw, please, keep Roland in Cc. -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 18:42, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> On Mon, Jul 09, 2007 at 06:21:24PM +0100, Keir Fraser wrote: >> By my understanding, the infiniband driver is doing an order-6 allocation, >> then stuffing that multi-page region into a single element of a scatterlist. >> It is then calling dma_map_sg(), which [on Xen] calls swiotlb_map_sg(), >> which calls map_single() on that multi-page extent. Am I misunderstanding >> something? > > Not exactly. > > IB driver does alloc_pages (order-6). Then it calls pci_map_sg() which is > basically dma_map_sg(). This one invokes swiotlb_map_sg() which possibly calls > map_single() (right now I''m not sure, I must check it, but if yes, it creates > index 3328). > > Then later, IB driver invokes dma_sync_single on the first page from that > order-6 allocation via dma_handle. And in __sync_single it crashes because > sync_single picks up index 3332 instead of 3328. > > Btw, please, keep Roland in Cc.Oh! I take it then that the infiniband driver will call sync_single() on subsections of a mapped region? I haven''t seen that behaviour before and it will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as surely as it does the Xen-specific swiotlb! We could make the swiotlb robust to this treatment, I guess. It will involve initialising all covered io_tlb_orig_addr[] slots rather than just the first. You could even have a go at this yourself if you like: rather than initialising a single slot at the end of map_single(), you''d have a for-loop to iterate over each allocated swiotlb slab. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 07:29:59PM +0100, Keir Fraser wrote:> Oh! I take it then that the infiniband driver will call sync_single() on > subsections of a mapped region? I haven''t seen that behaviour before and it > will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as > surely as it does the Xen-specific swiotlb! > > We could make the swiotlb robust to this treatment, I guess. It will involve > initialising all covered io_tlb_orig_addr[] slots rather than just the > first. You could even have a go at this yourself if you like: rather than > initialising a single slot at the end of map_single(), you''d have a for-loop > to iterate over each allocated swiotlb slab.Is there a reason why this is not an issue in Dom0? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 19:37, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:> On Mon, Jul 09, 2007 at 07:29:59PM +0100, Keir Fraser wrote: >> Oh! I take it then that the infiniband driver will call sync_single() on >> subsections of a mapped region? I haven''t seen that behaviour before and it >> will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as >> surely as it does the Xen-specific swiotlb! >> >> We could make the swiotlb robust to this treatment, I guess. It will involve >> initialising all covered io_tlb_orig_addr[] slots rather than just the >> first. You could even have a go at this yourself if you like: rather than >> initialising a single slot at the end of map_single(), you''d have a for-loop >> to iterate over each allocated swiotlb slab. > > Is there a reason why this is not an issue in Dom0?It should be an issue in dom0 just the same as in domU. You may get lucky in dom0 and not use the swiotlb as much, as your dom0 is more likely to have a physically contiguous memory map to start with. Anyway, attahced is a patch I just applied to our unstable Linux tree. It should fix this issue. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> By my understanding, the infiniband driver is doing an order-6 allocation,> then stuffing that multi-page region into a single element of a scatterlist. > It is then calling dma_map_sg(), which [on Xen] calls swiotlb_map_sg(), > which calls map_single() on that multi-page extent. Am I misunderstanding > something? Don''t know about the Xen part but you are correct about the mthca driver part. And as far as I know it is perfectly fine to put order>0 pages into a scatterlist... - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 19:46, "Keir Fraser" <keir@xensource.com> wrote:> Anyway, attahced is a patch I just applied to our unstable Linux tree. It > should fix this issue.And here''s *another* patch. It applies on top of the previous one (ie. you need both). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Oh! I take it then that the infiniband driver will call sync_single() on> subsections of a mapped region? I haven''t seen that behaviour before and it > will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as > surely as it does the Xen-specific swiotlb! I guess it is not being hit on non-Xen Linux because mthca sets a 64-bit DMA mask and hence bypasses swiotlb. However what you point out does seem to be a bug: Documentation/DMA-API.txt says: void dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) ... synchronise a single contiguous or scatter/gather mapping. All the parameters must be the same as those passed into the single mapping API. so it seems not really kosher to do on a subsection of a mapping... I''ll take a look at how we can do better here. - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 20:25, "Roland Dreier" <rdreier@cisco.com> wrote:> so it seems not really kosher to do on a subsection of a mapping... > > I''ll take a look at how we can do better here.Thanks. I don''t expect my workaround fixes would be accepted into the main lib/swiotlb.c since it is working around disallowed behaviour. We do want to get rid of the Xen-specific swiotlb in the near future so the fixes are a temporary measure. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 08:22:05PM +0100, Keir Fraser wrote:> On 9/7/07 19:46, "Keir Fraser" <keir@xensource.com> wrote: > > > Anyway, attahced is a patch I just applied to our unstable Linux tree. It > > should fix this issue. > > And here''s *another* patch. It applies on top of the previous one (ie. you > need both).OK, applied. However, the first patch already worked, at least no more oopses. However, the driver complains that it cannot receive IRQ. The IRQ handler is registred by pciback or should it be registered directly in DomU? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 12:25:16PM -0700, Roland Dreier wrote:> > Oh! I take it then that the infiniband driver will call sync_single() on > > subsections of a mapped region? I haven''t seen that behaviour before and it > > will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as > > surely as it does the Xen-specific swiotlb! > > I guess it is not being hit on non-Xen Linux because mthca sets a > 64-bit DMA mask and hence bypasses swiotlb. > > However what you point out does seem to be a bug:Thanks for help to solve this issue. The IB under Xen is another step forward :) However, the driver itself complain about to not receiving IRQ. But the IRQ *is* received. mthca_eq_int is performed but next_eqe_sw returns NULL becouse MTHCA_EQ_ENTRY_OWNER_HW is set. Any idea what could be wrong? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> However, the driver itself complain about to not receiving IRQ. But the IRQ> *is* received. mthca_eq_int is performed but next_eqe_sw returns NULL becouse > MTHCA_EQ_ENTRY_OWNER_HW is set. The EQ (event queue) is allocated with dma_alloc_coherent() in PAGE_SIZE chunks. My best guess would be that somehow the hardware is getting the wrong DMA address for the EQ and writing the EQE (event queue entry) to the wrong place. This is probably related to the earlier problem, since that was the driver writing into some hardware data structures, which may not work if there''s a bounce buffer involved. What version of the kernel are you using for your domU? I may be able to give you a simple patch to confirm this theory. - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 02:07:13PM -0700, Roland Dreier wrote:> The EQ (event queue) is allocated with dma_alloc_coherent() in > PAGE_SIZE chunks. My best guess would be that somehow the hardware is > getting the wrong DMA address for the EQ and writing the EQE (event > queue entry) to the wrong place. > > This is probably related to the earlier problem, since that was the > driver writing into some hardware data structures, which may not work > if there''s a bounce buffer involved. > > What version of the kernel are you using for your domU? I may be able > to give you a simple patch to confirm this theory.2.6.18 kernel + Xen 3.1 + OFED 1.2 I added some printks into OFED thus it may be better to grab it from my web site: it''s quite small about 2.5MB. http://undomiel.ics.muni.cz/tmp/ofa_kernel-1.2.tar.bz2 -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 20:43, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote:>> And here''s *another* patch. It applies on top of the previous one (ie. you >> need both). > > OK, applied. However, the first patch already worked, at least no more oopses. > > However, the driver complains that it cannot receive IRQ. The IRQ handler is > registred by pciback or should it be registered directly in DomU?Does the driver now work in dom0? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> 2.6.18 kernel + Xen 3.1 + OFED 1.2OK, if it''s OFED 1.2 then you have a pretty recent mthca driver. See what happens if you comment out the line dev->mthca_flags |= MTHCA_FLAG_FMR; in mthca_mr.c. (It''s in an if/else statement so obviously you have to get rid of the line in a way that leaves the driver compiling etc...) That should make the driver stop writing directly into the memory translation table. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 10:18:53PM +0100, Keir Fraser wrote:> > OK, applied. However, the first patch already worked, at least no more oopses. > > > > However, the driver complains that it cannot receive IRQ. The IRQ handler is > > registred by pciback or should it be registered directly in DomU? > > Does the driver now work in dom0?Hmm, interesting, it does *not*. And the IRQ *is* invoked but there is some other issue still related to the bounce buffers. See the mail from Roland. -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Oh! I take it then that the infiniband driver will call sync_single() on> subsections of a mapped region? I haven''t seen that behaviour before and it > will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as > surely as it does the Xen-specific swiotlb! > We could make the swiotlb robust to this treatment, I guess. It will involve > initialising all covered io_tlb_orig_addr[] slots rather than just the > first. Does this mean that lib/swiotlb.c''s swiotlb_sync_single_range_for_cpu() and swiotlb_sync_single_range_for_device() are broken? Given that (as you say) io_tlb_orig_addr[] only gets one slot filled in at the end of map_single(), I don''t see any way it could work if more than one page is mapped. - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 22:26, "Roland Dreier" <rdreier@cisco.com> wrote:>> Oh! I take it then that the infiniband driver will call sync_single() on >> subsections of a mapped region? I haven''t seen that behaviour before and it >> will kill lib/swiotlb.c (the generic Linux swiotlb implementation) just as >> surely as it does the Xen-specific swiotlb! > >> We could make the swiotlb robust to this treatment, I guess. It will involve >> initialising all covered io_tlb_orig_addr[] slots rather than just the >> first. > > Does this mean that lib/swiotlb.c''s swiotlb_sync_single_range_for_cpu() > and swiotlb_sync_single_range_for_device() are broken? Given that (as > you say) io_tlb_orig_addr[] only gets one slot filled in at the end of > map_single(), I don''t see any way it could work if more than one page > is mapped.Yeah, see the email I posted just one second ago. :-) So my workaround patches are quite applicable to lib/swiotlb.c as a genuine bug fix! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 02:18:57PM -0700, Roland Dreier wrote:> OK, if it''s OFED 1.2 then you have a pretty recent mthca driver. > See what happens if you comment out the line > > dev->mthca_flags |= MTHCA_FLAG_FMR; > > in mthca_mr.c. (It''s in an if/else statement so obviously you have to > get rid of the line in a way that leaves the driver compiling etc...) > > That should make the driver stop writing directly into the memory > translation table.Great! With this "fix", it turns on interface ib0. Many thanks to all! Anyway, is this fix limiting some features, latencies or throughput? -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 10:18:53PM +0100, Keir Fraser wrote:> On 9/7/07 20:43, "Lukas Hejtmanek" <xhejtman@ics.muni.cz> wrote: > > OK, applied. However, the first patch already worked, at least no more oopses. > > > > However, the driver complains that it cannot receive IRQ. The IRQ handler is > > registred by pciback or should it be registered directly in DomU? > > Does the driver now work in dom0?With the latest fix from Roland, the driver works in the both dom0 and domU. (or at least, it establishes link and ib0 device). -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Great! With this "fix", it turns on interface ib0. Many thanks to all!in domU? That is cool. > Anyway, is this fix limiting some features, latencies or throughput? It is getting rid of the FMR feature entirely and also affecting memory registration speed. Although going through an swiotlb memcpy may eliminate the memory registration speed advantage. Anyway things are quite useful without the direct write to the MTT -- I think Lustre may be the only thing that *has* to have FMRs, and the memory registration optimization is not that important; we didn''t have it for quite a while and it wasn''t *that* big a deal. Anyway I think the real fix is to switch to dma_sync_single_range in mthca along with Keir''s fixes to swiotlb. Although I''m a little confused about the earlier parts of the story. Why was it necessary to force the use of swiotlb? Shouldn''t things work by default? And is there any more intelligent way to give big chunks of system memory to a PCI device for exclusive use? - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Jul 09, 2007 at 02:53:53PM -0700, Roland Dreier wrote:> > Great! With this "fix", it turns on interface ib0. Many thanks to all! > > in domU? That is cool.Yes! I''m able to ping another machine over the IB from DomU. In Dom0, the driver works as well. (It stoped to work after swiotlb patches to the Xen.)> Anyway I think the real fix is to switch to dma_sync_single_range in > mthca along with Keir''s fixes to swiotlb.OK, I see. I would be happy to test any further patches if you are about to do them.> Although I''m a little confused about the earlier parts of the story. > Why was it necessary to force the use of swiotlb? Shouldn''t things > work by default?I guess it''s because swiotlb is ON by default in Dom0. So forcing it also in DomU makes it equivalent to Dom0. And because both domains are virtual, i.e., the domains do not see *real* physical memory then standard memory mapping routines do not work. It''s my guess.> And is there any more intelligent way to give big chunks of system > memory to a PCI device for exclusive use?Maybe some support in Xen - a syscall that is able to map contiguous ares but also in this case, we have problem migration and reporting physical addresses to the PCI device. It''s similar to FTP and NAT problem. NAT basically virtualizes network but the FTP needs to pass data IP thus hooks in NAT are needed. We would need similar hooks here or to deny sending ''virtual physical'' address to the devices. -- Lukáš Hejtmánek _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 22:53, "Roland Dreier" <rdreier@cisco.com> wrote:> Although I''m a little confused about the earlier parts of the story. > Why was it necessary to force the use of swiotlb? Shouldn''t things > work by default?A swiotlb is a pre-allocated bounce-buffer region, so it has a memory cost even if it''s not actually used. Hence we do not create one by default for a domU -- it has to be forced. Perhaps we could work out some way to detect whether a swiotlb is likely to be needed, but our BUG_ON() messages are pretty clear about why they are BUGging, and I considered that good enough.> And is there any more intelligent way to give big chunks of system > memory to a PCI device for exclusive use?Perhaps dma_alloc_coherent/pci_alloc_consistent? These always return machine-contiguous memory. I''m not sure if their use in this way would be an abuse of the DMA API, though. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > And is there any more intelligent way to give big chunks of system> > memory to a PCI device for exclusive use? > > Perhaps dma_alloc_coherent/pci_alloc_consistent? These always return > machine-contiguous memory. I''m not sure if their use in this way would be an > abuse of the DMA API, though. It''s not an abuse but it uses kernel address space unnecessarily on 32-bit (non-Xen) architectures. If I''m just giving a chunk of memory to the device, I might as well allocate pages with GFP_HIGHUSER and save scarce kernel address space. As a Xen-specific change it might make sense to convert some of the allocations from alloc_pages to dma_alloc_coherent but I''m not likely to make the change in the mainline driver. - R. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/7/07 23:30, "Keir Fraser" <keir@xensource.com> wrote:>> And is there any more intelligent way to give big chunks of system >> memory to a PCI device for exclusive use? > > Perhaps dma_alloc_coherent/pci_alloc_consistent? These always return > machine-contiguous memory. I''m not sure if their use in this way would be an > abuse of the DMA API, though.By which I mean: if you were using them instead of alloc_pages *only* because they guarantee you machine-contiguous memory when running on Xen, then that would probably be an abuse of the DMA-API. OTOH, it may just be pointing out places in the driver where pci_alloc_consistent would be a very suitable API function to be using. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2007-Jul-10 18:47 UTC
[Xen-devel] Question: Dynamic code in x86_64 Xen
(Sorry, hit the send button too soon, in the last message) Keir, Could you please help me understand some Xen code profile that I cannot explain? When running netperf on x86_64 Xen, oprofile reports that 2% of the PC samples are not recognized (i.e. they correspond to code outside the ".text" section reported on xen image file). I added some instrumentation in oprofile and observed that these PC samples are located in: 1) for dom0: in Xen BSS section (more specifically on cpu0_stack) 2) for domU: outside any section specified in Xen image file (i.e. dynamically allocated memory) I suspect case 2 is also a stack for a different CPU which is dynamically allocated (and used in the CPU that the guest is executing) but I am not sure... Anyway, why whould Xen execute any code from the stack? Is this expected Xen behavior or this is a bug somewhere (probably in xenoprofile)? I did not see this behavior in the past when I was using x86_32 Xen. Could you please shed some light on this ... Thanks Renato>>>> -----Original Message----- >> From: Santos, Jose Renato G >> Sent: Tuesday, July 10, 2007 11:44 AM >> To: ''Keir Fraser'' >> Subject: Question: Dynamic code in x86_64 Xen >> >> >> Keir, >> >> Could you please help me understand some Xen code profile >> that I cannot explain? >> When running netperf on x86_64 Xen, oprofile reports that 2% >> of the PC samples are not recognized (i.e. they correspond >> to code outside the ".text" section reported on xen image >> file). I added some instrumentation in oprofile and observed >> that these PC samples are located in: >> 1) for dom0: in Xen BSS section (more specifically on cpu0_stack) >> 2) for domU: outside any section specified in Xen image file >> (i.e. dynamically allocated memory) >> >> I suspect case 2 is also a stack for a different CPU which >> is dynamically allocated (and used in the CPU that the guest >> is executing) but I am not sure... >> >> Anyway, why whould Xen execute any code from the stack? Is >> this expected Xen behavior or this is a bug somewhere >> (probably in xenoprofile)? >> I did not see this behavior in the past when I was using x86_32 Xen. >> Could you please >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/7/07 19:47, "Santos, Jose Renato G" <joserenato.santos@hp.com> wrote:> Could you please help me understand some Xen code profile > that I cannot explain? > When running netperf on x86_64 Xen, oprofile reports that 2% > of the PC samples are not recognized (i.e. they correspond > to code outside the ".text" section reported on xen image > file). I added some instrumentation in oprofile and observed > that these PC samples are located in: > 1) for dom0: in Xen BSS section (more specifically on cpu0_stack) > 2) for domU: outside any section specified in Xen image file > (i.e. dynamically allocated memory)Yes, these are executions on the stack. This is expected behaviour for x86_64. The syscall instruction enters Xen via a stack trampoline. This is because syscall does not switch %rsp for us. Hence using a stack trampoline cunningly lets us compute %rsp from %rip. If you see 2% of your samples in the syscall trampoline, this probably indicates mainly that the processor is spending ~2% of its time doing a syscall transition (and then the NMI occurs on teh very first instruction executed in Xen). -- Keir> I suspect case 2 is also a stack for a different CPU which > is dynamically allocated (and used in the CPU that the guest > is executing) but I am not sure... > > Anyway, why whould Xen execute any code from the stack? Is > this expected Xen behavior or this is a bug somewhere > (probably in xenoprofile)? > I did not see this behavior in the past when I was using x86_32 Xen. > Could you please shed some light on this ..._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Santos, Jose Renato G
2007-Jul-10 22:06 UTC
[Xen-devel] RE: Question: Dynamic code in x86_64 Xen
Thanks, Keir. That explains it ... Glad to know this is not a xenoprof bug :) Regards Renato>> -----Original Message----- >> From: Keir Fraser [mailto:keir@xensource.com] >> Sent: Tuesday, July 10, 2007 3:01 PM >> To: Santos, Jose Renato G >> Cc: xen-devel@lists.xensource.com >> Subject: Re: Question: Dynamic code in x86_64 Xen >> >> On 10/7/07 19:47, "Santos, Jose Renato G" >> <joserenato.santos@hp.com> wrote: >> >> > Could you please help me understand some Xen code profile that I >> > cannot explain? >> > When running netperf on x86_64 Xen, oprofile reports that >> 2% of the PC >> > samples are not recognized (i.e. they correspond to code >> outside the >> > ".text" section reported on xen image file). I added some >> > instrumentation in oprofile and observed that these PC samples are >> > located in: >> > 1) for dom0: in Xen BSS section (more specifically on cpu0_stack) >> > 2) for domU: outside any section specified in Xen image >> file (i.e. >> > dynamically allocated memory) >> >> Yes, these are executions on the stack. This is expected >> behaviour for x86_64. The syscall instruction enters Xen via >> a stack trampoline. This is because syscall does not switch >> %rsp for us. Hence using a stack trampoline cunningly lets >> us compute %rsp from %rip. >> >> If you see 2% of your samples in the syscall trampoline, >> this probably indicates mainly that the processor is >> spending ~2% of its time doing a syscall transition (and >> then the NMI occurs on teh very first instruction executed in Xen). >> >> -- Keir >> >> > I suspect case 2 is also a stack for a different CPU which is >> > dynamically allocated (and used in the CPU that the guest is >> > executing) but I am not sure... >> > >> > Anyway, why whould Xen execute any code from the stack? Is this >> > expected Xen behavior or this is a bug somewhere (probably in >> > xenoprofile)? >> > I did not see this behavior in the past when I was using >> x86_32 Xen. >> > Could you please shed some light on this ... >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
hi I read the code about vmcreate,and I am confused about the function of setup_pgtables_x86_64(),which is called by xc_dom_boot_image(), does the setup_pgtables_x86_64() setup the kernel Pte for the guestos kernel ,that is, the guestos will run with it ,is it right? if it is ,why does it do this here and why does it not do this in the guestos? another one , does setup_pgtables_x86_64()setup the pte for all the mfn which is allocated for the VM,or not? I am really confused about it could you help me Thanks in advance _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel