Hi Keir, Does xen and/or the xen console depend on physical cpu 0 ? I''m still trying to solve the mystery of my machine freezing when doing: - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) - compiling a linux kernel with "make -j 6" It''s a 6 core AMD phenom x6. Without cpu pinning: I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). When the machine freezes i can''t do anything with xen serial console. With cpu pinning: By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 3 r-- 2169.7 1-4 Domain-0 0 1 1 -b- 2339.3 1-4 Domain-0 0 2 2 -b- 2358.9 1-4 Domain-0 0 3 3 -b- 2298.2 1-4 Domain-0 0 4 1 -b- 2221.9 1-4 Domain-0 0 5 4 -b- 2287.7 1-4 backup 9 0 4 -b- 10.6 1-4 database 1 0 4 -b- 45.3 1-4 davical 5 0 3 -b- 8.7 1-4 git 8 0 2 -b- 7.9 1-4 mail 2 0 4 -b- 8.0 1-4 samba 3 0 3 -b- 11.1 1-4 security 7 0 5 r-- 1433.2 5 www 4 0 1 -b- 10.2 1-4 zabbix 6 0 3 -b- 21.2 1-4 Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ? -- Sander _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote:> Hi Keir, > > Does xen and/or the xen console depend on physical cpu 0 ?Usually the console for Dom0, and I think all other domains go through CPU0. Let me CC Ian here, who has been mucking in this area and found some bugs (and produced fixes). Ian, that bug you found with not clearing the eventchannel - that wouldn''t have an impact here, right?> > I''m still trying to solve the mystery of my machine freezing when doing: > > - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) > - compiling a linux kernel with "make -j 6" > > It''s a 6 core AMD phenom x6. > > Without cpu pinning: > I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). > When the machine freezes i can''t do anything with xen serial console. > > With cpu pinning: > By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. > Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. >AHA! So finally closer to the mystery. Can you provide the /proc/interrupts of the Dom0? I wonder if this is related to the isseu I had some time ago, and never got to look at. The problem was that during heavy compilation (this is a 2 Nehelem socket box, just running Dom0 - no guests), the keyboard and USB driver would stop getting interrupts. So the drivers would start polling which is quite slow, albeit servicable, and then at some point it would pick up again. The weirdness was that the /proc/interrupts showed absolutly _no_ interrupts on CPU0 during that time - as if Xen just forgot to update them. Jeremy suggested I try to disable Xen IRQ balance (noirqbalance on Xen command line) in case that is it, and to my emberrasement I haven''t tried that yet. Did you try that? I think somebody suggested that but I can''t recall whether it was for this issue?> > Name ID VCPU CPU State Time(s) CPU Affinity > Domain-0 0 0 3 r-- 2169.7 1-4 > Domain-0 0 1 1 -b- 2339.3 1-4 > Domain-0 0 2 2 -b- 2358.9 1-4 > Domain-0 0 3 3 -b- 2298.2 1-4 > Domain-0 0 4 1 -b- 2221.9 1-4 > Domain-0 0 5 4 -b- 2287.7 1-4 > backup 9 0 4 -b- 10.6 1-4 > database 1 0 4 -b- 45.3 1-4 > davical 5 0 3 -b- 8.7 1-4 > git 8 0 2 -b- 7.9 1-4 > mail 2 0 4 -b- 8.0 1-4 > samba 3 0 3 -b- 11.1 1-4 > security 7 0 5 r-- 1433.2 5 > www 4 0 1 -b- 10.2 1-4 > zabbix 6 0 3 -b- 21.2 1-4 > > > Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ?I don''t know, but I do know that the IRQ handling in Xen 4.0 changed significantly compared to 3.4. I don''t remember if you ever ran this setup under 3.4?> > -- > Sander_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 2010-10-12 at 17:44 +0100, Konrad Rzeszutek Wilk wrote:> On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote: > > Hi Keir, > > > > Does xen and/or the xen console depend on physical cpu 0 ? > > Usually the console for Dom0, and I think all other domains go > through CPU0. Let me CC Ian here, who has been mucking in this > area and found some bugs (and produced fixes). > > Ian, that bug you found with not clearing the eventchannel - that > wouldn''t have an impact here, right?I don''t think so. That issue was related to evtchn delivery which is to VCPUs not PCPUs. I don''t think it was specific to VCPU0 either -- it just happened that the particular evtchn was generally tied to VCPU0 by default. I don''t think the problem would happen for PIRQs anyway since the ->startup method for that IRQ chip includes an explicit rebind of the evtchn to a VCPU, it''s only dynirqs which have the issue. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tuesday, October 12, 2010, 6:44:33 PM, you wrote:> On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote: >> Hi Keir, >> >> Does xen and/or the xen console depend on physical cpu 0 ?> Usually the console for Dom0, and I think all other domains go > through CPU0. Let me CC Ian here, who has been mucking in this > area and found some bugs (and produced fixes).> Ian, that bug you found with not clearing the eventchannel - that > wouldn''t have an impact here, right?>> >> I''m still trying to solve the mystery of my machine freezing when doing: >> >> - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) >> - compiling a linux kernel with "make -j 6" >> >> It''s a 6 core AMD phenom x6. >> >> Without cpu pinning: >> I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). >> When the machine freezes i can''t do anything with xen serial console. >> >> With cpu pinning: >> By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. >> Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. >>> AHA! So finally closer to the mystery.So i thought ... but all though it survived 20 iterations of kernel compiling, it still froze while the dom0 was relatively idle, and the domU still grabing video. This time it gave the "RCU detected CPU stalls " again cpu 0, since it''s dom0 that should be vcpu0=pcpu1. My xen serial console was frozen again, so i can''t dump anything. But: -the hypervisor should still have pcpu0 available -dom0 has pcpu1-4 although shared with some other mostly idle domains -domU with videograbbing has pcpu5 So the cpu pinning seems to change things a bit, but only in the sense that it survives some what longer ... Another thing i''m wondering about is that xentop reports that dom0 consumes about 50% cpu, when i use top on dom0, i seem to get nowhere near 50% when using the 2.6.31 pvops kernel With the latest 2.6.32-pvops there is a problem that events/0 consumes a lot of cpu related to xenconsoled (jeremy has allready a thread running on that). That''s why i now tested 2.6.31-pvops that hasn''t got that issue.> Can you provide the /proc/interrupts of the Dom0?Just when running for some time, or try to get it under load / just before freeze ?> I wonder if this is related to the isseu I had some time ago, and never got > to look at. The problem was that during heavy compilation (this is a 2 Nehelem > socket box, just running Dom0 - no guests), the keyboard and USB driver would > stop getting interrupts. So the drivers would start polling which is quite slow, > albeit servicable, and then at some point it would pick up again.> The weirdness was that the /proc/interrupts showed absolutly _no_ interrupts on CPU0 > during that time - as if Xen just forgot to update them. Jeremy suggested I try to > disable Xen IRQ balance (noirqbalance on Xen command line) in case that is it, and to my > emberrasement I haven''t tried that yet.I did try that before, didn''t seem to make a difference, but i will try again just to be sure.> Did you try that? I think somebody suggested that but I can''t recall whether it > was for this issue? >> >> Name ID VCPU CPU State Time(s) CPU Affinity >> Domain-0 0 0 3 r-- 2169.7 1-4 >> Domain-0 0 1 1 -b- 2339.3 1-4 >> Domain-0 0 2 2 -b- 2358.9 1-4 >> Domain-0 0 3 3 -b- 2298.2 1-4 >> Domain-0 0 4 1 -b- 2221.9 1-4 >> Domain-0 0 5 4 -b- 2287.7 1-4 >> backup 9 0 4 -b- 10.6 1-4 >> database 1 0 4 -b- 45.3 1-4 >> davical 5 0 3 -b- 8.7 1-4 >> git 8 0 2 -b- 7.9 1-4 >> mail 2 0 4 -b- 8.0 1-4 >> samba 3 0 3 -b- 11.1 1-4 >> security 7 0 5 r-- 1433.2 5 >> www 4 0 1 -b- 10.2 1-4 >> zabbix 6 0 3 -b- 21.2 1-4 >> >> >> Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ?> I don''t know, but I do know that the IRQ handling in Xen 4.0 changed significantly compared > to 3.4. I don''t remember if you ever ran this setup under 3.4?I tried xen 3.4-testing as well today (in combination with 2.6.31-pvops as dom0), but that resulted in a videograbbing domU going beserk, the xhci driver complains about "spurious interrupts" multiple times a second.>> >> -- >> Sander_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi Konrad, Here are the /proc/interrupts, without any cpu pinning, with normal interrupts i saw a pciback entry in /proc/interrupts in dom0, but with msi-x these seem to be missing ? /proc/interrupts dom0 (2.6.31-pvops), running domU with videograbber active, little other activity in dom0 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 1: 2 0 0 0 0 0 xen-pirq-ioapic-edge i8042 8: 0 0 0 0 0 0 xen-pirq-ioapic-edge rtc0 9: 0 0 0 0 0 0 xen-pirq-ioapic-level acpi 12: 4 0 0 0 0 0 xen-pirq-ioapic-edge i8042 17: 2 0 0 0 0 0 xen-pirq-ioapic-level ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3 18: 33 0 0 0 0 0 xen-pirq-ioapic-level ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 25: 18 0 0 0 0 0 xen-pirq-ioapic-level HDA Intel 903: 38 0 0 0 0 0 xen-dyn-event vif9.0 904: 912 0 0 0 0 0 xen-dyn-event blkif-backend 905: 18 0 0 0 0 0 xen-dyn-event blkif-backend 906: 409 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 907: 285 0 0 0 0 0 xen-dyn-event evtchn:xenstored 908: 12 0 0 0 0 0 xen-dyn-event vif8.0 909: 4882 0 0 0 0 0 xen-dyn-event blkif-backend 910: 19 0 0 0 0 0 xen-dyn-event blkif-backend 911: 465 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 912: 426 0 0 0 0 0 xen-dyn-event evtchn:xenstored 913: 252 0 0 0 0 0 xen-dyn-event vif7.0 916: 135 0 0 0 0 0 xen-dyn-event blkif-backend 917: 1822 0 0 0 0 0 xen-dyn-event blkif-backend 918: 25 0 0 0 0 0 xen-dyn-event blkif-backend 919: 1021 0 0 0 0 0 xen-dyn-event pciback 920: 947 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 921: 357 0 0 0 0 0 xen-dyn-event evtchn:xenstored 922: 61440 0 0 0 0 0 xen-dyn-event vif6.0 923: 3065 0 0 0 0 0 xen-dyn-event blkif-backend 924: 25 0 0 0 0 0 xen-dyn-event blkif-backend 925: 236 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 926: 262 0 0 0 0 0 xen-dyn-event evtchn:xenstored 927: 12 0 0 0 0 0 xen-dyn-event vif5.0 928: 932 0 0 0 0 0 xen-dyn-event blkif-backend 929: 19 0 0 0 0 0 xen-dyn-event blkif-backend 930: 272 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 931: 288 0 0 0 0 0 xen-dyn-event evtchn:xenstored 932: 59 0 0 0 0 0 xen-dyn-event vif4.0 933: 1263 0 0 0 0 0 xen-dyn-event blkif-backend 934: 23 0 0 0 0 0 xen-dyn-event blkif-backend 935: 282 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 936: 201 0 0 0 0 0 xen-dyn-event vif3.0 937: 286 0 0 0 0 0 xen-dyn-event evtchn:xenstored 938: 1082 0 0 0 0 0 xen-dyn-event blkif-backend 939: 19 0 0 0 0 0 xen-dyn-event blkif-backend 940: 301 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 941: 18 0 0 0 0 0 xen-dyn-event vif2.0 942: 810 0 0 0 0 0 xen-dyn-event blkif-backend 943: 19 0 0 0 0 0 xen-dyn-event blkif-backend 944: 280 0 0 0 0 0 xen-dyn-event evtchn:xenstored 945: 281 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 946: 42169 0 0 0 0 0 xen-dyn-event vif1.0 947: 279 0 0 0 0 0 xen-dyn-event evtchn:xenstored 948: 24824 0 0 0 0 0 xen-dyn-event blkif-backend 949: 19 0 0 0 0 0 xen-dyn-event blkif-backend 950: 285 0 0 0 0 0 xen-dyn-event evtchn:xenconsoled 951: 282 0 0 0 0 0 xen-dyn-event evtchn:xenstored 952: 0 0 0 0 0 0 xen-dyn-event evtchn:xenstored 953: 5922 0 0 0 0 0 xen-dyn-event evtchn:xenstored 954: 22988 0 0 0 0 0 xen-pirq-msi eth1 955: 21566 0 0 0 0 0 xen-pirq-msi eth0 956: 0 0 0 0 0 0 xen-pirq-msi ahci 957: 72163 0 0 0 0 0 xen-pirq-msi ahci 968: 0 0 0 0 0 0 xen-dyn-virq pcpu 969: 11384 0 0 0 0 0 xen-dyn-event xenbus 970: 0 0 0 0 0 1347 xen-dyn-ipi callfuncsingle5 971: 0 0 0 0 0 0 xen-dyn-virq debug5 972: 0 0 0 0 0 390 xen-dyn-ipi callfunc5 973: 0 0 0 0 0 10300 xen-dyn-ipi resched5 974: 0 0 0 0 0 137425 xen-dyn-virq timer5 975: 0 0 0 0 1504 0 xen-dyn-ipi callfuncsingle4 976: 0 0 0 0 0 0 xen-dyn-virq debug4 977: 0 0 0 0 394 0 xen-dyn-ipi callfunc4 978: 0 0 0 0 20872 0 xen-dyn-ipi resched4 979: 0 0 0 0 254028 0 xen-dyn-virq timer4 980: 0 0 0 1560 0 0 xen-dyn-ipi callfuncsingle3 981: 0 0 0 0 0 0 xen-dyn-virq debug3 982: 0 0 0 309 0 0 xen-dyn-ipi callfunc3 983: 0 0 0 23055 0 0 xen-dyn-ipi resched3 984: 0 0 0 348681 0 0 xen-dyn-virq timer3 985: 0 0 1252 0 0 0 xen-dyn-ipi callfuncsingle2 986: 0 0 0 0 0 0 xen-dyn-virq debug2 987: 0 0 415 0 0 0 xen-dyn-ipi callfunc2 988: 0 0 24141 0 0 0 xen-dyn-ipi resched2 989: 0 0 415948 0 0 0 xen-dyn-virq timer2 990: 0 1847 0 0 0 0 xen-dyn-ipi callfuncsingle1 991: 0 0 0 0 0 0 xen-dyn-virq debug1 992: 0 404 0 0 0 0 xen-dyn-ipi callfunc1 993: 0 22844 0 0 0 0 xen-dyn-ipi resched1 994: 0 484202 0 0 0 0 xen-dyn-virq timer1 995: 1148 0 0 0 0 0 xen-dyn-ipi callfuncsingle0 996: 0 0 0 0 0 0 xen-dyn-virq debug0 997: 276 0 0 0 0 0 xen-dyn-ipi callfunc0 998: 17435 0 0 0 0 0 xen-dyn-ipi resched0 999: 1296208 0 0 0 0 0 xen-dyn-virq timer0 NMI: 0 0 0 0 0 0 Non-maskable interrupts LOC: 0 0 0 0 0 0 Local timer interrupts SPU: 0 0 0 0 0 0 Spurious interrupts CNT: 0 0 0 0 0 0 Performance counter interrupts PND: 0 0 0 0 0 0 Performance pending work RES: 17435 22844 24141 23055 20872 10300 Rescheduling interrupts CAL: 1424 2251 1667 1869 1898 1737 Function call interrupts TLB: 0 0 0 0 0 0 TLB shootdowns TRM: 0 0 0 0 0 0 Thermal event interrupts MCE: 0 0 0 0 0 0 Machine check exceptions MCP: 5 5 5 5 5 5 Machine check polls ERR: 0 MIS: 0 /proc/interrupts in videograbbing domU (2.6.36 pci-front0.7) CPU0 44: 0 xen-pirq-pcifront ohci_hcd:usb2 45: 0 xen-pirq-pcifront ohci_hcd:usb3 46: 0 xen-pirq-pcifront ehci_hcd:usb1 86: 0 xen-pirq-pcifront-msi-x xhci_hcd 87: 147452 xen-pirq-pcifront-msi-x xhci_hcd 244: 461 xen-dyn-event eth0 245: 151 xen-dyn-event blkif 246: 2720 xen-dyn-event blkif 247: 30 xen-dyn-event blkif 248: 309 xen-dyn-event hvc_console 249: 2 xen-dyn-event pcifront 250: 603 xen-dyn-event xenbus 251: 0 xen-percpu-ipi callfuncsingle0 252: 0 xen-percpu-virq debug0 253: 0 xen-percpu-ipi callfunc0 254: 0 xen-percpu-ipi resched0 255: 193070 xen-percpu-virq timer0 NMI: 0 Non-maskable interrupts LOC: 0 Local timer interrupts SPU: 0 Spurious interrupts PMI: 0 Performance monitoring interrupts PND: 0 Performance pending work RES: 0 Rescheduling interrupts CAL: 0 Function call interrupts TLB: 0 TLB shootdowns MCE: 0 Machine check exceptions MCP: 0 Machine check polls ERR: 0 MIS: 0 Tuesday, October 12, 2010, 6:44:33 PM, you wrote:> On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote: >> Hi Keir, >> >> Does xen and/or the xen console depend on physical cpu 0 ?> Usually the console for Dom0, and I think all other domains go > through CPU0. Let me CC Ian here, who has been mucking in this > area and found some bugs (and produced fixes).> Ian, that bug you found with not clearing the eventchannel - that > wouldn''t have an impact here, right?>> >> I''m still trying to solve the mystery of my machine freezing when doing: >> >> - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) >> - compiling a linux kernel with "make -j 6" >> >> It''s a 6 core AMD phenom x6. >> >> Without cpu pinning: >> I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). >> When the machine freezes i can''t do anything with xen serial console. >> >> With cpu pinning: >> By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. >> Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. >>> AHA! So finally closer to the mystery.> Can you provide the /proc/interrupts of the Dom0?> I wonder if this is related to the isseu I had some time ago, and never got > to look at. The problem was that during heavy compilation (this is a 2 Nehelem > socket box, just running Dom0 - no guests), the keyboard and USB driver would > stop getting interrupts. So the drivers would start polling which is quite slow, > albeit servicable, and then at some point it would pick up again.> The weirdness was that the /proc/interrupts showed absolutly _no_ interrupts on CPU0 > during that time - as if Xen just forgot to update them. Jeremy suggested I try to > disable Xen IRQ balance (noirqbalance on Xen command line) in case that is it, and to my > emberrasement I haven''t tried that yet.> Did you try that? I think somebody suggested that but I can''t recall whether it > was for this issue? >> >> Name ID VCPU CPU State Time(s) CPU Affinity >> Domain-0 0 0 3 r-- 2169.7 1-4 >> Domain-0 0 1 1 -b- 2339.3 1-4 >> Domain-0 0 2 2 -b- 2358.9 1-4 >> Domain-0 0 3 3 -b- 2298.2 1-4 >> Domain-0 0 4 1 -b- 2221.9 1-4 >> Domain-0 0 5 4 -b- 2287.7 1-4 >> backup 9 0 4 -b- 10.6 1-4 >> database 1 0 4 -b- 45.3 1-4 >> davical 5 0 3 -b- 8.7 1-4 >> git 8 0 2 -b- 7.9 1-4 >> mail 2 0 4 -b- 8.0 1-4 >> samba 3 0 3 -b- 11.1 1-4 >> security 7 0 5 r-- 1433.2 5 >> www 4 0 1 -b- 10.2 1-4 >> zabbix 6 0 3 -b- 21.2 1-4 >> >> >> Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ?> I don''t know, but I do know that the IRQ handling in Xen 4.0 changed significantly compared > to 3.4. I don''t remember if you ever ran this setup under 3.4? >> >> -- >> Sander-- Best regards, Sander mailto:linux@eikelenboom.it _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
By messing a bit with printk''s and debug settings a warn_on in the hypervisor is being triggered when starting the videograbbing domU: mapping kernel into physical memory about to get started... (XEN) [2010-10-13 13:30:44] Xen WARN at msi.c:636 (XEN) [2010-10-13 13:30:44] ----[ Xen-4.1-unstable x86_64 debug=y Tainted: C ]---- (XEN) [2010-10-13 13:30:44] CPU: 2 (XEN) [2010-10-13 13:30:44] RIP: e008:[<ffff82c48015d797>] pci_enable_msi+0x48a/0x9d5 (XEN) [2010-10-13 13:30:44] RFLAGS: 0000000000010216 CONTEXT: hypervisor (XEN) [2010-10-13 13:30:44] rax: 0000000000000004 rbx: 00000000fe5fe000 rcx: 0000000000000001 (XEN) [2010-10-13 13:30:44] rdx: 0000000000000004 rsi: 0000000000000282 rdi: ffff82c48024e940 (XEN) [2010-10-13 13:30:44] rbp: ffff830237e57dc8 rsp: ffff830237e57cf8 r8: 0000000000000009 (XEN) [2010-10-13 13:30:44] r9: 000000000000003a r10: 0000000000000092 r11: 0000000000000213 (XEN) [2010-10-13 13:30:44] r12: 0000000000000000 r13: ffff830237e57ea8 r14: ffff83020211ed10 (XEN) [2010-10-13 13:30:44] r15: 0000000000000008 cr0: 000000008005003b cr4: 00000000000006f0 (XEN) [2010-10-13 13:30:44] cr3: 0000000225f0e000 cr2: ffff880004e93d68 (XEN) [2010-10-13 13:30:44] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) [2010-10-13 13:30:44] Xen stack trace from rsp=ffff830237e57cf8: (XEN) [2010-10-13 13:30:44] ffff830237e57d38 ffff82c480126b66 ffff830237e57e18 0700000000000010 (XEN) [2010-10-13 13:30:44] 0000000000001000 0000000000000030 00000000fe5ff000 00000000fe5ff000 (XEN) [2010-10-13 13:30:44] 0000009000077d68 ffff83014601ad10 0000000700000246 0000000000000000 (XEN) [2010-10-13 13:30:44] 0000000700000092 0000000000000000 ffff83020211eda8 00000000000fe5ff (XEN) [2010-10-13 13:30:44] 00000000000fe5ff ffff8301622fde28 0000000000000202 ffff830237e57da8 (XEN) [2010-10-13 13:30:44] ffff82c480120680 ffff830237e57ea8 00000000ffffffed ffff830146a24000 (XEN) [2010-10-13 13:30:44] 0000000000000057 0000000000000048 ffff830237e57e48 ffff82c48015f16e (XEN) [2010-10-13 13:30:44] 0000000025dfc910 000000000000015c 0000000000000048 0000000000000120 (XEN) [2010-10-13 13:30:44] ffff830237e82480 0000000000000282 ffff83020211ed10 ffff830237e57e28 (XEN) [2010-10-13 13:30:44] ffff82c480120680 ffff88002df4bb30 0000000000000057 ffff830146a24000 (XEN) [2010-10-13 13:30:44] 0000000000000048 ffff830146a24190 ffff830237e57ef8 ffff82c480172806 (XEN) [2010-10-13 13:30:44] 0000000180196b1a ffff830237e5a020 ffff830200000004 ffff830237e57ea8 (XEN) [2010-10-13 13:30:44] 000000000000000b ffffffffffffffff 0000000000000007 0000000000000000 (XEN) [2010-10-13 13:30:44] 00000000fe5fe000 aaaaaaaaaaaaaaaa 0000000000000007 0000000000000048 (XEN) [2010-10-13 13:30:44] 00000000fe5fe000 0000000000000000 0000000000000246 ffff8300c7e88000 (XEN) [2010-10-13 13:30:44] 000000000000000b ffff8800278c4400 0000000000000011 ffff88002ffea700 (XEN) [2010-10-13 13:30:44] 00007cfdc81a80c7 ffff82c480202a82 ffffffff8100942a 0000000000000021 (XEN) [2010-10-13 13:30:44] ffff88002ffea700 0000000000000011 ffff8800278c4400 000000000000000b (XEN) [2010-10-13 13:30:44] ffff88002df4bbd0 00000000000006a1 0000000000000213 ffff88002fc20200 (XEN) [2010-10-13 13:30:44] ffffffff810df6ea 0000000000000011 0000000000000021 ffffffff8100942a (XEN) [2010-10-13 13:30:44] Xen call trace: (XEN) [2010-10-13 13:30:44] [<ffff82c48015d797>] pci_enable_msi+0x48a/0x9d5 (XEN) [2010-10-13 13:30:44] [<ffff82c48015f16e>] map_domain_pirq+0x275/0x363 (XEN) [2010-10-13 13:30:44] [<ffff82c480172806>] do_physdev_op+0x826/0x10b0 (XEN) [2010-10-13 13:30:44] [<ffff82c480202a82>] syscall_enter+0xf2/0x14c (XEN) [2010-10-13 13:30:44] (XEN) [2010-10-13 13:30:44] SEIK bus: 7 slot: 0 func:0 msi->table_base: fe5fe000 read_pci_mem_bar: 4 (XEN) [2010-10-13 13:30:44] SEIK pba_paddr: 4 it''s this one: WARN_ON(msi->table_base != read_pci_mem_bar(bus, slot, func, bir)); I have added some printk''s .. and read_pci_mem_bar seems to return a bogus value .. the pba_addr is used later in the function, but i can''t oversee if and when this could have implications. This also occurs when disabling the pci_resource_align on the kernel line. lspci on dom0 shows: 07:00.0 USB Controller: NEC Corporation Device 0194 (rev 03) (prog-if 30) Subsystem: ASUSTeK Computer Inc. Device 8413 Flags: bus master, fast devsel, latency 0, IRQ 33 Memory at fe5fe000 (64-bit, non-prefetchable) [size=8K] Capabilities: [50] Power Management version 3 Capabilities: [70] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Capabilities: [90] MSI-X: Enable+ Mask- TabSize=8 Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting <?> Capabilities: [140] Device Serial Number ff-ff-ff-ff-ff-ff-ff-ff Capabilities: [150] #18 Kernel driver in use: pciback In the same function it seems to trigger if ( d ) { /* XXX How to deal with existing mappings? */ } Which seems to be a bit odd for a freshly booted system with no domU restarts ? grub menu.lst: title xen-4.1-unstable.gz / Debian GNU/Linux, 2.6.32.23-xen-next-2.6.32.x-generaldebug-20101002 root (hd0,0) kernel /xen-4.1-unstable.gz dom0_mem=768M loglvl=all loglvl_guest=all com1=115200,8n1 sync_console console_to_ring console_timestamps console=vga,com1 iommu=off debug lapic=debug apic_verbosity=debug apic=debug noirqbalance module /vmlinuz-2.6.32.24-xen-next-2.6.32.x-tracing-20101013 root=/dev/mapper/serveerstertje-root ro earlyprintk=xen max_loop=255 loop_max_part=63 libata.noacpi=1 debug loglevel=10 noirqbalance irqbalance=off iommu=soft xen-pciback.hide=(03:06.0)(07:00.0)(09:01.0)(09:01.1)(09:01.2) pci=resource_alignment=03:06.0;07:00.0;09:01.0;09:01.1;09:01.2; module /initrd.img-2.6.32.24-xen-next-2.6.32.x-tracing-20101013 -- Sander Tuesday, October 12, 2010, 6:44:33 PM, you wrote:> On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote: >> Hi Keir, >> >> Does xen and/or the xen console depend on physical cpu 0 ?> Usually the console for Dom0, and I think all other domains go > through CPU0. Let me CC Ian here, who has been mucking in this > area and found some bugs (and produced fixes).> Ian, that bug you found with not clearing the eventchannel - that > wouldn''t have an impact here, right?>> >> I''m still trying to solve the mystery of my machine freezing when doing: >> >> - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) >> - compiling a linux kernel with "make -j 6" >> >> It''s a 6 core AMD phenom x6. >> >> Without cpu pinning: >> I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). >> When the machine freezes i can''t do anything with xen serial console. >> >> With cpu pinning: >> By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. >> Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. >>> AHA! So finally closer to the mystery.> Can you provide the /proc/interrupts of the Dom0?> I wonder if this is related to the isseu I had some time ago, and never got > to look at. The problem was that during heavy compilation (this is a 2 Nehelem > socket box, just running Dom0 - no guests), the keyboard and USB driver would > stop getting interrupts. So the drivers would start polling which is quite slow, > albeit servicable, and then at some point it would pick up again.> The weirdness was that the /proc/interrupts showed absolutly _no_ interrupts on CPU0 > during that time - as if Xen just forgot to update them. Jeremy suggested I try to > disable Xen IRQ balance (noirqbalance on Xen command line) in case that is it, and to my > emberrasement I haven''t tried that yet.> Did you try that? I think somebody suggested that but I can''t recall whether it > was for this issue? >> >> Name ID VCPU CPU State Time(s) CPU Affinity >> Domain-0 0 0 3 r-- 2169.7 1-4 >> Domain-0 0 1 1 -b- 2339.3 1-4 >> Domain-0 0 2 2 -b- 2358.9 1-4 >> Domain-0 0 3 3 -b- 2298.2 1-4 >> Domain-0 0 4 1 -b- 2221.9 1-4 >> Domain-0 0 5 4 -b- 2287.7 1-4 >> backup 9 0 4 -b- 10.6 1-4 >> database 1 0 4 -b- 45.3 1-4 >> davical 5 0 3 -b- 8.7 1-4 >> git 8 0 2 -b- 7.9 1-4 >> mail 2 0 4 -b- 8.0 1-4 >> samba 3 0 3 -b- 11.1 1-4 >> security 7 0 5 r-- 1433.2 5 >> www 4 0 1 -b- 10.2 1-4 >> zabbix 6 0 3 -b- 21.2 1-4 >> >> >> Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ?> I don''t know, but I do know that the IRQ handling in Xen 4.0 changed significantly compared > to 3.4. I don''t remember if you ever ran this setup under 3.4? >> >> -- >> Sander-- Best regards, Sander mailto:linux@eikelenboom.it _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
This code was changed in changeset "x86: protect MSI-X table and pending bit array from guest writes" 22182:68cc3c514a0a Besides ... returning a bogus address in this piece of code: if ( !dev->domain || !paging_mode_translate(dev->domain) ) { struct domain *d = dev->domain; if ( !d ) for_each_domain(d) if ( !paging_mode_translate(d) && (iomem_access_permitted(d, dev->msix_table.first, dev->msix_table.last) || iomem_access_permitted(d, dev->msix_pba.first, dev->msix_pba.last)) ) break; if ( d ) { /* XXX How to deal with existing mappings? */ printk("SEIK: err what am i doing here ?? d=%d \n",d->domain_id); } } On a freshly booted machine, d seems to be 0 ... that would mean the ( !d ) code path will never be followed since all devices will belong to dom0 at first ? -- Sander Wednesday, October 13, 2010, 3:36:41 PM, you wrote:> By messing a bit with printk''s and debug settings a warn_on in the hypervisor is being triggered when starting the videograbbing domU:> mapping kernel into physical memory > about to get started... > (XEN) [2010-10-13 13:30:44] Xen WARN at msi.c:636 > (XEN) [2010-10-13 13:30:44] ----[ Xen-4.1-unstable x86_64 debug=y Tainted: C ]---- > (XEN) [2010-10-13 13:30:44] CPU: 2 > (XEN) [2010-10-13 13:30:44] RIP: e008:[<ffff82c48015d797>] pci_enable_msi+0x48a/0x9d5 > (XEN) [2010-10-13 13:30:44] RFLAGS: 0000000000010216 CONTEXT: hypervisor > (XEN) [2010-10-13 13:30:44] rax: 0000000000000004 rbx: 00000000fe5fe000 rcx: 0000000000000001 > (XEN) [2010-10-13 13:30:44] rdx: 0000000000000004 rsi: 0000000000000282 rdi: ffff82c48024e940 > (XEN) [2010-10-13 13:30:44] rbp: ffff830237e57dc8 rsp: ffff830237e57cf8 r8: 0000000000000009 > (XEN) [2010-10-13 13:30:44] r9: 000000000000003a r10: 0000000000000092 r11: 0000000000000213 > (XEN) [2010-10-13 13:30:44] r12: 0000000000000000 r13: ffff830237e57ea8 r14: ffff83020211ed10 > (XEN) [2010-10-13 13:30:44] r15: 0000000000000008 cr0: 000000008005003b cr4: 00000000000006f0 > (XEN) [2010-10-13 13:30:44] cr3: 0000000225f0e000 cr2: ffff880004e93d68 > (XEN) [2010-10-13 13:30:44] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) [2010-10-13 13:30:44] Xen stack trace from rsp=ffff830237e57cf8: > (XEN) [2010-10-13 13:30:44] ffff830237e57d38 ffff82c480126b66 ffff830237e57e18 0700000000000010 > (XEN) [2010-10-13 13:30:44] 0000000000001000 0000000000000030 00000000fe5ff000 00000000fe5ff000 > (XEN) [2010-10-13 13:30:44] 0000009000077d68 ffff83014601ad10 0000000700000246 0000000000000000 > (XEN) [2010-10-13 13:30:44] 0000000700000092 0000000000000000 ffff83020211eda8 00000000000fe5ff > (XEN) [2010-10-13 13:30:44] 00000000000fe5ff ffff8301622fde28 0000000000000202 ffff830237e57da8 > (XEN) [2010-10-13 13:30:44] ffff82c480120680 ffff830237e57ea8 00000000ffffffed ffff830146a24000 > (XEN) [2010-10-13 13:30:44] 0000000000000057 0000000000000048 ffff830237e57e48 ffff82c48015f16e > (XEN) [2010-10-13 13:30:44] 0000000025dfc910 000000000000015c 0000000000000048 0000000000000120 > (XEN) [2010-10-13 13:30:44] ffff830237e82480 0000000000000282 ffff83020211ed10 ffff830237e57e28 > (XEN) [2010-10-13 13:30:44] ffff82c480120680 ffff88002df4bb30 0000000000000057 ffff830146a24000 > (XEN) [2010-10-13 13:30:44] 0000000000000048 ffff830146a24190 ffff830237e57ef8 ffff82c480172806 > (XEN) [2010-10-13 13:30:44] 0000000180196b1a ffff830237e5a020 ffff830200000004 ffff830237e57ea8 > (XEN) [2010-10-13 13:30:44] 000000000000000b ffffffffffffffff 0000000000000007 0000000000000000 > (XEN) [2010-10-13 13:30:44] 00000000fe5fe000 aaaaaaaaaaaaaaaa 0000000000000007 0000000000000048 > (XEN) [2010-10-13 13:30:44] 00000000fe5fe000 0000000000000000 0000000000000246 ffff8300c7e88000 > (XEN) [2010-10-13 13:30:44] 000000000000000b ffff8800278c4400 0000000000000011 ffff88002ffea700 > (XEN) [2010-10-13 13:30:44] 00007cfdc81a80c7 ffff82c480202a82 ffffffff8100942a 0000000000000021 > (XEN) [2010-10-13 13:30:44] ffff88002ffea700 0000000000000011 ffff8800278c4400 000000000000000b > (XEN) [2010-10-13 13:30:44] ffff88002df4bbd0 00000000000006a1 0000000000000213 ffff88002fc20200 > (XEN) [2010-10-13 13:30:44] ffffffff810df6ea 0000000000000011 0000000000000021 ffffffff8100942a > (XEN) [2010-10-13 13:30:44] Xen call trace: > (XEN) [2010-10-13 13:30:44] [<ffff82c48015d797>] pci_enable_msi+0x48a/0x9d5 > (XEN) [2010-10-13 13:30:44] [<ffff82c48015f16e>] map_domain_pirq+0x275/0x363 > (XEN) [2010-10-13 13:30:44] [<ffff82c480172806>] do_physdev_op+0x826/0x10b0 > (XEN) [2010-10-13 13:30:44] [<ffff82c480202a82>] syscall_enter+0xf2/0x14c > (XEN) [2010-10-13 13:30:44] > (XEN) [2010-10-13 13:30:44] SEIK bus: 7 slot: 0 func:0 msi->table_base: fe5fe000 read_pci_mem_bar: 4 > (XEN) [2010-10-13 13:30:44] SEIK pba_paddr: 4> it''s this one:WARN_ON(msi->>table_base != read_pci_mem_bar(bus, slot, func, bir));> I have added some printk''s .. and read_pci_mem_bar seems to return a bogus value .. the pba_addr is used later in the function, but i can''t oversee if and when this could have implications. > This also occurs when disabling the pci_resource_align on the kernel line.> lspci on dom0 shows:> 07:00.0 USB Controller: NEC Corporation Device 0194 (rev 03) (prog-if 30) > Subsystem: ASUSTeK Computer Inc. Device 8413 > Flags: bus master, fast devsel, latency 0, IRQ 33 > Memory at fe5fe000 (64-bit, non-prefetchable) [size=8K] > Capabilities: [50] Power Management version 3 > Capabilities: [70] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- > Capabilities: [90] MSI-X: Enable+ Mask- TabSize=8 > Capabilities: [a0] Express Endpoint, MSI 00 > Capabilities: [100] Advanced Error Reporting <?> > Capabilities: [140] Device Serial Number ff-ff-ff-ff-ff-ff-ff-ff > Capabilities: [150] #18 > Kernel driver in use: pciback> In the same function it seems to trigger > if ( d ) > { > /* XXX How to deal with existing mappings? */ > }> Which seems to be a bit odd for a freshly booted system with no domU restarts ?> grub menu.lst:> title xen-4.1-unstable.gz / Debian GNU/Linux, 2.6.32.23-xen-next-2.6.32.x-generaldebug-20101002 > root (hd0,0) > kernel /xen-4.1-unstable.gz dom0_mem=768M loglvl=all loglvl_guest=all com1=115200,8n1 sync_console console_to_ring console_timestamps console=vga,com1 iommu=off debug lapic=debug apic_verbosity=debug apic=debug noirqbalance > module /vmlinuz-2.6.32.24-xen-next-2.6.32.x-tracing-20101013 root=/dev/mapper/serveerstertje-root ro earlyprintk=xen max_loop=255 loop_max_part=63 libata.noacpi=1 debug loglevel=10 noirqbalance irqbalance=off iommu=soft xen-pciback.hide=(03:06.0)(07:00.0)(09:01.0)(09:01.1)(09:01.2) pci=resource_alignment=03:06.0;07:00.0;09:01.0;09:01.1;09:01.2; > module /initrd.img-2.6.32.24-xen-next-2.6.32.x-tracing-20101013> --> Sander> Tuesday, October 12, 2010, 6:44:33 PM, you wrote:>> On Tue, Oct 12, 2010 at 06:28:13PM +0200, Sander Eikelenboom wrote: >>> Hi Keir, >>> >>> Does xen and/or the xen console depend on physical cpu 0 ?>> Usually the console for Dom0, and I think all other domains go >> through CPU0. Let me CC Ian here, who has been mucking in this >> area and found some bugs (and produced fixes).>> Ian, that bug you found with not clearing the eventchannel - that >> wouldn''t have an impact here, right?>>> >>> I''m still trying to solve the mystery of my machine freezing when doing: >>> >>> - videograbbing in a domU with a usb3 pci-express controller passed through (seems to cause quite a few interrupts) >>> - compiling a linux kernel with "make -j 6" >>> >>> It''s a 6 core AMD phenom x6. >>> >>> Without cpu pinning: >>> I can freeze the machine easily within a minute after starting the compile, at first xen serial console also slows down under the load (slow updates). >>> When the machine freezes i can''t do anything with xen serial console. >>> >>> With cpu pinning: >>> By not using the pcpu 0 at all for any domain, and pinning the domain with the videograbber to it''s own pcpu (pcpu 5) it seems the machine keeps running after 20 "make -j6" iterations of kernel compilation. >>> Xen serial console stays responsive and doesn''t slow down during the kernel compilation. The videograbber shows no problem grabbing video. >>>>> AHA! So finally closer to the mystery.>> Can you provide the /proc/interrupts of the Dom0?>> I wonder if this is related to the isseu I had some time ago, and never got >> to look at. The problem was that during heavy compilation (this is a 2 Nehelem >> socket box, just running Dom0 - no guests), the keyboard and USB driver would >> stop getting interrupts. So the drivers would start polling which is quite slow, >> albeit servicable, and then at some point it would pick up again.>> The weirdness was that the /proc/interrupts showed absolutly _no_ interrupts on CPU0 >> during that time - as if Xen just forgot to update them. Jeremy suggested I try to >> disable Xen IRQ balance (noirqbalance on Xen command line) in case that is it, and to my >> emberrasement I haven''t tried that yet.>> Did you try that? I think somebody suggested that but I can''t recall whether it >> was for this issue? >>> >>> Name ID VCPU CPU State Time(s) CPU Affinity >>> Domain-0 0 0 3 r-- 2169.7 1-4 >>> Domain-0 0 1 1 -b- 2339.3 1-4 >>> Domain-0 0 2 2 -b- 2358.9 1-4 >>> Domain-0 0 3 3 -b- 2298.2 1-4 >>> Domain-0 0 4 1 -b- 2221.9 1-4 >>> Domain-0 0 5 4 -b- 2287.7 1-4 >>> backup 9 0 4 -b- 10.6 1-4 >>> database 1 0 4 -b- 45.3 1-4 >>> davical 5 0 3 -b- 8.7 1-4 >>> git 8 0 2 -b- 7.9 1-4 >>> mail 2 0 4 -b- 8.0 1-4 >>> samba 3 0 3 -b- 11.1 1-4 >>> security 7 0 5 r-- 1433.2 5 >>> www 4 0 1 -b- 10.2 1-4 >>> zabbix 6 0 3 -b- 21.2 1-4 >>> >>> >>> Is there a way a deadlock could occur between hypervisor <-> dom0 <-> domU especially related to passthrough/interrupts in the context of pcpu 0 ?>> I don''t know, but I do know that the IRQ handling in Xen 4.0 changed significantly compared >> to 3.4. I don''t remember if you ever ran this setup under 3.4? >>> >>> -- >>> Sander-- Best regards, Sander mailto:linux@eikelenboom.it _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.10 at 15:36, Sander Eikelenboom <linux@eikelenboom.it> wrote: > it''s this one: > WARN_ON(msi->table_base != read_pci_mem_bar(bus, slot, func, bir));Yeah, read_pci_mem_bar() uses an inverted mask in two places. Would you remove the ~ from the two uses of PCI_BASE_ADDRESS_MEM_MASK in that function and try again? (Yunhong, you had tested the patch that introduced this, and this warning would basically trigger unconditionally as it stands. Didn''t you notice that in your logs?) The main thing however, if I correctly remember the context of this thread, is that this code was only recently introduced and doesn''t exist in the 4.0 tree, so your original problem is unlikely caused by it.> I have added some printk''s .. and read_pci_mem_bar seems to return a bogus > value .. the pba_addr is used later in the function, but i can''t oversee if > and when this could have implications. > This also occurs when disabling the pci_resource_align on the kernel line. > In the same function it seems to trigger > if ( d ) > { > /* XXX How to deal with existing mappings? */ > } > > Which seems to be a bit odd for a freshly booted system with no domU > restarts ?No, the comment refers to potentially existing mappings (which would need to be actively searched for). It doesn''t mean there have to be any. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.10 at 16:26, Sander Eikelenboom <linux@eikelenboom.it> wrote: > Besides ... returning a bogus address in this piece of code: > > if ( !dev->domain || !paging_mode_translate(dev->domain) ) > { > struct domain *d = dev->domain; > > if ( !d ) > for_each_domain(d) > if ( !paging_mode_translate(d) && > (iomem_access_permitted(d, dev->msix_table.first, > dev->msix_table.last) || > iomem_access_permitted(d, dev->msix_pba.first, > dev->msix_pba.last)) ) > break; > if ( d ) > { > /* XXX How to deal with existing mappings? */ > printk("SEIK: err what am i doing here ?? d=%d > \n",d->domain_id); > > } > } > > On a freshly booted machine, d seems to be 0 ... that would mean the ( !d ) > code path will never be followed since all devices will belong to dom0 at > first ?Not sure what you''re trying to say. This code path will not only get executed on a freshly booted system, but whenever MSI-X gets first enabled for a device after it having been disabled (perhaps because of it getting assigned to a guest). And for the moment Dom0 is still considered an exception (i.e. may map this space writable), so on initial boot it doesn''t matter whether the device is considered un-owned or owned by Dom0. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>-----Original Message----- >From: Jan Beulich [mailto:JBeulich@novell.com] >Sent: Wednesday, October 13, 2010 10:27 PM >To: Sander Eikelenboom >Cc: Ian; Keir Fraser; Jeremy Fitzhardinge; Jiang, Yunhong; >xen-devel@lists.xensource.com; Konrad Rzeszutek Wilk >Subject: [Xen-devel] Re: xen dependant on pcpu 0 ? > >>>> On 13.10.10 at 15:36, Sander Eikelenboom <linux@eikelenboom.it> wrote: >> it''s this one: >> WARN_ON(msi->table_base != read_pci_mem_bar(bus, slot, func, bir)); > >Yeah, read_pci_mem_bar() uses an inverted mask in two places. >Would you remove the ~ from the two uses of PCI_BASE_ADDRESS_MEM_MASK >in that function and try again? > >(Yunhong, you had tested the patch that introduced this, and this >warning would basically trigger unconditionally as it stands. Didn''t >you notice that in your logs?)A bit amazing to me, but I do remember I didn''t notice such log. And seems with this bug, the patch itself should not work at all, since the PBA_addr is not correct, but I do remember with attached test module, and your patch, the write_vector() will cause fault. --jyh> >The main thing however, if I correctly remember the context of this >thread, is that this code was only recently introduced and doesn''t >exist in the 4.0 tree, so your original problem is unlikely caused by it. > >> I have added some printk''s .. and read_pci_mem_bar seems to return a bogus >> value .. the pba_addr is used later in the function, but i can''t oversee if >> and when this could have implications. >> This also occurs when disabling the pci_resource_align on the kernel line. >> In the same function it seems to trigger >> if ( d ) >> { >> /* XXX How to deal with existing mappings? */ >> } >> >> Which seems to be a bit odd for a freshly booted system with no domU >> restarts ? > >No, the comment refers to potentially existing mappings (which >would need to be actively searched for). It doesn''t mean there have >to be any. > >Jan_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.10 at 17:00, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote: >>From: Jan Beulich [mailto:JBeulich@novell.com] >>(Yunhong, you had tested the patch that introduced this, and this >>warning would basically trigger unconditionally as it stands. Didn''t >>you notice that in your logs?) > > A bit amazing to me, but I do remember I didn''t notice such log. > And seems with this bug, the patch itself should not work at all, since the > PBA_addr is not correct, but I do remember with attached test module, and > your patch, the write_vector() will cause fault.Probably MSI-X table and PBA share a page for the device you tested with? In that case, the code would still have worked as is afaict. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Wednesday, October 13, 2010, 5:26:27 PM, you wrote:>>>> On 13.10.10 at 17:03, Sander Eikelenboom <linux@eikelenboom.it> wrote: >> Err yes i''m nor a kernel nor a xen hacker so i''m just trying not to speak >> complete gibberish :-) >> Well since the device when seized by pciback on boot, seems to be assigned >> to dom0 and therefore d=0, the >> for_each_domain(d) >> if ( !paging_mode_translate(d) && >> (iomem_access_permitted(d, dev->msix_table.first, >> dev->msix_table.last) || >> iomem_access_permitted(d, dev->msix_pba.first, >> dev->msix_pba.last)) ) >> break; >> >> part seems never to be run, because a device seems to allways be assigned to >> a domain.> That code fragment sits inside a if (!d), i.e. if we can easily tell > (by just looking at dev->domain) which domain owns the device.>> So if it seems to be never run ... why is it there ?> You''re probably more after the subsequent if (d) with the comment > somewhat confusing you in its body - again, the function gets > executed (or is supposed to) when a domain enables MSI-X on the > device. At that point, dev->domain should be non-NULL (and > different from dom0), so the body (if there really was one) would > get executed.Thx for you patience .. just one more time ... I saw a mistake in my explanation, i didn''t mean d=0, but in my case (fresh boot, first time domain with passthrough is started) d is not NULL and d->domain_id = 0 So it seems it thinks it''s still assigned to dom0 when the MSI-X gets enabled ? But this all does get triggered when the domU is started to which the domain is passed through, and yes it enables MSI-X (when i look at lspci or /proc/interrupts in the domU) but d->domain_id results in "0" and not in the domain id of domU. So if in this case the code in ( !d ) should have been run, it didn''t (have put a printk there to be sure) You were right that it didn''t fix my freeze problem, although the RCU detected CPU stall was now followed by the beginning of a trace although it doesn''t provide much more info. I attached a photo of it. -- Sander> Jan-- Best regards, Sander mailto:linux@eikelenboom.it _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> On 13.10.10 at 17:41, Sander Eikelenboom <linux@eikelenboom.it> wrote: > I saw a mistake in my explanation, i didn''t mean d=0, but in my case (fresh > boot, first time domain with passthrough is started) d is not NULL and > d->domain_id = 0 > So it seems it thinks it''s still assigned to dom0 when the MSI-X gets enabled > ?That would be bad indeed, but would indicate a problem elsewhere.> But this all does get triggered when the domU is started to which the domain > is passed through, and yes it enables MSI-X (when i look at lspci or > /proc/interrupts in the domU) > but d->domain_id results in "0" and not in the domain id of domU. > So if in this case the code in ( !d ) should have been run, it didn''t (have > put a printk there to be sure)No, generally the !d case shouldn''t get executed, the following d case, however, would expect the correct domain to be used (if it had a body implemented).> You were right that it didn''t fix my freeze problem, although the RCU > detected CPU stall was now followed by the beginning of a trace although it > doesn''t provide much more info. > I attached a photo of it.Looks like an access to a not mapped (IO-?)APIC page - that code path likely hasn''t been tested so far in the pv-ops context. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel