Hi all, Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. I''m run out of idea, so I send this email to have advice from the community. Most of the time bash will abort with random error in dom0: - page fault (data and prefetch abort) - memory corruption (malloc corruption and invalid pointer) It''s easily to reproduce by doing ./configure on the xen tree. My environment is an arndale board: - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) - xen upstream The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git using the branch linaro-3.10. The previous branch is based on the linaro tree with some patches for the dts and xen. The issue also occurs on the versatile express. But it''s harder to reproduce. Here the environment is: - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) - ubuntu linaro 13.05 - xen upstream I have tried different distributions and linux version, the issue was the same. I made some testing to narrow down the bug and I came to the following test case: Only dom0 is running and each VCPUs are pinned to a specific cpu (vcpu0 -> cpu0 and vcpu1 -> cpu1). The patch below removes WFI trap and by consequence avoid a VCPU to move to another physical CPU. ========================================diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c index 6cfba1a..e89ca15 100644 --- a/xen/arch/arm/traps.c +++ b/xen/arch/arm/traps.c @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); /* Setup hypervisor traps */ - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); isb(); } ======================================== If a bash process is assigned to a specific cpu with taskset, the process seems to always run without any issue. taskset -c 0 ./configure I guess it''s a caching issue, but each time I''ve tried to play with the caching policy Linux was not booting. Thanks in advance for any advice. Cheers, -- Julien Grall
Christoffer Dall
2013-Jun-05 01:38 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote:> Hi all, > > Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. > > I''m run out of idea, so I send this email to have advice from the community. > > Most of the time bash will abort with random error in dom0: > - page fault (data and prefetch abort) > - memory corruption (malloc corruption and invalid pointer) > > It''s easily to reproduce by doing ./configure on the xen tree. > > My environment is an arndale board: > - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) > - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) > - xen upstream > > The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git > using the branch linaro-3.10. > The previous branch is based on the linaro tree with some patches for the dts and xen. > > The issue also occurs on the versatile express. But it''s harder to reproduce. > Here the environment is: > - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) > - ubuntu linaro 13.05 > - xen upstream > > I have tried different distributions and linux version, the issue was the same. > I made some testing to narrow down the bug and I came to the following test case: > > Only dom0 is running and each VCPUs are pinned to a specific cpu > (vcpu0 -> cpu0 and vcpu1 -> cpu1). > > The patch below removes WFI trap and by consequence avoid a VCPU to move to > another physical CPU. > ========================================> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > index 6cfba1a..e89ca15 100644 > --- a/xen/arch/arm/traps.c > +++ b/xen/arch/arm/traps.c > @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) > WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); > > /* Setup hypervisor traps */ > - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); > + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); > isb(); > } > > ========================================> > If a bash process is assigned to a specific cpu with taskset, the process seems > to always run without any issue. > > taskset -c 0 ./configure > > I guess it''s a caching issue, but each time I''ve tried to play with the caching > policy Linux was not booting. > > Thanks in advance for any advice.Some thoughts: - Does dom0 run with Stage-2 translation? If so, you should be able to disable caches in both Hyp mode and for dom0 by manipulating the hyp registers to try and exclude caches. If Linux doesn''t boot under such configuration, something else is completely broken, as it must be transparent to your dom0. - Are you doing any swapping and/or page reclaiming? I wouldn''t assume so for dom0, but if you are, you need to maintain the icache properly, since it can be aliasing, see http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this is the case though) - All other cache accesses should be coherent across cores and are physically indexed/physically tagged so I don''t see how this could be your issue. - Are you managing the VMID properly across physical CPU migration? (ensure that dom0 always uses the same vmid regardless of the physical cpu) - Do you always see the crash in user space or kernel space in dom0 or is it all over the map? -Christoffer
Ian Campbell
2013-Jun-05 09:38 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On Tue, 2013-06-04 at 23:45 +0100, Julien Grall wrote:> The patch below removes WFI trap and by consequence avoid a VCPU to move to > another physical CPU.FWIW the dom0_vcpus_pin command line parameter should have achieved the same thing without removing the WFI code paths. I very much doubt that code path is to blame but it might be worth ruling it out. Ian.
Ian Campbell
2013-Jun-05 09:52 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On Tue, 2013-06-04 at 18:38 -0700, Christoffer Dall wrote:> - Does dom0 run with Stage-2 translation?Yes.> If so, you should be able > to disable caches in both Hyp mode and for dom0 by manipulating the > hyp registers to try and exclude caches. If Linux doesn''t boot under > such configuration, something else is completely broken, as it must be > transparent to your dom0.For some reason I had it in my head that the monitor used by the load/store exclusive instructions was somehow tied to the cache controller (i.e. you can''t use them with caching disabled) which makes it impossible to disable caching if you are using them in your spinlock routines. I can''t actually find anything to that affect in the ARM ARM now though -- Am/was I imagining things?> - Are you doing any swapping and/or page reclaiming?At the hypervisor level you mean? No. dom0 might be swapping itself but I don''t think that is what you meant and I expect Julien doesn''t have a swap device configured in any case.> - All other cache accesses should be coherent across cores and are > physically indexed/physically tagged so I don''t see how this could be > your issue.Agreed.> - Are you managing the VMID properly across physical CPU migration? > (ensure that dom0 always uses the same vmid regardless of the physical > cpu)Currently VMID = DOMID + 1 so yes. Ian.
Julien Grall
2013-Jun-05 10:39 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 10:38 AM, Ian Campbell wrote:> On Tue, 2013-06-04 at 23:45 +0100, Julien Grall wrote: >> The patch below removes WFI trap and by consequence avoid a VCPU to move to >> another physical CPU. > > FWIW the dom0_vcpus_pin command line parameter should have achieved the > same thing without removing the WFI code paths. I very much doubt that > code path is to blame but it might be worth ruling it out.Thanks for this option. -- Julien
Julien Grall
2013-Jun-05 11:48 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 02:38 AM, Christoffer Dall wrote:> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote: >> Hi all, >> >> Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. >> >> I''m run out of idea, so I send this email to have advice from the community. >> >> Most of the time bash will abort with random error in dom0: >> - page fault (data and prefetch abort) >> - memory corruption (malloc corruption and invalid pointer) >> >> It''s easily to reproduce by doing ./configure on the xen tree. >> >> My environment is an arndale board: >> - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) >> - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) >> - xen upstream >> >> The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git >> using the branch linaro-3.10. >> The previous branch is based on the linaro tree with some patches for the dts and xen. >> >> The issue also occurs on the versatile express. But it''s harder to reproduce. >> Here the environment is: >> - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) >> - ubuntu linaro 13.05 >> - xen upstream >> >> I have tried different distributions and linux version, the issue was the same. >> I made some testing to narrow down the bug and I came to the following test case: >> >> Only dom0 is running and each VCPUs are pinned to a specific cpu >> (vcpu0 -> cpu0 and vcpu1 -> cpu1). >> >> The patch below removes WFI trap and by consequence avoid a VCPU to move to >> another physical CPU. >> ========================================>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >> index 6cfba1a..e89ca15 100644 >> --- a/xen/arch/arm/traps.c >> +++ b/xen/arch/arm/traps.c >> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) >> WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); >> >> /* Setup hypervisor traps */ >> - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); >> + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); >> isb(); >> } >> >> ========================================>> >> If a bash process is assigned to a specific cpu with taskset, the process seems >> to always run without any issue. >> >> taskset -c 0 ./configure >> >> I guess it''s a caching issue, but each time I''ve tried to play with the caching >> policy Linux was not booting. >> >> Thanks in advance for any advice. > > Some thoughts: > > - Does dom0 run with Stage-2 translation? If so, you should be able > to disable caches in both Hyp mode and for dom0 by manipulating the > hyp registers to try and exclude caches. If Linux doesn''t boot under > such configuration, something else is completely broken, as it must be > transparent to your dom0. > > - Are you doing any swapping and/or page reclaiming? I wouldn''t > assume so for dom0, but if you are, you need to maintain the icache > properly, since it can be aliasing, see > http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this > is the case though) > > - All other cache accesses should be coherent across cores and are > physically indexed/physically tagged so I don''t see how this could be > your issue.It was only an idea because I have noticed the memory was often corrupted.> - Do you always see the crash in user space or kernel space in dom0 or > is it all over the map?Only in user space in dom0. -- Julien
Christoffer Dall
2013-Jun-05 14:30 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org> wrote:> On 06/05/2013 02:38 AM, Christoffer Dall wrote: > >> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote: >>> Hi all, >>> >>> Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. >>> >>> I''m run out of idea, so I send this email to have advice from the community. >>> >>> Most of the time bash will abort with random error in dom0: >>> - page fault (data and prefetch abort) >>> - memory corruption (malloc corruption and invalid pointer) >>> >>> It''s easily to reproduce by doing ./configure on the xen tree. >>> >>> My environment is an arndale board: >>> - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) >>> - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) >>> - xen upstream >>> >>> The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git >>> using the branch linaro-3.10. >>> The previous branch is based on the linaro tree with some patches for the dts and xen. >>> >>> The issue also occurs on the versatile express. But it''s harder to reproduce. >>> Here the environment is: >>> - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) >>> - ubuntu linaro 13.05 >>> - xen upstream >>> >>> I have tried different distributions and linux version, the issue was the same. >>> I made some testing to narrow down the bug and I came to the following test case: >>> >>> Only dom0 is running and each VCPUs are pinned to a specific cpu >>> (vcpu0 -> cpu0 and vcpu1 -> cpu1). >>> >>> The patch below removes WFI trap and by consequence avoid a VCPU to move to >>> another physical CPU. >>> ========================================>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >>> index 6cfba1a..e89ca15 100644 >>> --- a/xen/arch/arm/traps.c >>> +++ b/xen/arch/arm/traps.c >>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) >>> WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); >>> >>> /* Setup hypervisor traps */ >>> - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); >>> + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); >>> isb(); >>> } >>> >>> ========================================>>> >>> If a bash process is assigned to a specific cpu with taskset, the process seems >>> to always run without any issue. >>> >>> taskset -c 0 ./configure >>> >>> I guess it''s a caching issue, but each time I''ve tried to play with the caching >>> policy Linux was not booting. >>> >>> Thanks in advance for any advice. >> >> Some thoughts: >> >> - Does dom0 run with Stage-2 translation? If so, you should be able >> to disable caches in both Hyp mode and for dom0 by manipulating the >> hyp registers to try and exclude caches. If Linux doesn''t boot under >> such configuration, something else is completely broken, as it must be >> transparent to your dom0. >> >> - Are you doing any swapping and/or page reclaiming? I wouldn''t >> assume so for dom0, but if you are, you need to maintain the icache >> properly, since it can be aliasing, see >> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this >> is the case though) >> >> - All other cache accesses should be coherent across cores and are >> physically indexed/physically tagged so I don''t see how this could be >> your issue. > > It was only an idea because I have noticed the memory was often corrupted. > >> - Do you always see the crash in user space or kernel space in dom0 or >> is it all over the map? > > > Only in user space in dom0. >Hmm, which kernel version is dom0 based on? Can you bisect the dom0 source to make sure it''s not something introduced during development. You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix broken mmap..." ? -Christoffer
Ian Campbell
2013-Jun-05 15:18 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On Wed, 2013-06-05 at 07:30 -0700, Christoffer Dall wrote:> > Only in user space in dom0. > > > Hmm, which kernel version is dom0 based on? Can you bisect the dom0 > source to make sure it''s not something introduced during development. > > You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix > broken mmap..." ?FYI 9d1f5c is ambiguous in my tree, 79d1f5c AKA 79d1f5c9acf9fc8d06e5537083b19114ce87159f is the unambiguous commit at least for me. Ian.> > -Christoffer > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Julien Grall
2013-Jun-05 16:12 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 03:30 PM, Christoffer Dall wrote:> On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org> wrote: >> On 06/05/2013 02:38 AM, Christoffer Dall wrote: >> >>> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote: >>>> Hi all, >>>> >>>> Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. >>>> >>>> I''m run out of idea, so I send this email to have advice from the community. >>>> >>>> Most of the time bash will abort with random error in dom0: >>>> - page fault (data and prefetch abort) >>>> - memory corruption (malloc corruption and invalid pointer) >>>> >>>> It''s easily to reproduce by doing ./configure on the xen tree. >>>> >>>> My environment is an arndale board: >>>> - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) >>>> - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) >>>> - xen upstream >>>> >>>> The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git >>>> using the branch linaro-3.10. >>>> The previous branch is based on the linaro tree with some patches for the dts and xen. >>>> >>>> The issue also occurs on the versatile express. But it''s harder to reproduce. >>>> Here the environment is: >>>> - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) >>>> - ubuntu linaro 13.05 >>>> - xen upstream >>>> >>>> I have tried different distributions and linux version, the issue was the same. >>>> I made some testing to narrow down the bug and I came to the following test case: >>>> >>>> Only dom0 is running and each VCPUs are pinned to a specific cpu >>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1). >>>> >>>> The patch below removes WFI trap and by consequence avoid a VCPU to move to >>>> another physical CPU. >>>> ========================================>>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c >>>> index 6cfba1a..e89ca15 100644 >>>> --- a/xen/arch/arm/traps.c >>>> +++ b/xen/arch/arm/traps.c >>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) >>>> WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); >>>> >>>> /* Setup hypervisor traps */ >>>> - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); >>>> + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); >>>> isb(); >>>> } >>>> >>>> ========================================>>>> >>>> If a bash process is assigned to a specific cpu with taskset, the process seems >>>> to always run without any issue. >>>> >>>> taskset -c 0 ./configure >>>> >>>> I guess it''s a caching issue, but each time I''ve tried to play with the caching >>>> policy Linux was not booting. >>>> >>>> Thanks in advance for any advice. >>> >>> Some thoughts: >>> >>> - Does dom0 run with Stage-2 translation? If so, you should be able >>> to disable caches in both Hyp mode and for dom0 by manipulating the >>> hyp registers to try and exclude caches. If Linux doesn''t boot under >>> such configuration, something else is completely broken, as it must be >>> transparent to your dom0. >>> >>> - Are you doing any swapping and/or page reclaiming? I wouldn''t >>> assume so for dom0, but if you are, you need to maintain the icache >>> properly, since it can be aliasing, see >>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this >>> is the case though) >>> >>> - All other cache accesses should be coherent across cores and are >>> physically indexed/physically tagged so I don''t see how this could be >>> your issue. >> >> It was only an idea because I have noticed the memory was often corrupted. >> >>> - Do you always see the crash in user space or kernel space in dom0 or >>> is it all over the map? >> >> >> Only in user space in dom0. >> > Hmm, which kernel version is dom0 based on? Can you bisect the dom0 > source to make sure it''s not something introduced during development.I''m using the linaro''s branch ll_20130528.0, I have only few patches for the dts and not yet in linaro tree patches. I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t really go before without carrying many xen patches to try it. I have tried different configuration with the number of CPUs in Xen (pCPU) and linux (vCPU): - 2 pCPU 2 vCPU : segfaulting - 2 pCPU 1 vCPU : working - 1 pCPU 1 vCPU : working - 1 pCPU 2 vCPU : very slow but working> You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix > broken mmap..." ?Yes. -- Julien
Stefano Stabellini
2013-Jun-05 16:46 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On Wed, 5 Jun 2013, Julien Grall wrote:> On 06/05/2013 03:30 PM, Christoffer Dall wrote: > > > On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org> wrote: > >> On 06/05/2013 02:38 AM, Christoffer Dall wrote: > >> > >>> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote: > >>>> Hi all, > >>>> > >>>> Since a couple of week, I''m tracking an issue with Xen on ARM with no luck. > >>>> > >>>> I''m run out of idea, so I send this email to have advice from the community. > >>>> > >>>> Most of the time bash will abort with random error in dom0: > >>>> - page fault (data and prefetch abort) > >>>> - memory corruption (malloc corruption and invalid pointer) > >>>> > >>>> It''s easily to reproduce by doing ./configure on the xen tree. > >>>> > >>>> My environment is an arndale board: > >>>> - linux linaro 13.05 (using arndale_xen_dom0_defconfig and exynos5250_arndale.dts) > >>>> - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale) > >>>> - xen upstream > >>>> > >>>> The linux tree can be retrieved from git://xenbits.xen.org/people/julieng/linux-arm.git > >>>> using the branch linaro-3.10. > >>>> The previous branch is based on the linaro tree with some patches for the dts and xen. > >>>> > >>>> The issue also occurs on the versatile express. But it''s harder to reproduce. > >>>> Here the environment is: > >>>> - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and vexpress_v2p_ca15_a7.dtb) > >>>> - ubuntu linaro 13.05 > >>>> - xen upstream > >>>> > >>>> I have tried different distributions and linux version, the issue was the same. > >>>> I made some testing to narrow down the bug and I came to the following test case: > >>>> > >>>> Only dom0 is running and each VCPUs are pinned to a specific cpu > >>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1). > >>>> > >>>> The patch below removes WFI trap and by consequence avoid a VCPU to move to > >>>> another physical CPU. > >>>> ========================================> >>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c > >>>> index 6cfba1a..e89ca15 100644 > >>>> --- a/xen/arch/arm/traps.c > >>>> +++ b/xen/arch/arm/traps.c > >>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void) > >>>> WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2); > >>>> > >>>> /* Setup hypervisor traps */ > >>>> - WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC, HCR_EL2); > >>>> + WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2); > >>>> isb(); > >>>> } > >>>> > >>>> ========================================> >>>> > >>>> If a bash process is assigned to a specific cpu with taskset, the process seems > >>>> to always run without any issue. > >>>> > >>>> taskset -c 0 ./configure > >>>> > >>>> I guess it''s a caching issue, but each time I''ve tried to play with the caching > >>>> policy Linux was not booting. > >>>> > >>>> Thanks in advance for any advice. > >>> > >>> Some thoughts: > >>> > >>> - Does dom0 run with Stage-2 translation? If so, you should be able > >>> to disable caches in both Hyp mode and for dom0 by manipulating the > >>> hyp registers to try and exclude caches. If Linux doesn''t boot under > >>> such configuration, something else is completely broken, as it must be > >>> transparent to your dom0. > >>> > >>> - Are you doing any swapping and/or page reclaiming? I wouldn''t > >>> assume so for dom0, but if you are, you need to maintain the icache > >>> properly, since it can be aliasing, see > >>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this > >>> is the case though) > >>> > >>> - All other cache accesses should be coherent across cores and are > >>> physically indexed/physically tagged so I don''t see how this could be > >>> your issue. > >> > >> It was only an idea because I have noticed the memory was often corrupted. > >> > >>> - Do you always see the crash in user space or kernel space in dom0 or > >>> is it all over the map? > >> > >> > >> Only in user space in dom0. > >> > > Hmm, which kernel version is dom0 based on? Can you bisect the dom0 > > source to make sure it''s not something introduced during development. > > I''m using the linaro''s branch ll_20130528.0, I have only few patches for > the dts and not yet in linaro tree patches. > > I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t > really go before without carrying many xen patches to try it. > > I have tried different configuration with the number of CPUs in Xen > (pCPU) and linux (vCPU): > - 2 pCPU 2 vCPU : segfaulting > - 2 pCPU 1 vCPU : working > - 1 pCPU 1 vCPU : working > - 1 pCPU 2 vCPU : very slow but workingIf you put it like that, it would seem to me that the most likely candidate would be a bug in SMP support in Xen. What happen if you have 2 pCPU, 1vCPU but you keep moving the vCPU between the two pCPU?
Christoffer Dall
2013-Jun-05 17:36 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
> > I''m using the linaro''s branch ll_20130528.0, I have only few patches for > the dts and not yet in linaro tree patches. > > I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t > really go before without carrying many xen patches to try it. > > I have tried different configuration with the number of CPUs in Xen > (pCPU) and linux (vCPU): > - 2 pCPU 2 vCPU : segfaulting > - 2 pCPU 1 vCPU : working > - 1 pCPU 1 vCPU : working > - 1 pCPU 2 vCPU : very slow but working >2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but only creating 1 vCPU or are you actually compiling the dom0 as UP? -Christoffer
Julien Grall
2013-Jun-05 17:53 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 06:36 PM, Christoffer Dall wrote:>> >> I''m using the linaro''s branch ll_20130528.0, I have only few patches for >> the dts and not yet in linaro tree patches. >> >> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t >> really go before without carrying many xen patches to try it. >> >> I have tried different configuration with the number of CPUs in Xen >> (pCPU) and linux (vCPU): >> - 2 pCPU 2 vCPU : segfaulting >> - 2 pCPU 1 vCPU : working >> - 1 pCPU 1 vCPU : working >> - 1 pCPU 2 vCPU : very slow but working >> > 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but > only creating 1 vCPU or are you actually compiling the dom0 as UP?Yes. It''s same kernel with the same command line (ie without nosmp). I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen command line. -- Julien
Christoffer Dall
2013-Jun-05 17:57 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org> wrote:> On 06/05/2013 06:36 PM, Christoffer Dall wrote: > >>> >>> I''m using the linaro''s branch ll_20130528.0, I have only few patches for >>> the dts and not yet in linaro tree patches. >>> >>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t >>> really go before without carrying many xen patches to try it. >>> >>> I have tried different configuration with the number of CPUs in Xen >>> (pCPU) and linux (vCPU): >>> - 2 pCPU 2 vCPU : segfaulting >>> - 2 pCPU 1 vCPU : working >>> - 1 pCPU 1 vCPU : working >>> - 1 pCPU 2 vCPU : very slow but working >>> >> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but >> only creating 1 vCPU or are you actually compiling the dom0 as UP? > > > Yes. It''s same kernel with the same command line (ie without nosmp). > I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen > command line. >It indicates a bug in Xen then. Curious that it only happens for user space in dom0, but perhaps you just haven''t seen it in the kernel yet. Bash scripts are pretty intensive on page faults so perhaps there''s a synchronization issue with some of your page fault handlers. You could try to touch all the memory inside dom0 (dd to a ramfs for example) and then run your bash script and see if the problem still occurs, that should point you to whether it''s a stage-2 fault handling issue, but this is not a fool-proof approach. Maybe Xen can pre-allocate all the stage-2 entries? -Christoffer
Stefano Stabellini
2013-Jun-05 18:01 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On Wed, 5 Jun 2013, Christoffer Dall wrote:> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org> wrote: > > On 06/05/2013 06:36 PM, Christoffer Dall wrote: > > > >>> > >>> I''m using the linaro''s branch ll_20130528.0, I have only few patches for > >>> the dts and not yet in linaro tree patches. > >>> > >>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t > >>> really go before without carrying many xen patches to try it. > >>> > >>> I have tried different configuration with the number of CPUs in Xen > >>> (pCPU) and linux (vCPU): > >>> - 2 pCPU 2 vCPU : segfaulting > >>> - 2 pCPU 1 vCPU : working > >>> - 1 pCPU 1 vCPU : working > >>> - 1 pCPU 2 vCPU : very slow but working > >>> > >> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but > >> only creating 1 vCPU or are you actually compiling the dom0 as UP? > > > > > > Yes. It''s same kernel with the same command line (ie without nosmp). > > I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen > > command line. > > > It indicates a bug in Xen then. Curious that it only happens for user > space in dom0, but perhaps you just haven''t seen it in the kernel yet. > Bash scripts are pretty intensive on page faults so perhaps there''s a > synchronization issue with some of your page fault handlers. > > You could try to touch all the memory inside dom0 (dd to a ramfs for > example) and then run your bash script and see if the problem still > occurs, that should point you to whether it''s a stage-2 fault handling > issue, but this is not a fool-proof approach. Maybe Xen can > pre-allocate all the stage-2 entries?Xen pre-allocates all the memory for stage-2 entries (no overcommit or populate on demand by default)
Christoffer Dall
2013-Jun-05 18:17 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 5 June 2013 11:01, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:> On Wed, 5 Jun 2013, Christoffer Dall wrote: >> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org> wrote: >> > On 06/05/2013 06:36 PM, Christoffer Dall wrote: >> > >> >>> >> >>> I''m using the linaro''s branch ll_20130528.0, I have only few patches for >> >>> the dts and not yet in linaro tree patches. >> >>> >> >>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t >> >>> really go before without carrying many xen patches to try it. >> >>> >> >>> I have tried different configuration with the number of CPUs in Xen >> >>> (pCPU) and linux (vCPU): >> >>> - 2 pCPU 2 vCPU : segfaulting >> >>> - 2 pCPU 1 vCPU : working >> >>> - 1 pCPU 1 vCPU : working >> >>> - 1 pCPU 2 vCPU : very slow but working >> >>> >> >> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but >> >> only creating 1 vCPU or are you actually compiling the dom0 as UP? >> > >> > >> > Yes. It''s same kernel with the same command line (ie without nosmp). >> > I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen >> > command line. >> > >> It indicates a bug in Xen then. Curious that it only happens for user >> space in dom0, but perhaps you just haven''t seen it in the kernel yet. >> Bash scripts are pretty intensive on page faults so perhaps there''s a >> synchronization issue with some of your page fault handlers. >> >> You could try to touch all the memory inside dom0 (dd to a ramfs for >> example) and then run your bash script and see if the problem still >> occurs, that should point you to whether it''s a stage-2 fault handling >> issue, but this is not a fool-proof approach. Maybe Xen can >> pre-allocate all the stage-2 entries? > > Xen pre-allocates all the memory for stage-2 entries (no overcommit or > populate on demand by default)what was the conclusion when pinning the vcpu to dedicated pcpus - did the error still show up?
Julien Grall
2013-Jun-05 18:36 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 07:17 PM, Christoffer Dall wrote:> On 5 June 2013 11:01, Stefano Stabellini > <stefano.stabellini@eu.citrix.com> wrote: >> On Wed, 5 Jun 2013, Christoffer Dall wrote: >>> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org> wrote: >>>> On 06/05/2013 06:36 PM, Christoffer Dall wrote: >>>> >>>>>> >>>>>> I''m using the linaro''s branch ll_20130528.0, I have only few patches for >>>>>> the dts and not yet in linaro tree patches. >>>>>> >>>>>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t >>>>>> really go before without carrying many xen patches to try it. >>>>>> >>>>>> I have tried different configuration with the number of CPUs in Xen >>>>>> (pCPU) and linux (vCPU): >>>>>> - 2 pCPU 2 vCPU : segfaulting >>>>>> - 2 pCPU 1 vCPU : working >>>>>> - 1 pCPU 1 vCPU : working >>>>>> - 1 pCPU 2 vCPU : very slow but working >>>>>> >>>>> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but >>>>> only creating 1 vCPU or are you actually compiling the dom0 as UP? >>>> >>>> >>>> Yes. It''s same kernel with the same command line (ie without nosmp). >>>> I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen >>>> command line. >>>> >>> It indicates a bug in Xen then. Curious that it only happens for user >>> space in dom0, but perhaps you just haven''t seen it in the kernel yet. >>> Bash scripts are pretty intensive on page faults so perhaps there''s a >>> synchronization issue with some of your page fault handlers. >>> >>> You could try to touch all the memory inside dom0 (dd to a ramfs for >>> example) and then run your bash script and see if the problem still >>> occurs, that should point you to whether it''s a stage-2 fault handling >>> issue, but this is not a fool-proof approach. Maybe Xen can >>> pre-allocate all the stage-2 entries? >> >> Xen pre-allocates all the memory for stage-2 entries (no overcommit or >> populate on demand by default) > > what was the conclusion when pinning the vcpu to dedicated pcpus - did > the error still show up?If I have 2 pCPU (CPU 0 and CPU 2) and 1 vCPU which is moving every 2 second between the pCPUs, Linux will freeze each time the vcpu is running on CPU 1. I''m not sure why, perhaps another issue. As Stefano advised me, I will setup the debugger on the arndale tomorrow and see if I can find something. -- Julien
Julien Grall
2013-Jun-11 11:48 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 06/05/2013 02:38 AM, Christoffer Dall wrote:> - All other cache accesses should be coherent across cores and are > physically indexed/physically tagged so I don''t see how this could be > your issue.I was looking on KVM code and I noticed that it traps all data cache instruction. Is there any reason to trap it? I wonder if it could be the issue on Xen because we don''t trap cache instruction and the manual is not clear on that. Thanks, -- Julien
Christoffer Dall
2013-Jun-11 14:25 UTC
Re: [ARM] Bash often segfaults in Dom0 with the latest Xen
On 11 June 2013 04:48, Julien Grall <julien.grall@linaro.org> wrote:> On 06/05/2013 02:38 AM, Christoffer Dall wrote: > >> - All other cache accesses should be coherent across cores and are >> physically indexed/physically tagged so I don''t see how this could be >> your issue. > > > I was looking on KVM code and I noticed that it traps all data cache > instruction. Is there any reason to trap it? > > I wonder if it could be the issue on Xen because we don''t trap cache > instruction and the manual is not clear on that. >We don''t trap all cache maintenance instructions, only those that work by set/way, since if the vcpu gets migrated in the middle, it will potentially clean one pcpu partially and another pcpu partially. So we simply stop everything, clean everything, and carry on. Luckily, this is not very common with linux guests. This may be your issue if it really happens...