thr3ads.net - Xen devel - [ARM] Bash often segfaults in Dom0 with the latest Xen [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Julien Grall

2013-Jun-04 22:45 UTC

[ARM] Bash often segfaults in Dom0 with the latest Xen

Hi all,

Since a couple of week,  I''m tracking an issue with Xen on ARM with no
luck.

I''m run out of idea, so I send this email to have advice from the
community.

Most of the time bash will abort with random error in dom0:
  - page fault (data and prefetch abort)
  - memory corruption (malloc corruption and invalid pointer)

It''s easily to reproduce by doing ./configure on the xen tree.

My environment is an arndale board:
  - linux linaro 13.05 (using arndale_xen_dom0_defconfig and
exynos5250_arndale.dts)
  - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
  - xen upstream

The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
using the branch linaro-3.10.
The previous branch is based on the linaro tree with some patches for the dts
and xen.

The issue also occurs on the versatile express. But it''s harder to
reproduce.
Here the environment is:
  - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and
vexpress_v2p_ca15_a7.dtb)
  - ubuntu linaro 13.05
  - xen upstream

I have tried different distributions and linux version, the issue was the same.
I made some testing to narrow down the bug and I came to the following test
case:

Only dom0 is running and each VCPUs are pinned to a specific cpu
(vcpu0 -> cpu0 and vcpu1 -> cpu1).

The patch below removes WFI trap and by consequence avoid a VCPU to move to
another physical CPU.
========================================diff --git a/xen/arch/arm/traps.c
b/xen/arch/arm/traps.c
index 6cfba1a..e89ca15 100644
--- a/xen/arch/arm/traps.c
+++ b/xen/arch/arm/traps.c
@@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
     WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
 
     /* Setup hypervisor traps */
-    WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
+    WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC,
HCR_EL2);
     isb();
 }
 
========================================
If a bash process is assigned to a specific cpu with taskset, the process seems
to always run without any issue.

taskset -c 0 ./configure

I guess it''s a caching issue, but each time I''ve tried to play
with the caching
policy Linux was not booting.

Thanks in advance for any advice.

Cheers,

--
Julien Grall

Christoffer Dall

2013-Jun-05 01:38 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org>
wrote:> Hi all,
>
> Since a couple of week,  I''m tracking an issue with Xen on ARM
with no luck.
>
> I''m run out of idea, so I send this email to have advice from the
community.
>
> Most of the time bash will abort with random error in dom0:
>   - page fault (data and prefetch abort)
>   - memory corruption (malloc corruption and invalid pointer)
>
> It''s easily to reproduce by doing ./configure on the xen tree.
>
> My environment is an arndale board:
>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig and
exynos5250_arndale.dts)
>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
>   - xen upstream
>
> The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
> using the branch linaro-3.10.
> The previous branch is based on the linaro tree with some patches for the
dts and xen.
>
> The issue also occurs on the versatile express. But it''s harder to
reproduce.
> Here the environment is:
>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and
vexpress_v2p_ca15_a7.dtb)
>   - ubuntu linaro 13.05
>   - xen upstream
>
> I have tried different distributions and linux version, the issue was the
same.
> I made some testing to narrow down the bug and I came to the following test
case:
>
> Only dom0 is running and each VCPUs are pinned to a specific cpu
> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
>
> The patch below removes WFI trap and by consequence avoid a VCPU to move to
> another physical CPU.
> ========================================> diff --git
a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> index 6cfba1a..e89ca15 100644
> --- a/xen/arch/arm/traps.c
> +++ b/xen/arch/arm/traps.c
> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
>
>      /* Setup hypervisor traps */
> -   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
> +    WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC,
HCR_EL2);
>      isb();
>  }
>
> ========================================>
> If a bash process is assigned to a specific cpu with taskset, the process
seems
> to always run without any issue.
>
> taskset -c 0 ./configure
>
> I guess it''s a caching issue, but each time I''ve tried to
play with the caching
> policy Linux was not booting.
>
> Thanks in advance for any advice.
Some thoughts:

 - Does dom0 run with Stage-2 translation? If so, you should be able
to disable caches in both Hyp mode and for dom0 by manipulating the
hyp registers to try and exclude caches. If Linux doesn''t boot under
such configuration, something else is completely broken, as it must be
transparent to your dom0.

 - Are you doing any swapping and/or page reclaiming? I wouldn''t
assume so for dom0, but if you are, you need to maintain the icache
properly, since it can be aliasing, see
http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this
is the case though)

- All other cache accesses should be coherent across cores and are
physically indexed/physically tagged so I don''t see how this could be
your issue.

- Are you managing the VMID properly across physical CPU migration?
(ensure that dom0 always uses the same vmid regardless of the physical
cpu)

- Do you always see the crash in user space or kernel space in dom0 or
is it all over the map?

-Christoffer

Ian Campbell

2013-Jun-05 09:38 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On Tue, 2013-06-04 at 23:45 +0100, Julien Grall wrote:> The patch below removes WFI trap and by consequence avoid a VCPU to move to
> another physical CPU.
FWIW the dom0_vcpus_pin command line parameter should have achieved the
same thing without removing the WFI code paths. I very much doubt that
code path is to blame but it might be worth ruling it out.

Ian.

Ian Campbell

2013-Jun-05 09:52 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On Tue, 2013-06-04 at 18:38 -0700, Christoffer Dall
wrote:>  - Does dom0 run with Stage-2 translation?
Yes.
>  If so, you should be able
> to disable caches in both Hyp mode and for dom0 by manipulating the
> hyp registers to try and exclude caches. If Linux doesn''t boot
under
> such configuration, something else is completely broken, as it must be
> transparent to your dom0.
For some reason I had it in my head that the monitor used by the
load/store exclusive instructions was somehow tied to the cache
controller (i.e. you can''t use them with caching disabled) which makes
it impossible to disable caching if you are using them in your spinlock
routines.

I can''t actually find anything to that affect in the ARM ARM now though
-- Am/was I imagining things?
>  - Are you doing any swapping and/or page reclaiming?
At the hypervisor level you mean? No.

dom0 might be swapping itself but I don''t think that is what you meant
and I expect Julien doesn''t have a swap device configured in any case.
> - All other cache accesses should be coherent across cores and are
> physically indexed/physically tagged so I don''t see how this could
be
> your issue.
Agreed.
> - Are you managing the VMID properly across physical CPU migration?
> (ensure that dom0 always uses the same vmid regardless of the physical
> cpu)
Currently VMID = DOMID + 1 so yes.

Ian.

Julien Grall

2013-Jun-05 10:39 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 10:38 AM, Ian Campbell wrote:
> On Tue, 2013-06-04 at 23:45 +0100, Julien Grall wrote:
>> The patch below removes WFI trap and by consequence avoid a VCPU to
move to
>> another physical CPU.
> 
> FWIW the dom0_vcpus_pin command line parameter should have achieved the
> same thing without removing the WFI code paths. I very much doubt that
> code path is to blame but it might be worth ruling it out.

Thanks for this option.

-- 
Julien

Julien Grall

2013-Jun-05 11:48 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 02:38 AM, Christoffer Dall wrote:
> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org> wrote:
>> Hi all,
>>
>> Since a couple of week,  I''m tracking an issue with Xen on ARM
with no luck.
>>
>> I''m run out of idea, so I send this email to have advice from
the community.
>>
>> Most of the time bash will abort with random error in dom0:
>>   - page fault (data and prefetch abort)
>>   - memory corruption (malloc corruption and invalid pointer)
>>
>> It''s easily to reproduce by doing ./configure on the xen tree.
>>
>> My environment is an arndale board:
>>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig and
exynos5250_arndale.dts)
>>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
>>   - xen upstream
>>
>> The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
>> using the branch linaro-3.10.
>> The previous branch is based on the linaro tree with some patches for
the dts and xen.
>>
>> The issue also occurs on the versatile express. But it''s
harder to reproduce.
>> Here the environment is:
>>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and
vexpress_v2p_ca15_a7.dtb)
>>   - ubuntu linaro 13.05
>>   - xen upstream
>>
>> I have tried different distributions and linux version, the issue was
the same.
>> I made some testing to narrow down the bug and I came to the following
test case:
>>
>> Only dom0 is running and each VCPUs are pinned to a specific cpu
>> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
>>
>> The patch below removes WFI trap and by consequence avoid a VCPU to
move to
>> another physical CPU.
>> ========================================>> diff --git
a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>> index 6cfba1a..e89ca15 100644
>> --- a/xen/arch/arm/traps.c
>> +++ b/xen/arch/arm/traps.c
>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
>>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
>>
>>      /* Setup hypervisor traps */
>> -   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
>> +    WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC,
HCR_EL2);
>>      isb();
>>  }
>>
>> ========================================>>
>> If a bash process is assigned to a specific cpu with taskset, the
process seems
>> to always run without any issue.
>>
>> taskset -c 0 ./configure
>>
>> I guess it''s a caching issue, but each time I''ve
tried to play with the caching
>> policy Linux was not booting.
>>
>> Thanks in advance for any advice.
> 
> Some thoughts:
> 
>  - Does dom0 run with Stage-2 translation? If so, you should be able
> to disable caches in both Hyp mode and for dom0 by manipulating the
> hyp registers to try and exclude caches. If Linux doesn''t boot
under
> such configuration, something else is completely broken, as it must be
> transparent to your dom0.
> 
>  - Are you doing any swapping and/or page reclaiming? I wouldn''t
> assume so for dom0, but if you are, you need to maintain the icache
> properly, since it can be aliasing, see
> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this
> is the case though)
> 
> - All other cache accesses should be coherent across cores and are
> physically indexed/physically tagged so I don''t see how this could
be
> your issue.
It was only an idea because I have noticed the memory was often corrupted.
> - Do you always see the crash in user space or kernel space in dom0 or
> is it all over the map?

Only in user space in dom0.

-- 
Julien

Christoffer Dall

2013-Jun-05 14:30 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org>
wrote:> On 06/05/2013 02:38 AM, Christoffer Dall wrote:
>
>> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org>
wrote:
>>> Hi all,
>>>
>>> Since a couple of week,  I''m tracking an issue with Xen on
ARM with no luck.
>>>
>>> I''m run out of idea, so I send this email to have advice
from the community.
>>>
>>> Most of the time bash will abort with random error in dom0:
>>>   - page fault (data and prefetch abort)
>>>   - memory corruption (malloc corruption and invalid pointer)
>>>
>>> It''s easily to reproduce by doing ./configure on the xen
tree.
>>>
>>> My environment is an arndale board:
>>>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig and
exynos5250_arndale.dts)
>>>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
>>>   - xen upstream
>>>
>>> The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
>>> using the branch linaro-3.10.
>>> The previous branch is based on the linaro tree with some patches
for the dts and xen.
>>>
>>> The issue also occurs on the versatile express. But it''s
harder to reproduce.
>>> Here the environment is:
>>>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and
vexpress_v2p_ca15_a7.dtb)
>>>   - ubuntu linaro 13.05
>>>   - xen upstream
>>>
>>> I have tried different distributions and linux version, the issue
was the same.
>>> I made some testing to narrow down the bug and I came to the
following test case:
>>>
>>> Only dom0 is running and each VCPUs are pinned to a specific cpu
>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
>>>
>>> The patch below removes WFI trap and by consequence avoid a VCPU to
move to
>>> another physical CPU.
>>> ========================================>>> diff --git
a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>>> index 6cfba1a..e89ca15 100644
>>> --- a/xen/arch/arm/traps.c
>>> +++ b/xen/arch/arm/traps.c
>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
>>>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
>>>
>>>      /* Setup hypervisor traps */
>>> -   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
>>> +   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2);
>>>      isb();
>>>  }
>>>
>>> ========================================>>>
>>> If a bash process is assigned to a specific cpu with taskset, the
process seems
>>> to always run without any issue.
>>>
>>> taskset -c 0 ./configure
>>>
>>> I guess it''s a caching issue, but each time I''ve
tried to play with the caching
>>> policy Linux was not booting.
>>>
>>> Thanks in advance for any advice.
>>
>> Some thoughts:
>>
>>  - Does dom0 run with Stage-2 translation? If so, you should be able
>> to disable caches in both Hyp mode and for dom0 by manipulating the
>> hyp registers to try and exclude caches. If Linux doesn''t boot
under
>> such configuration, something else is completely broken, as it must be
>> transparent to your dom0.
>>
>>  - Are you doing any swapping and/or page reclaiming? I
wouldn''t
>> assume so for dom0, but if you are, you need to maintain the icache
>> properly, since it can be aliasing, see
>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt this
>> is the case though)
>>
>> - All other cache accesses should be coherent across cores and are
>> physically indexed/physically tagged so I don''t see how this
could be
>> your issue.
>
> It was only an idea because I have noticed the memory was often corrupted.
>
>> - Do you always see the crash in user space or kernel space in dom0 or
>> is it all over the map?
>
>
> Only in user space in dom0.
>Hmm, which kernel version is dom0 based on? Can you bisect the dom0
source to make sure it''s not something introduced during development.

You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix
broken mmap..." ?

-Christoffer

Ian Campbell

2013-Jun-05 15:18 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On Wed, 2013-06-05 at 07:30 -0700, Christoffer Dall
wrote:> > Only in user space in dom0.
> >
> Hmm, which kernel version is dom0 based on? Can you bisect the dom0
> source to make sure it''s not something introduced during
development.
> 
> You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix
> broken mmap..." ?
FYI 9d1f5c is ambiguous in my tree, 79d1f5c AKA
79d1f5c9acf9fc8d06e5537083b19114ce87159f is the unambiguous commit at
least for me. 

Ian.
> 
> -Christoffer
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Julien Grall

2013-Jun-05 16:12 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 03:30 PM, Christoffer Dall wrote:
> On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org> wrote:
>> On 06/05/2013 02:38 AM, Christoffer Dall wrote:
>>
>>> On 4 June 2013 15:45, Julien Grall <julien.grall@linaro.org>
wrote:
>>>> Hi all,
>>>>
>>>> Since a couple of week,  I''m tracking an issue with
Xen on ARM with no luck.
>>>>
>>>> I''m run out of idea, so I send this email to have
advice from the community.
>>>>
>>>> Most of the time bash will abort with random error in dom0:
>>>>   - page fault (data and prefetch abort)
>>>>   - memory corruption (malloc corruption and invalid pointer)
>>>>
>>>> It''s easily to reproduce by doing ./configure on the
xen tree.
>>>>
>>>> My environment is an arndale board:
>>>>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig and
exynos5250_arndale.dts)
>>>>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
>>>>   - xen upstream
>>>>
>>>> The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
>>>> using the branch linaro-3.10.
>>>> The previous branch is based on the linaro tree with some
patches for the dts and xen.
>>>>
>>>> The issue also occurs on the versatile express. But
it''s harder to reproduce.
>>>> Here the environment is:
>>>>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig and
vexpress_v2p_ca15_a7.dtb)
>>>>   - ubuntu linaro 13.05
>>>>   - xen upstream
>>>>
>>>> I have tried different distributions and linux version, the
issue was the same.
>>>> I made some testing to narrow down the bug and I came to the
following test case:
>>>>
>>>> Only dom0 is running and each VCPUs are pinned to a specific
cpu
>>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
>>>>
>>>> The patch below removes WFI trap and by consequence avoid a
VCPU to move to
>>>> another physical CPU.
>>>> ========================================>>>> diff
--git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
>>>> index 6cfba1a..e89ca15 100644
>>>> --- a/xen/arch/arm/traps.c
>>>> +++ b/xen/arch/arm/traps.c
>>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
>>>>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
>>>>
>>>>      /* Setup hypervisor traps */
>>>> -   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
>>>> +   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2);
>>>>      isb();
>>>>  }
>>>>
>>>> ========================================>>>>
>>>> If a bash process is assigned to a specific cpu with taskset,
the process seems
>>>> to always run without any issue.
>>>>
>>>> taskset -c 0 ./configure
>>>>
>>>> I guess it''s a caching issue, but each time
I''ve tried to play with the caching
>>>> policy Linux was not booting.
>>>>
>>>> Thanks in advance for any advice.
>>>
>>> Some thoughts:
>>>
>>>  - Does dom0 run with Stage-2 translation? If so, you should be
able
>>> to disable caches in both Hyp mode and for dom0 by manipulating the
>>> hyp registers to try and exclude caches. If Linux doesn''t
boot under
>>> such configuration, something else is completely broken, as it must
be
>>> transparent to your dom0.
>>>
>>>  - Are you doing any swapping and/or page reclaiming? I
wouldn''t
>>> assume so for dom0, but if you are, you need to maintain the icache
>>> properly, since it can be aliasing, see
>>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I doubt
this
>>> is the case though)
>>>
>>> - All other cache accesses should be coherent across cores and are
>>> physically indexed/physically tagged so I don''t see how
this could be
>>> your issue.
>>
>> It was only an idea because I have noticed the memory was often
corrupted.
>>
>>> - Do you always see the crash in user space or kernel space in dom0
or
>>> is it all over the map?
>>
>>
>> Only in user space in dom0.
>>
> Hmm, which kernel version is dom0 based on? Can you bisect the dom0
> source to make sure it''s not something introduced during
development.
I''m using the linaro''s branch ll_20130528.0, I have only few
patches for
the dts and not yet in linaro tree patches.

I have the same issue with linux 3.9-rc4 with multiple CPUs and I can''t
really go before without carrying many xen patches to try it.

I have tried different configuration with the number of CPUs in Xen
(pCPU) and linux (vCPU):
  - 2 pCPU 2 vCPU : segfaulting
  - 2 pCPU 1 vCPU : working
  - 1 pCPU 1 vCPU : working
  - 1 pCPU 2 vCPU : very slow but working
> You have this in your tree right: "9d1f5c ARM: 7641/1: memory: fix
> broken mmap..." ?

Yes.

-- 
Julien

Stefano Stabellini

2013-Jun-05 16:46 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On Wed, 5 Jun 2013, Julien Grall wrote:> On 06/05/2013 03:30 PM, Christoffer Dall wrote:
> 
> > On 5 June 2013 04:48, Julien Grall <julien.grall@linaro.org>
wrote:
> >> On 06/05/2013 02:38 AM, Christoffer Dall wrote:
> >>
> >>> On 4 June 2013 15:45, Julien Grall
<julien.grall@linaro.org> wrote:
> >>>> Hi all,
> >>>>
> >>>> Since a couple of week,  I''m tracking an issue
with Xen on ARM with no luck.
> >>>>
> >>>> I''m run out of idea, so I send this email to have
advice from the community.
> >>>>
> >>>> Most of the time bash will abort with random error in
dom0:
> >>>>   - page fault (data and prefetch abort)
> >>>>   - memory corruption (malloc corruption and invalid
pointer)
> >>>>
> >>>> It''s easily to reproduce by doing ./configure on
the xen tree.
> >>>>
> >>>> My environment is an arndale board:
> >>>>   - linux linaro 13.05 (using arndale_xen_dom0_defconfig
and exynos5250_arndale.dts)
> >>>>   - opensuse 12.03 (http://en.opensuse.org/HCL:Arndale)
> >>>>   - xen upstream
> >>>>
> >>>> The linux tree can be retrieved from
git://xenbits.xen.org/people/julieng/linux-arm.git
> >>>> using the branch linaro-3.10.
> >>>> The previous branch is based on the linaro tree with some
patches for the dts and xen.
> >>>>
> >>>> The issue also occurs on the versatile express. But
it''s harder to reproduce.
> >>>> Here the environment is:
> >>>>   - linux linaro 13.05 (using vexpress_xen_dom0_defconfig
and vexpress_v2p_ca15_a7.dtb)
> >>>>   - ubuntu linaro 13.05
> >>>>   - xen upstream
> >>>>
> >>>> I have tried different distributions and linux version,
the issue was the same.
> >>>> I made some testing to narrow down the bug and I came to
the following test case:
> >>>>
> >>>> Only dom0 is running and each VCPUs are pinned to a
specific cpu
> >>>> (vcpu0 -> cpu0 and vcpu1 -> cpu1).
> >>>>
> >>>> The patch below removes WFI trap and by consequence avoid
a VCPU to move to
> >>>> another physical CPU.
> >>>> ========================================>
>>>> diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c
> >>>> index 6cfba1a..e89ca15 100644
> >>>> --- a/xen/arch/arm/traps.c
> >>>> +++ b/xen/arch/arm/traps.c
> >>>> @@ -62,7 +62,7 @@ void __cpuinit init_traps(void)
> >>>>      WRITE_SYSREG((vaddr_t)hyp_traps_vector, VBAR_EL2);
> >>>>
> >>>>      /* Setup hypervisor traps */
> >>>> -   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TWI|HCR_TSC,
HCR_EL2);
> >>>> +   
WRITE_SYSREG(HCR_PTW|HCR_BSU_OUTER|HCR_AMO|HCR_IMO|HCR_VM|HCR_TSC, HCR_EL2);
> >>>>      isb();
> >>>>  }
> >>>>
> >>>> ========================================>
>>>>
> >>>> If a bash process is assigned to a specific cpu with
taskset, the process seems
> >>>> to always run without any issue.
> >>>>
> >>>> taskset -c 0 ./configure
> >>>>
> >>>> I guess it''s a caching issue, but each time
I''ve tried to play with the caching
> >>>> policy Linux was not booting.
> >>>>
> >>>> Thanks in advance for any advice.
> >>>
> >>> Some thoughts:
> >>>
> >>>  - Does dom0 run with Stage-2 translation? If so, you should
be able
> >>> to disable caches in both Hyp mode and for dom0 by
manipulating the
> >>> hyp registers to try and exclude caches. If Linux
doesn''t boot under
> >>> such configuration, something else is completely broken, as it
must be
> >>> transparent to your dom0.
> >>>
> >>>  - Are you doing any swapping and/or page reclaiming? I
wouldn''t
> >>> assume so for dom0, but if you are, you need to maintain the
icache
> >>> properly, since it can be aliasing, see
> >>> http://lxr.linux.no/linux+v3.9.4/arch/arm/kvm/mmu.c#L495 (I
doubt this
> >>> is the case though)
> >>>
> >>> - All other cache accesses should be coherent across cores and
are
> >>> physically indexed/physically tagged so I don''t see
how this could be
> >>> your issue.
> >>
> >> It was only an idea because I have noticed the memory was often
corrupted.
> >>
> >>> - Do you always see the crash in user space or kernel space in
dom0 or
> >>> is it all over the map?
> >>
> >>
> >> Only in user space in dom0.
> >>
> > Hmm, which kernel version is dom0 based on? Can you bisect the dom0
> > source to make sure it''s not something introduced during
development.
> 
> I''m using the linaro''s branch ll_20130528.0, I have only
few patches for
> the dts and not yet in linaro tree patches.
> 
> I have the same issue with linux 3.9-rc4 with multiple CPUs and I
can''t
> really go before without carrying many xen patches to try it.
> 
> I have tried different configuration with the number of CPUs in Xen
> (pCPU) and linux (vCPU):
>   - 2 pCPU 2 vCPU : segfaulting
>   - 2 pCPU 1 vCPU : working
>   - 1 pCPU 1 vCPU : working
>   - 1 pCPU 2 vCPU : very slow but working
If you put it like that, it would seem to me that the most likely
candidate would be a bug in SMP support in Xen.
What happen if you have 2 pCPU, 1vCPU but you keep moving the vCPU
between the two pCPU?

Christoffer Dall

2013-Jun-05 17:36 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

>
> I''m using the linaro''s branch ll_20130528.0, I have only
few patches for
> the dts and not yet in linaro tree patches.
>
> I have the same issue with linux 3.9-rc4 with multiple CPUs and I
can''t
> really go before without carrying many xen patches to try it.
>
> I have tried different configuration with the number of CPUs in Xen
> (pCPU) and linux (vCPU):
>   - 2 pCPU 2 vCPU : segfaulting
>   - 2 pCPU 1 vCPU : working
>   - 1 pCPU 1 vCPU : working
>   - 1 pCPU 2 vCPU : very slow but working
>2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but
only creating 1 vCPU or are you actually compiling the dom0 as UP?

-Christoffer

Julien Grall

2013-Jun-05 17:53 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 06:36 PM, Christoffer Dall wrote:
>>
>> I''m using the linaro''s branch ll_20130528.0, I have
only few patches for
>> the dts and not yet in linaro tree patches.
>>
>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I
can''t
>> really go before without carrying many xen patches to try it.
>>
>> I have tried different configuration with the number of CPUs in Xen
>> (pCPU) and linux (vCPU):
>>   - 2 pCPU 2 vCPU : segfaulting
>>   - 2 pCPU 1 vCPU : working
>>   - 1 pCPU 1 vCPU : working
>>   - 1 pCPU 2 vCPU : very slow but working
>>
> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but
> only creating 1 vCPU or are you actually compiling the dom0 as UP?

Yes. It''s same kernel with the same command line (ie without nosmp).
I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen
command line.

--
Julien

Christoffer Dall

2013-Jun-05 17:57 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org>
wrote:> On 06/05/2013 06:36 PM, Christoffer Dall wrote:
>
>>>
>>> I''m using the linaro''s branch ll_20130528.0, I
have only few patches for
>>> the dts and not yet in linaro tree patches.
>>>
>>> I have the same issue with linux 3.9-rc4 with multiple CPUs and I
can''t
>>> really go before without carrying many xen patches to try it.
>>>
>>> I have tried different configuration with the number of CPUs in Xen
>>> (pCPU) and linux (vCPU):
>>>   - 2 pCPU 2 vCPU : segfaulting
>>>   - 2 pCPU 1 vCPU : working
>>>   - 1 pCPU 1 vCPU : working
>>>   - 1 pCPU 2 vCPU : very slow but working
>>>
>> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel, but
>> only creating 1 vCPU or are you actually compiling the dom0 as UP?
>
>
> Yes. It''s same kernel with the same command line (ie without
nosmp).
> I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen
> command line.
>It indicates a bug in Xen then. Curious that it only happens for user
space in dom0, but perhaps you just haven''t seen it in the kernel yet.
Bash scripts are pretty intensive on page faults so perhaps there''s a
synchronization issue with some of your page fault handlers.

You could try to touch all the memory inside dom0 (dd to a ramfs for
example) and then run your bash script and see if the problem still
occurs, that should point you to whether it''s a stage-2 fault handling
issue, but this is not a fool-proof approach. Maybe Xen can
pre-allocate all the stage-2 entries?

-Christoffer

Stefano Stabellini

2013-Jun-05 18:01 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On Wed, 5 Jun 2013, Christoffer Dall wrote:> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org> wrote:
> > On 06/05/2013 06:36 PM, Christoffer Dall wrote:
> >
> >>>
> >>> I''m using the linaro''s branch ll_20130528.0,
I have only few patches for
> >>> the dts and not yet in linaro tree patches.
> >>>
> >>> I have the same issue with linux 3.9-rc4 with multiple CPUs
and I can''t
> >>> really go before without carrying many xen patches to try it.
> >>>
> >>> I have tried different configuration with the number of CPUs
in Xen
> >>> (pCPU) and linux (vCPU):
> >>>   - 2 pCPU 2 vCPU : segfaulting
> >>>   - 2 pCPU 1 vCPU : working
> >>>   - 1 pCPU 1 vCPU : working
> >>>   - 1 pCPU 2 vCPU : very slow but working
> >>>
> >> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP kernel,
but
> >> only creating 1 vCPU or are you actually compiling the dom0 as UP?
> >
> >
> > Yes. It''s same kernel with the same command line (ie without
nosmp).
> > I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on xen
> > command line.
> >
> It indicates a bug in Xen then. Curious that it only happens for user
> space in dom0, but perhaps you just haven''t seen it in the kernel
yet.
> Bash scripts are pretty intensive on page faults so perhaps
there''s a
> synchronization issue with some of your page fault handlers.
> 
> You could try to touch all the memory inside dom0 (dd to a ramfs for
> example) and then run your bash script and see if the problem still
> occurs, that should point you to whether it''s a stage-2 fault
handling
> issue, but this is not a fool-proof approach. Maybe Xen can
> pre-allocate all the stage-2 entries?
Xen pre-allocates all the memory for stage-2 entries (no overcommit or
populate on demand by default)

Christoffer Dall

2013-Jun-05 18:17 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 5 June 2013 11:01, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:> On Wed, 5 Jun 2013, Christoffer Dall wrote:
>> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org>
wrote:
>> > On 06/05/2013 06:36 PM, Christoffer Dall wrote:
>> >
>> >>>
>> >>> I''m using the linaro''s branch
ll_20130528.0, I have only few patches for
>> >>> the dts and not yet in linaro tree patches.
>> >>>
>> >>> I have the same issue with linux 3.9-rc4 with multiple
CPUs and I can''t
>> >>> really go before without carrying many xen patches to try
it.
>> >>>
>> >>> I have tried different configuration with the number of
CPUs in Xen
>> >>> (pCPU) and linux (vCPU):
>> >>>   - 2 pCPU 2 vCPU : segfaulting
>> >>>   - 2 pCPU 1 vCPU : working
>> >>>   - 1 pCPU 1 vCPU : working
>> >>>   - 1 pCPU 2 vCPU : very slow but working
>> >>>
>> >> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP
kernel, but
>> >> only creating 1 vCPU or are you actually compiling the dom0 as
UP?
>> >
>> >
>> > Yes. It''s same kernel with the same command line (ie
without nosmp).
>> > I have limited the number of dom0 vcpus with dom0_max_vcpus=1 on
xen
>> > command line.
>> >
>> It indicates a bug in Xen then. Curious that it only happens for user
>> space in dom0, but perhaps you just haven''t seen it in the
kernel yet.
>> Bash scripts are pretty intensive on page faults so perhaps
there''s a
>> synchronization issue with some of your page fault handlers.
>>
>> You could try to touch all the memory inside dom0 (dd to a ramfs for
>> example) and then run your bash script and see if the problem still
>> occurs, that should point you to whether it''s a stage-2 fault
handling
>> issue, but this is not a fool-proof approach. Maybe Xen can
>> pre-allocate all the stage-2 entries?
>
> Xen pre-allocates all the memory for stage-2 entries (no overcommit or
> populate on demand by default)
what was the conclusion when pinning the vcpu to dedicated pcpus - did
the error still show up?

Julien Grall

2013-Jun-05 18:36 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 07:17 PM, Christoffer Dall wrote:
> On 5 June 2013 11:01, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> On Wed, 5 Jun 2013, Christoffer Dall wrote:
>>> On 5 June 2013 10:53, Julien Grall <julien.grall@linaro.org>
wrote:
>>>> On 06/05/2013 06:36 PM, Christoffer Dall wrote:
>>>>
>>>>>>
>>>>>> I''m using the linaro''s branch
ll_20130528.0, I have only few patches for
>>>>>> the dts and not yet in linaro tree patches.
>>>>>>
>>>>>> I have the same issue with linux 3.9-rc4 with multiple
CPUs and I can''t
>>>>>> really go before without carrying many xen patches to
try it.
>>>>>>
>>>>>> I have tried different configuration with the number of
CPUs in Xen
>>>>>> (pCPU) and linux (vCPU):
>>>>>>   - 2 pCPU 2 vCPU : segfaulting
>>>>>>   - 2 pCPU 1 vCPU : working
>>>>>>   - 1 pCPU 1 vCPU : working
>>>>>>   - 1 pCPU 2 vCPU : very slow but working
>>>>>>
>>>>> 2 pCPU 1 vCPU are you still compiling your dom0 as an SMP
kernel, but
>>>>> only creating 1 vCPU or are you actually compiling the dom0
as UP?
>>>>
>>>>
>>>> Yes. It''s same kernel with the same command line (ie
without nosmp).
>>>> I have limited the number of dom0 vcpus with dom0_max_vcpus=1
on xen
>>>> command line.
>>>>
>>> It indicates a bug in Xen then. Curious that it only happens for
user
>>> space in dom0, but perhaps you just haven''t seen it in the
kernel yet.
>>> Bash scripts are pretty intensive on page faults so perhaps
there''s a
>>> synchronization issue with some of your page fault handlers.
>>>
>>> You could try to touch all the memory inside dom0 (dd to a ramfs
for
>>> example) and then run your bash script and see if the problem still
>>> occurs, that should point you to whether it''s a stage-2
fault handling
>>> issue, but this is not a fool-proof approach. Maybe Xen can
>>> pre-allocate all the stage-2 entries?
>>
>> Xen pre-allocates all the memory for stage-2 entries (no overcommit or
>> populate on demand by default)
> 
> what was the conclusion when pinning the vcpu to dedicated pcpus - did
> the error still show up?

If I have 2 pCPU (CPU 0 and CPU 2) and 1 vCPU which is moving every 2
second between the pCPUs, Linux will freeze each time the vcpu is
running on CPU 1. I''m not sure why, perhaps another issue.

As Stefano advised me, I will setup the debugger on the arndale tomorrow
and see if I can find something.

-- 
Julien

Julien Grall

2013-Jun-11 11:48 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 06/05/2013 02:38 AM, Christoffer Dall wrote:
> - All other cache accesses should be coherent across cores and are
> physically indexed/physically tagged so I don''t see how this could
be
> your issue.

I was looking on KVM code and I noticed that it traps all data cache
instruction. Is there any reason to trap it?

I wonder if it could be the issue on Xen because we don''t trap cache
instruction and the manual is not clear on that.

Thanks,

-- 
Julien

Christoffer Dall

2013-Jun-11 14:25 UTC

head link

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

On 11 June 2013 04:48, Julien Grall <julien.grall@linaro.org>
wrote:> On 06/05/2013 02:38 AM, Christoffer Dall wrote:
>
>> - All other cache accesses should be coherent across cores and are
>> physically indexed/physically tagged so I don''t see how this
could be
>> your issue.
>
>
> I was looking on KVM code and I noticed that it traps all data cache
> instruction. Is there any reason to trap it?
>
> I wonder if it could be the issue on Xen because we don''t trap
cache
> instruction and the manual is not clear on that.
>We don''t trap all cache maintenance instructions, only those that work
by set/way, since if the vcpu gets migrated in the middle, it will
potentially clean one pcpu partially and another pcpu partially. So we
simply stop everything, clean everything, and carry on. Luckily, this
is not very common with linux guests.

This may be your issue if it really happens...

Xen devel - Jun 2013 - [ARM] Bash often segfaults in Dom0 with the latest Xen

[ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen

Re: [ARM] Bash often segfaults in Dom0 with the latest Xen