thr3ads.net - llvm dev - [LLVMdev] Unaligned vector memory access for ARM/NEON. [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Peter Couperus

2012-Sep-06 15:13 UTC

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hello,

Thanks again.  We did try overestimating the alignment, and saw the vldr 
you reference here.
It looks like a recent change (r161962?) did enable vld1 generation for 
this case (great!) on darwin, but not linux.
I'm not sure if the effect of lowering load <4 x i16>* align 2 to 
vld1.16 this was intentional in this change or not.
If so, my question is what is the preferable way to inform the Subtarget 
that it is allowed to use unaligned vector loads/stores when NEON is 
available,
but can't use unaligned accesses generally speaking?
A new field in ARMSubtarget?
Should the -arm-strict-align flag force expansion even on unaligned 
vector loads/stores?
We got this working by adding a field to ARMSubtarget and changing logic 
in ARMTargetLowering::allowsUnalignedMemoryAccesses, but
I am admittedly not entirely sure of the downstream consequences of 
this, as we don't allow unaligned access generally.

Pete


On 09/05/2012 04:58 PM, Jim Grosbach wrote:> Hmmm. Well, it's entirely possible that it's LLVM that's
confused about the alignment requirements here. :)
>
> I think I see, in general, where. I twiddled the IR to give it higher
alignment (16 bytes) and get:
> extend:                                 @ @extend
> @ BB#0:
> 	vldr	d16, [r0]
> 	vmovl.s16	q8, d16
> 	vstmia	r1, {d16, d17}
> 	vldr	d16, [r0, #8]
> 	add	r0, r1, #16
> 	vmovl.s16	q8, d16
> 	vstmia	r0, {d16, d17}
> 	bx	lr
>
> Note that we're using a plain vldr instruction here to load the d
register, not a vld1 instruction. Similarly for the stores. According to the ARM
ARM (DDI 0406C),  you're correct about the element size alignment
requirement for VLD1, but our isel isn't attempting to use that instruction,
but rather VLDR, which has word alignment required, so it falls over.
>
> Given that, it seems that the answer to your original question is that to
improve codegen for this case, the proper place to look is in instruction
selection for loads and stores to the VFP/NEON registers. That code can be made
smarter to better use the NEON instructions. I know Jakob has done some work
related to better utilization of those for other constructs.
>
> -Jim
>
> On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at st.com> 
wrote:
>
>> Hello Jim,
>>
>> Thank you for the response.  I may be confused about the alignment
rules here.
>> I had been looking at the ARM RVCT Assembler Guide, which seems to
indicate vld1.16 operates on 16-bit aligned data, unless I am misinterpreting
their table
>> (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71).
>> Prior to the table, It does mention the accesses need to be
"element" aligned, where I took element in this case to mean i16.
>>
>> Anyhow, to make this a little more concrete:
>>
>> void extend(short* a, int* b) {
>>   for(int i = 0; i<  8; i++)
>>     b[i] = (int)a[i];
>> }
>>
>> When I compile this program with clang -O3 -ccc-host-triple
armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate LLVM
assembly
>> looks OK (and it has an align 2 vector load), but the generated ARM
assembly has the scalar loads.
>> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm
-mfpu=neon -O3, it uses vld1.16 and vst1.32 regardless of the parameter
alignment.  This is on armv7a.
>>
>> The gcc version (and the clang version with our modified backend) runs
fine, even on 2-byte aligned data.  Is this not a guarantee across armv7/armv7a
generally?
>>
>> Pete
>>
>>
>>
>>
>> On 09/05/2012 03:15 PM, Jim Grosbach wrote:
>>> VLD1 expects a 64-bit aligned address unless the target explicitly
days that unaligned loads are OK.
>>>
>>> For your situation, either the subtarget should set
AllowsUnalignedMem to true (if that's accurate), or the load address should
be made 64-bit aligned.
>>>
>>> -Jim
>>>
>>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at
st.com>   wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a first time writer here, but am a happy LLVM tinkerer. 
It is a pleasure to use :).
>>>> We have come across some sub-optimal behavior when LLVM lowers
loads for vectors with small integers, i.e. load<4 x i16>* %a, align 2,
>>>> using a sequence of scalar loads rather than a single vld1 on
armv7 linux with NEON.
>>>> Looking at the code in svn, it appears the ARM backend is
capable of lowering these loads as desired, and will if we use an appropriate
darwin triple.
>>>> It appears this was actually enabled relatively recently.
>>>> Seemingly, the case where the Subtarget has NEON available
should be handled the same on Darwin and Linux.
>>>> Is this true, or am I missing something?
>>>> Do the regulars have an opinion on the best way to handle this?
>>>> Thanks!
>>>>
>>>> Pete
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> <extend.c>

David Peixotto

2012-Sep-06 21:48 UTC

head link

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Hi Pete,

We ran into the same issue with generating vector loads/stores for vectors
with less than word alignment. It seems we took a similar approach to
solving the problem by modifying the logic in allowsUnalignedMemoryAccesses.

As you and Jim mentioned, it looks like the vld1/vst1 instructions should
support element aligned access for any armv7 implementation (I'm looking at
Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C).

Right now I do not think we have the correct code setup in ARMSubtarget to
accurately represent this table. I would propose that we keep the existing
field for unaligned access and add a new field for element-aligned access.

The AllowsUnAlignedMem field remains as is and it could be used to represent
the SCTLR.A column in Table A3-1. The AllowsElementAlignedNEON field would
be used allow targets to generate vld1/vst1 instructions for element-aligned
accesses. By default it would be set to true for armv7 targets with NEON.

The -arm-strict-align would set both of the fields to false. This would
retain the behavior that seems to be desired from the
test/CodeGen/ARM/unaligned_load_store.ll test case.

A bit of a grey area is if we have an unaligned f64 store and
AllowsElementAlignedNEON is true. We can actually generate a vstr1.8 to
support this store directly instead of using the target-independent method
and I think it would be good to do so.

I have some code to do this that I will likely be able to upstream.

-Dave

-- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
by The Linux Foundation

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Peter Couperus
Sent: Thursday, September 06, 2012 8:14 AM
To: Jim Grosbach
Cc: Jakob Olesen; llvmdev at cs.uiuc.edu (LLVMdev at cs.uiuc.edu)
Subject: Re: [LLVMdev] Unaligned vector memory access for ARM/NEON.

Hello,

Thanks again.  We did try overestimating the alignment, and saw the vldr you
reference here.
It looks like a recent change (r161962?) did enable vld1 generation for this
case (great!) on darwin, but not linux.
I'm not sure if the effect of lowering load <4 x i16>* align 2 to
vld1.16 this was intentional in this change or not.
If so, my question is what is the preferable way to inform the Subtarget
that it is allowed to use unaligned vector loads/stores when NEON is
available, but can't use unaligned accesses generally speaking?
A new field in ARMSubtarget?
Should the -arm-strict-align flag force expansion even on unaligned vector
loads/stores?
We got this working by adding a field to ARMSubtarget and changing logic in
ARMTargetLowering::allowsUnalignedMemoryAccesses, but I am admittedly not
entirely sure of the downstream consequences of this, as we don't allow
unaligned access generally.

Pete

On 09/05/2012 04:58 PM, Jim Grosbach wrote:> Hmmm. Well, it's entirely possible that it's LLVM that's
confused
> about the alignment requirements here. :)
>
> I think I see, in general, where. I twiddled the IR to give it higher
alignment (16 bytes) and get:> extend:                                 @ @extend
> @ BB#0:
> 	vldr	d16, [r0]
> 	vmovl.s16	q8, d16
> 	vstmia	r1, {d16, d17}
> 	vldr	d16, [r0, #8]
> 	add	r0, r1, #16
> 	vmovl.s16	q8, d16
> 	vstmia	r0, {d16, d17}
> 	bx	lr
>
> Note that we're using a plain vldr instruction here to load the dregister, not a vld1 instruction. Similarly for the stores. According to the
ARM ARM (DDI 0406C),  you're correct about the element size alignment
requirement for VLD1, but our isel isn't attempting to use that instruction,
but rather VLDR, which has word alignment required, so it falls
over.>
> Given that, it seems that the answer to your original question is that toimprove codegen for this case, the proper place to look is in instruction
selection for loads and stores to the VFP/NEON registers. That code can be
made smarter to better use the NEON instructions. I know Jakob has done some
work related to better utilization of those for other
constructs.>
> -Jim
>
> On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at st.com> 
wrote:
>
>> Hello Jim,
>>
>> Thank you for the response.  I may be confused about the alignment
rules
here.>> I had been looking at the ARM RVCT Assembler Guide, which seems to 
>> indicate vld1.16 operates on 16-bit aligned data, unless I ammisinterpreting their table (Table 5-11 in ARM DUI 0204H, pg
5-70,5-71).>> Prior to the table, It does mention the accesses need to be
"element"
aligned, where I took element in this case to mean i16.>>
>> Anyhow, to make this a little more concrete:
>>
>> void extend(short* a, int* b) {
>>   for(int i = 0; i<  8; i++)
>>     b[i] = (int)a[i];
>> }
>>
>> When I compile this program with clang -O3 -ccc-host-triple 
>> armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediateLLVM assembly looks OK (and it has an align 2 vector load), but the
generated ARM assembly has the scalar loads.>> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm
-mfpu=neon-O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment.
This is on armv7a.>>
>> The gcc version (and the clang version with our modified backend) runsfine, even on 2-byte aligned data.  Is this not a guarantee across
armv7/armv7a generally?>>
>> Pete
>>
>>
>>
>>
>> On 09/05/2012 03:15 PM, Jim Grosbach wrote:
>>> VLD1 expects a 64-bit aligned address unless the target explicitly
days
that unaligned loads are OK.>>>
>>> For your situation, either the subtarget should set
AllowsUnalignedMemto true (if that's accurate), or the load address should be made 64-bit
aligned.>>>
>>> -Jim
>>>
>>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at
st.com>
wrote:>>>
>>>> Hello all,
>>>>
>>>> I am a first time writer here, but am a happy LLVM tinkerer. 
It is a
pleasure to use :).>>>> We have come across some sub-optimal behavior when LLVM lowers 
>>>> loads for vectors with small integers, i.e. load<4 x
i16>* %a, align 2,using a sequence of scalar loads rather than a single vld1 on armv7 linux
with NEON.>>>> Looking at the code in svn, it appears the ARM backend is
capable oflowering these loads as desired, and will if we use an appropriate darwin
triple.>>>> It appears this was actually enabled relatively recently.
>>>> Seemingly, the case where the Subtarget has NEON available
should be
handled the same on Darwin and Linux.>>>> Is this true, or am I missing something?
>>>> Do the regulars have an opinion on the best way to handle this?
>>>> Thanks!
>>>>
>>>> Pete
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> <extend.c>
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Bob Wilson

2012-Sep-06 22:38 UTC

head link

[LLVMdev] Unaligned vector memory access for ARM/NEON.

On Sep 6, 2012, at 2:48 PM, David Peixotto <dpeixott at codeaurora.org>
wrote:
> Hi Pete,
> 
> We ran into the same issue with generating vector loads/stores for vectors
> with less than word alignment. It seems we took a similar approach to
> solving the problem by modifying the logic in
allowsUnalignedMemoryAccesses.
> 
> As you and Jim mentioned, it looks like the vld1/vst1 instructions should
> support element aligned access for any armv7 implementation (I'm
looking at
> Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C).
> 
> Right now I do not think we have the correct code setup in ARMSubtarget to
> accurately represent this table. I would propose that we keep the existing
> field for unaligned access and add a new field for element-aligned access.
> 
> The AllowsUnAlignedMem field remains as is and it could be used to
represent
> the SCTLR.A column in Table A3-1. The AllowsElementAlignedNEON field would
> be used allow targets to generate vld1/vst1 instructions for
element-aligned
> accesses. By default it would be set to true for armv7 targets with NEON.
That doesn't make sense to me.  Element-aligned Neon load/stores are always
valid.  The AllowsUnalignedMem setting is supposed to model the SCTLR.A setting,
which also applies to Neon load/stores with less-than-element alignment.  Why
would we need a new field?
> 
> The -arm-strict-align would set both of the fields to false. This would
> retain the behavior that seems to be desired from the
> test/CodeGen/ARM/unaligned_load_store.ll test case.
> 
> A bit of a grey area is if we have an unaligned f64 store and
> AllowsElementAlignedNEON is true. We can actually generate a vstr1.8 to
> support this store directly instead of using the target-independent method
> and I think it would be good to do so.
Don't we already do this as of svn r161962?
> 
> I have some code to do this that I will likely be able to upstream.
> 
> -Dave
> 
> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted
> by The Linux Foundation
> 
> 
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] On
> Behalf Of Peter Couperus
> Sent: Thursday, September 06, 2012 8:14 AM
> To: Jim Grosbach
> Cc: Jakob Olesen; llvmdev at cs.uiuc.edu (LLVMdev at cs.uiuc.edu)
> Subject: Re: [LLVMdev] Unaligned vector memory access for ARM/NEON.
> 
> Hello,
> 
> Thanks again.  We did try overestimating the alignment, and saw the vldr
you
> reference here.
> It looks like a recent change (r161962?) did enable vld1 generation for
this
> case (great!) on darwin, but not linux.
> I'm not sure if the effect of lowering load <4 x i16>* align 2 to
> vld1.16 this was intentional in this change or not.
> If so, my question is what is the preferable way to inform the Subtarget
> that it is allowed to use unaligned vector loads/stores when NEON is
> available, but can't use unaligned accesses generally speaking?
> A new field in ARMSubtarget?
> Should the -arm-strict-align flag force expansion even on unaligned vector
> loads/stores?
> We got this working by adding a field to ARMSubtarget and changing logic in
> ARMTargetLowering::allowsUnalignedMemoryAccesses, but I am admittedly not
> entirely sure of the downstream consequences of this, as we don't allow
> unaligned access generally.
> 
> Pete
> 
> 
> On 09/05/2012 04:58 PM, Jim Grosbach wrote:
>> Hmmm. Well, it's entirely possible that it's LLVM that's
confused
>> about the alignment requirements here. :)
>> 
>> I think I see, in general, where. I twiddled the IR to give it higher
> alignment (16 bytes) and get:
>> extend:                                 @ @extend
>> @ BB#0:
>> 	vldr	d16, [r0]
>> 	vmovl.s16	q8, d16
>> 	vstmia	r1, {d16, d17}
>> 	vldr	d16, [r0, #8]
>> 	add	r0, r1, #16
>> 	vmovl.s16	q8, d16
>> 	vstmia	r0, {d16, d17}
>> 	bx	lr
>> 
>> Note that we're using a plain vldr instruction here to load the d
> register, not a vld1 instruction. Similarly for the stores. According to
the
> ARM ARM (DDI 0406C),  you're correct about the element size alignment
> requirement for VLD1, but our isel isn't attempting to use that
instruction,
> but rather VLDR, which has word alignment required, so it falls over.
>> 
>> Given that, it seems that the answer to your original question is that
to
> improve codegen for this case, the proper place to look is in instruction
> selection for loads and stores to the VFP/NEON registers. That code can be
> made smarter to better use the NEON instructions. I know Jakob has done
some
> work related to better utilization of those for other constructs.
>> 
>> -Jim
>> 
>> On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at
st.com>  wrote:
>> 
>>> Hello Jim,
>>> 
>>> Thank you for the response.  I may be confused about the alignment
rules
> here.
>>> I had been looking at the ARM RVCT Assembler Guide, which seems to 
>>> indicate vld1.16 operates on 16-bit aligned data, unless I am
> misinterpreting their table (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71).
>>> Prior to the table, It does mention the accesses need to be
"element"
> aligned, where I took element in this case to mean i16.
>>> 
>>> Anyhow, to make this a little more concrete:
>>> 
>>> void extend(short* a, int* b) {
>>>  for(int i = 0; i<  8; i++)
>>>    b[i] = (int)a[i];
>>> }
>>> 
>>> When I compile this program with clang -O3 -ccc-host-triple 
>>> armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the
intermediate
> LLVM assembly looks OK (and it has an align 2 vector load), but the
> generated ARM assembly has the scalar loads.
>>> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm
-mfpu=neon
> -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment.
> This is on armv7a.
>>> 
>>> The gcc version (and the clang version with our modified backend)
runs
> fine, even on 2-byte aligned data.  Is this not a guarantee across
> armv7/armv7a generally?
>>> 
>>> Pete
>>> 
>>> 
>>> 
>>> 
>>> On 09/05/2012 03:15 PM, Jim Grosbach wrote:
>>>> VLD1 expects a 64-bit aligned address unless the target
explicitly days
> that unaligned loads are OK.
>>>> 
>>>> For your situation, either the subtarget should set
AllowsUnalignedMem
> to true (if that's accurate), or the load address should be made 64-bit
> aligned.
>>>> 
>>>> -Jim
>>>> 
>>>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at
st.com>
> wrote:
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I am a first time writer here, but am a happy LLVM
tinkerer.  It is a
> pleasure to use :).
>>>>> We have come across some sub-optimal behavior when LLVM
lowers
>>>>> loads for vectors with small integers, i.e. load<4 x
i16>* %a, align 2,
> using a sequence of scalar loads rather than a single vld1 on armv7 linux
> with NEON.
>>>>> Looking at the code in svn, it appears the ARM backend is
capable of
> lowering these loads as desired, and will if we use an appropriate darwin
> triple.
>>>>> It appears this was actually enabled relatively recently.
>>>>> Seemingly, the case where the Subtarget has NEON available
should be
> handled the same on Darwin and Linux.
>>>>> Is this true, or am I missing something?
>>>>> Do the regulars have an opinion on the best way to handle
this?
>>>>> Thanks!
>>>>> 
>>>>> Pete
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> <extend.c>
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Sep 2012 - [LLVMdev] Unaligned vector memory access for ARM/NEON.

[LLVMdev] Unaligned vector memory access for ARM/NEON.

[LLVMdev] Unaligned vector memory access for ARM/NEON.

[LLVMdev] Unaligned vector memory access for ARM/NEON.

Maybe Matching Threads