Peter Couperus
2012-Sep-05 23:25 UTC
[LLVMdev] Unaligned vector memory access for ARM/NEON.
Hello Jim, Thank you for the response. I may be confused about the alignment rules here. I had been looking at the ARM RVCT Assembler Guide, which seems to indicate vld1.16 operates on 16-bit aligned data, unless I am misinterpreting their table (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71). Prior to the table, It does mention the accesses need to be "element" aligned, where I took element in this case to mean i16. Anyhow, to make this a little more concrete: void extend(short* a, int* b) { for(int i = 0; i < 8; i++) b[i] = (int)a[i]; } When I compile this program with clang -O3 -ccc-host-triple armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate LLVM assembly looks OK (and it has an align 2 vector load), but the generated ARM assembly has the scalar loads. When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm -mfpu=neon -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment. This is on armv7a. The gcc version (and the clang version with our modified backend) runs fine, even on 2-byte aligned data. Is this not a guarantee across armv7/armv7a generally? Pete On 09/05/2012 03:15 PM, Jim Grosbach wrote:> VLD1 expects a 64-bit aligned address unless the target explicitly days that unaligned loads are OK. > > For your situation, either the subtarget should set AllowsUnalignedMem to true (if that's accurate), or the load address should be made 64-bit aligned. > > -Jim > > On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at st.com> wrote: > >> Hello all, >> >> I am a first time writer here, but am a happy LLVM tinkerer. It is a pleasure to use :). >> We have come across some sub-optimal behavior when LLVM lowers loads for vectors with small integers, i.e. load<4 x i16>* %a, align 2, >> using a sequence of scalar loads rather than a single vld1 on armv7 linux with NEON. >> Looking at the code in svn, it appears the ARM backend is capable of lowering these loads as desired, and will if we use an appropriate darwin triple. >> It appears this was actually enabled relatively recently. >> Seemingly, the case where the Subtarget has NEON available should be handled the same on Darwin and Linux. >> Is this true, or am I missing something? >> Do the regulars have an opinion on the best way to handle this? >> Thanks! >> >> Pete >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- A non-text attachment was scrubbed... Name: extend.c Type: text/x-csrc Size: 92 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120905/3e81319f/attachment.c>
Hmmm. Well, it's entirely possible that it's LLVM that's confused about the alignment requirements here. :) I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: extend: @ @extend @ BB#0: vldr d16, [r0] vmovl.s16 q8, d16 vstmia r1, {d16, d17} vldr d16, [r0, #8] add r0, r1, #16 vmovl.s16 q8, d16 vstmia r0, {d16, d17} bx lr Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you're correct about the element size alignment requirement for VLD1, but our isel isn't attempting to use that instruction, but rather VLDR, which has word alignment required, so it falls over. Given that, it seems that the answer to your original question is that to improve codegen for this case, the proper place to look is in instruction selection for loads and stores to the VFP/NEON registers. That code can be made smarter to better use the NEON instructions. I know Jakob has done some work related to better utilization of those for other constructs. -Jim On Sep 5, 2012, at 4:25 PM, Peter Couperus <peter.couperus at st.com> wrote:> Hello Jim, > > Thank you for the response. I may be confused about the alignment rules here. > I had been looking at the ARM RVCT Assembler Guide, which seems to indicate vld1.16 operates on 16-bit aligned data, unless I am misinterpreting their table > (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71). > Prior to the table, It does mention the accesses need to be "element" aligned, where I took element in this case to mean i16. > > Anyhow, to make this a little more concrete: > > void extend(short* a, int* b) { > for(int i = 0; i < 8; i++) > b[i] = (int)a[i]; > } > > When I compile this program with clang -O3 -ccc-host-triple armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate LLVM assembly > looks OK (and it has an align 2 vector load), but the generated ARM assembly has the scalar loads. > When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm -mfpu=neon -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment. This is on armv7a. > > The gcc version (and the clang version with our modified backend) runs fine, even on 2-byte aligned data. Is this not a guarantee across armv7/armv7a generally? > > Pete > > > > > On 09/05/2012 03:15 PM, Jim Grosbach wrote: >> VLD1 expects a 64-bit aligned address unless the target explicitly days that unaligned loads are OK. >> >> For your situation, either the subtarget should set AllowsUnalignedMem to true (if that's accurate), or the load address should be made 64-bit aligned. >> >> -Jim >> >> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at st.com> wrote: >> >>> Hello all, >>> >>> I am a first time writer here, but am a happy LLVM tinkerer. It is a pleasure to use :). >>> We have come across some sub-optimal behavior when LLVM lowers loads for vectors with small integers, i.e. load<4 x i16>* %a, align 2, >>> using a sequence of scalar loads rather than a single vld1 on armv7 linux with NEON. >>> Looking at the code in svn, it appears the ARM backend is capable of lowering these loads as desired, and will if we use an appropriate darwin triple. >>> It appears this was actually enabled relatively recently. >>> Seemingly, the case where the Subtarget has NEON available should be handled the same on Darwin and Linux. >>> Is this true, or am I missing something? >>> Do the regulars have an opinion on the best way to handle this? >>> Thanks! >>> >>> Pete >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > <extend.c>
Jakob Stoklund Olesen
2012-Sep-06 00:03 UTC
[LLVMdev] Unaligned vector memory access for ARM/NEON.
On Sep 5, 2012, at 4:58 PM, Jim Grosbach <grosbach at apple.com> wrote:> Hmmm. Well, it's entirely possible that it's LLVM that's confused about the alignment requirements here. :) > > I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: > extend: @ @extend > @ BB#0: > vldr d16, [r0] > vmovl.s16 q8, d16 > vstmia r1, {d16, d17} > vldr d16, [r0, #8] > add r0, r1, #16 > vmovl.s16 q8, d16 > vstmia r0, {d16, d17} > bx lr > > Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you're correct about the element size alignment requirement for VLD1, but our isel isn't attempting to use that instruction, but rather VLDR, which has word alignment required, so it falls over. > > Given that, it seems that the answer to your original question is that to improve codegen for this case, the proper place to look is in instruction selection for loads and stores to the VFP/NEON registers. That code can be made smarter to better use the NEON instructions. I know Jakob has done some work related to better utilization of those for other constructs.I don't think isel ever uses vld1.16, but I don't see anything wrong with it for 2-byte aligned vectors. There is an issue with big-endian semantics, but I don't think we're seriously trying to support big-endian ARM? /jakob
Peter Couperus
2012-Sep-06 15:13 UTC
[LLVMdev] Unaligned vector memory access for ARM/NEON.
Hello, Thanks again. We did try overestimating the alignment, and saw the vldr you reference here. It looks like a recent change (r161962?) did enable vld1 generation for this case (great!) on darwin, but not linux. I'm not sure if the effect of lowering load <4 x i16>* align 2 to vld1.16 this was intentional in this change or not. If so, my question is what is the preferable way to inform the Subtarget that it is allowed to use unaligned vector loads/stores when NEON is available, but can't use unaligned accesses generally speaking? A new field in ARMSubtarget? Should the -arm-strict-align flag force expansion even on unaligned vector loads/stores? We got this working by adding a field to ARMSubtarget and changing logic in ARMTargetLowering::allowsUnalignedMemoryAccesses, but I am admittedly not entirely sure of the downstream consequences of this, as we don't allow unaligned access generally. Pete On 09/05/2012 04:58 PM, Jim Grosbach wrote:> Hmmm. Well, it's entirely possible that it's LLVM that's confused about the alignment requirements here. :) > > I think I see, in general, where. I twiddled the IR to give it higher alignment (16 bytes) and get: > extend: @ @extend > @ BB#0: > vldr d16, [r0] > vmovl.s16 q8, d16 > vstmia r1, {d16, d17} > vldr d16, [r0, #8] > add r0, r1, #16 > vmovl.s16 q8, d16 > vstmia r0, {d16, d17} > bx lr > > Note that we're using a plain vldr instruction here to load the d register, not a vld1 instruction. Similarly for the stores. According to the ARM ARM (DDI 0406C), you're correct about the element size alignment requirement for VLD1, but our isel isn't attempting to use that instruction, but rather VLDR, which has word alignment required, so it falls over. > > Given that, it seems that the answer to your original question is that to improve codegen for this case, the proper place to look is in instruction selection for loads and stores to the VFP/NEON registers. That code can be made smarter to better use the NEON instructions. I know Jakob has done some work related to better utilization of those for other constructs. > > -Jim > > On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at st.com> wrote: > >> Hello Jim, >> >> Thank you for the response. I may be confused about the alignment rules here. >> I had been looking at the ARM RVCT Assembler Guide, which seems to indicate vld1.16 operates on 16-bit aligned data, unless I am misinterpreting their table >> (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71). >> Prior to the table, It does mention the accesses need to be "element" aligned, where I took element in this case to mean i16. >> >> Anyhow, to make this a little more concrete: >> >> void extend(short* a, int* b) { >> for(int i = 0; i< 8; i++) >> b[i] = (int)a[i]; >> } >> >> When I compile this program with clang -O3 -ccc-host-triple armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate LLVM assembly >> looks OK (and it has an align 2 vector load), but the generated ARM assembly has the scalar loads. >> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm -mfpu=neon -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment. This is on armv7a. >> >> The gcc version (and the clang version with our modified backend) runs fine, even on 2-byte aligned data. Is this not a guarantee across armv7/armv7a generally? >> >> Pete >> >> >> >> >> On 09/05/2012 03:15 PM, Jim Grosbach wrote: >>> VLD1 expects a 64-bit aligned address unless the target explicitly days that unaligned loads are OK. >>> >>> For your situation, either the subtarget should set AllowsUnalignedMem to true (if that's accurate), or the load address should be made 64-bit aligned. >>> >>> -Jim >>> >>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at st.com> wrote: >>> >>>> Hello all, >>>> >>>> I am a first time writer here, but am a happy LLVM tinkerer. It is a pleasure to use :). >>>> We have come across some sub-optimal behavior when LLVM lowers loads for vectors with small integers, i.e. load<4 x i16>* %a, align 2, >>>> using a sequence of scalar loads rather than a single vld1 on armv7 linux with NEON. >>>> Looking at the code in svn, it appears the ARM backend is capable of lowering these loads as desired, and will if we use an appropriate darwin triple. >>>> It appears this was actually enabled relatively recently. >>>> Seemingly, the case where the Subtarget has NEON available should be handled the same on Darwin and Linux. >>>> Is this true, or am I missing something? >>>> Do the regulars have an opinion on the best way to handle this? >>>> Thanks! >>>> >>>> Pete >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> <extend.c>
Maybe Matching Threads
- [LLVMdev] Unaligned vector memory access for ARM/NEON.
- [LLVMdev] Unaligned vector memory access for ARM/NEON.
- [LLVMdev] Unaligned vector memory access for ARM/NEON.
- [LLVMdev] Unaligned vector memory access for ARM/NEON.
- [LLVMdev] Unaligned vector memory access for ARM/NEON.