thr3ads.net - llvm dev - [LLVMdev] speed up memcpy intrinsic using ARM Neon registers [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Neel Nagar

2009-Nov-10 00:34 UTC

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
memcpy intrinsic. I used the Neon load multiple instruction to move up
to 48 bytes at a time . Over 15 scalar instructions collapsed down
into these 2 Neon instructions.

       fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
       fstmiad r1, {d0, d1, d2, d3, d4, d5}

It seems like this should be faster. But I did not see any appreciable speedup.

I think the patch is correct. The code runs fine.

I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to
this email.

Does this look like the right modification?

Does anyone have any insights into why this is not way faster than
using scalar registers?

I am using a BeagleBoard.

Thanks,
Neel Nagar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memcpy_neon_091109.patch
Type: application/octet-stream
Size: 2040 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20091110/61765619/attachment.obj>

David Conrad

2009-Nov-10 01:59 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
> memcpy intrinsic. I used the Neon load multiple instruction to move up
> to 48 bytes at a time . Over 15 scalar instructions collapsed down
> into these 2 Neon instructions.
>
>       fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
>       fstmiad r1, {d0, d1, d2, d3, d4, d5}
>
> It seems like this should be faster. But I did not see any  
> appreciable speedup.
>
> I think the patch is correct. The code runs fine.
>
> I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp"
to
> this email.
>
> Does this look like the right modification?
>
> Does anyone have any insights into why this is not way faster than
> using scalar registers?
On the A8, an ARM store after NEON stores to the same 16-byte block  
incurs a ~20 cycle penalty since the NEON unit executes behind ARM.  
It's worse if the NEON store was split across a 16-byte boundary, then  
there could be a 50 cycle stall.

See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for  
some more details and benchmarks.

Evan Cheng

2009-Nov-10 07:13 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 5:59 PM, David Conrad wrote:
> On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
>
>> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
>> memcpy intrinsic. I used the Neon load multiple instruction to move  
>> up
>> to 48 bytes at a time . Over 15 scalar instructions collapsed down
>> into these 2 Neon instructions.
Nice. Thanks for working on this. It has long been on my todo list.
>>
>>      fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
>>      fstmiad r1, {d0, d1, d2, d3, d4, d5}
>>
>> It seems like this should be faster. But I did not see any
>> appreciable speedup.
Even if it's not faster, it's still a code size win which is also  
important. Are we generating the right aligned NEON load / stores?
>>
>> I think the patch is correct. The code runs fine.
>>
>> I have attached my patch for
"lib/Target/ARM/ARMISelLowering.cpp" to
>> this email.
>>
>> Does this look like the right modification?
>>
>> Does anyone have any insights into why this is not way faster than
>> using scalar registers?
>
> On the A8, an ARM store after NEON stores to the same 16-byte block
> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
> It's worse if the NEON store was split across a 16-byte boundary, then
> there could be a 50 cycle stall.
>
> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
> some more details and benchmarks.
If that's the case, then for A8 we should only do this when there  
won't be trailing scalar load / stores.

Evan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Bob Wilson

2009-Nov-11 17:20 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 11, 2009, at 3:27 AM, Rodolph Perfetta wrote:>
> If you know about the alignment, maybe use structured load/store
> (vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache  
> lines
> (64 bytes on A8). You can find more in this discussion:
>
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc
> 0993/e382202f1a92b0f8?lnk=gst&q=memcpy&pli=1 .
>
>> Even if it's not faster, it's still a code size win which is
also
>> important.
>
> Yes but NEON will drive up your power consumption, so if you are not  
> faster
> you will drain your battery faster (assuming you care of course).
>
> In general we wouldn't recommend writing memcpy using NEON unless  
> you can
> detect the exact core you will be running on: on A9 NEON will not  
> give you
> any speed up, you'll just end up using more power. NEON is a SIMD  
> engine.
>
> If one wanted to write memcpy on A9 we would recommend something like:
> * do not use NEON
> * use PLD (3-6 cache lines ahead, to be tuned)
> * ldm/stm whole cache lines (32 bytes on A9)
> * align destination
Thanks, Rodolph.  That is very helpful.

Can you comment on David Conrad's message in this thread regarding a  
~20 cycle penalty for an ARM store following a NEON store to the same  
16-byte block?  If the memcpy size is not a multiple of 8, we need  
some ARM load/store instructions to copy the tail end of it.  The  
context here is LLVM generating inline code for small copies, so if  
there is a penalty like that, it is probably not worthwhile to use  
NEON unless the alignment shows that the tail will be in a separate 16- 
byte block.  (And what's up with the 16-byte divisions?  I thought the  
cache lines are 64 bytes....)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20091111/1cd1162f/attachment.html>

Rodolph Perfetta

2009-Nov-13 11:27 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

> Can you comment on David Conrad's message in this thread regarding
> a ~20 cycle penalty for an ARM store following a NEON store to the
> same 16-byte block?
It is correct for A8: a NEON store followed by an ARM store in the same 16
bytes block will incur a penalty (20 cycles sounds about right) as the CPU
ensures there are no data hazards.

A9 does not have this penalty.
> If the memcpy size is not a multiple of 8, we need some ARM load/store
> instructions to copy the tail end of it. The context here is LLVM
> generating inline code for small copies, so if there is a penalty
> like that, it is probably not worthwhile to use NEON unless the
> alignment shows that the tail will be in a separate 16-byte block.
I agree it is probably not worthwhile (though I assume using NEON releases
pressure on your register allocator), it is usually not recommended to mix
ARM/NEON memory operation.

Also the NEON engines tend to have a deeper pipeline than the ARM integer
cores, so the delay to store the first bytes is likely to be higher using
NEON (although it should be faster afterwards). So for very small memcpy (20
bytes or less) ARM will be faster. For best performance remember to use PLD.

For A9 you have more to take into account: A9 is a superscalar, dual issue,
out of order and speculative CPU but this only applies to the ARM integer
core, NEON and VFP are single issue in order. However an ARM instruction can
be issued with a NEON or VFP instruction. So if you have some VFP/NEON code
before the memcpy, by the time the CPU reaches the inline NEON memcpy it
might not have finished the previous NEON/VFP instruction and you'll have to
wait ...
> (And what's up with the 16-byte divisions? I thought the cache
> lines are 64 bytes....)
Cache line is 64 bytes on A8 and 32 bytes on A9. 16 bytes is the size of an
internal buffer use by the load/store unit.

Cheers,
Rodolph.

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Nov 2009 - [LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Possibly Parallel Threads