thr3ads.net - llvm dev - [LLVMdev] speed up memcpy intrinsic using ARM Neon registers [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Evan Cheng

2009-Nov-10 07:13 UTC

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 5:59 PM, David Conrad wrote:
> On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
>
>> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
>> memcpy intrinsic. I used the Neon load multiple instruction to move  
>> up
>> to 48 bytes at a time . Over 15 scalar instructions collapsed down
>> into these 2 Neon instructions.
Nice. Thanks for working on this. It has long been on my todo list.
>>
>>      fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
>>      fstmiad r1, {d0, d1, d2, d3, d4, d5}
>>
>> It seems like this should be faster. But I did not see any
>> appreciable speedup.
Even if it's not faster, it's still a code size win which is also  
important. Are we generating the right aligned NEON load / stores?
>>
>> I think the patch is correct. The code runs fine.
>>
>> I have attached my patch for
"lib/Target/ARM/ARMISelLowering.cpp" to
>> this email.
>>
>> Does this look like the right modification?
>>
>> Does anyone have any insights into why this is not way faster than
>> using scalar registers?
>
> On the A8, an ARM store after NEON stores to the same 16-byte block
> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
> It's worse if the NEON store was split across a 16-byte boundary, then
> there could be a 50 cycle stall.
>
> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
> some more details and benchmarks.
If that's the case, then for A8 we should only do this when there  
won't be trailing scalar load / stores.

Evan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chris Lattner

2009-Nov-10 07:25 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
>>
>> On the A8, an ARM store after NEON stores to the same 16-byte block
>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>> It's worse if the NEON store was split across a 16-byte boundary,  
>> then
>> there could be a 50 cycle stall.
>>
>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>> some more details and benchmarks.
>
> If that's the case, then for A8 we should only do this when there
> won't be trailing scalar load / stores.
It should be safe if the start pointer is known 16-byte aligned.  The  
trailing stores won't be in the same 16-byte chunk.

-Chris

Evan Cheng

2009-Nov-10 19:27 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

On Nov 9, 2009, at 11:25 PM, Chris Lattner wrote:
> 
> On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
> 
>>> 
>>> On the A8, an ARM store after NEON stores to the same 16-byte block
>>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>>> It's worse if the NEON store was split across a 16-byte
boundary, then
>>> there could be a 50 cycle stall.
>>> 
>>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>>> some more details and benchmarks.
>> 
>> If that's the case, then for A8 we should only do this when there
>> won't be trailing scalar load / stores.
> 
> It should be safe if the start pointer is known 16-byte aligned.  The
trailing stores won't be in the same 16-byte chunk.
According to
http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/

There are secondary effects if the load / store are within 64-byte block.

Evan
> 
> -Chris

Rodolph Perfetta

2009-Nov-11 11:27 UTC

head link

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

> >> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
> >> memcpy intrinsic. I used the Neon load multiple instruction to
move
> >> up to 48 bytes at a time . Over 15 scalar instructions collapsed
> >> down into these 2 Neon instructions.
> 
> Nice. Thanks for working on this. It has long been on my todo list.
> 
> >>
> >>      fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c
359
> >>      fstmiad r1, {d0, d1, d2, d3, d4, d5}
> >>
> >> It seems like this should be faster. But I did not see any
> >> appreciable speedup.
If you know about the alignment, maybe use structured load/store
(vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache lines
(64 bytes on A8). You can find more in this discussion:
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc
0993/e382202f1a92b0f8?lnk=gst&q=memcpy&pli=1 .
 > Even if it's not faster, it's still a code size win which is also
> important.
Yes but NEON will drive up your power consumption, so if you are not faster
you will drain your battery faster (assuming you care of course).

In general we wouldn't recommend writing memcpy using NEON unless you can
detect the exact core you will be running on: on A9 NEON will not give you
any speed up, you'll just end up using more power. NEON is a SIMD engine.

If one wanted to write memcpy on A9 we would recommend something like:
 * do not use NEON
 * use PLD (3-6 cache lines ahead, to be tuned)
 * ldm/stm whole cache lines (32 bytes on A9)
 * align destination

Cheers,
Rodolph.

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Nov 2009 - [LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Seemingly Similar Threads