Displaying 4 results from an estimated 4 matches for "hardwarebug".
2009 Nov 10
3
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...egisters?
>
> On the A8, an ARM store after NEON stores to the same 16-byte block
> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
> It's worse if the NEON store was split across a 16-byte boundary, then
> there could be a 50 cycle stall.
>
> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
> some more details and benchmarks.
If that's the case, then for A8 we should only do this when there
won't be trailing scalar load / stores.
Evan
> _______________________________________________
> LLVM Developers mailing list
>...
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...an ARM store after NEON stores to the same 16-byte block
>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>> It's worse if the NEON store was split across a 16-byte boundary,
>> then
>> there could be a 50 cycle stall.
>>
>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>> some more details and benchmarks.
>
> If that's the case, then for A8 we should only do this when there
> won't be trailing scalar load / stores.
It should be safe if the start pointer is known 16-byte aligned. The
trailing s...
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...er than
> using scalar registers?
On the A8, an ARM store after NEON stores to the same 16-byte block
incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
It's worse if the NEON store was split across a 16-byte boundary, then
there could be a 50 cycle stall.
See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
some more details and benchmarks.
2009 Nov 10
4
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
memcpy intrinsic. I used the Neon load multiple instruction to move up
to 48 bytes at a time . Over 15 scalar instructions collapsed down
into these 2 Neon instructions.
fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
fstmiad r1, {d0, d1, d2, d3, d4, d5}
It seems like this should be faster. But I did