search for: hardwarebug

Displaying 4 results from an estimated 4 matches for "hardwarebug".

2009 Nov 10
3
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...egisters? > > On the A8, an ARM store after NEON stores to the same 16-byte block > incurs a ~20 cycle penalty since the NEON unit executes behind ARM. > It's worse if the NEON store was split across a 16-byte boundary, then > there could be a 50 cycle stall. > > See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for > some more details and benchmarks. If that's the case, then for A8 we should only do this when there won't be trailing scalar load / stores. Evan > _______________________________________________ > LLVM Developers mailing list >...
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...an ARM store after NEON stores to the same 16-byte block >> incurs a ~20 cycle penalty since the NEON unit executes behind ARM. >> It's worse if the NEON store was split across a 16-byte boundary, >> then >> there could be a 50 cycle stall. >> >> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for >> some more details and benchmarks. > > If that's the case, then for A8 we should only do this when there > won't be trailing scalar load / stores. It should be safe if the start pointer is known 16-byte aligned. The trailing s...
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...er than > using scalar registers? On the A8, an ARM store after NEON stores to the same 16-byte block incurs a ~20 cycle penalty since the NEON unit executes behind ARM. It's worse if the NEON store was split across a 16-byte boundary, then there could be a 50 cycle stall. See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for some more details and benchmarks.
2009 Nov 10
4
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the memcpy intrinsic. I used the Neon load multiple instruction to move up to 48 bytes at a time . Over 15 scalar instructions collapsed down into these 2 Neon instructions. fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359 fstmiad r1, {d0, d1, d2, d3, d4, d5} It seems like this should be faster. But I did