Displaying 2 results from an estimated 2 matches for "1cd1162f".
2009 Nov 11
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...shows that the tail will be in a separate 16-
byte block. (And what's up with the 16-byte divisions? I thought the
cache lines are 64 bytes....)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20091111/1cd1162f/attachment.html>
2009 Nov 10
4
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
memcpy intrinsic. I used the Neon load multiple instruction to move up
to 48 bytes at a time . Over 15 scalar instructions collapsed down
into these 2 Neon instructions.
fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
fstmiad r1, {d0, d1, d2, d3, d4, d5}
It seems like this should be faster. But I did