Displaying 3 results from an estimated 3 matches for "fstmiad".
2009 Nov 10
0
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...ex-A8 by optimizing the
> memcpy intrinsic. I used the Neon load multiple instruction to move up
> to 48 bytes at a time . Over 15 scalar instructions collapsed down
> into these 2 Neon instructions.
>
> fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
> fstmiad r1, {d0, d1, d2, d3, d4, d5}
>
> It seems like this should be faster. But I did not see any
> appreciable speedup.
>
> I think the patch is correct. The code runs fine.
>
> I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to
> this email.
>...
2009 Nov 10
4
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...ed up Dhrystone on ARM Cortex-A8 by optimizing the
memcpy intrinsic. I used the Neon load multiple instruction to move up
to 48 bytes at a time . Over 15 scalar instructions collapsed down
into these 2 Neon instructions.
fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
fstmiad r1, {d0, d1, d2, d3, d4, d5}
It seems like this should be faster. But I did not see any appreciable speedup.
I think the patch is correct. The code runs fine.
I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to this email.
Does this look like the right modification?...
2009 Nov 10
3
[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
...>> up
>> to 48 bytes at a time . Over 15 scalar instructions collapsed down
>> into these 2 Neon instructions.
Nice. Thanks for working on this. It has long been on my todo list.
>>
>> fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
>> fstmiad r1, {d0, d1, d2, d3, d4, d5}
>>
>> It seems like this should be faster. But I did not see any
>> appreciable speedup.
Even if it's not faster, it's still a code size win which is also
important. Are we generating the right aligned NEON load / stores?
>>
>>...