thr3ads.net - llvm dev - [LLVMdev] LLVM ARM VMLA instruction [Dec 2013]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2013-Dec-19 08:50 UTC

[LLVMdev] LLVM ARM VMLA instruction

Hi Tim,





> > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON
instruction)
>
> I get a VFP vmla here rather than a NEON one (clang -target
> armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2".
Are
> you seeing something different?
>
As per Renato comment above, vmla instruction is NEON instruction while
vmfa is VFP instruction. Correct me if i am wrong on this.

>
> > However, if gcc emits vmla (NEON) instruction with cortex-a8 then
> shouldn't
> > LLVM also emit vmla (NEON) instruction?
>
> It appears we've decided in the past that vmla just isn't worth it
on
> Cortex-A8. There's this comment in the source:
>
> // Some processors have FP multiply-accumulate instructions that don't
> // play nicely with other VFP / NEON instructions, and it's generally
> better
> // to just not use them.
>
> Sufficient benchmarking evidence could overturn that decision, but I
> assume the people who added it in the first place didn't do so on a
> whim.
>
> > The performance gain with vmla instruction is huge.
>
> Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines
> in odd ways, and that was a very primitive core so it's almost
> certainly not going to be just as good as a vmul (in fact if I'm
> reading correctly, it takes pretty much exactly the same time as
> separate vmul and vadd instructions, 10 cycles vs 2 * 5).
>
It may seem that total number of cycles are more or less same for single
vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
vmla, then intermediate results will be generated which needs to be stored
in memory for future access. This will lead to lot of load/store ops being
inserted which degrade performance. Correct me if i am wrong on this, but
my observation till date have shown this.

>
> Cheers.
>
> Tim.
>


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/fe30dc82/attachment.html>

Tim Northover

2013-Dec-19 09:13 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

> As per Renato comment above, vmla instruction is NEON instruction while
vmfa is VFP instruction. Correct me if i am wrong on this.
My version of the ARM architecture reference manual (v7 A & R) lists
versions requiring NEON and versions requiring VFP. (Section
A8.8.337). Split in just the way you'd expect (SIMD variants need
NEON).
> It may seem that total number of cycles are more or less same for single
vmla
> and vmul+vadd. However, when vmul+vadd combination is used instead of vmla,
> then intermediate results will be generated which needs to be stored in
memory
> for future access.
Well, it increases register pressure slightly I suppose, but there's
no need to store anything to memory unless that gets critical.
> Correct me if i am wrong on this, but my observation till date have shown
this.
Perhaps. Actual data is needed, I think, if you seriously want to
change this behaviour in LLVM. The test-suite might be a good place to
start, though it'll give an incomplete picture without the externals
(SPEC & other things).

Of course, if we're just speculating we can carry on.

Cheers.

Tim.

suyog sarda

2013-Dec-19 09:28 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.northover at
gmail.com>wrote:
> > As per Renato comment above, vmla instruction is NEON instruction
while
> vmfa is VFP instruction. Correct me if i am wrong on this.
>
> My version of the ARM architecture reference manual (v7 A & R) lists
> versions requiring NEON and versions requiring VFP. (Section
> A8.8.337). Split in just the way you'd expect (SIMD variants need
> NEON).
>
I will check on this part.

>
> > It may seem that total number of cycles are more or less same for
single
> vmla
> > and vmul+vadd. However, when vmul+vadd combination is used instead of
> vmla,
> > then intermediate results will be generated which needs to be stored
in
> memory
> > for future access.
>
> Well, it increases register pressure slightly I suppose, but there's
> no need to store anything to memory unless that gets critical.
>
> > Correct me if i am wrong on this, but my observation till date have
> shown this.
>
> Perhaps. Actual data is needed, I think, if you seriously want to
> change this behaviour in LLVM. The test-suite might be a good place to
> start, though it'll give an incomplete picture without the externals
> (SPEC & other things).
>
> Of course, if we're just speculating we can carry on.
>
I wasn't speculating. Let's take an example of a 3*3 simple matrix
multiplication (no loops, all multiplication and additions are hard coded -
basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]  and
so on for all 9 elements of the result ).

If i compile above code with "clang -O3 -mcpu=cortex-a8
-mfpu=vfpv3-d16"
(only 16 floating point registers present with my arm, so specifying
vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load  ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9
store and 20 load ops. So, its clear that extra load/store ops gets added
with clang as it is not emitting vmla instruction. Won't this lead to
performance degradation?

I would also like to know about accuracy with vmla and pair of vmul and
vadd ops.


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/126b3c33/attachment.html>

Renato Golin

2013-Dec-19 11:06 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote:
> It may seem that total number of cycles are more or less same for single
> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
> vmla, then intermediate results will be generated which needs to be stored
> in memory for future access. This will lead to lot of load/store ops being
> inserted which degrade performance. Correct me if i am wrong on this, but
> my observation till date have shown this.
>
VMLA.F can be either NEON or VFP on A series and the encoding will
determine which will be used. In assembly files, the difference is mainly
the type vs. the registers used.

The problem we were trying to avoid a long time ago was well researched by
Evan Cheng and it has shown that there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD.

Also, please note that, as accurate as cycle counts go, according to the A9
manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to provide
the results, so a sequence of VMUL+VADD might be faster, in some contexts
or cores, than half the sequence of VMLAs.

As Tim and David said and I agree, without hard data, anything we say might
be used against us. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/68a15bea/attachment.html>

suyog sarda

2013-Dec-19 11:16 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at
linaro.org>wrote:
> On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com>
wrote:
>
>> It may seem that total number of cycles are more or less same for
single
>> vmla and vmul+vadd. However, when vmul+vadd combination is used instead
of
>> vmla, then intermediate results will be generated which needs to be
stored
>> in memory for future access. This will lead to lot of load/store ops
being
>> inserted which degrade performance. Correct me if i am wrong on this,
but
>> my observation till date have shown this.
>>
>
> VMLA.F can be either NEON or VFP on A series and the encoding will
> determine which will be used. In assembly files, the difference is mainly
> the type vs. the registers used.
>
> The problem we were trying to avoid a long time ago was well researched by
> Evan Cheng and it has shown that there is a pipeline stall between two
> sequential VMLAs (possibly due to the need of re-use of some registers) and
> this made code much slower than a sequence of VMLA+VMUL+VADD.
>
> Also, please note that, as accurate as cycle counts go, according to the
> A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to
> provide the results, so a sequence of VMUL+VADD might be faster, in some
> contexts or cores, than half the sequence of VMLAs.
>
> As Tim and David said and I agree, without hard data, anything we say
> might be used against us. ;)
>
>
Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :

Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .

clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)

flags passed to both gcc and clang : -march=armv7-a  -mfloat-abi=softfp
-mfpu=vfpv3-d16  -mcpu=cortex-a8
Optimization level used : O3

No vmla instruction emitted by clang but GCC happily emits it.


This was tested on real hardware. Time taken for a 4x4 matrix
multiplication:

clang : ~14 secs
gcc : ~9 secs


Time taken for a 3x3 matrix multiplication:

clang : ~6.5 secs
gcc : ~5 secs


when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla
instructions (gcc emits by default)

Time for 4x4 matrix multiplication :

clang : ~8.5 secs
GCC : ~9secs

Time for matrix multiplication :

clang : ~3.8 secs
GCC : ~5 secs

Please let me know if i am missing something. (-ffast-math option doesn't
help in this case.) On examining assembly code for various scenarios above,
i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1ff94025/attachment.html>

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Dec 2013 - [LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

Seemingly Similar Threads