thr3ads.net - llvm dev - [LLVMdev] LLVM ARM VMLA instruction [Dec 2013]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2013-Dec-19 11:16 UTC

[LLVMdev] LLVM ARM VMLA instruction

On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at
linaro.org>wrote:
> On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com>
wrote:
>
>> It may seem that total number of cycles are more or less same for
single
>> vmla and vmul+vadd. However, when vmul+vadd combination is used instead
of
>> vmla, then intermediate results will be generated which needs to be
stored
>> in memory for future access. This will lead to lot of load/store ops
being
>> inserted which degrade performance. Correct me if i am wrong on this,
but
>> my observation till date have shown this.
>>
>
> VMLA.F can be either NEON or VFP on A series and the encoding will
> determine which will be used. In assembly files, the difference is mainly
> the type vs. the registers used.
>
> The problem we were trying to avoid a long time ago was well researched by
> Evan Cheng and it has shown that there is a pipeline stall between two
> sequential VMLAs (possibly due to the need of re-use of some registers) and
> this made code much slower than a sequence of VMLA+VMUL+VADD.
>
> Also, please note that, as accurate as cycle counts go, according to the
> A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to
> provide the results, so a sequence of VMUL+VADD might be faster, in some
> contexts or cores, than half the sequence of VMLAs.
>
> As Tim and David said and I agree, without hard data, anything we say
> might be used against us. ;)
>
>
Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :

Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .

clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)

flags passed to both gcc and clang : -march=armv7-a  -mfloat-abi=softfp
-mfpu=vfpv3-d16  -mcpu=cortex-a8
Optimization level used : O3

No vmla instruction emitted by clang but GCC happily emits it.


This was tested on real hardware. Time taken for a 4x4 matrix
multiplication:

clang : ~14 secs
gcc : ~9 secs


Time taken for a 3x3 matrix multiplication:

clang : ~6.5 secs
gcc : ~5 secs


when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla
instructions (gcc emits by default)

Time for 4x4 matrix multiplication :

clang : ~8.5 secs
GCC : ~9secs

Time for matrix multiplication :

clang : ~3.8 secs
GCC : ~5 secs

Please let me know if i am missing something. (-ffast-math option doesn't
help in this case.) On examining assembly code for various scenarios above,
i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1ff94025/attachment.html>

Renato Golin

2013-Dec-19 11:27 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 11:16, suyog sarda <sardask01 at gmail.com> wrote:
> Test case name :
> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
> This is a 4x4 matrix multiplication, we can make small changes to make it a
> 3x3 matrix multiplication for making things simple to understand .
>
This is one very specific case. How does that behave on all other cases?
Normally, every big improvement comes with a cost, and if you only look at
the benchmark you're tuning to, you'll never see it. It may be that the
cost is small and that we decide to pay the price, but not until we know
that the cost is.


This was tested on real hardware. Time taken for a 4x4
matrix> multiplication:
>
What hardware? A7? A8? A9? A15?


Also, as stated by Renato - "there is a pipeline stall between
two> sequential VMLAs (possibly due to the need of re-use of some registers) and
> this made code much slower than a sequence of VMLA+VMUL+VADD" , when i
use
> -mcpu=cortex-a15 as option, clang emits vmla instructions back to
> back(sequential) . Is there something different with cortex-a15 regarding
> pipeline stalls, that we are ignoring back to back vmla hazards?
>
A8 and A15 are quite different beasts. I haven't read about this hazard in
the A15 manual, so I suspect that they have fixed whatever was causing the
stall.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1dff2d95/attachment.html>

suyog sarda

2013-Dec-19 13:30 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

Test case name :>> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c 
-
>> This is a 4x4 matrix multiplication, we can make small changes to make
it a
>> 3x3 matrix multiplication for making things simple to understand .
>>
>
> This is one very specific case. How does that behave on all other cases?
> Normally, every big improvement comes with a cost, and if you only look at
> the benchmark you're tuning to, you'll never see it. It may be that
the
> cost is small and that we decide to pay the price, but not until we know
> that the cost is.
>
>I agree that we should approach in whole than in bits and pieces. I was
basically comparing performance of clang and gcc code for benchmarks listed
in llvm trunk. I found that wherever there was floating point ops
(specifically floating point multiplication), performance with clang was
bad. On analyzing further those issues, i came across vmla instruction by
gcc. The test cases hit by bad performance of clang are :

Test
Case
No of vmla instructions emitted by gcc (clang does not emit vmla for
cortex-a8)
================================================================
llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/sphereflake
  55

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/ray.cpp
40

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/ffbench.c
8

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c
18

llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
36

With vmul+vadd instruction pair comes extra overhead of load/store ops, as
seen in assembly generated. With -mcpu=cortex-a15 option clang performs
better, as it emits vmla instructions.

>
> This was tested on real hardware. Time taken for a 4x4 matrix
>> multiplication:
>>
>
> What hardware? A7? A8? A9? A15?
>
I tested it on A15, i don't have access to A8 rightnow, but i intend to
test it for A8 as well. I compiled the code for A8 and as it was working
fine on A15 without any crash, i went ahead with cortex-a8 option. I don't
think i will get A8 hardware soon, can someone please check it on A8
hardware as well (Sorry for the trouble)?

>
>
> Also, as stated by Renato - "there is a pipeline stall between two
>> sequential VMLAs (possibly due to the need of re-use of some registers)
and
>> this made code much slower than a sequence of VMLA+VMUL+VADD" ,
when i use
>> -mcpu=cortex-a15 as option, clang emits vmla instructions back to
>> back(sequential) . Is there something different with cortex-a15
regarding
>> pipeline stalls, that we are ignoring back to back vmla hazards?
>>
>
> A8 and A15 are quite different beasts. I haven't read about this hazard
in
> the A15 manual, so I suspect that they have fixed whatever was causing the
> stall.
>
Ok. I couldn't find reference for this. If the pipeline stall issue was
fixed in cortex-a15 then LLVM developers will definitely know about this
(and hence we are emitting vmla for cortex-a15). I couldn't find any
comment related to this in the code. Can someone please point it out? Also,
I will be glad to know the code place where we start differentiating
between cortex-a8 and cortex-a15 for code generation.



-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/94623334/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Dec 2013 - [LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

Apparently Analagous Threads