thr3ads.net - llvm dev - [LLVMdev] LLVM ARM VMLA instruction [Dec 2013]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2013-Dec-19 13:30 UTC

[LLVMdev] LLVM ARM VMLA instruction

Test case name :>> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c 
-
>> This is a 4x4 matrix multiplication, we can make small changes to make
it a
>> 3x3 matrix multiplication for making things simple to understand .
>>
>
> This is one very specific case. How does that behave on all other cases?
> Normally, every big improvement comes with a cost, and if you only look at
> the benchmark you're tuning to, you'll never see it. It may be that
the
> cost is small and that we decide to pay the price, but not until we know
> that the cost is.
>
>I agree that we should approach in whole than in bits and pieces. I was
basically comparing performance of clang and gcc code for benchmarks listed
in llvm trunk. I found that wherever there was floating point ops
(specifically floating point multiplication), performance with clang was
bad. On analyzing further those issues, i came across vmla instruction by
gcc. The test cases hit by bad performance of clang are :

Test
Case
No of vmla instructions emitted by gcc (clang does not emit vmla for
cortex-a8)
================================================================
llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/sphereflake
  55

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/ray.cpp
40

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/ffbench.c
8

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c
18

llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
36

With vmul+vadd instruction pair comes extra overhead of load/store ops, as
seen in assembly generated. With -mcpu=cortex-a15 option clang performs
better, as it emits vmla instructions.

>
> This was tested on real hardware. Time taken for a 4x4 matrix
>> multiplication:
>>
>
> What hardware? A7? A8? A9? A15?
>
I tested it on A15, i don't have access to A8 rightnow, but i intend to
test it for A8 as well. I compiled the code for A8 and as it was working
fine on A15 without any crash, i went ahead with cortex-a8 option. I don't
think i will get A8 hardware soon, can someone please check it on A8
hardware as well (Sorry for the trouble)?

>
>
> Also, as stated by Renato - "there is a pipeline stall between two
>> sequential VMLAs (possibly due to the need of re-use of some registers)
and
>> this made code much slower than a sequence of VMLA+VMUL+VADD" ,
when i use
>> -mcpu=cortex-a15 as option, clang emits vmla instructions back to
>> back(sequential) . Is there something different with cortex-a15
regarding
>> pipeline stalls, that we are ignoring back to back vmla hazards?
>>
>
> A8 and A15 are quite different beasts. I haven't read about this hazard
in
> the A15 manual, so I suspect that they have fixed whatever was causing the
> stall.
>
Ok. I couldn't find reference for this. If the pipeline stall issue was
fixed in cortex-a15 then LLVM developers will definitely know about this
(and hence we are emitting vmla for cortex-a15). I couldn't find any
comment related to this in the code. Can someone please point it out? Also,
I will be glad to know the code place where we start differentiating
between cortex-a8 and cortex-a15 for code generation.



-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/94623334/attachment.html>

Renato Golin

2013-Dec-19 13:42 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 13:30, suyog sarda <sardask01 at gmail.com> wrote:
> I tested it on A15, i don't have access to A8 rightnow, but i intend to
> test it for A8 as well. I compiled the code for A8 and as it was working
> fine on A15 without any crash, i went ahead with cortex-a8 option. I
don't
> think i will get A8 hardware soon, can someone please check it on A8
> hardware as well (Sorry for the trouble)?
>
It's not surprising that -mcpu=cortex-a15 option performs better on an A15
than -mcpu=cortex-a8. It's also not surprising that you don't see the
VMLA
hazard we're talking about, since that was (if I recall correctly) specific
to A8 (maybe A9, too).

We can only talk about disabling the VMLX-fwd feature from A8 when
substantial benchmarks are done on a Cortex-A8. Not number of instructions,
but performance. Emitting more VMLAs doesn't mean it'll go faster, as
what
we found in some cases, actually, is quite the opposite.

In the meantime, if you're using an A15, just use -mcpu=cortex-a15 and
hopefully, the code generated will be as fast as possible.

Having Clang detect that you have an A15 automatically is another topic
that we could descend, but it has nothing to do with VMLA.



Ok. I couldn't find reference for this. If the pipeline stall issue
was> fixed in cortex-a15 then LLVM developers will definitely know about this
> (and hence we are emitting vmla for cortex-a15). I couldn't find any
> comment related to this in the code. Can someone please point it out? Also,
> I will be glad to know the code place where we start differentiating
> between cortex-a8 and cortex-a15 for code generation.
>
The link below shows some fragments of the thread (I hate gmane), but shows
Evan's benchmarks and assumptions.

http://comments.gmane.org/gmane.comp.compilers.llvm.devel/59458

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/3ba8da3d/attachment.html>

Tim Northover

2013-Dec-20 13:00 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

Hi Suyog,
> I tested it on A15, i don't have access to A8 rightnow, but i intend to
test
> it for A8 as well.
That's extremely dodgy, the two processors are very different.
> I don't think i
> will get A8 hardware soon, can someone please check it on A8 hardware as
> well (Sorry for the trouble)?
I've got a BeagleBone hanging around, and tested Clang against a
hacked version of itself (without the VMLx disabling on Cortex-A8).
The results (for matmul_f64_4x4, -O3 -mcpu=cortex=a8) were:
1. vfpv3-d16, stock Clang: 96.2s
2. vfpv3-d16, clang + vmla: 95.7s
3. vfpv3, stock clang: 82.9s
4. vfpv3, clang + vmla: 81.1s

Worth investigating more, but as the others have said nowhere near
enough data on its own. Especially since Evan clearly did some
benchmarking himself before specifically disabling the vmla formation.
> Also, I will
> be glad to know the code place where we start differentiating between
> cortex-a8 and cortex-a15 for code generation.
Probably most relevant is the combination of features given to each
processor in lib/Target/ARM/ARM.td. This vmul/vmla difference comes
from "FeatureHasSlowFPVMLx", via ARMSubtarget.h's useFPVMLx and
ARMInstrInfo.td's UseFPVMLx.

Cheers.

Tim.

Renato Golin

2013-Dec-20 13:17 UTC

head link

[LLVMdev] LLVM ARM VMLA instruction

On 20 December 2013 13:00, Tim Northover <t.p.northover at gmail.com>
wrote:
> Worth investigating more, but as the others have said nowhere near
> enough data on its own. Especially since Evan clearly did some
> benchmarking himself before specifically disabling the vmla formation.
>
Indeed. Not just specific micro benchmarks. I also did some testing and
found similar results.


Probably most relevant is the combination of features given to
each> processor in lib/Target/ARM/ARM.td. This vmul/vmla difference comes
> from "FeatureHasSlowFPVMLx", via ARMSubtarget.h's useFPVMLx
and
> ARMInstrInfo.td's UseFPVMLx.
>
Yes, there's no way to turn that on/off from the command line, but I think
this is a good thing, not a bad one.

Ultimately, using the -mcpu flag to chose the right CPU is the best thing
you can do, and LLVM should get it right.

Another thing that comes to my mind is that maybe it's time to set
Cortex-A9 as the default ARMv7 target... no?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131220/6f2644a0/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Dec 2013 - [LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

[LLVMdev] LLVM ARM VMLA instruction

Apparently Analagous Threads