Displaying 19 results from an estimated 19 matches for "matmul_f64_4x4".
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
On 19 December 2013 11:16, suyog sarda <sardask01 at gmail.com> wrote:
> Test case name :
> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
> This is a 4x4 matrix multiplication, we can make small changes to make it a
> 3x3 matrix multiplication for making things simple to understand .
>
This is one very specific case. How does that behave on all other cases?
Normally, every big improvement comes with a cost, and if you...
2013 Dec 19
2
[LLVMdev] LLVM ARM VMLA instruction
...m and David said and I agree, without hard data, anything we say
> might be used against us. ;)
>
>
Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :
Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .
clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)
flags passed to both gcc and clang : -m...
2012 Sep 22
2
[LLVMdev] Heads up! New SROA implementation is going on-by-default today!
...got by flipping it on and back off:
http://llvm.org/perf/db_default/v4/nts/3963
Most of this is very, very green. There are three somewhat worrisome
regressions in execution time:
1) sse_expandfft -- when I build this, the binaries have no differences
before and after
2) sse_stepfft -- ditto
3) matmul_f64_4x4 -- this one is interesting
The last one represents the only real regressions I expect to see with the
new pass. There is a helpful indicator about what caused it: the compile
time *improved* by 44%!!! This is because the benchmark was tickling the
bad behavior of the old SROA pass that inspired a...
2013 Jul 14
6
[LLVMdev] Enabling the SLP vectorizer by default for -O3
...erformance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.
— Performance Gains —
SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
MultiSource/Benchmarks/Olden/power/power -18.55%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
SingleSource/Benchmarks/Misc/flops-6 -11.02%
SingleSource/Benchmarks/Misc/flops-5 -10.03%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%...
2013 Dec 19
3
[LLVMdev] LLVM ARM VMLA instruction
Test case name :
>> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c -
>> This is a 4x4 matrix multiplication, we can make small changes to make it a
>> 3x3 matrix multiplication for making things simple to understand .
>>
>
> This is one very specific case. How does that behave on all other cases?
> Normally, every big improvement come...
2013 Jul 15
0
[LLVMdev] Enabling the SLP vectorizer by default for -O3
...ullet regression though? We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.
-Chris
>
> — Performance Gains —
> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
> MultiSource/Benchmarks/Olden/power/power -18.55%
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
> SingleSource/Benchmarks/Misc/flops-6 -11.02%
> SingleSource/Benchmarks/Misc/flops-5 -10.03%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/Lin...
2013 Jul 28
2
[LLVMdev] Enabling the SLP-vectorizer by default for -O3
...ent σ
MultiSource/Benchmarks/Olden/bh/bh 19.24% 1.1551 1.3773 0.0021
SingleSource/Benchmarks/SmallPT/smallpt 3.75% 5.8779 6.0983 0.0146
SingleSource/Benchmarks/Misc-C++/Large/ray 1.08% 1.8194 1.8390 0.0009
Performance Improvements - Execution Time Δ Previous Current σ
SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.67% 1.4064 0.6516 0.0007
External/Nurbs/nurbs -19.47% 2.5389 2.0445 0.0029
MultiSource/Benchmarks/Olden/power/power -18.49% 1.2572 1.0248 0.0004
SingleSource/Benchmarks/Misc/flops-4 -15.93% 0.7767 0.6530 0.0348
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.72% 2.3925 2.040...
2012 Sep 24
1
[LLVMdev] Heads up! New SROA implementation is going on-by-default today!
...** **
>
> Most of this is very, very green. There are three somewhat worrisome
> regressions in execution time:****
>
> ** **
>
> 1) sse_expandfft -- when I build this, the binaries have no differences
> before and after****
>
> 2) sse_stepfft -- ditto****
>
> 3) matmul_f64_4x4 -- this one is interesting****
>
> ** **
>
> The last one represents the only real regressions I expect to see with the
> new pass. There is a helpful indicator about what caused it: the compile
> time *improved* by 44%!!! This is because the benchmark was tickling the
> bad be...
2013 Jul 15
3
[LLVMdev] Enabling the SLP vectorizer by default for -O3
...e should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.
>
> -Chris
>
>>
>> — Performance Gains —
>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>> MultiSource/Benchmarks/Olden/power/power -18.55%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
>> SingleSource/Benchmarks/Misc/flops-6 -11.02%
>> SingleSource/Benchmarks/Misc/flops-5 -10.03%
>> MultiSource/Benchmarks/TSVC/Line...
2013 Dec 20
0
[LLVMdev] LLVM ARM VMLA instruction
...rent.
> I don't think i
> will get A8 hardware soon, can someone please check it on A8 hardware as
> well (Sorry for the trouble)?
I've got a BeagleBone hanging around, and tested Clang against a
hacked version of itself (without the VMLx disabling on Cortex-A8).
The results (for matmul_f64_4x4, -O3 -mcpu=cortex=a8) were:
1. vfpv3-d16, stock Clang: 96.2s
2. vfpv3-d16, clang + vmla: 95.7s
3. vfpv3, stock clang: 82.9s
4. vfpv3, clang + vmla: 81.1s
Worth investigating more, but as the others have said nowhere near
enough data on its own. Especially since Evan clearly did some
benchmarking h...
2013 Jul 23
0
[LLVMdev] Enabling the SLP vectorizer by default for -O3
...tand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is.
>>
>> -Chris
>>
>>>
>>> — Performance Gains —
>>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>>> MultiSource/Benchmarks/Olden/power/power -18.55%
>>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
>>> SingleSource/Benchmarks/Misc/flops-6 -11.02%
>>> SingleSource/Benchmarks/Misc/flops-5 -10.03%
>>> MultiSource/...
2013 Jul 14
0
[LLVMdev] Enabling the SLP vectorizer by default for -O3
...low) I would like to enable the
> SLP-vectorizer by default on -O3. I would like to hear what others in the
> community think about this and give other people the opportunity to perform
> their own performance measurements.
>
> — Performance Gains —
> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
> MultiSource/Benchmarks/Olden/power/power -18.55%
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
> SingleSource/Benchmarks/Misc/flops-6 -11.02%
> SingleSource/Benchmarks/Misc/flops-5 -10.03%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/Lin...
2013 Jul 28
0
[LLVMdev] Enabling the SLP-vectorizer by default for -O3
...ource/Benchmarks/Olden/bh/bh19.24%1.15511.37730.0021
> SingleSource/Benchmarks/SmallPT/smallpt3.75%5.87796.09830.0146
> SingleSource/Benchmarks/Misc-C++/Large/ray1.08%1.81941.83900.0009
>
>
> Performance Improvements - Execution TimeΔPreviousCurrentσ
> SingleSource/Benchmarks/Misc/matmul_f64_4x4-53.67%1.40640.65160.0007
> External/Nurbs/nurbs-19.47%2.53892.04450.0029
> MultiSource/Benchmarks/Olden/power/power-18.49%1.25721.02480.0004
> SingleSource/Benchmarks/Misc/flops-4-15.93%0.77670.65300.0348
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt-14.72%
> 2.39...
2013 Dec 19
0
[LLVMdev] LLVM ARM VMLA instruction
On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote:
> It may seem that total number of cycles are more or less same for single
> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
> vmla, then intermediate results will be generated which needs to be stored
> in memory for future access. This will lead to lot of load/store ops being
>
2013 Dec 19
4
[LLVMdev] LLVM ARM VMLA instruction
Hi Tim,
> > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)
>
> I get a VFP vmla here rather than a NEON one (clang -target
> armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are
> you seeing something different?
>
As per Renato comment above, vmla instruction is NEON instruction while
vmfa is VFP instruction. Correct
2018 Apr 26
0
Compare test-suite benchmarks performance complied without TBAA, with default TBAA and with new TBAA struct path
...1181| 0.56538205| -0.02| 3451931174| 0|0.565291149| 0| 3451931174| 0|
|SingleSource/Benchmarks/Misc/mandel.test | 87|0.384127781| 2765575042|0.384125541| 0| 2765575035| 0|0.384110649| 0| 2765575035| 0|
|SingleSource/Benchmarks/Misc/matmul_f64_4x4.test | 71|0.489191861| 6750041890|0.489258207| -0.01| 6750041883| 0|0.489120899| 0.01| 6750041883| 0|
|SingleSource/Benchmarks/Misc/oourafft.test | 40|2.067325053|26308061552|1.997489314| 3.5|25399500162| 3.58|1.99824...
2015 Feb 26
5
[LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?
Hi all,
I've started looking at the GlobalMerge pass, enabled by default on
ARM and AArch64. I think we should reconsider that, at least for
AArch64.
As is, the pass just merges all globals together, in groups of 4KB
(AArch64, 128B on ARM).
At the time it was enabled, the general thinking was "it's almost
free, it doesn't affect performance much, we might as well use it".
2014 Jan 28
3
[LLVMdev] MergeFunctions: reduce complexity to O(log(N))
Hi Stepan,
Sorry for the delay. It's great that you are working on MergeFunctions
as well and I agree, we should definitely try to combine our efforts to
improve MergeFunctions.
Just to give you some context, the pass (with the similar function
merging patch) is already being used in a production setting. From my
point of view, it would be better if we focus on improving its
capability
2014 Jan 30
3
[LLVMdev] MergeFunctions: reduce complexity to O(log(N))
...mandel-2.ll 4 9495 0 0.01 9480 0 0.01 9480
mandel.ll 4 10074 0 0.01 10059 0 0.01 10059
mandel-text.ll 1 9372 0 0.01 9351 0 0.01 9351
map.ll 1 7301 0 0.01 7270 0 0.01 7270
maskgen.ll 1 54157 0 0.01 54131 0 0.02 54131
mason.ll 10 49163 1 0.01 48786 * * *
matchpat.ll 1 30952 0 0.01 30926 0 0.01 30926
matmul_f64_4x4.ll 3 47703 0 0.01 47688 0 0.01 47688
matrix_dec.ll 9 60244 0 0.02 60221 0 0.02 59501
matrix_enc.ll 6 52336 0 0.01 52313 0 0.02 50612
matrix.ll 6 16895 0 0.01 16876 0 0.01 16876
matrixTranspose.ll 2 11809 0 0.01 11797 0 0.01 11797
maze.ll 18 150317 0 0.02 150294 0 0.02 150294
mb_access.ll 9 109296 0...