thr3ads.net - search: "matmul_f64

Displaying 19 results from an estimated 19 matches for "matmul_f64_4x4".

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 11:16, suyog sarda <sardask01 at gmail.com> wrote: > Test case name : > llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - > This is a 4x4 matrix multiplication, we can make small changes to make it a > 3x3 matrix multiplication for making things simple to understand . > This is one very specific case. How does that behave on all other cases? Normally, every big improvement comes with a cost, and if you...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

...m and David said and I agree, without hard data, anything we say > might be used against us. ;) > > Sorry folks, i didn't specify the actual test case and results in detail previously. The details are as follows : Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand . clang version : trunk version (latest as of today 19 Dec 2013) GCC version : 4.5 (i checked with 4.8 as well) flags passed to both gcc and clang : -m...

[LLVMdev] Heads up! New SROA implementation is going on-by-default today!

2012 Sep 22

[LLVMdev] Heads up! New SROA implementation is going on-by-default today!

...got by flipping it on and back off: http://llvm.org/perf/db_default/v4/nts/3963 Most of this is very, very green. There are three somewhat worrisome regressions in execution time: 1) sse_expandfft -- when I build this, the binaries have no differences before and after 2) sse_stepfft -- ditto 3) matmul_f64_4x4 -- this one is interesting The last one represents the only real regressions I expect to see with the new pass. There is a helpful indicator about what caused it: the compile time *improved* by 44%!!! This is because the benchmark was tickling the bad behavior of the old SROA pass that inspired a...

[LLVMdev] Enabling the SLP vectorizer by default for -O3

2013 Jul 14

[LLVMdev] Enabling the SLP vectorizer by default for -O3

...erformance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. — Performance Gains — SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% MultiSource/Benchmarks/Olden/power/power -18.55% MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% SingleSource/Benchmarks/Misc/flops-6 -11.02% SingleSource/Benchmarks/Misc/flops-5 -10.03% MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Test case name : >> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c - >> This is a 4x4 matrix multiplication, we can make small changes to make it a >> 3x3 matrix multiplication for making things simple to understand . >> > > This is one very specific case. How does that behave on all other cases? > Normally, every big improvement come...

[LLVMdev] Enabling the SLP vectorizer by default for -O3

2013 Jul 15

[LLVMdev] Enabling the SLP vectorizer by default for -O3

...ullet regression though? We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. -Chris > > — Performance Gains — > SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% > MultiSource/Benchmarks/Olden/power/power -18.55% > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% > SingleSource/Benchmarks/Misc/flops-6 -11.02% > SingleSource/Benchmarks/Misc/flops-5 -10.03% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/Lin...

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

2013 Jul 28

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

...ent σ MultiSource/Benchmarks/Olden/bh/bh 19.24% 1.1551 1.3773 0.0021 SingleSource/Benchmarks/SmallPT/smallpt 3.75% 5.8779 6.0983 0.0146 SingleSource/Benchmarks/Misc-C++/Large/ray 1.08% 1.8194 1.8390 0.0009 Performance Improvements - Execution Time Δ Previous Current σ SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.67% 1.4064 0.6516 0.0007 External/Nurbs/nurbs -19.47% 2.5389 2.0445 0.0029 MultiSource/Benchmarks/Olden/power/power -18.49% 1.2572 1.0248 0.0004 SingleSource/Benchmarks/Misc/flops-4 -15.93% 0.7767 0.6530 0.0348 MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.72% 2.3925 2.040...

[LLVMdev] Heads up! New SROA implementation is going on-by-default today!

2012 Sep 24

[LLVMdev] Heads up! New SROA implementation is going on-by-default today!

...** ** > > Most of this is very, very green. There are three somewhat worrisome > regressions in execution time:**** > > ** ** > > 1) sse_expandfft -- when I build this, the binaries have no differences > before and after**** > > 2) sse_stepfft -- ditto**** > > 3) matmul_f64_4x4 -- this one is interesting**** > > ** ** > > The last one represents the only real regressions I expect to see with the > new pass. There is a helpful indicator about what caused it: the compile > time *improved* by 44%!!! This is because the benchmark was tickling the > bad be...

[LLVMdev] Enabling the SLP vectorizer by default for -O3

2013 Jul 15

[LLVMdev] Enabling the SLP vectorizer by default for -O3

...e should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. > > -Chris > >> >> — Performance Gains — >> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% >> MultiSource/Benchmarks/Olden/power/power -18.55% >> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% >> SingleSource/Benchmarks/Misc/flops-6 -11.02% >> SingleSource/Benchmarks/Misc/flops-5 -10.03% >> MultiSource/Benchmarks/TSVC/Line...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 20

[LLVMdev] LLVM ARM VMLA instruction

...rent. > I don't think i > will get A8 hardware soon, can someone please check it on A8 hardware as > well (Sorry for the trouble)? I've got a BeagleBone hanging around, and tested Clang against a hacked version of itself (without the VMLx disabling on Cortex-A8). The results (for matmul_f64_4x4, -O3 -mcpu=cortex=a8) were: 1. vfpv3-d16, stock Clang: 96.2s 2. vfpv3-d16, clang + vmla: 95.7s 3. vfpv3, stock clang: 82.9s 4. vfpv3, clang + vmla: 81.1s Worth investigating more, but as the others have said nowhere near enough data on its own. Especially since Evan clearly did some benchmarking h...

[LLVMdev] Enabling the SLP vectorizer by default for -O3

2013 Jul 23

[LLVMdev] Enabling the SLP vectorizer by default for -O3

...tand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. >> >> -Chris >> >>> >>> — Performance Gains — >>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% >>> MultiSource/Benchmarks/Olden/power/power -18.55% >>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% >>> SingleSource/Benchmarks/Misc/flops-6 -11.02% >>> SingleSource/Benchmarks/Misc/flops-5 -10.03% >>> MultiSource/...

[LLVMdev] Enabling the SLP vectorizer by default for -O3

2013 Jul 14

[LLVMdev] Enabling the SLP vectorizer by default for -O3

...low) I would like to enable the > SLP-vectorizer by default on -O3. I would like to hear what others in the > community think about this and give other people the opportunity to perform > their own performance measurements. > > — Performance Gains — > SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% > MultiSource/Benchmarks/Olden/power/power -18.55% > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% > SingleSource/Benchmarks/Misc/flops-6 -11.02% > SingleSource/Benchmarks/Misc/flops-5 -10.03% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/Lin...

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

2013 Jul 28

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

...ource/Benchmarks/Olden/bh/bh19.24%1.15511.37730.0021 > SingleSource/Benchmarks/SmallPT/smallpt3.75%5.87796.09830.0146 > SingleSource/Benchmarks/Misc-C++/Large/ray1.08%1.81941.83900.0009 > > > Performance Improvements - Execution TimeΔPreviousCurrentσ > SingleSource/Benchmarks/Misc/matmul_f64_4x4-53.67%1.40640.65160.0007 > External/Nurbs/nurbs-19.47%2.53892.04450.0029 > MultiSource/Benchmarks/Olden/power/power-18.49%1.25721.02480.0004 > SingleSource/Benchmarks/Misc/flops-4-15.93%0.77670.65300.0348 > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt-14.72% > 2.39...

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote: > It may seem that total number of cycles are more or less same for single > vmla and vmul+vadd. However, when vmul+vadd combination is used instead of > vmla, then intermediate results will be generated which needs to be stored > in memory for future access. This will lead to lot of load/store ops being >

[LLVMdev] LLVM ARM VMLA instruction

2013 Dec 19

[LLVMdev] LLVM ARM VMLA instruction

Hi Tim, > > cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction) > > I get a VFP vmla here rather than a NEON one (clang -target > armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are > you seeing something different? > As per Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct

Compare test-suite benchmarks performance complied without TBAA, with default TBAA and with new TBAA struct path

2018 Apr 26

Compare test-suite benchmarks performance complied without TBAA, with default TBAA and with new TBAA struct path

...1181| 0.56538205| -0.02| 3451931174| 0|0.565291149| 0| 3451931174| 0| |SingleSource/Benchmarks/Misc/mandel.test | 87|0.384127781| 2765575042|0.384125541| 0| 2765575035| 0|0.384110649| 0| 2765575035| 0| |SingleSource/Benchmarks/Misc/matmul_f64_4x4.test | 71|0.489191861| 6750041890|0.489258207| -0.01| 6750041883| 0|0.489120899| 0.01| 6750041883| 0| |SingleSource/Benchmarks/Misc/oourafft.test | 40|2.067325053|26308061552|1.997489314| 3.5|25399500162| 3.58|1.99824...

[LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?

2015 Feb 26

[LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?

Hi all, I've started looking at the GlobalMerge pass, enabled by default on ARM and AArch64. I think we should reconsider that, at least for AArch64. As is, the pass just merges all globals together, in groups of 4KB (AArch64, 128B on ARM). At the time it was enabled, the general thinking was "it's almost free, it doesn't affect performance much, we might as well use it".

[LLVMdev] MergeFunctions: reduce complexity to O(log(N))

2014 Jan 28

[LLVMdev] MergeFunctions: reduce complexity to O(log(N))

Hi Stepan, Sorry for the delay. It's great that you are working on MergeFunctions as well and I agree, we should definitely try to combine our efforts to improve MergeFunctions. Just to give you some context, the pass (with the similar function merging patch) is already being used in a production setting. From my point of view, it would be better if we focus on improving its capability

[LLVMdev] MergeFunctions: reduce complexity to O(log(N))

2014 Jan 30

[LLVMdev] MergeFunctions: reduce complexity to O(log(N))

...mandel-2.ll 4 9495 0 0.01 9480 0 0.01 9480 mandel.ll 4 10074 0 0.01 10059 0 0.01 10059 mandel-text.ll 1 9372 0 0.01 9351 0 0.01 9351 map.ll 1 7301 0 0.01 7270 0 0.01 7270 maskgen.ll 1 54157 0 0.01 54131 0 0.02 54131 mason.ll 10 49163 1 0.01 48786 * * * matchpat.ll 1 30952 0 0.01 30926 0 0.01 30926 matmul_f64_4x4.ll 3 47703 0 0.01 47688 0 0.01 47688 matrix_dec.ll 9 60244 0 0.02 60221 0 0.02 59501 matrix_enc.ll 6 52336 0 0.01 52313 0 0.02 50612 matrix.ll 6 16895 0 0.01 16876 0 0.01 16876 matrixTranspose.ll 2 11809 0 0.01 11797 0 0.01 11797 maze.ll 18 150317 0 0.02 150294 0 0.02 150294 mb_access.ll 9 109296 0...

search for: matmul_f64_4x4