Displaying 4 results from an estimated 4 matches for "harness33".
Did you mean:
harness32
2013 Jul 10
2
[LLVMdev] unaligned AVX store gets split into two instructions
...a single kernel (kernel.ll), which does a
fixed-size matrix-matrix multiply:
# ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s
# ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s
# ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32
# ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33
# time ./harness32
real 0m0.584s
user 0m0.581s
sys 0m0.001s
# time ./harness33
real 0m0.730s
user 0m0.725s
sys 0m0.001s
If you look at kernel33.s, it has a register spill/reload in the inner
loop. This doesn't appear in the llvm 3.2 version and disappears from the
3.3 version if you remove the...
2013 Sep 19
0
[LLVMdev] unaligned AVX store gets split into two instructions
...which does a
> fixed-size matrix-matrix multiply:
>
> # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s
> # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s
> # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32
> # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33
> # time ./harness32
> real 0m0.584s
> user 0m0.581s
> sys 0m0.001s
> # time ./harness33
> real 0m0.730s
> user 0m0.725s
> sys 0m0.001s
>
> If you look at kernel33.s, it has a register spill/reload in the inner
> loop. This doesn't appear in the llvm 3.2 version...
2013 Jul 10
0
[LLVMdev] unaligned AVX store gets split into two instructions
Thanks for all the the info! I'm still in the process of narrowing down the
performance difference in my code. I'm no longer convinced its related to
only the unaligned loads/stores alone since extracting this part of the
kernel makes the performance difference disappear. I will try to narrow
down what is going on and if it seems related LLVM, I will post an example.
Thanks again,
Zach
2013 Jul 10
3
[LLVMdev] unaligned AVX store gets split into two instructions
Hi,
Yes. On Sandybridge 256-bit loads/stores are double pumped. This means that they go in one after the other in two cycles. On Haswell the memory ports are wide enough to allow a 256bit memory operation in one cycle. So, on Sandybridge we split unaligned memory operations into two 128bit parts to allow them to execute in two separate ports. This is also what GCC and ICC do.
It is very