Zach Devito
2013-Jul-10 09:12 UTC
[LLVMdev] unaligned AVX store gets split into two instructions
I've narrowed this down to a single kernel (kernel.ll), which does a fixed-size matrix-matrix multiply: # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32 # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33 # time ./harness32 real 0m0.584s user 0m0.581s sys 0m0.001s # time ./harness33 real 0m0.730s user 0m0.725s sys 0m0.001s If you look at kernel33.s, it has a register spill/reload in the inner loop. This doesn't appear in the llvm 3.2 version and disappears from the 3.3 version if you remove the "align 8"s from kernel.ll which are making it unaligned. Do the two-instruction unaligned loads increase register pressure? Or is something else going on? Zach On Tue, Jul 9, 2013 at 11:33 PM, Zach Devito <zdevito at stanford.edu> wrote:> Thanks for all the the info! I'm still in the process of narrowing down > the performance difference in my code. I'm no longer convinced its related > to only the unaligned loads/stores alone since extracting this part of the > kernel makes the performance difference disappear. I will try to narrow > down what is going on and if it seems related LLVM, I will post an example. > Thanks again, > > Zach > > > On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com> wrote: > >> Hi, >> >> Yes. On Sandybridge 256-bit loads/stores are double pumped. This means >> that they go in one after the other in two cycles. On Haswell the memory >> ports are wide enough to allow a 256bit memory operation in one cycle. So, >> on Sandybridge we split unaligned memory operations into two 128bit parts >> to allow them to execute in two separate ports. This is also what GCC and >> ICC do. >> >> It is very possible that the decision to split the wide vectors causes a >> regression. If the memory ports are busy it is better to double-pump them >> and save the cost of the insert/extract subvector. Unfortunately, during >> ISel we don’t have a good way to estimate port pressure. In any case, it is >> a good idea to revise the heuristics that I put in and to see if it matches >> the Sandybridge optimization guide. If I remember correctly the >> optimization guide does not have too much information on this, but Elena >> looked over it and said that it made sense. >> >> BTW, you can validate that this is the problem using the IACA tool. It >> performs static analysis on your binary and tells you where the critical >> path is. >> http://software.intel.com/en-us/articles/intel-architecture-code-analyzer >> >> Thanks, >> Nadav >> >> >> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com> wrote: >> >> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com> wrote: >> >> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector >> loads >> on AVX. >> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted >> as a >> single instruction (details below). >> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, >> which >> seems to be due to this. >> >> Any ideas why this changed? Thanks! >> >> >> This was intentional; apparently doing it with two instructions is >> supposed to be faster. See r172868/r172894. >> >> Adding Nadav in case he has anything more to say. >> >> -Eli >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: harness.cpp Type: text/x-c++src Size: 346 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.cpp> -------------- next part -------------- A non-text attachment was scrubbed... Name: kernel.ll Type: application/octet-stream Size: 6787 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.obj>
Dmitry Babokin
2013-Sep-19 16:19 UTC
[LLVMdev] unaligned AVX store gets split into two instructions
Nadav, We see multiple regressions after r172868 in ISPC compiler (based on LLVM optimizer). The regressions are due to spill/reloads, which are due to increase register pressure. This matches Zach's analysis. We've filed bug 17285 for this problem. Is there any possibility to avoid splitting in case of multiple loads going together? Dmitry. On Wed, Jul 10, 2013 at 1:12 PM, Zach Devito <zdevito at stanford.edu> wrote:> I've narrowed this down to a single kernel (kernel.ll), which does a > fixed-size matrix-matrix multiply: > > # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s > # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s > # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32 > # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33 > # time ./harness32 > real 0m0.584s > user 0m0.581s > sys 0m0.001s > # time ./harness33 > real 0m0.730s > user 0m0.725s > sys 0m0.001s > > If you look at kernel33.s, it has a register spill/reload in the inner > loop. This doesn't appear in the llvm 3.2 version and disappears from the > 3.3 version if you remove the "align 8"s from kernel.ll which are making it > unaligned. Do the two-instruction unaligned loads increase register > pressure? Or is something else going on? > > Zach > > On Tue, Jul 9, 2013 at 11:33 PM, Zach Devito <zdevito at stanford.edu> wrote: > >> Thanks for all the the info! I'm still in the process of narrowing down >> the performance difference in my code. I'm no longer convinced its related >> to only the unaligned loads/stores alone since extracting this part of the >> kernel makes the performance difference disappear. I will try to narrow >> down what is going on and if it seems related LLVM, I will post an example. >> Thanks again, >> >> Zach >> >> >> On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com> wrote: >> >>> Hi, >>> >>> Yes. On Sandybridge 256-bit loads/stores are double pumped. This means >>> that they go in one after the other in two cycles. On Haswell the memory >>> ports are wide enough to allow a 256bit memory operation in one cycle. So, >>> on Sandybridge we split unaligned memory operations into two 128bit parts >>> to allow them to execute in two separate ports. This is also what GCC and >>> ICC do. >>> >>> It is very possible that the decision to split the wide vectors causes a >>> regression. If the memory ports are busy it is better to double-pump them >>> and save the cost of the insert/extract subvector. Unfortunately, during >>> ISel we don’t have a good way to estimate port pressure. In any case, it is >>> a good idea to revise the heuristics that I put in and to see if it matches >>> the Sandybridge optimization guide. If I remember correctly the >>> optimization guide does not have too much information on this, but Elena >>> looked over it and said that it made sense. >>> >>> BTW, you can validate that this is the problem using the IACA tool. It >>> performs static analysis on your binary and tells you where the critical >>> path is. >>> http://software.intel.com/en-us/articles/intel-architecture-code-analyzer >>> >>> Thanks, >>> Nadav >>> >>> >>> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com> >>> wrote: >>> >>> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com> wrote: >>> >>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector >>> loads >>> on AVX. >>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted >>> as a >>> single instruction (details below). >>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, >>> which >>> seems to be due to this. >>> >>> Any ideas why this changed? Thanks! >>> >>> >>> This was intentional; apparently doing it with two instructions is >>> supposed to be faster. See r172868/r172894. >>> >>> Adding Nadav in case he has anything more to say. >>> >>> -Eli >>> >>> >>> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/7ef1f7a7/attachment.html>
Dmitry Babokin
2013-Sep-19 17:40 UTC
[LLVMdev] unaligned AVX store gets split into two instructions
Update: the problem seems to be fixed by r190916. On Thu, Sep 19, 2013 at 8:19 PM, Dmitry Babokin <babokin at gmail.com> wrote:> Nadav, > > We see multiple regressions after r172868 in ISPC compiler (based on LLVM > optimizer). The regressions are due to spill/reloads, which are due to > increase register pressure. This matches Zach's analysis. We've filed bug > 17285 for this problem. > > Is there any possibility to avoid splitting in case of multiple loads > going together? > > Dmitry. > > > On Wed, Jul 10, 2013 at 1:12 PM, Zach Devito <zdevito at stanford.edu> wrote: > >> I've narrowed this down to a single kernel (kernel.ll), which does a >> fixed-size matrix-matrix multiply: >> >> # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s >> # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s >> # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32 >> # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33 >> # time ./harness32 >> real 0m0.584s >> user 0m0.581s >> sys 0m0.001s >> # time ./harness33 >> real 0m0.730s >> user 0m0.725s >> sys 0m0.001s >> >> If you look at kernel33.s, it has a register spill/reload in the inner >> loop. This doesn't appear in the llvm 3.2 version and disappears from the >> 3.3 version if you remove the "align 8"s from kernel.ll which are making it >> unaligned. Do the two-instruction unaligned loads increase register >> pressure? Or is something else going on? >> >> Zach >> >> On Tue, Jul 9, 2013 at 11:33 PM, Zach Devito <zdevito at stanford.edu>wrote: >> >>> Thanks for all the the info! I'm still in the process of narrowing down >>> the performance difference in my code. I'm no longer convinced its related >>> to only the unaligned loads/stores alone since extracting this part of the >>> kernel makes the performance difference disappear. I will try to narrow >>> down what is going on and if it seems related LLVM, I will post an example. >>> Thanks again, >>> >>> Zach >>> >>> >>> On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com> wrote: >>> >>>> Hi, >>>> >>>> Yes. On Sandybridge 256-bit loads/stores are double pumped. This means >>>> that they go in one after the other in two cycles. On Haswell the memory >>>> ports are wide enough to allow a 256bit memory operation in one cycle. So, >>>> on Sandybridge we split unaligned memory operations into two 128bit parts >>>> to allow them to execute in two separate ports. This is also what GCC and >>>> ICC do. >>>> >>>> It is very possible that the decision to split the wide vectors causes >>>> a regression. If the memory ports are busy it is better to double-pump >>>> them and save the cost of the insert/extract subvector. Unfortunately, >>>> during ISel we don’t have a good way to estimate port pressure. In any >>>> case, it is a good idea to revise the heuristics that I put in and to see >>>> if it matches the Sandybridge optimization guide. If I remember correctly >>>> the optimization guide does not have too much information on this, but >>>> Elena looked over it and said that it made sense. >>>> >>>> BTW, you can validate that this is the problem using the IACA tool. It >>>> performs static analysis on your binary and tells you where the critical >>>> path is. >>>> http://software.intel.com/en-us/articles/intel-architecture-code-analyzer >>>> >>>> Thanks, >>>> Nadav >>>> >>>> >>>> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com> >>>> wrote: >>>> >>>> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com> wrote: >>>> >>>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector >>>> loads >>>> on AVX. >>>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted >>>> as a >>>> single instruction (details below). >>>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, >>>> which >>>> seems to be due to this. >>>> >>>> Any ideas why this changed? Thanks! >>>> >>>> >>>> This was intentional; apparently doing it with two instructions is >>>> supposed to be faster. See r172868/r172894. >>>> >>>> Adding Nadav in case he has anything more to say. >>>> >>>> -Eli >>>> >>>> >>>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/4b434183/attachment.html>
Possibly Parallel Threads
- [LLVMdev] unaligned AVX store gets split into two instructions
- [LLVMdev] unaligned AVX store gets split into two instructions
- [LLVMdev] unaligned AVX store gets split into two instructions
- [LLVMdev] unaligned AVX store gets split into two instructions
- [LLVMdev] unaligned AVX store gets split into two instructions