Hi Frank, Thanks for working on this. Please look at vectorizeStoreChains. In this function we process all of the stores in the function in buckets of 16 elements because constructing consecutive stores is implemented using an O(n^2) algorithm. You can try to increase this threshold to 128 and see if it helps. I also agree with Renato and Chad that adding a flag to tell the SLP-vectorizer to put more effort (compile time) into the problem is a good idea. Thanks, Nadav> On Aug 8, 2014, at 8:27 AM, Frank Winter <fwinter at jlab.org> wrote: > > I changed the max. recursion depth to 36, and tried then 1000 (from the original value of 12) and it did not improve SLP's optimization capabilities on my input function. For example, the attached function is (by design) perfectly vectorizable into 4-packed single precision SIMD load/add/store. The SLP vectorizer just does nothing with it. > > I ran > > opt -default-data-layout="e-m:e-i64:64-f80:128-n8:16:32:64-S128" -basicaa -slp-vectorizer -S < mod_vec_p_vec.ll > > with RecursionMaxDepth = 12, 36, and 1000. > > Thanks, > Frank > > > On 08/07/2014 12:57 PM, Renato Golin wrote: >> On 7 August 2014 17:33, Chad Rosier <mcrosier at codeaurora.org> wrote: >>> You might consider filing a bug (llvm.org/bugs) requesting a flag, but I >>> don't know if the code owners want to expose such a flag. >> I'm not sure that's a good idea as a raw access to that limit, as >> there are no guarantees that it'll stay the same. But maybe a flag >> turning some "aggressive" behaviour from SLP that would then change >> that threshold (and maybe some others) would be a good flag to have, I >> think. >> >> This could maybe be a way to deprecate the BB vectorizer faster than >> we would otherwise. But that would depend on all missing BB features >> to be implemented in SLP. >> >> An item in bugzilla seems appropriate. >> >> cheers, >> --renato > > > <mod_vec_p_vec.ll>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/6ab14c03/attachment.html>
Hi Nadav, increasing the number of instructions per bucket indeed results in a completely vectorized version of the given small function. For a medium-size function I had to increase the bucket size to 8192 to achieve full vectorization. If I then try this setup on one of my larger functions (containing one huge basic block) it seems that the O(n^2) algorithm you were talking about hits me hard here. After 10min in SLP I sent the termination signal by hand. I remember the documentation saying that SLP is a quite versatile vectorizer. I was wondering if I only need to vectorize the load/store/arithmetic a single basic block and nothing else, are there any parts in SLP that I could deactivate in order to reduce the time needed for optimization? Thanks, Frank On 08/08/2014 01:13 PM, Nadav Rotem wrote:> Hi Frank, > > Thanks for working on this. Please look at vectorizeStoreChains. In > this function we process all of the stores in the function in buckets > of 16 elements because constructing consecutive stores is implemented > using an O(n^2) algorithm. You can try to increase this threshold to > 128 and see if it helps. > > I also agree with Renato and Chad that adding a flag to tell the > SLP-vectorizer to put more effort (compile time) into the problem is a > good idea. > > Thanks, > Nadav > > > >> On Aug 8, 2014, at 8:27 AM, Frank Winter <fwinter at jlab.org >> <mailto:fwinter at jlab.org>> wrote: >> >> I changed the max. recursion depth to 36, and tried then 1000 (from >> the original value of 12) and it did not improve SLP's optimization >> capabilities on my input function. For example, the attached function >> is (by design) perfectly vectorizable into 4-packed single precision >> SIMD load/add/store. The SLP vectorizer just does nothing with it. >> >> I ran >> >> opt -default-data-layout="e-m:e-i64:64-f80:128-n8:16:32:64-S128" >> -basicaa -slp-vectorizer -S < mod_vec_p_vec.ll >> >> with RecursionMaxDepth = 12, 36, and 1000. >> >> Thanks, >> Frank >> >> >> On 08/07/2014 12:57 PM, Renato Golin wrote: >>> On 7 August 2014 17:33, Chad Rosier <mcrosier at codeaurora.org >>> <mailto:mcrosier at codeaurora.org>> wrote: >>>> You might consider filing a bug (llvm.org/bugs >>>> <http://llvm.org/bugs>) requesting a flag, but I >>>> don't know if the code owners want to expose such a flag. >>> I'm not sure that's a good idea as a raw access to that limit, as >>> there are no guarantees that it'll stay the same. But maybe a flag >>> turning some "aggressive" behaviour from SLP that would then change >>> that threshold (and maybe some others) would be a good flag to have, I >>> think. >>> >>> This could maybe be a way to deprecate the BB vectorizer faster than >>> we would otherwise. But that would depend on all missing BB features >>> to be implemented in SLP. >>> >>> An item in bugzilla seems appropriate. >>> >>> cheers, >>> --renato >> >> >> <mod_vec_p_vec.ll>_______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu >> <mailto:LLVMdev at cs.uiuc.edu>http://llvm.cs.uiuc.edu >> <http://llvm.cs.uiuc.edu/> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/33765da4/attachment.html>
Hi Frank,> On Aug 8, 2014, at 12:16 PM, Frank Winter <fwinter at jlab.org> wrote: > > Hi Nadav, > > increasing the number of instructions per bucket indeed results in a completely vectorized version of the given small function. For a medium-size function I had to increase the bucket size to 8192 to achieve full vectorization.This means that you had at least 8192 stores in that medium-sized function. :) I wonder why it is necessary to vectorize two stores with thousands of other store instructions in between. Are you sure you need such a high number?> > If I then try this setup on one of my larger functions (containing one huge basic block) it seems that the O(n^2) algorithm you were talking about hits me hard here. After 10min in SLP I sent the termination signal by hand. > > I remember the documentation saying that SLP is a quite versatile vectorizer. I was wondering if I only need to vectorize the load/store/arithmetic a single basic block and nothing else, are there any parts in SLP that I could deactivate in order to reduce the time needed for optimization?The SLP vectorizer starts by finding sequences of consecutive stores. It sounds like most of the time is spent in the code that find consecutive store sequence. You could probably verify this assumption using a profiler. I think that one way to move forward would be to improve the routine that finds consecutive stores. At the moment we iterate over all possibly combinations and use SCEV to figure out if two stores are consecutive. We can’t simply ’sort’ the stores based on the distance from some location because there could be multiple base indices. I think that one way to improve the search is to sort the store instructions into buckets according to their memory type. Thanks, Nadav> > Thanks, > Frank > > > On 08/08/2014 01:13 PM, Nadav Rotem wrote: >> Hi Frank, >> >> Thanks for working on this. Please look at vectorizeStoreChains. In this function we process all of the stores in the function in buckets of 16 elements because constructing consecutive stores is implemented using an O(n^2) algorithm. You can try to increase this threshold to 128 and see if it helps. >> >> I also agree with Renato and Chad that adding a flag to tell the SLP-vectorizer to put more effort (compile time) into the problem is a good idea. >> >> Thanks, >> Nadav >> >> >> >>> On Aug 8, 2014, at 8:27 AM, Frank Winter <fwinter at jlab.org <mailto:fwinter at jlab.org>> wrote: >>> >>> I changed the max. recursion depth to 36, and tried then 1000 (from the original value of 12) and it did not improve SLP's optimization capabilities on my input function. For example, the attached function is (by design) perfectly vectorizable into 4-packed single precision SIMD load/add/store. The SLP vectorizer just does nothing with it. >>> >>> I ran >>> >>> opt -default-data-layout="e-m:e-i64:64-f80:128-n8:16:32:64-S128" -basicaa -slp-vectorizer -S < mod_vec_p_vec.ll >>> >>> with RecursionMaxDepth = 12, 36, and 1000. >>> >>> Thanks, >>> Frank >>> >>> >>> On 08/07/2014 12:57 PM, Renato Golin wrote: >>>> On 7 August 2014 17:33, Chad Rosier <mcrosier at codeaurora.org <mailto:mcrosier at codeaurora.org>> wrote: >>>>> You might consider filing a bug (llvm.org/bugs <http://llvm.org/bugs>) requesting a flag, but I >>>>> don't know if the code owners want to expose such a flag. >>>> I'm not sure that's a good idea as a raw access to that limit, as >>>> there are no guarantees that it'll stay the same. But maybe a flag >>>> turning some "aggressive" behaviour from SLP that would then change >>>> that threshold (and maybe some others) would be a good flag to have, I >>>> think. >>>> >>>> This could maybe be a way to deprecate the BB vectorizer faster than >>>> we would otherwise. But that would depend on all missing BB features >>>> to be implemented in SLP. >>>> >>>> An item in bugzilla seems appropriate. >>>> >>>> cheers, >>>> --renato >>> >>> >>> <mod_vec_p_vec.ll>_______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/5bef6099/attachment.html>