Saito, Hideki via llvm-dev
2018-Sep-05 01:58 UTC
[llvm-dev] LoopVectorizer: shufflevectors
>> To me, this looks like something the LoopVectorizer is neglecting and >> should be combining. > >It's not up to the vectoriser to combine code. > >But it could be up to the vectoriser to generate less bloated code, >given it's a small change. > >That's my point above.We should note that 1) Loop Vectorizer is not the only place that generates vectorized IR. For example, programmer's intrinsic vector code, after inlining etc. might show the same problem. Any optimizations added within LV won't be applied when other parts of the compiler is generating vectorized IR. 2) Vectorizer's main job is generating widened vector code that is easier to optimize later on, not necessarily generating highly optimized vector code on its own. 3) Cost modeling correctly (and as a result choosing good VF) is a more important problem, than performing the optimization within the vectorizer itself. 4) If cost modeling is taking optimization into account, LV has a chance of generating optimized code. That doesn't necessarily mean LV should ---- back to 1). The last thing we want would be making LV a gigantic monolithic optimizer that is so hard to maintain. I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF). This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect. Thanks, Hideki =============================Date: Tue, 4 Sep 2018 18:57:17 +0100 From: Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> To: Cc: LLVM Dev <llvm-dev at lists.llvm.org>, Ulrich Weigand <ulrich.weigand at de.ibm.com> Subject: Re: [llvm-dev] LoopVectorizer: shufflevectors Message-ID: <CAMSE1kcHuN4a-a1VTUdsyyVD_9aThZ6p_N8ZbPhW1H8KoxAJtg at mail.gmail.com> Content-Type: text/plain; charset="UTF-8" On Tue, 4 Sep 2018 at 17:35, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote:> > It's probably a lot simpler to improve the SystemZ model to "know" > > have the same arch flags / cost model completeness as the other > > targets. > I thought they were - anything particular in mind?I have no idea about SystemZ, sorry. :)>From your post and response, it seems that both improving the targetinfo and cost model are opening new ways to vectorise on SystemZ. That's what I was referring to.> This then made many more cases of interleaving happen (~450 cases on > spec IIRC). Only problem was... the SystemZ backend could not handle > those shuffles as well in all the cases. To me that looked like > something to be fixed on the I/R level, and after discussions with > Sanjay I got the impression that this was the case...Right. Being fixed at IR level and that being done in the vectoriser are two different things. Our current implementation is too monolithic to be trying out branching off the beaten path, and we're in the process of moving out (which can still take years), so I don't recommend big refactorings on the code. You could probably find a number of simplifications, taking target info in consideration, that can later be ported to VPlan, but that will require testing the vectorisation on the supported targets. We don't need to re-benchmark everything again, just make sure the code doesn't change for them, of if it does, to know why.> To me, this looks like something the LoopVectorizer is neglecting and > should be combining.It's not up to the vectoriser to combine code. But it could be up to the vectoriser to generate less bloated code, given it's a small change. That's my point above.> I suppose with my patch for the Load -> Store > groups, I could add also the handling of recomputed indices so that the > load group produces a vector that fits the store group directly. But if > I understand you correctly, even this is not so wise?It will depend on how much that changes other targets, because what looks less bloated can also mean patterns are not recognised any more by other back-ends.> And if so, then indeed improving the SystemZ DAGCombiner is the only alternative left, I guess...You'll probably have to do that anyway, but I wouldn't try it unless I had no other choice. :)> But having the cost functions available is not enough to drive a later > I/R pass to optimize the generated vector code? I mean if the target > indicated which shuffles were expensive, that could then easily be avoided.Sure, but "expensive" is a relative term and it's intimately linked to what the back-end can combine. If you're lucky enough that a mid-end change just happens to unbloat shuffles and be correctly lowered, without breaking other targets, then that's a big win. -- cheers, --renato
On Wed, 5 Sep 2018 at 02:58, Saito, Hideki <hideki.saito at intel.com> wrote:> I think we should talk about how much complexity we would be adding for general "vectorized load/store optimization", and whether we should have a separate post-vectorizer optimizer doing it (while LV still needs to understand the cost modeling aspect of that optimization, in order to choose the right VF).I imagine it would be a lot easier to plug loop-vectorisation-specific clean up passes in a VPlan model than today. But as you said, this is only part of the vectorised code the middle end generates. While LV could (potentially) generate less bloated code, which would also help clean up passes to do their jobs better, it will have to be very conservative and extensively tested.> This should include a discussion about moving interleave memory access optimization from LV to there. Adding a small new optimization here and there to LV can have a snowball effect.I agree that interleave access is not exclusive to loop vectorisation and that it should be moved to a higher position (some of your patches earlier this year come to mind). But, as I said back then, before we do so, we need to understand exactly where to put it. That will depend on what other passes will actually use it and if we want it to be a utility class or an analysis pass, or both. Have you compiled a list of passes that could benefit from such a move? cheers, --renato