Hi, I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%). I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently. I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why? /Jonas
On Thu, 13 Sep 2018 at 09:22, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote:> I found with the help of the optimization remarks a loop that could not > be vectorized, but if loop distribution was enabled this may happen, > which it in fact did with a very significant benchmark improvement (~25%).Hi Jonas, That's not surprising, given that LD only tries to enable vectorisation. Performance improvements of course depends on the target and the quality of LLVM's lowering and further vectorisation.> I tried (on SystemZ) to enable this pass, and found that it only > affected a handful of files on SPEC. This means I could enable this > without worrying about any regressions on SystemZ at least currently.IIUC, it's all about compile time. Loop distribution analysis is not terribly complex, but does have a cost (see [1]). I don't think it will have many regressions because it's *very* conservative (see [2]), perhaps too much. Shouldn't be too much of a problem for SystemZ, but I'd wait for others closer to the LD pass to chime in, before taking any decision. :)> I wonder if there is something more to know about this. It seems that no > other target has enabled this due to general mixed results, or? Is this > triggering much more on other targets, and if so, why?I think it's mostly about the success rate, given it's too conservative. But in the past 2 years, improvements in (and around) the LV have been slowed down a bit due to the move to VPlan. Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan pass, so that it can be combined with others in the cost analysis, given that it's mostly meant to enable loop vectorisation. Adding some VPlan folks in CC. -- cheers, --renato [1] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.html [2] http://lists.llvm.org/pipermail/llvm-dev/2016-October/105766.html
Jonas/Renato,>I think it's mostly about the success rate, given it's too conservative. But in the past 2 years, improvements in (and around) the LV have been slowed down a bit due to the move >to VPlan.It wasn't our intention to slow down LV improvements, but if the project ended up causing other developers take the stance of wait-and-see, that's an inevitable side effect of any infrastructure level work. We welcome others work with us to move things faster. I hope everyone will see that the end result is well worth the pain it has caused.>Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan pass, so that it can be combined with others in the cost analysis, given that it's mostly meant to enable >loop vectorisation.There are other reasons why LD is good on its own, but I certainly agree that LD shines more when it enables vectorization. In my perspective, however, there is a value in the standalone LD, and in many cases vectorization oriented LD can still happen there. Performing LD in VPlan-to-VPlan would improve precision of the cost modeling, but given that vectorizer's cost model is "ball park"-based to begin with (we have a lot of optimziers running downstream!), having extra precision will be worth only by that much. I have a thought about moving vectorizer's analysis part (all the way to cost model) into Analysis. When extra precision is desired, we can utilize such an (heavier weight) Analysis. In short, my preference is to make vectorizer's analysis more usable by other xforms than making more and more loop xforms happen inside LV. In the meantime, if those who are working on LD needs our input in tuning LD cost model, I'm more than happy to pitch in. We can also discuss what part of vectorizer analysis is helpful in LD at the same time. Thanks, Hideki -----Original Message----- From: Renato Golin [mailto:renato.golin at linaro.org] Sent: Thursday, September 13, 2018 1:48 AM To: Jonas Paulsson <paulsson at linux.vnet.ibm.com> Cc: LLVM Dev <llvm-dev at lists.llvm.org>; Adam Nemet <anemet at apple.com>; Sanjay Patel <spatel at rotateright.com>; Ulrich Weigand <ulrich.weigand at de.ibm.com>; Saito, Hideki <hideki.saito at intel.com>; Zaks, Ayal <ayal.zaks at intel.com>; Caballero, Diego <diego.caballero at intel.com>; Florian Hahn <florian.hahn at arm.com> Subject: Re: Loop Distribution pass On Thu, 13 Sep 2018 at 09:22, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote:> I found with the help of the optimization remarks a loop that could > not be vectorized, but if loop distribution was enabled this may > happen, which it in fact did with a very significant benchmark improvement (~25%).Hi Jonas, That's not surprising, given that LD only tries to enable vectorisation. Performance improvements of course depends on the target and the quality of LLVM's lowering and further vectorisation.> I tried (on SystemZ) to enable this pass, and found that it only > affected a handful of files on SPEC. This means I could enable this > without worrying about any regressions on SystemZ at least currently.IIUC, it's all about compile time. Loop distribution analysis is not terribly complex, but does have a cost (see [1]). I don't think it will have many regressions because it's *very* conservative (see [2]), perhaps too much. Shouldn't be too much of a problem for SystemZ, but I'd wait for others closer to the LD pass to chime in, before taking any decision. :)> I wonder if there is something more to know about this. It seems that > no other target has enabled this due to general mixed results, or? Is > this triggering much more on other targets, and if so, why?I think it's mostly about the success rate, given it's too conservative. But in the past 2 years, improvements in (and around) the LV have been slowed down a bit due to the move to VPlan. Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan pass, so that it can be combined with others in the cost analysis, given that it's mostly meant to enable loop vectorisation. Adding some VPlan folks in CC. -- cheers, --renato [1] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.html [2] http://lists.llvm.org/pipermail/llvm-dev/2016-October/105766.html
> On Sep 13, 2018, at 1:21 AM, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote: > > Hi, > > I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%). > > I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently. > > I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why?The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization. We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now). Adam> > /Jonas > >
Hi Adam, On 2018-09-19 19:26, Adam Nemet wrote:> >> On Sep 13, 2018, at 1:21 AM, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote: >> >> Hi, >> >> I found with the help of the optimization remarks a loop that could not be vectorized, but if loop distribution was enabled this may happen, which it in fact did with a very significant benchmark improvement (~25%). >> >> I tried (on SystemZ) to enable this pass, and found that it only affected a handful of files on SPEC. This means I could enable this without worrying about any regressions on SystemZ at least currently. >> >> I wonder if there is something more to know about this. It seems that no other target has enabled this due to general mixed results, or? Is this triggering much more on other targets, and if so, why? > The main thing that is missing from the pass right now is a serious analysis of profitability as it affects instruction- and memory-level parallelism. The easiest to see this that LD is a reverse transformation of Loop fusion so where LF helps LD may regress. MLP is the big one in my opinion which would totally reverse any gains from vectorization. > > We would probably have to do similar things to the SW prefetch insertion pass in order to analyze accesses that are likely to be skipped by the HW prefetcher. Needless to say this is a very micro-architecture specific analysis/cost model. If we can establish that ILP/MPL is unaffected even in simplest cases and vectorization is enabled we could enable the transformation by default (in addition to the pragma-driven approach we have now).Thanks for the reply. Since this is today extremely conservative and nearly never triggers, at least on SystemZ, while still being very beneficial when it does happen, it seems that this could be used as-is now on SystemZ with a new TTI hook to enable it selectively per target. The question now is if this is a wise idea? Do you think things will change significantly with the Loop Distribution pass in the direction that it gets much more enabled, which may then cause regressions on SystemZ? If that is the case, perhaps the idea now is that nobody activates it per default until some initial reasonable cost modeling has been made? /Jonas
Possibly Parallel Threads
- Loop Distribution pass
- Loop Distribution pass
- RFC: [LV] any objections in moving isLegalMasked* check from Legal to CostModel? (Cleaning up LoopVectorizationLegality)
- RFC: [LV] any objections in moving isLegalMasked* check from Legal to CostModel? (Cleaning up LoopVectorizationLegality)
- RFC: [LV] any objections in moving isLegalMasked* check from Legal to CostModel? (Cleaning up LoopVectorizationLegality)