Hal Finkel via llvm-dev
2018-Jul-24 12:45 UTC
[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop
On 07/24/2018 02:58 AM, Nema, Ashutosh wrote:> > > > > > *From:*Hal Finkel <hfinkel at anl.gov> > *Sent:* Tuesday, July 24, 2018 5:05 AM > *To:* Craig Topper <craig.topper at gmail.com>; hideki.saito at intel.com; > estotzer at ti.com; Nemanja Ivanovic <nemanja.i.ibm at gmail.com>; Adam > Nemet <anemet at apple.com>; graham.hunter at arm.com; Michael Kuperstein > <mkuper at google.com>; Sanjay Patel <spatel at rotateright.com>; Simon > Pilgrim <llvm-dev at redking.me.uk>; Nema, Ashutosh > <Ashutosh.Nema at amd.com>; llvm-dev <llvm-dev at lists.llvm.org> > *Subject:* Re: [llvm-dev] [LoopVectorizer] Improving the performance > of dot product reduction loop > > > > > > > > On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote: > > > > On 07/23/2018 05:22 PM, Craig Topper wrote: > > Hello all, > > > > This code https://godbolt.org/g/tTyxpf > <https://godbolt.org/g/tTyxpf> is a dot product reduction loop > multipying sign extended 16-bit values to produce a 32-bit > accumulated result. The x86 backend is currently not able to > optimize it as well as gcc and icc. The IR we are getting from > the loop vectorizer has several v8i32 adds and muls inside the > loop. These are fed by v8i16 loads and sexts from v8i16 to > v8i32. The x86 backend recognizes that these are addition > reductions of multiplication so we use the vpmaddwd > instruction which calculates 32-bit products from 16-bit > inputs and does a horizontal add of adjacent pairs. A vpmaddwd > given two v8i16 inputs will produce a v4i32 result. > > > > In the example code, because we are reducing the number of > elements from 8->4 in the vpmaddwd step we are left with a > width mismatch between vpmaddwd and the vpaddd instruction > that we use to sum with the results from the previous loop > iterations. We rely on the fact that a 128-bit vpmaddwd zeros > the upper bits of the register so that we can use a 256-bit > vpaddd instruction so that the upper elements can keep going > around the loop without being disturbed in case they weren't > initialized to 0. But this still means the vpmaddwd > instruction is doing half the amount of work the CPU is > capable of if we had been able to use a 256-bit vpmaddwd > instruction. Additionally, future x86 CPUs will be gaining an > instruction that can do VPMADDWD and VPADDD in one > instruction, but that width mismatch makes that instruction > difficult to utilize. > > > > In order for the backend to handle this better it would be > great if we could have something like two v32i8 loads, two > shufflevectors to extract the even elements and the odd > elements to create four v16i8 pieces. > > > Why v*i8 loads? I thought that we have 16-bit and 32-bit types here? > > > Sign extend each of those pieces. Multiply the two even pieces > and the two odd pieces separately, sum those results with a > v8i32 add. Then another v8i32 add to accumulate the previous > loop iterations. Then ensures that no pieces exceed the target > vector width and the final operation is correctly sized to go > around the loop in one register. All but the last add can then > be pattern matched to vpmaddwd as proposed > in https://reviews.llvm.org/D49636. And for the future CPU the > whole thing can be matched to the new instruction. > > > > Do other targets have a similar instruction or a similar issue > to this? Is this something we can solve in the loop > vectorizer? Or should we have a separate IR transformation > that can recognize this pattern and generate the new sequence? > As a separate pass we would need to pair two vector loads > together, remove a reduction step outside the loop and remove > half the phis assuming the loop was partially unrolled. Or if > there was only one add/mul inside the loop we'd have to reduce > its width and the width of the phi. > > > Can you explain how the desired code from the vectorizer differs > from the code that the vectorizer produces if you add '#pragma > clang loop vectorize(enable) vectorize_width(16)' above the loop? > I tried it in your godbolt example and the generated code looks > very similar to the icc-generated code. > > > > Vectorizer considers the largest data type size in the loop body and > considers the maximum possible VF as 8, hence in this example it > generates the 128 bits vpmaddwd. > > Like to share my observation where instead of forcing vector factor to > 16 (by using pragma) tried option “vectorizer-maximize-bandwidth”. > > “vectorizer-maximize-bandwidth” considers the smallest data type size > in the loop body and allows the possible VF up to 16, but LV selects > the VF as 8 though VF16 has the same cost. > > LV: Vector loop of width 8 costs: 1. > > LV: Vector loop of width 16 costs: 1. > > > > It’s because of below check in LV: > > LoopVectorizationCostModel::selectVectorizationFactor() { > > … > > if (VectorCost < Cost) { > > Cost = VectorCost; > > Width = i; > > } > > > > I don’t know the history behind this change but wondering why it’s > like this, at least for “vectorizer-maximize-bandwidth” it should be > (VectorCost <= Cost). >Ah, interesting. I was indeed wondering whether this was another case where we'd benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they're not really equal in practice. -Hal> By forcing the vector factor to 16 it generates 256 bits vpmaddwd. > > > > Regards, > > Ashutosh >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/c7a23041/attachment.html>
Renato Golin via llvm-dev
2018-Jul-24 12:58 UTC
[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop
On Tue, 24 Jul 2018 at 13:46, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Ah, interesting. I was indeed wondering whether this was another case where we'd benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they're not really equal in practice.Hi Hal, Enabling vectorizer-maximize-bandwidth by default has been explored before (see history in https://reviews.llvm.org/D46283) and it created a handful of problems. With the correctness problems fixed, we still were blocked on high negative performance hits, which was probably to do with the cost model. Adhemerval sent this patch instead: https://reviews.llvm.org/D48332, which dealt with the specific case in hand. I remember hearing that some complex loops get worse performance for wide loads (ex. AVX512) because it makes the loop too short and the shuffles in and out too long, or increase the number of shuffles if the loads are not trivial. So, while enabling it by default is probably a good idea in a lot of cases, we probably need to be careful with its usage as a wide scope default. I don't have more info on the individual cases, though, so just is more of an FYI. :) -- cheers, --renato
Hal Finkel via llvm-dev
2018-Jul-24 13:27 UTC
[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop
On 07/24/2018 07:58 AM, Renato Golin wrote:> On Tue, 24 Jul 2018 at 13:46, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> Ah, interesting. I was indeed wondering whether this was another case where we'd benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they're not really equal in practice. > Hi Hal, > > Enabling vectorizer-maximize-bandwidth by default has been explored > before (see history in https://reviews.llvm.org/D46283) and it created > a handful of problems. > > With the correctness problems fixed, we still were blocked on high > negative performance hits, which was probably to do with the cost > model. Adhemerval sent this patch instead: > https://reviews.llvm.org/D48332, which dealt with the specific case in > hand.This matches my recollection: the correctness problems had been addressed and there were still some outstanding cost-modeling problems.> > I remember hearing that some complex loops get worse performance for > wide loads (ex. AVX512) because it makes the loop too short and the > shuffles in and out too long, or increase the number of shuffles if > the loads are not trivial.Sounds like another potential cost-modeling problem?> > So, while enabling it by default is probably a good idea in a lot of > cases, we probably need to be careful with its usage as a wide scope > default. I don't have more info on the individual cases, though, so > just is more of an FYI. :)I don't really see any long-term path to addressing these kinds of problems aside from working through the cost-modeling problems and getting this enabled. We'll see ;) -Hal>-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory