Robin Kruppe via llvm-dev
2018-Jul-31 20:17 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 31 July 2018 at 21:10, David A. Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> writes: > >> Hi David, >> >> Let me put the last two comments up: >> >>> > But we're trying to represent slightly different techniques >>> > (predication, vscale change) which need to be tied down to only >>> > exactly what they do. >>> >>> Wouldn't intrinsics to change vscale do exactly that? >> >> You're right. I've been using the same overloaded term and this is >> probably what caused the confusion. > > Me too. Thanks Robin for clarifying this for all of us! I'll try to > follow this terminology: > > VL/active vector length - The software notion of how many elements to > operate on; a special case of predication > > vscale - The hardware notion of how big a vector register is > > TL;DR - Changing VL in a function doesn't affect anything about this > proposal, but changing vscale might. Changing VL shouldn't > impact things like ISel at all but changing vscale might. > Changing vscale is (much) more difficult than changing VL.Great, seems like we're all in violent agreement that VL changes are a non-issue for the discussion at hand.>> In some cases, predicating and shortening the vectors are semantically >> equivalent. In this case, the IR should also be equivalent. >> Instructions/intrinsics that handle predication could be used by the >> backend to simply change VL instead, as long as it's guaranteed that >> the semantics are identical. There are no problems here. > > Right. Changing VL is no problem. I think even reducing vscale is ok > from an IR perspective, if a little strange. > >> In other cases, for example widening or splitting the vector, or cases >> we haven't thought of yet, the semantics are not the same, and having >> them in IR would be bad. I think we're all in agreements on that. > > You mean going from a shorter active vector length to a longer active > vector length? Or smaller vscale to larger vscale? The latter would be > bad. The former seems ok if the dataflow is captured and the vectorizer > generates correct code to account for it. Presumably it would if it is > the thing changing the active vector length. > >> All I'm asking is that we make a list of what we want to happen and >> disallow everything else explicitly, until someone comes with a strong >> case for it. Makes sense? > > Yes. > >>> Ok, I think I am starting to grasp what you are saying. If a value >>> flows from memory or some scalar computation to vector and then back to >>> memory or scalar, VL should only ever be set at the start of the vector >>> computation until it finishes and the value is deposited in memory or >>> otherwise extracted. I think this is ok, but note that any vector >>> functions called may change VL for the duration of the call. The change >>> would not be visible to the caller. >> >> If a function is called and changes the length, does it restore back on return? > > If a function changes VL, it would typically restore it before return. > This would be an ABI guarantee just like any other callee-save register. > > If a function changes vscale, I don't know. The RISC-V people seem to > have thought the most about this. I have no point of reference here. > >> Right, so it's not as clear cut as I hoped. But we can start >> implementing the basic idea and then expand as we go. I think trying >> to hash out all potential scenarios now will drive us crazy. > > Sure. > >>> It seems strange to me for an optimizer to operate in such a way. The >>> optimizer should be fully aware of the target's capabilities and use >>> them accordingly. >> >> Mid-end optimisers tend to be fairly agnostic. And when not, they >> usually ask "is this supported" instead of "which one is better". > > Yes, the "is this supported" question is common. Isn't the whole point > of VPlan to get the "which one is better" question answered for > vectorization? That would be necessarily tied to the target. The > questions asked can be agnostic, like the target-agnostics bits of > codegen use, but the answers would be target-specific.Just like the old loop vectorizer, VPlan will need a cost model that is based on properties of the target, exposed to the optimizer in the form of e.g. TargetLowering hooks. But we should try really hard to avoid having a hard distinction between e.g. predication- and VL-based loops in the VPlan representation. Duplicating or triplicating vectorization logic would be really bad, and there are a lot of similarities that we can exploit to avoid that. For a simple example, SVE and RVV both want the same basic loop skeleton: strip-mining with predication of the loop body derived from the induction variable. Hopefully we can have a 99% unified VPlan pipeline and most differences can be delegated to the final VPlan->IR step and the respective backends. + Diego, Florian and others that have been discussing this previously>>> ARM seems to have no difficulty selecting instructions for it. Changing >>> the value of vscale shouldn't impact ISel at all. The same instructions >>> are selected. >> >> I may very well be getting lost in too many floating future ideas, atm. :) > > Given our clearer terminology, my statement above is maybe not correct. > Changing vscale *would* impact the IR and codegen (stack allocation, > etc.). Changing VL would not, other than adding some Instructions to > capture the semantics. I suspect neither would change ISel (I know VL > would not) but as you say I don't think we need concern ourselves with > changing vscale right now, unless others have a dire need to support it. > >>> > It is, but IIGIR, changing vscale and predicating are similar >>> > transformations to achieve the similar goals, but will not be >>> > represented the same way in IR. >>> >>> They probably will not be represented the same way, though I think they >>> could be (but probably shouldn't be). >> >> Maybe in the simple cases (like last iteration) they should be? > > Perhaps changing VL could be modeled the same way but I have a feeling > it will be awkward. Changing vscale is something totally different and > likely should be represented differently if allowed at all. > >>> Ok, but would be optimizer be prevented from introducing VL changes? >> >> In the case where they're represented in similar ways in IR, it >> wouldn't need to. > > It would have to generate IR code to effect the software change in VL > somehow, by altering predicates or by using special instrinsics or some > other way. > >> Otherwise, we'd have to teach the two methods to IR optimisers that >> are virtually identical in semantics. It'd be left for the back end to >> implement the last iteration notation as a predicate fill or a vscale >> change. > > I suspect that is too late. The vectorizer needs to account for the > choice and pick the most profitable course. That's one of the reasons I > think modeling VL changes like predicates is maybe unnecessarily > complex. If VL is modeled as "just another predicate" then there's no > guarantee that ISel will honor the choices the vectorizer made to use VL > over predication. If it's modeled explicitly, ISel should have an > easier time generating the code the vectorizer expects. > > VL changes aren't always on the last iteration. The Cray X1 had an > instruction (I would have to dust off old manuals to remember the > mnemonic) with somewhat strange semantics to get the desired VL for an > iteration. Code would look something like this: > > loop top: > vl = getvl N # N contains the number of iterations left > <do computation> > N = N - vl > branch N > 0, loop top > > The "getvl" instruction would usually return the full hardware vector > register length (MAXVL), except on the 2nd-to-last iteration if N was > larger than MAXVL but less than 2*MAXVL it would return something like > <N % 2 == 0 ? N/2 : N/2 + 1>, so in the range (0, MAXVL). The last > iteration would then run at the same VL or one less depending on whether > N was odd or even. So the last two iterations would often run at less > than MAXVL and often at different VLs from each other.FWIW this is exactly how the RISC-V vector unit works -- unsurprisingly, since it owes a lot to Cray-style processors :)> And no, I don't know why the hardware operated this way. :) > >>> Being conservative is fine, but we should have a clear understanding of >>> exactly what that means. I would not want to prohibit all VL changes >>> now and forever, because I see that as unnecessarily restrictive and >>> possibly damaging to supporting future architectures. >>> >>> If we don't want to provide intrinsics for changing VL right now, I'm >>> all in favor. There would be no reason to add error checks because >>> there would be no way within the IR to change VL. >> >> Right, I think we're converging. > > Agreed.+1, there is no need to deal with VL at all at this point. I would even say there isn't even any concept of VL in IR at all at this time. At some point in the future I will propose something in this space to support RISC-V vectors, but we'll cross that bridge when we come to it.>> How about we don't forbid changes in vscale, but we find a common >> notation for all the cases where predicating and changing vscale would >> be semantically identical, and implement those in the same way. >> >> Later on, if there are additional cases where changes in vscale would >> be beneficial, we can discuss them independently. >> >> Makes sense? > > Again trying to use the VL/vscale terminology: > > Changing vscale - no IR support currently and less likely in the future > Changing VL - no IR support currently but more likely in the future > > The second seems like a straightforward extension to me. There will be > some questions about how to represent VL semantics in IR but those don't > impact the proposal under discussion at all. > > The first seems much harder, at least within a function. It may or may > not impact the proposal under discussion. It sounds like the RISC-V > people have some use cases so those should probably be the focal point > of this discussion.Yes, for RISC-V we definitely need vscale to vary a bit, but are fine with limiting that to function boundaries. The use case is *not* "changing how large vectors are" in the middle of a loop or something like that, which we all agree is very dubious at best. The RISC-V vector unit is just very configurable (number of registers, vector element sizes, etc.) and this configuration can impact how large the vector registers are. For any given vectorized loop next we want to configure the vector unit to suit that piece of code and run the loop with whatever register size that configuration yields. And when that loop is done, we stop using the vector unit entirely and disable it, so that the next loop can use it differently, possibly with a different register size. For IR modeling purposes, I propose to enlarge "loop nest" to "function" but the same principle applies, it just means all vectorized loops in the function will have to share a configuration. Without getting too far into the details, does this make sense as a use case? Cheers, Robin> -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
David A. Greene via llvm-dev
2018-Jul-31 21:32 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Robin Kruppe <robin.kruppe at gmail.com> writes:>> Yes, the "is this supported" question is common. Isn't the whole point >> of VPlan to get the "which one is better" question answered for >> vectorization? That would be necessarily tied to the target. The >> questions asked can be agnostic, like the target-agnostics bits of >> codegen use, but the answers would be target-specific. > > Just like the old loop vectorizer, VPlan will need a cost model that > is based on properties of the target, exposed to the optimizer in the > form of e.g. TargetLowering hooks. But we should try really hard to > avoid having a hard distinction between e.g. predication- and VL-based > loops in the VPlan representation. Duplicating or triplicating > vectorization logic would be really bad, and there are a lot of > similarities that we can exploit to avoid that. For a simple example, > SVE and RVV both want the same basic loop skeleton: strip-mining with > predication of the loop body derived from the induction variable. > Hopefully we can have a 99% unified VPlan pipeline and most > differences can be delegated to the final VPlan->IR step and the > respective backends. > > + Diego, Florian and others that have been discussing this previouslyIf VL and predication are represented the same way, how does VPlan distinguish between the two? How does it cost code generation just using predication vs. code generation using a combination of predication and VL? Assuming it can do that, do you envision vector codegen would emit different IR for VL+predication (say, using intrinsics to set VL) vs. a strictly predication-only-based plan? If not, how does the LLVM backend know to emit code to manipulate VL in the former case? I don't need answers to these questions right now as VL is a separate issue and I don't want this thread to get bogged down in it. But these are questions that will come up if/when we tackle VL.> At some point in the future I will propose something in this space to > support RISC-V vectors, but we'll cross that bridge when we come to > it.Sounds good.> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine > with limiting that to function boundaries. The use case is *not* > "changing how large vectors are" in the middle of a loop or something > like that, which we all agree is very dubious at best. The RISC-V > vector unit is just very configurable (number of registers, vector > element sizes, etc.) and this configuration can impact how large the > vector registers are. For any given vectorized loop next we want to > configure the vector unit to suit that piece of code and run the loop > with whatever register size that configuration yields. And when that > loop is done, we stop using the vector unit entirely and disable it, > so that the next loop can use it differently, possibly with a > different register size. For IR modeling purposes, I propose to > enlarge "loop nest" to "function" but the same principle applies, it > just means all vectorized loops in the function will have to share a > configuration. > > Without getting too far into the details, does this make sense as a > use case?I think so. If changing vscale has some important advantage (saving power?), I wonder how the compiler will deal with very large functions. I have seen some truly massive Fortran subroutines with hundreds of loop nests in them, possibly with very different iteration counts for each one. -David
Hal Finkel via llvm-dev
2018-Jul-31 22:46 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 07/31/2018 04:32 PM, David A. Greene via llvm-dev wrote:> Robin Kruppe <robin.kruppe at gmail.com> writes: > >>> Yes, the "is this supported" question is common. Isn't the whole point >>> of VPlan to get the "which one is better" question answered for >>> vectorization? That would be necessarily tied to the target. The >>> questions asked can be agnostic, like the target-agnostics bits of >>> codegen use, but the answers would be target-specific. >> Just like the old loop vectorizer, VPlan will need a cost model that >> is based on properties of the target, exposed to the optimizer in the >> form of e.g. TargetLowering hooks. But we should try really hard to >> avoid having a hard distinction between e.g. predication- and VL-based >> loops in the VPlan representation. Duplicating or triplicating >> vectorization logic would be really bad, and there are a lot of >> similarities that we can exploit to avoid that. For a simple example, >> SVE and RVV both want the same basic loop skeleton: strip-mining with >> predication of the loop body derived from the induction variable. >> Hopefully we can have a 99% unified VPlan pipeline and most >> differences can be delegated to the final VPlan->IR step and the >> respective backends. >> >> + Diego, Florian and others that have been discussing this previously > If VL and predication are represented the same way, how does VPlan > distinguish between the two? How does it cost code generation just > using predication vs. code generation using a combination of predication > and VL? > > Assuming it can do that, do you envision vector codegen would emit > different IR for VL+predication (say, using intrinsics to set VL) vs. a > strictly predication-only-based plan? If not, how does the LLVM backend > know to emit code to manipulate VL in the former case? > > I don't need answers to these questions right now as VL is a separate > issue and I don't want this thread to get bogged down in it. But these > are questions that will come up if/when we tackle VL. > >> At some point in the future I will propose something in this space to >> support RISC-V vectors, but we'll cross that bridge when we come to >> it. > Sounds good. > >> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine >> with limiting that to function boundaries. The use case is *not* >> "changing how large vectors are" in the middle of a loop or something >> like that, which we all agree is very dubious at best. The RISC-V >> vector unit is just very configurable (number of registers, vector >> element sizes, etc.) and this configuration can impact how large the >> vector registers are. For any given vectorized loop next we want to >> configure the vector unit to suit that piece of code and run the loop >> with whatever register size that configuration yields. And when that >> loop is done, we stop using the vector unit entirely and disable it, >> so that the next loop can use it differently, possibly with a >> different register size. For IR modeling purposes, I propose to >> enlarge "loop nest" to "function" but the same principle applies, it >> just means all vectorized loops in the function will have to share a >> configuration. >> >> Without getting too far into the details, does this make sense as a >> use case? > I think so. If changing vscale has some important advantage (saving > power?), I wonder how the compiler will deal with very large functions. > I have seen some truly massive Fortran subroutines with hundreds of loop > nests in them, possibly with very different iteration counts for each > one.I have two concerns: 1. If we change vscale in the middle of a function, then we have no way to introduce a dependence, or barrier, at the point where the change is made. Transformations, GVN/PRE/etc. for example, can move code around the place where the change is made and I suspect that we'll have no good options to prevent it (this could include whole subloops, although we might not do that today). In some sense, if you make vscale dynamic, you've introduced dependent types into LLVM's type system, but you've done it in an implicit manner. It's not clear to me that works. If we need dependent types, then an explicit dependence seems better. (e.g., <scalable <n> x %vscale_var x <type>>) 2. How would the function-call boundary work? Does the function itself have intrinsics that change the vscale? If so, then it's not clear that the function-call boundary makes sense unless you prevent inlining. If you prevent inlining, when does that decision get made? Will the vectorizer need to outline loops? If so, outlining can have a real cost that's difficult to model. How do return types work? To other thoughts: 1. I can definitely see the use cases for changing vscale dynamically, and so I do suspect that we'll want that support. 2. LLVM does not have loops as first-class constructs. We only have SSA (and, thus, dominance), and when specifying restrictions on placement of things in function bodies, we need to do so in terms of these constructs that we have (which don't include loops). Thanks again, Hal> > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Robin Kruppe via llvm-dev
2018-Aug-01 20:28 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 31 July 2018 at 23:32, David A. Greene <dag at cray.com> wrote:> Robin Kruppe <robin.kruppe at gmail.com> writes: > >> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine >> with limiting that to function boundaries. The use case is *not* >> "changing how large vectors are" in the middle of a loop or something >> like that, which we all agree is very dubious at best. The RISC-V >> vector unit is just very configurable (number of registers, vector >> element sizes, etc.) and this configuration can impact how large the >> vector registers are. For any given vectorized loop next we want to >> configure the vector unit to suit that piece of code and run the loop >> with whatever register size that configuration yields. And when that >> loop is done, we stop using the vector unit entirely and disable it, >> so that the next loop can use it differently, possibly with a >> different register size. For IR modeling purposes, I propose to >> enlarge "loop nest" to "function" but the same principle applies, it >> just means all vectorized loops in the function will have to share a >> configuration. >> >> Without getting too far into the details, does this make sense as a >> use case? > > I think so. If changing vscale has some important advantage (saving > power?), I wonder how the compiler will deal with very large functions. > I have seen some truly massive Fortran subroutines with hundreds of loop > nests in them, possibly with very different iteration counts for each > one.Yeah, many loops with different demands on the vector unit in one function is a problem for the "one vscale per function" approach. Though for the record, the differences that matter here are not trip count, but things like register pressure and the bit widths of the vector elements. There are some (fragile) workarounds for this problem, such as splitting up the function. There's also the possibility of optimizing for this case in the backend: trying to recognize when you can use different configurations/vscales for two loops without changing observable behavior (no vector values live between the loops, vscale doesn't escape, etc.). In general this is of course extremely difficult, but I hope it'll work well enough in practice to mitigate this problem somewhat. This is just an educated guess at this point, we'll have to wait and see how big the impact is on real applications and real hardware (or simulations thereof). But at the end of the day, sure, maybe we'll generate sub-optimal code for some applications. That's still better than making the problem intractable by being too greedy and ending up with either a broken compiler or one that can't vary vscale at all. Cheers, Robin