Hal Finkel via llvm-dev
2018-Jul-31 22:46 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 07/31/2018 04:32 PM, David A. Greene via llvm-dev wrote:> Robin Kruppe <robin.kruppe at gmail.com> writes: > >>> Yes, the "is this supported" question is common. Isn't the whole point >>> of VPlan to get the "which one is better" question answered for >>> vectorization? That would be necessarily tied to the target. The >>> questions asked can be agnostic, like the target-agnostics bits of >>> codegen use, but the answers would be target-specific. >> Just like the old loop vectorizer, VPlan will need a cost model that >> is based on properties of the target, exposed to the optimizer in the >> form of e.g. TargetLowering hooks. But we should try really hard to >> avoid having a hard distinction between e.g. predication- and VL-based >> loops in the VPlan representation. Duplicating or triplicating >> vectorization logic would be really bad, and there are a lot of >> similarities that we can exploit to avoid that. For a simple example, >> SVE and RVV both want the same basic loop skeleton: strip-mining with >> predication of the loop body derived from the induction variable. >> Hopefully we can have a 99% unified VPlan pipeline and most >> differences can be delegated to the final VPlan->IR step and the >> respective backends. >> >> + Diego, Florian and others that have been discussing this previously > If VL and predication are represented the same way, how does VPlan > distinguish between the two? How does it cost code generation just > using predication vs. code generation using a combination of predication > and VL? > > Assuming it can do that, do you envision vector codegen would emit > different IR for VL+predication (say, using intrinsics to set VL) vs. a > strictly predication-only-based plan? If not, how does the LLVM backend > know to emit code to manipulate VL in the former case? > > I don't need answers to these questions right now as VL is a separate > issue and I don't want this thread to get bogged down in it. But these > are questions that will come up if/when we tackle VL. > >> At some point in the future I will propose something in this space to >> support RISC-V vectors, but we'll cross that bridge when we come to >> it. > Sounds good. > >> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine >> with limiting that to function boundaries. The use case is *not* >> "changing how large vectors are" in the middle of a loop or something >> like that, which we all agree is very dubious at best. The RISC-V >> vector unit is just very configurable (number of registers, vector >> element sizes, etc.) and this configuration can impact how large the >> vector registers are. For any given vectorized loop next we want to >> configure the vector unit to suit that piece of code and run the loop >> with whatever register size that configuration yields. And when that >> loop is done, we stop using the vector unit entirely and disable it, >> so that the next loop can use it differently, possibly with a >> different register size. For IR modeling purposes, I propose to >> enlarge "loop nest" to "function" but the same principle applies, it >> just means all vectorized loops in the function will have to share a >> configuration. >> >> Without getting too far into the details, does this make sense as a >> use case? > I think so. If changing vscale has some important advantage (saving > power?), I wonder how the compiler will deal with very large functions. > I have seen some truly massive Fortran subroutines with hundreds of loop > nests in them, possibly with very different iteration counts for each > one.I have two concerns: 1. If we change vscale in the middle of a function, then we have no way to introduce a dependence, or barrier, at the point where the change is made. Transformations, GVN/PRE/etc. for example, can move code around the place where the change is made and I suspect that we'll have no good options to prevent it (this could include whole subloops, although we might not do that today). In some sense, if you make vscale dynamic, you've introduced dependent types into LLVM's type system, but you've done it in an implicit manner. It's not clear to me that works. If we need dependent types, then an explicit dependence seems better. (e.g., <scalable <n> x %vscale_var x <type>>) 2. How would the function-call boundary work? Does the function itself have intrinsics that change the vscale? If so, then it's not clear that the function-call boundary makes sense unless you prevent inlining. If you prevent inlining, when does that decision get made? Will the vectorizer need to outline loops? If so, outlining can have a real cost that's difficult to model. How do return types work? To other thoughts: 1. I can definitely see the use cases for changing vscale dynamically, and so I do suspect that we'll want that support. 2. LLVM does not have loops as first-class constructs. We only have SSA (and, thus, dominance), and when specifying restrictions on placement of things in function bodies, we need to do so in terms of these constructs that we have (which don't include loops). Thanks again, Hal> > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Renato Golin via llvm-dev
2018-Aug-01 11:15 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> In some sense, if you make vscale dynamic, > you've introduced dependent types into LLVM's type system, but you've > done it in an implicit manner. It's not clear to me that works. If we > need dependent types, then an explicit dependence seems better. (e.g., > <scalable <n> x %vscale_var x <type>>)That's a shift from the current proposal and I think we can think about it after the current changes. For now, both SVE and RISC-V are proposing function boundaries for changes in vscale.> 2. How would the function-call boundary work? Does the function itself > have intrinsics that change the vscale?Functions may not know what their vscale is until they're actually executed. They could even have different vscales for different call sites. AFAIK, it's not up to the compiled program (ie via a function attribute or an inline asm call) to change the vscale, but the kernel/hardware can impose dynamic restrictions on the process. But, for now, only at (binary object) function boundaries. I don't know how that works at the kernel level (how to detect those boundaries? instrument every branch?) but this is what I understood from the current discussion.> If so, then it's not clear that > the function-call boundary makes sense unless you prevent inlining. If > you prevent inlining, when does that decision get made? Will the > vectorizer need to outline loops? If so, outlining can have a real cost > that's difficult to model. How do return types work?The dynamic nature is not part of the program, so inlining can happen as always. Given that the vectors are agnostic of size and work regardless of what the kernel provides (within safety boundaries), the code generation shouldn't change too much. We may have to create artefacts to restrict the maximum vscale (for safety), but others are better equipped to answer that question.> 1. I can definitely see the use cases for changing vscale dynamically, > and so I do suspect that we'll want that support.At a process/function level, yes. Within the same self-contained sub-graph, I don't know.> 2. LLVM does not have loops as first-class constructs. We only have SSA > (and, thus, dominance), and when specifying restrictions on placement of > things in function bodies, we need to do so in terms of these constructs > that we have (which don't include loops).That's why I was trying to define the "self-contained sub-graph" above (there must be a better term for that). It has to do with data dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure side-effects don't leak out. A loop iteration is usually such a block, but not all are and not all such blocks are loops. Changing vscale inside a function, but outside of those blocks would be "fine", as long as we made sure code movement respects those boundaries and that context would be restored correctly on exceptions. But that's not part of the current proposal. Chaning vscale inside one of those blocks would be madness. :) cheers, --renato
Hal Finkel via llvm-dev
2018-Aug-01 16:26 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 08/01/2018 06:15 AM, Renato Golin wrote:> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> In some sense, if you make vscale dynamic, >> you've introduced dependent types into LLVM's type system, but you've >> done it in an implicit manner. It's not clear to me that works. If we >> need dependent types, then an explicit dependence seems better. (e.g., >> <scalable <n> x %vscale_var x <type>>) > That's a shift from the current proposal and I think we can think > about it after the current changes. For now, both SVE and RISC-V are > proposing function boundaries for changes in vscale.I understand. I'm afraid that the function-boundary idea doesn't work reasonably.> > >> 2. How would the function-call boundary work? Does the function itself >> have intrinsics that change the vscale? > Functions may not know what their vscale is until they're actually > executed. They could even have different vscales for different call > sites. > > AFAIK, it's not up to the compiled program (ie via a function > attribute or an inline asm call) to change the vscale, but the > kernel/hardware can impose dynamic restrictions on the process. But, > for now, only at (binary object) function boundaries.I'm not sure if that's better or worse than the compiler putting in code to indicate that the vscale might change. How do vector function arguments work if vscale gets larger? or smaller? So, if I have some vectorized code, and we figure out that some of it is cold, so we outline it, and then the kernel decides to decrease vscale for that function, now I have broken the application? Storing a vector argument in memory in that function now doesn't store as much data as it would have in the caller?> > I don't know how that works at the kernel level (how to detect those > boundaries? instrument every branch?) but this is what I understood > from the current discussion.Can we find out?> > >> If so, then it's not clear that >> the function-call boundary makes sense unless you prevent inlining. If >> you prevent inlining, when does that decision get made? Will the >> vectorizer need to outline loops? If so, outlining can have a real cost >> that's difficult to model. How do return types work? > The dynamic nature is not part of the program, so inlining can happen > as always. Given that the vectors are agnostic of size and work > regardless of what the kernel provides (within safety boundaries), the > code generation shouldn't change too much. > > We may have to create artefacts to restrict the maximum vscale (for > safety), but others are better equipped to answer that question. > > >> 1. I can definitely see the use cases for changing vscale dynamically, >> and so I do suspect that we'll want that support. > At a process/function level, yes. Within the same self-contained > sub-graph, I don't know. > > >> 2. LLVM does not have loops as first-class constructs. We only have SSA >> (and, thus, dominance), and when specifying restrictions on placement of >> things in function bodies, we need to do so in terms of these constructs >> that we have (which don't include loops). > That's why I was trying to define the "self-contained sub-graph" above > (there must be a better term for that). It has to do with data > dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure > side-effects don't leak out. > > A loop iteration is usually such a block, but not all are and not all > such blocks are loops. > > Changing vscale inside a function, but outside of those blocks would > be "fine", as long as we made sure code movement respects those > boundaries and that context would be restored correctly on exceptions. > But that's not part of the current proposal.But I don't know how to implement that restriction without major changes to the code base. Such a restriction doesn't follow from use/def chains, and if we need a restriction that involves looking for non-SSA dependencies (e.g., memory dependencies), then I think that we need something different than the current proposal. Explicitly dependent types might work, something like intrinsics might work, etc. Thanks again, Hal> > Chaning vscale inside one of those blocks would be madness. :) > > cheers, > --renato-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory