Hal Finkel via llvm-dev
2018-Aug-01 16:26 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 08/01/2018 06:15 AM, Renato Golin wrote:> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> In some sense, if you make vscale dynamic, >> you've introduced dependent types into LLVM's type system, but you've >> done it in an implicit manner. It's not clear to me that works. If we >> need dependent types, then an explicit dependence seems better. (e.g., >> <scalable <n> x %vscale_var x <type>>) > That's a shift from the current proposal and I think we can think > about it after the current changes. For now, both SVE and RISC-V are > proposing function boundaries for changes in vscale.I understand. I'm afraid that the function-boundary idea doesn't work reasonably.> > >> 2. How would the function-call boundary work? Does the function itself >> have intrinsics that change the vscale? > Functions may not know what their vscale is until they're actually > executed. They could even have different vscales for different call > sites. > > AFAIK, it's not up to the compiled program (ie via a function > attribute or an inline asm call) to change the vscale, but the > kernel/hardware can impose dynamic restrictions on the process. But, > for now, only at (binary object) function boundaries.I'm not sure if that's better or worse than the compiler putting in code to indicate that the vscale might change. How do vector function arguments work if vscale gets larger? or smaller? So, if I have some vectorized code, and we figure out that some of it is cold, so we outline it, and then the kernel decides to decrease vscale for that function, now I have broken the application? Storing a vector argument in memory in that function now doesn't store as much data as it would have in the caller?> > I don't know how that works at the kernel level (how to detect those > boundaries? instrument every branch?) but this is what I understood > from the current discussion.Can we find out?> > >> If so, then it's not clear that >> the function-call boundary makes sense unless you prevent inlining. If >> you prevent inlining, when does that decision get made? Will the >> vectorizer need to outline loops? If so, outlining can have a real cost >> that's difficult to model. How do return types work? > The dynamic nature is not part of the program, so inlining can happen > as always. Given that the vectors are agnostic of size and work > regardless of what the kernel provides (within safety boundaries), the > code generation shouldn't change too much. > > We may have to create artefacts to restrict the maximum vscale (for > safety), but others are better equipped to answer that question. > > >> 1. I can definitely see the use cases for changing vscale dynamically, >> and so I do suspect that we'll want that support. > At a process/function level, yes. Within the same self-contained > sub-graph, I don't know. > > >> 2. LLVM does not have loops as first-class constructs. We only have SSA >> (and, thus, dominance), and when specifying restrictions on placement of >> things in function bodies, we need to do so in terms of these constructs >> that we have (which don't include loops). > That's why I was trying to define the "self-contained sub-graph" above > (there must be a better term for that). It has to do with data > dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure > side-effects don't leak out. > > A loop iteration is usually such a block, but not all are and not all > such blocks are loops. > > Changing vscale inside a function, but outside of those blocks would > be "fine", as long as we made sure code movement respects those > boundaries and that context would be restored correctly on exceptions. > But that's not part of the current proposal.But I don't know how to implement that restriction without major changes to the code base. Such a restriction doesn't follow from use/def chains, and if we need a restriction that involves looking for non-SSA dependencies (e.g., memory dependencies), then I think that we need something different than the current proposal. Explicitly dependent types might work, something like intrinsics might work, etc. Thanks again, Hal> > Chaning vscale inside one of those blocks would be madness. :) > > cheers, > --renato-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Robin Kruppe via llvm-dev
2018-Aug-01 19:59 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 1 August 2018 at 18:26, Hal Finkel <hfinkel at anl.gov> wrote:> > On 08/01/2018 06:15 AM, Renato Golin wrote: >> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >>> In some sense, if you make vscale dynamic, >>> you've introduced dependent types into LLVM's type system, but you've >>> done it in an implicit manner. It's not clear to me that works. If we >>> need dependent types, then an explicit dependence seems better. (e.g., >>> <scalable <n> x %vscale_var x <type>>) >> That's a shift from the current proposal and I think we can think >> about it after the current changes. For now, both SVE and RISC-V are >> proposing function boundaries for changes in vscale. > > I understand. I'm afraid that the function-boundary idea doesn't work > reasonably.FWIW, I don't think dependent types really help with the code motion problems. While using an SSA value in a type would presumably enforce that instructions mentioning that type have to be dominated by the definition of said value, the real problem is when you _stop_ using one vscale (and presumably start using another). For example, we want to rule out the following: %vscale.1 = call i32 @change_vscale(...) %v.1 = load <scalable 4 x %vscale.1 x i32> ... %vscale.2 = call i32 @change_vscale(...) %v.2 = load <scalable 4 x %vscale.1 x i32> ... ; vscale changed but we're still doing things with the old one And of course, actually introducing this notion of types mentioning SSA values into LLVM would be an extraordinarily huge and difficult step. I did actually consider something along these lines (and even had a digression about it in drafts of my RFC, but I cut it in the final version) but I don't think it's viable. Tying some values to the function they're in, on the other hand, even has precedent in current LLVM: tokens values must be confined to one function (intrinsics are special, of course), so most of the interprocedural passes already must be careful with moving certain kinds of values between functions. It's ad-hoc and requires auditing passes, yes, but it's something we know and have some experience with. (The similarity to tokens is strong enough that my original proposal heavily leaned on tokens to encode the restrictions on the optimizer that are needed for different-vscale-per-function, but I've been persuaded that it's more trouble than it's worth, hence the "implicit" approach of this RFC.)>> >> >>> 2. How would the function-call boundary work? Does the function itself >>> have intrinsics that change the vscale? >> Functions may not know what their vscale is until they're actually >> executed. They could even have different vscales for different call >> sites. >> >> AFAIK, it's not up to the compiled program (ie via a function >> attribute or an inline asm call) to change the vscale, but the >> kernel/hardware can impose dynamic restrictions on the process. But, >> for now, only at (binary object) function boundaries. > > I'm not sure if that's better or worse than the compiler putting in code > to indicate that the vscale might change. How do vector function > arguments work if vscale gets larger? or smaller?I don't see any way for the OS to change a running process's vscale without a great amount of cooperation from the program and the compiler. In general, the kernel has nowhere near enough information to identify spots where it's safe to fiddle with vscale -- function call boundaries aren't safe in general, as you point out. FWIW, in the RISC-V vector task group we discussed migrating running processes between cores in heterogenous architectures (e.g. think big.LITTLE) that may have different vector register sizes. We quickly agreed that there's no way to make that work and dismissed the idea. The current thinking is, if you want to migrate a process that's currently using the vector unit, you can only migrate it between cores that have the same kind of register field. For the RISC-V backend I don't want anything to do with OS shenangians, I'm exclusively focused on codegen. The backend inserts machine code in the prologue that configures the vector unit in whatever way the backend considers best, and this configuration determines vscale (and some other things that aren't visible to IR). The caller saves their vector unit state before the call and restores it after the call returns, so their vscale is not affected by the call either. For SVE, I could imagine a function attribute that indicates it's OK to change vscale at this point (this will probably have to be a very careful and deliberate decision by a programmer). The backend could then change vscale in the prologie, either set it to a specific value (e.g., requested by the attribute) or make a libcall asking the kernel to adjust vscale if it wants to. In both cases, the change happens after the caller saved all their state and before any of the callee's code runs. That leaves arguments and return values, and more generally any vector values that are shared (e.g., in memory) between caller and callee. Indeed it's not possible to share any vectors between two functions that disagree on how large a vector is (sounds obvious when you put it that way). If you need to pass vectors in any way, caller and callee have to agree on vscale as part of the ABI, and the callee does *not* change vscale but "inherits" it from the caller. On SVE that's the default ABI, on RISC-V there will be one or multiple non-default "vector call" ABIs (as Bruce mentioned in an earlier email). In IR we could represent these different ABIs though calling convention numbers, function attributes, or a combination thereof. With ABIs where caller and callee don't necessarily agree on vscale, it is simply impossible to pass vector values (and while you can e.g. pass the caller's vscale value, it probably isn't meaningful to the callee): - it's a verifier error if such a function takes or returns scalable vectors directly - a source program that e.g. tries to smuggle a vector from one function to another through heap memory is erroneous - the optimizer must not introduce such errors in correct input programs The last point means, for example, that partial inlining can't pull the computation of a vector value into the caller and pass the result as a new argument. Such optimizations wouldn't be correct anyway, regardless of ABI concerns: the instructions that are affected all depend on vscale and therefore moving them to a different function changes their behavior. Of course, this doesn't mean all interprocedural optimizations are invalid. *Complete* inlining, for example, is always valid. Of course, all this applies only if caller and callee don't agree on vscale. With suitable ABIs, all existing optimizations can be applied without problem.> So, if I have some vectorized code, and we figure out that some of it is > cold, so we outline it, and then the kernel decides to decrease vscale > for that function, now I have broken the application? Storing a vector > argument in memory in that function now doesn't store as much data as it > would have in the caller? > >> >> I don't know how that works at the kernel level (how to detect those >> boundaries? instrument every branch?) but this is what I understood >> from the current discussion. > > Can we find out? > >> >> >>> If so, then it's not clear that >>> the function-call boundary makes sense unless you prevent inlining. If >>> you prevent inlining, when does that decision get made? Will the >>> vectorizer need to outline loops? If so, outlining can have a real cost >>> that's difficult to model. How do return types work? >> The dynamic nature is not part of the program, so inlining can happen >> as always. Given that the vectors are agnostic of size and work >> regardless of what the kernel provides (within safety boundaries), the >> code generation shouldn't change too much. >> >> We may have to create artefacts to restrict the maximum vscale (for >> safety), but others are better equipped to answer that question. >> >> >>> 1. I can definitely see the use cases for changing vscale dynamically, >>> and so I do suspect that we'll want that support. >> At a process/function level, yes. Within the same self-contained >> sub-graph, I don't know. >> >> >>> 2. LLVM does not have loops as first-class constructs. We only have SSA >>> (and, thus, dominance), and when specifying restrictions on placement of >>> things in function bodies, we need to do so in terms of these constructs >>> that we have (which don't include loops). >> That's why I was trying to define the "self-contained sub-graph" above >> (there must be a better term for that). It has to do with data >> dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure >> side-effects don't leak out. >> >> A loop iteration is usually such a block, but not all are and not all >> such blocks are loops. >> >> Changing vscale inside a function, but outside of those blocks would >> be "fine", as long as we made sure code movement respects those >> boundaries and that context would be restored correctly on exceptions. >> But that's not part of the current proposal. > > But I don't know how to implement that restriction without major changes > to the code base. Such a restriction doesn't follow from use/def chains, > and if we need a restriction that involves looking for non-SSA > dependencies (e.g., memory dependencies), then I think that we need > something different than the current proposal. Explicitly dependent > types might work, something like intrinsics might work, etc.Seconded, this is an extraordinarily difficult problem. I've spent unreasonable amounts of time thinking about ways to model changing vector sizes and sketching countless designs for it. Multiple times I convinced myself some clever setup would work, and every time I later discovered a fatal flaw. Until I settled on "only at funciton boundaries", that is, and even that took a few iterations. Cheers, Robin> Thanks again, > Hal > >> >> Chaning vscale inside one of those blocks would be madness. :) >> >> cheers, >> --renato > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >
Hal Finkel via llvm-dev
2018-Aug-01 20:41 UTC
[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
On 08/01/2018 02:59 PM, Robin Kruppe wrote:> On 1 August 2018 at 18:26, Hal Finkel <hfinkel at anl.gov> wrote: >> On 08/01/2018 06:15 AM, Renato Golin wrote: >>> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev >>> <llvm-dev at lists.llvm.org> wrote: >>>> In some sense, if you make vscale dynamic, >>>> you've introduced dependent types into LLVM's type system, but you've >>>> done it in an implicit manner. It's not clear to me that works. If we >>>> need dependent types, then an explicit dependence seems better. (e.g., >>>> <scalable <n> x %vscale_var x <type>>) >>> That's a shift from the current proposal and I think we can think >>> about it after the current changes. For now, both SVE and RISC-V are >>> proposing function boundaries for changes in vscale. >> I understand. I'm afraid that the function-boundary idea doesn't work >> reasonably. > FWIW, I don't think dependent types really help with the code motion > problems.Good point. ... -- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory