On 2/5/19 1:27 AM, Philip Reames via llvm-dev wrote:> > On 1/31/19 4:57 PM, Bruce Hoult wrote: >> On Thu, Jan 31, 2019 at 4:05 PM Philip Reames via llvm-dev >> <llvm-dev at lists.llvm.org> wrote: >>> Do such architectures frequently have arithmetic operations on the >>> mask registers? (i.e. can I reasonable compute a conservative >>> length given a mask register value) If I can, then having a mask as >>> the canonical form and re-deriving the length register from a mask >>> for a sequence of instructions which share a predicate seems fairly >>> reasonable. Note that I'm assuming this as a fallback, and that the >>> common case is handled via the equivalent of ComputeKnownBits on the >>> mask itself at compile time. >> If masking is used (which it is usually not for loops without control >> flow inside the vectorised loop) then, yes, logical operations on the >> mask registers will happen at every basic block boundary. >> >> But it is NOT the case that you can computer the active vector length >> VL from an initial mask value. The active vector length is set by the >> hardware based on the remaining application vector length. The VL can >> change for each loop iteration -- the normal pattern is for VL to >> equal VLMAX for initial executions of the loop, and then be less than >> VLMAX for the final one or two iterations of the loop. For example if >> VLMAX is 16 and there are 19 elements left in the application vector >> then the hardware might choose to use 10 elements for the 2nd to last >> iteration and 9 elements for the last iteration. Or not. Other >> hardware might choose to perform the last three iterations as 12/12/11 >> instead of 16/10/9. (It is constrained to be monotonic). >> >> VL can also be dynamically shortened in the middle of a loop iteration >> by an unaligned vector load that crosses a protection boundary if the >> later elements are inaccessible. > I can't reconcile this complexity with either the snippet on RISV > which was shared, or the current EVL proposal. Doesn't this imply > that the vector length can change between *every* pair of vector > instructions? If so, how does having it as part of the EVL intrinsics > work?I think this is the usual mixup of AVL and MVL. AVL: is part of the predicate and can change between vector operations just like a mask can (light weight). MVL: Is the physical vector register length and can be re-configured per function (RVV only atm) - (heavy weight, stop-the-world instruction). The vectorlen parameter in EVL intrinsics is for the AVL.>> >> I'm curious what SVE will do if there is an if/then/else in the middle >> of a vectorised loop with a shorter-than-maximum vector length. You >> can't just invert the mask when going from the then-part to the >> else-part because that would re-enable elements past the end of the >> vector. You'd need to invert the mask and then AND it with the mask >> containing the (bitwise representation of) the vector length.I folks have issues with carrying the vlen around even if the target only supports masking, we can rephrase EVL using higher-order functions with varargs (basically prefixing): ARM SVE, AVX512 (mask only targets): llvm.evl.masked(<16 x i1> mask %M, ...) llvm.evl.fsub(<16 x float>, <16 x float>) ; exists only to get a function handls call @llvm.evl.masked.v16f32(%M, @llvm.evl.fsub(v16f32, <16 x float>, <16 x float>) RISC-V V, SX-Aurora: llvm.evl.pred(<16 x i1> mask %M, i32 vlen %VL, ...) llvm.evl.pred(%M, %vl, @llvm.evl.fsub, %a, %b) The problem with this is mostly that the operand positions are now off compared to regular IR and the API abstractions that accept both will have to account for that. - Simon> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : compilers.cs.uni-saarland.de/people/moll
On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote:> I think this is the usual mixup of AVL and MVL. > > AVL: is part of the predicate and can change between vector operations > just like a mask can (light weight). > > MVL: Is the physical vector register length and can be re-configured per > function (RVV only atm) - (heavy weight, stop-the-world instruction). > > The vectorlen parameter in EVL intrinsics is for the AVL.Unless I misunderstand, this doesn't describe RVV correctly, although this is understandable as the spec has moved around a bit in the last six or twelve months as it's gotten closer to being set in stone. The way it has ended up (very unlikely to change now) is: - any given RVV vector unit has 32 registers each with the same and fixed length in bits. - the vector unit is configured by the VSETVL[I] instruction which has two arguments: 1) the requested AVL, and 2) the vtype (vector type). - The vtype is an integer with several small fields, of which two are currently defined (the other bits must be zero). The fields are the Standard Element Width and VLMul. SEW can be any power of 2 from 8 bits up to some implementation-defined maximum (1024 bits absolute maximum). VLMul says that you don't actually need 32 distinct vector variables in your current loop/function and you're willing to trade number of registers for a larger MVL. So, you can gang together each even/odd register pair into 16 longer registers (named 0,2,4...30), or you can gang together groups of four or at most eight registers. - the current MVL -- the maximum number of elements in a vector register -- is the hardware register length, multiplied by the VLMul field in vtype, divided by the SEW field in vtype. - the AVL is the smaller of MVL and the requested AVL. - only two things can change AVL: the VSETVL[I] instruction, and a special kind of memory load: "Unit-stride First-Fault Loads" if the load crosses a protection boundary and the tail of the vector is inaccessible. This kind of load is relatively uncommon and exists so you can vectorise things where the end of the application vector is data-dependent rather than counted. The canonical example is strlen()/strcpy(). For most code you can ignore it and say the AVL changes only when you execute VSETVL[I]. - any time the program uses VSETVL[I] *both* the MVL and the AVL can change. - the common case is a loop with the vtype in an immediate VSETVLI at the head of the loop. In this case, the AVL potentially changes in every iteration of the loop (but usually only in the last one or two iterations). As the vtype is in an immediate it can't change from iteration to iteration. But it's common for two loops in the same function to use different vtype, and so different MVL, because the loops might either operate on different data types, or need a different number of vector variables in the loop, or both. - VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's quite ok to execute it as much as you want -- even before every vector instruction if you want. That would be pretty unusual, and I think falls more into the "clever hand-written code" area than into anything a compiler is likely to want to generate from C loops, although it's certainly possible. Here's an example: void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){ for (size_t i=0; i<n; ++i) dst[i] += a[i] * b[i]; } If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you might want to compile this to: # args n in a0, dst in a1, a in a2, b in a3, AVL in t0 foo: vsetvli a4, a0, vsew32,vlmul4 # vtype = 32-bit integer vectors, AVL in a4 vlw.v v0, (a2) # Get 32b vector a into v0-v3 vlw.v v4, (a3) # Get 32b vector b into v4-v7 slli a5, a4, 2 # multiply AVL by element size 4 bytes add a2, a2, a5 # Bump pointer a add a3, a3, a5 # Bump pointer b vwmul.vv v8, v0, v4 # 64b result in v8-v15 vsetvli zero, a0, vsew64,vlmul8 # Operate on 64b values, discard new AVL as it's the same vld.v v16, (a1) # Get 64b vector dst into v16-v23 vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23 vsd.v v16, (a1) # Store vector of 64b slli a5, a4, 3 # multiply AVL by element size 8 bytes add a1, a1, a5 # Bump pointer dst sub a0, a0, a4 # subtract AVL from n to get remaining count bnez a0, foo # Any more? ret The alternative of course is to set up for 64 bit elements at the outset, let the two vlw.v's for a and b widen the 32 bit loads into 64 bit elements, then do 64x64->64 multiplies. The code would be two instructions shorter, saving one of the vsetvli (4 bytes) and one of the shifts (2 bytes). Assuming for the moment a 512 bit (64 byte) vector register size (total vector register file 2 KB). this function initially sets the MVL to 64 (2048 bits divided into 32-bit elements). The widening multiply produces 64 64-bit elements. The second half of the loop then sets the element size to 64 bits and doubles the vlmul, so the MVL is still 64 (4096 bits divided into 64-bit elements). The load, add, and store of dst then takes place using 64 bit calculations. Except on the last iteration [1] the AVL will be the same as the MVL. Both will change (in bits, not in number of elements in this case) twice in each loop. [1] if on the 2nd to last iteration there are, say, 72 elements left, the vsetvli instruction might choose to return an AVL of 36 elements, leaving 36 for the last iteration, rather than doing 64 and then leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and 32 depending on what suits that particular hardware. Or maybe it will equalise the last three or four or more iterations. The main rule is the AVL must decrease monotonically.
Luke Kenneth Casson Leighton via llvm-dev
2019-Feb-05 11:49 UTC
[llvm-dev] [RFC] Vector Predication
On Tuesday, February 5, 2019, Simon Moll <moll at cs.uni-saarland.de> wrote: I think this is the usual mixup of AVL and MVL.> > AVL: is part of the predicateMmm that's very confusing to say that AVL is part of the predicate. It's.... kiinda true?> > and can change between vector operations just like a mask can (light > weight).Yes, ok, it's more that it is an "advisory". In RVV the program (the instruction) *requests* a specific AVL and the processor responds with an *actual* AVL of between 0 (yes really, zero) and MIN(MVL, requested_AVL). To say that it's a predicate, well... a predicate mask, you set it, and the mask is obeyed, period. AVL, that just doesn't happen.> > MVL: Is the physical vector register length and can be re-configured per > function (RVV only atm) - (heavy weight, stop-the-world instruction).My understanding of RVV is that MVL is intended to be more of a hardcoded parameter that is part of the processor design. Any compiler should be generating code that really does not need to know what MVL is. SV is slightly different, due to the fact that we use the *scalar* regfile as if it was a typecasted SRAM. The register number in any given instruction is just a pointer to the SRAM address at which vector elements i8/16/32/64 are read/written. So in SV we need to *set* the MVL, otherwise how can the engine know the point where it has to stop reading/writing to the register SRAM? However what is most likely to happen is, MVL will be set globally to e.g 4 and be done with it. SV semantics for AVL are also slightly different from RVV, not by much though. The engine is not permitted to choose arbitrary values: if AVL is requested to be set to 4, it must *be* set to MIN(MVL, 4). This can sometimes avoid the need for a loop, entirely (short vectors). Note also that in SV, neither AVL nor MVL may be set to zero. AVL=1 indicates that the engine is to interpret instructions in SCALAR mode. The vectorlen parameter in EVL intrinsics is for the AVL.> >Ok so there is a bit of a problem, for both SV and RVV, in that both can end up with different AVL values from what is requested. If the API expects that when AVL elements are to be processed, that exactly that number of elements *will* have been processed, that is simply not the case and that assumption will result in a catastrophic failure, elements not being processed. To deal with that, if it is a hard requirement of the API that exactly the number of AVL ops are carried out as requested, an otherwise completely redundant assembly code for-loop will have to be generated. Oh and then outside of that loop would be the IR level inner loop that was actually part of the user's program. Basically what I am saying is that the semantics "request an AVL from the hardware and get an ACTUAL number of elements to be processed" really needs to become part of the API. Now, fascinatingly, for SIMD-only style architectures, that could hypothetically be used to communicate to the JIT engine converting the IR to use progressively smaller SIMD widths, on architectures that have multiple widths. Also to indicate when corner-case cleanup is to be used. (SIMD alteady being a mess, this would all not be high priority / optimised) OR... the inner workings of AVL are entirely hidden and opaque to the IR. The IR sets the total explicit number of elements, and It Gets Done. However I suspect that doing that will open a can o worms.>>> I'm curious what SVE will do if there is an if/then/else in the middle >>> of a vectorised loop with a shorter-than-maximum vector length. You >>> can't just invert the mask when going from the then-part to the >>> else-part because that would re-enable elements past the end of the >>> vector. You'd need to invert the mask and then AND it with the mask >>> containing the (bitwise representation of) the vector length. >>> >>Yep, that is a workable solution for fixed width (SIMD) architectures, it is a good pattern to use. As I mentioned earlier (about the mistake of using gather/scatter as a means and method of implementing predication), it would be a mistake to try to "dumb down" this proposal to cater for fixed-length SIMD engines to the detriment of dynamic-length engines. If you try that then all the advantages of dynamic-length ISAs are utterly destroyed, as the only way to implement the compliance with a dumbed-down fixed-length proposal is: for variable-length ISAs to issue brain-dead FIXED length assembly code. Whereas if the API can cope with variable length, the length that is returned for a SIMD engine may be one of the multiples of SIMD widths that that engine supports, can use scatter/gather as a substitute for (potential) lack of predication masks and so on. If as an industry we want to break free of the seductively broken SIMD paradigm, then variable-length engines need to be given top priority. Really. and again, I say that with profuse apologies to all engineers who have to deal with SIMD. I know it's so much easier to implement at the hardware level, it's just that SIMD has always made the compiler writers job absolute hell. L. -- --- crowd-funded eco-conscious hardware: crowdsupply.com/eoma68 -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20190205/b4470a42/attachment.html>
On 2/5/19 12:49 PM, Luke Kenneth Casson Leighton wrote:> > Basically what I am saying is that the semantics "request an AVL from > the hardware and get an ACTUAL number of elements to be processed" > really needs to become part of the API. >Ok. We could add this behavior to the EVL contract with an intrinsic. %EffectiveVL = llvm.evl.setvl(<scalable vscale x float>, %RequestedAVL) where vscale would be interpreted as VLMul on RISC-V.> the inner workings of AVL are entirely hidden and opaque to the IR. > The IR sets the total explicit number of elements, and It Gets Done.That would still be ok, if the following invariant holds on RVV: Given that %effectivevl = setvl <vty> %reqVL Will the following invariant hold whenever %avl <= %effectivevl? %avl == setvl <vty>, %avl In that case, we could require the VL parameter to be derived from the evl.setvl intrinsic without violating that part of EVL semantics (all elements in the range from [0 to VL) will be processed). - Simon> > > -- > --- > crowd-funded eco-conscious hardware: crowdsupply.com/eoma68 >-- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : compilers.cs.uni-saarland.de/people/moll
On 2/5/19 12:06 PM, Bruce Hoult wrote:> On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote: >> I think this is the usual mixup of AVL and MVL. >> >> AVL: is part of the predicate and can change between vector operations >> just like a mask can (light weight). >> >> MVL: Is the physical vector register length and can be re-configured per >> function (RVV only atm) - (heavy weight, stop-the-world instruction). >> >> The vectorlen parameter in EVL intrinsics is for the AVL. > Unless I misunderstand, this doesn't describe RVV correctly, although > this is understandable as the spec has moved around a bit in the last > six or twelve months as it's gotten closer to being set in stone. > > The way it has ended up (very unlikely to change now) is: > > - any given RVV vector unit has 32 registers each with the same and > fixed length in bits. > > - the vector unit is configured by the VSETVL[I] instruction which has > two arguments: 1) the requested AVL, and 2) the vtype (vector type). > > - The vtype is an integer with several small fields, of which two are > currently defined (the other bits must be zero). The fields are the > Standard Element Width and VLMul. SEW can be any power of 2 from 8 > bits up to some implementation-defined maximum (1024 bits absolute > maximum). VLMul says that you don't actually need 32 distinct vector > variables in your current loop/function and you're willing to trade > number of registers for a larger MVL. So, you can gang together each > even/odd register pair into 16 longer registers (named 0,2,4...30), or > you can gang together groups of four or at most eight registers. > > - the current MVL -- the maximum number of elements in a vector > register -- is the hardware register length, multiplied by the VLMul > field in vtype, divided by the SEW field in vtype. > > - the AVL is the smaller of MVL and the requested AVL. > > - only two things can change AVL: the VSETVL[I] instruction, and a > special kind of memory load: "Unit-stride First-Fault Loads" if the > load crosses a protection boundary and the tail of the vector is > inaccessible. This kind of load is relatively uncommon and exists so > you can vectorise things where the end of the application vector is > data-dependent rather than counted. The canonical example is > strlen()/strcpy(). For most code you can ignore it and say the AVL > changes only when you execute VSETVL[I]. > > - any time the program uses VSETVL[I] *both* the MVL and the AVL can change. > > - the common case is a loop with the vtype in an immediate VSETVLI at > the head of the loop. In this case, the AVL potentially changes in > every iteration of the loop (but usually only in the last one or two > iterations). As the vtype is in an immediate it can't change from > iteration to iteration. But it's common for two loops in the same > function to use different vtype, and so different MVL, because the > loops might either operate on different data types, or need a > different number of vector variables in the loop, or both. > > - VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's > quite ok to execute it as much as you want -- even before every vector > instruction if you want. That would be pretty unusual, and I think > falls more into the "clever hand-written code" area than into anything > a compiler is likely to want to generate from C loops, although it's > certainly possible. > > Here's an example: > > void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){ > for (size_t i=0; i<n; ++i) > dst[i] += a[i] * b[i]; > } > > If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you > might want to compile this to: > > # args n in a0, dst in a1, a in a2, b in a3, AVL in t0 > foo: > vsetvli a4, a0, vsew32,vlmul4 # vtype = 32-bit integer vectors, AVL in a4 > vlw.v v0, (a2) # Get 32b vector a into v0-v3 > vlw.v v4, (a3) # Get 32b vector b into v4-v7 > slli a5, a4, 2 # multiply AVL by element size 4 bytes > add a2, a2, a5 # Bump pointer a > add a3, a3, a5 # Bump pointer b > vwmul.vv v8, v0, v4 # 64b result in v8-v15 > > vsetvli zero, a0, vsew64,vlmul8 # Operate on 64b values, discard > new AVL as it's the same > vld.v v16, (a1) # Get 64b vector dst into v16-v23 > vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23 > vsd.v v16, (a1) # Store vector of 64b > slli a5, a4, 3 # multiply AVL by element size 8 bytes > add a1, a1, a5 # Bump pointer dst > sub a0, a0, a4 # subtract AVL from n to get remaining count > bnez a0, foo # Any more? > ret > > The alternative of course is to set up for 64 bit elements at the > outset, let the two vlw.v's for a and b widen the 32 bit loads into 64 > bit elements, then do 64x64->64 multiplies. The code would be two > instructions shorter, saving one of the vsetvli (4 bytes) and one of > the shifts (2 bytes). > > Assuming for the moment a 512 bit (64 byte) vector register size > (total vector register file 2 KB). this function initially sets the > MVL to 64 (2048 bits divided into 32-bit elements). The widening > multiply produces 64 64-bit elements. The second half of the loop then > sets the element size to 64 bits and doubles the vlmul, so the MVL is > still 64 (4096 bits divided into 64-bit elements). The load, add, and > store of dst then takes place using 64 bit calculations. > > Except on the last iteration [1] the AVL will be the same as the MVL. > Both will change (in bits, not in number of elements in this case) > twice in each loop. > > [1] if on the 2nd to last iteration there are, say, 72 elements left, > the vsetvli instruction might choose to return an AVL of 36 elements, > leaving 36 for the last iteration, rather than doing 64 and then > leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and > 32 depending on what suits that particular hardware. Or maybe it will > equalise the last three or four or more iterations. The main rule is > the AVL must decrease monotonically.Thank you for the detailed explanation! I wasn't aware of the current state of RVV in that regard. This seems to imply that enforcing MVL changes only per function level is now moot (as in lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html). -- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : compilers.cs.uni-saarland.de/people/moll