C Bergström via llvm-dev
2016-Nov-27 15:35 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
I'm sorry.. may I interrupt for a minute and try to grok things for a bit different angle.. While the VL can vary.. in practice wouldn't the cost of vectorization and width be tied more to the hardware implementation than anything else? The cost of vectorizing thread 1 vs 2 isn't likely to change? (Am I drunk and mistaken?) If the above holds true then the the length would be only variable between different hardware implementations.. (at least this is how I understand it) This seems tightly coupled to hardware.. On Sun, Nov 27, 2016 at 9:59 PM, Paul Walker via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Thanks Renato, my takeaway is that I am presenting the design out of order. So let's focus purely on the vector length (VL) and ignore everything else. For SVE the vector length is unknown and can vary across an as yet undetermined boundary (process, library....). Within a boundary we propose making VL a constant with all instructions that operate on this constant locked within its boundary. > > I know this stretches the meaning of constant and my reasoning (however unsound) is below. We expect changes to VL to be infrequent and not located where it would present an unnecessary barrier to optimisation. With this in mind the initial implementation of VL barriers would be an intrinsic that prevents any instruction movement across it. > > Question: Is this type of intrinsic something LLVM supports today? > > Why a constant? Well it doesn't change within the context it is being used. More crucially the LLVM implementation of constants gives us a property that's very important to SVE (perhaps this is where prototyping laziness has kicked in). Constants remain attached to the instructions that operate on them through until code generation. This allows the semantic meaning of these instruction to be maintained, something non-scalable vectors get for free with their "real" constants. > > As a specific example take the vector reversal that LoopVectorize does when iterating backward through memory. For non-scalable vectors this looks thusly: > > shufflevector <4 x i32> %a, <4 x i32> undef, <i32 3, i32 2, i32 1, i32 0> > > Throughout the IR and into code generation the intention of this instruction is clear. Now turning to scalable vectors the same operation becomes: > > shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x i32> seriesvector ( sub (i32 VL, 1), i32 -1) > > Firstly I'll highlight the use of seriesvector is purely for brevity, let's ignore that debate for now. Our concern is that not treating VL as a Constant means sub and seriesvector are no longer constant and are likely to be hoisted away from the shufflevector. The knock on effect being to force the code generator into generating generic vector permutes rather than utilise any specialised permute instructions the target provides. > > Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF. > > Paul!!! > > p.s. > > I'll respond to the stepvector question later in a separate post to break down the different discussion points. > > > On 26/11/2016, 17:07, "Renato Golin" <renato.golin at linaro.org> wrote: > > On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com> wrote: > > Related to this I want to push this and related conversations in a different direction. From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication. With this in mind I would rather the problem and its solution be discussed at the IR's level of abstraction rather than getting into the guts of SVE. > > Hi Paul, > > How scalable vectors operate is intimately related to how you > represent them in IR. It took a long time for the vector types to be > mapped to all available semantics. We still had to use a bunch of > intrinsics for scatter / gather, it took years to get the strided > access settled. > > I understand that scalable vectors are orthogonal to all this, but as > a new concept, one that isn't available in any open source compiler I > know of, is one that will likely be very vague. Not publishing the > specs only make it worse. > > I take the example of the ACLE and ARMv8.2 patches that ARM has been > pushing upstream. I have no idea what the new additions are, so I have > to take your word that they're correct. But later on, different > behaviour comes along for the same features with a comment "it didn't > work that way, let's try this". Sometimes, I don't even know what > failed, or why this new thing is better. > > When that behaviour is constricted to the ARM back-end, it's ok. It's > a burden that me and Tim will have to carry, and so far, it has been a > small burden. But exposing the guts of the vectorizers (which are > already getting to a point where the need large refactorings), which > will affect all targets, need a bit more of concrete information. > > The last thing we want is to keep changing how the vectorizer behaves > every six months without any concrete information as to why. > > I also understand that LLVM is great at prototyping, and that's an > important step for companies like ARM to make sure their features work > as reliably as they expect in the wild, but I think adding new IR > semantics and completely refactoring core LLVM passes without a clue > is a few steps too far. > > I'm not asking for a full spec. All I'm asking is for a description of > the intended basic functionality. Addressing modes, how to extract > information from unknown lanes, or if all reduction steps will be done > like `saddv`. Without that information, I cannot know what is the best > IR representation for scalable vectors or what will be the semantics > of shufffle / extract / insert operations. > > > > "complex constant" is the term used within the LangRef. Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant. > > From the LangRef: > > "Complex constants are a (potentially recursive) combination of simple > constants and smaller complex constants." > > There's nothing there saying it doesn't need to be modeled in IR. > > > > "vscale" is not trying to represent the result of such speculation. It's purely a constant runtime vector length multiplier. Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector. > > Right, I'm beginning to see what you mean... > > The vectorizer needs that to be a constant at compile time to make > safety assurances. > > For instance: for (1..N) { a[i+3] = a[i] + i; } > > Has a max VF of 3. If the vectorizer is to act on that loop, it'll > have to change "vscale" to 3. If there are no loop dependencies, then > you leave as "vscale" but vectorizes anyway. > > Other assurances are done for run time constants, for instance, tail > loops when changing > > for (i=0; i<N; i++) -> for (i=0; i<N; i+=VF) > > That VF is now a run-time "constant", and the vectorizer needs to see > it as much, otherwise it can't even test for validity. > > So, the vectorizer will need to be taught two things: > > 1. "vscale" is a run time constant, and for the purpose of validity, > can be shrunk to any value down to two. If the value is shrunk, the > new compile time constant replaces vscale. > > 2. The cost model will *have* to treat "vscale" as an actual compile > time constant. This could come from a target feature, overriden by a > command line flag but there has to be a default, which I'd assume is > 4, given that it's the lowest length. > > > > > %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) > > > > for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>") > > I see what you mean. > > Quick question: Since you're saying "vscale" is an unknown constant, > why not just: > > %index.next = add nuw nsw i64 %index, i64 vscale > > All scalable operations will be tied up by the predication vector > anyway, and you already know what the vector type size is anyway. > > The only worry is about providing redundant information that could go > stale and introduce bugs. > > I'm assuming the vectorizer will *have* to learn about the compulsory > predication and build those vectors, or the back-end will have to > handle them, and it can get ugly. > > > >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step) > > > > This intrinsic matches the seriesvector instruction we original proposed. However, on reflection we didn't like how it allowed multiple representations for the same constant. > > Can you expand how this allows multiple representations for the same constant? > > This is a series, with a start and a step, and will only be identical > to another which has the same start and step. > > Just like C constants can "appear" different... > > const int foo = 4; > const int bar = foo; > const int baz = 2 + 2; > > > > I know this doesn't preclude the use of an intrinsic, I just wanted to highlight that doing so doesn't automatically change the surrounding IR. > > I don't mind IR changes, I'm just trying to understand the need for it. > > Normally, what we did in the past for some things was to add > intrinsics and then, if it's clear a native IR construct would be > better, we change it. > > At least the intrinsic can be easily added without breaking > compatibility with anything, and since we're in prototyping phase > anyway, changing the IR would be the worst idea. > > cheers, > --renato > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Renato Golin via llvm-dev
2016-Nov-27 15:40 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On 27 November 2016 at 15:35, C Bergström <cbergstrom at pathscale.com> wrote:> While the VL can vary.. in practice wouldn't the cost of vectorization > and width be tied more to the hardware implementation than anything > else? The cost of vectorizing thread 1 vs 2 isn't likely to change? > (Am I drunk and mistaken?)Mistaken. :) The scale of the vector can change between two processes on the same machine and it's up to the kernel (I guess) to make sure they're correct. In theory, it could even change in the same process, for instance, as a result of PGO or if some loops have less loop-carried dependencies than others. The three important premises are: 1. The vectorizer still has the duty to restrict the vector length to whatever makes it cope with the loop dependencies. SVE *has* to be able to cope with that by restricting the number of lanes "per access". 2. The cost analysis will have to assume the smallest possible vector size and "hope" that anything larger will only mean profit. This seems straight-forward enough. 3. Hardware flags and target features must be able to override the minimum size, maximum size, etc. and it's up to the users to make sure that's meaningful in their hardware. cheers, --renato
C Bergström via llvm-dev
2016-Nov-27 15:58 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
On Sun, Nov 27, 2016 at 11:40 PM, Renato Golin <renato.golin at linaro.org> wrote:> On 27 November 2016 at 15:35, C Bergström <cbergstrom at pathscale.com> wrote: >> While the VL can vary.. in practice wouldn't the cost of vectorization >> and width be tied more to the hardware implementation than anything >> else? The cost of vectorizing thread 1 vs 2 isn't likely to change? >> (Am I drunk and mistaken?) > > Mistaken. :) > > The scale of the vector can change between two processes on the same > machine and it's up to the kernel (I guess) to make sure they're > correct. > > In theory, it could even change in the same process, for instance, as > a result of PGO or if some loops have less loop-carried dependencies > than others. > > The three important premises are: > > 1. The vectorizer still has the duty to restrict the vector length to > whatever makes it cope with the loop dependencies. SVE *has* to be > able to cope with that by restricting the number of lanes "per > access". > > 2. The cost analysis will have to assume the smallest possible vector > size and "hope" that anything larger will only mean profit. This seems > straight-forward enough. > > 3. Hardware flags and target features must be able to override the > minimum size, maximum size, etc. and it's up to the users to make sure > that's meaningful in their hardware.I'll bite my tongue on negative comments, but it seems that for anything other than trivial loops this is going to put the burden entirely on the user. Are you telling me the *kernel* is really going to be able to make these decisions on the fly, correctly? Won't this block loop transformations?
Bruce Hoult via llvm-dev
2016-Nov-28 13:09 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
"If the above holds true then the the length would be only variable between different hardware implementations.." This seems related to a problem that has independently hit several different projects around the world for a while now, and only recently have people understood what the problem is. These projects are all doing things that require cache invalidation, for example JIT compilation. They have hit a problem that the size of the cache block is changing unexpectedly underneath the program when the program is migrated from a big processor to a LITTLE. The program might start on a CPU with a 64 byte cache block and then suddenly find itself on a CPU with a 32 byte cache block, but it's still doing cache flushes with a 64 byte stride. So half the cache blocks don't get flushed. As far as I'm aware, there is no defined time at which this happens. Maybe it could be between one instruction and the next! We don't even know a good way to enumerate all cache block sizes present in the system at runtime (and always use the smallest one as the stride). So we're for the moment hard-coding a value which we hope will always be small enough, and taking the (minor) hit from trying to flush the same cache block multiple times. A 32 byte stride, say, on a machine with 128 byte cache blocks is still a lot better than using a stride of 1 or 4 bytes. If there is a defined time when these changes can happen e.g. at a system call then we'd really love to know about it! Not having seen any actual designs for SVE It seems possible to me that the vector width could also change on migration between core types. So perhaps the answer is the same. On Sun, Nov 27, 2016 at 6:35 PM, C Bergström via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I'm sorry.. may I interrupt for a minute and try to grok things for a > bit different angle.. > > While the VL can vary.. in practice wouldn't the cost of vectorization > and width be tied more to the hardware implementation than anything > else? The cost of vectorizing thread 1 vs 2 isn't likely to change? > (Am I drunk and mistaken?) > > If the above holds true then the the length would be only variable > between different hardware implementations.. (at least this is how I > understand it) > > This seems tightly coupled to hardware.. > > > On Sun, Nov 27, 2016 at 9:59 PM, Paul Walker via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > Thanks Renato, my takeaway is that I am presenting the design out of > order. So let's focus purely on the vector length (VL) and ignore > everything else. For SVE the vector length is unknown and can vary across > an as yet undetermined boundary (process, library....). Within a boundary > we propose making VL a constant with all instructions that operate on this > constant locked within its boundary. > > > > I know this stretches the meaning of constant and my reasoning (however > unsound) is below. We expect changes to VL to be infrequent and not > located where it would present an unnecessary barrier to optimisation. > With this in mind the initial implementation of VL barriers would be an > intrinsic that prevents any instruction movement across it. > > > > Question: Is this type of intrinsic something LLVM supports today? > > > > Why a constant? Well it doesn't change within the context it is being > used. More crucially the LLVM implementation of constants gives us a > property that's very important to SVE (perhaps this is where prototyping > laziness has kicked in). Constants remain attached to the instructions > that operate on them through until code generation. This allows the > semantic meaning of these instruction to be maintained, something > non-scalable vectors get for free with their "real" constants. > > > > As a specific example take the vector reversal that LoopVectorize does > when iterating backward through memory. For non-scalable vectors this > looks thusly: > > > > shufflevector <4 x i32> %a, <4 x i32> undef, <i32 3, i32 2, i32 > 1, i32 0> > > > > Throughout the IR and into code generation the intention of this > instruction is clear. Now turning to scalable vectors the same operation > becomes: > > > > shufflevector <n x 4 x i32> %a, <n x 4 x i32> undef, <n x 4 x > i32> seriesvector ( sub (i32 VL, 1), i32 -1) > > > > Firstly I'll highlight the use of seriesvector is purely for brevity, > let's ignore that debate for now. Our concern is that not treating VL as a > Constant means sub and seriesvector are no longer constant and are likely > to be hoisted away from the shufflevector. The knock on effect being to > force the code generator into generating generic vector permutes rather > than utilise any specialised permute instructions the target provides. > > > > Does this make sense? I am not after agreement just want to make sure we > are on the same page regarding our aims before digging down into how VL > actually looks and its interaction with the loop vectoriser’s chosen VF. > > > > Paul!!! > > > > p.s. > > > > I'll respond to the stepvector question later in a separate post to > break down the different discussion points. > > > > > > On 26/11/2016, 17:07, "Renato Golin" <renato.golin at linaro.org> wrote: > > > > On 26 November 2016 at 11:49, Paul Walker <Paul.Walker at arm.com> > wrote: > > > Related to this I want to push this and related conversations in a > different direction. From the outset our approach to add SVE support to > LLVM IR has been about solving the generic problem of vectorising for an > unknown vector length and then extending this to support predication. With > this in mind I would rather the problem and its solution be discussed at > the IR's level of abstraction rather than getting into the guts of SVE. > > > > Hi Paul, > > > > How scalable vectors operate is intimately related to how you > > represent them in IR. It took a long time for the vector types to be > > mapped to all available semantics. We still had to use a bunch of > > intrinsics for scatter / gather, it took years to get the strided > > access settled. > > > > I understand that scalable vectors are orthogonal to all this, but as > > a new concept, one that isn't available in any open source compiler I > > know of, is one that will likely be very vague. Not publishing the > > specs only make it worse. > > > > I take the example of the ACLE and ARMv8.2 patches that ARM has been > > pushing upstream. I have no idea what the new additions are, so I > have > > to take your word that they're correct. But later on, different > > behaviour comes along for the same features with a comment "it didn't > > work that way, let's try this". Sometimes, I don't even know what > > failed, or why this new thing is better. > > > > When that behaviour is constricted to the ARM back-end, it's ok. It's > > a burden that me and Tim will have to carry, and so far, it has been > a > > small burden. But exposing the guts of the vectorizers (which are > > already getting to a point where the need large refactorings), which > > will affect all targets, need a bit more of concrete information. > > > > The last thing we want is to keep changing how the vectorizer behaves > > every six months without any concrete information as to why. > > > > I also understand that LLVM is great at prototyping, and that's an > > important step for companies like ARM to make sure their features > work > > as reliably as they expect in the wild, but I think adding new IR > > semantics and completely refactoring core LLVM passes without a clue > > is a few steps too far. > > > > I'm not asking for a full spec. All I'm asking is for a description > of > > the intended basic functionality. Addressing modes, how to extract > > information from unknown lanes, or if all reduction steps will be > done > > like `saddv`. Without that information, I cannot know what is the > best > > IR representation for scalable vectors or what will be the semantics > > of shufffle / extract / insert operations. > > > > > > > "complex constant" is the term used within the LangRef. Although > its value can be different across certain interfaces this does not need to > be modelled within the IR and thus for all intents and purposes we can > safely consider it to be constant. > > > > From the LangRef: > > > > "Complex constants are a (potentially recursive) combination of > simple > > constants and smaller complex constants." > > > > There's nothing there saying it doesn't need to be modeled in IR. > > > > > > > "vscale" is not trying to represent the result of such > speculation. It's purely a constant runtime vector length multiplier. Such > a value is required by LoopVectorize to update induction variables as > describe below plus simple interactions like extracting the last element of > a scalable vector. > > > > Right, I'm beginning to see what you mean... > > > > The vectorizer needs that to be a constant at compile time to make > > safety assurances. > > > > For instance: for (1..N) { a[i+3] = a[i] + i; } > > > > Has a max VF of 3. If the vectorizer is to act on that loop, it'll > > have to change "vscale" to 3. If there are no loop dependencies, then > > you leave as "vscale" but vectorizes anyway. > > > > Other assurances are done for run time constants, for instance, tail > > loops when changing > > > > for (i=0; i<N; i++) -> for (i=0; i<N; i+=VF) > > > > That VF is now a run-time "constant", and the vectorizer needs to see > > it as much, otherwise it can't even test for validity. > > > > So, the vectorizer will need to be taught two things: > > > > 1. "vscale" is a run time constant, and for the purpose of validity, > > can be shrunk to any value down to two. If the value is shrunk, the > > new compile time constant replaces vscale. > > > > 2. The cost model will *have* to treat "vscale" as an actual compile > > time constant. This could come from a target feature, overriden by a > > command line flag but there has to be a default, which I'd assume is > > 4, given that it's the lowest length. > > > > > > > > > %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4) > > > > > > for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x > Ty>") > > > > I see what you mean. > > > > Quick question: Since you're saying "vscale" is an unknown constant, > > why not just: > > > > %index.next = add nuw nsw i64 %index, i64 vscale > > > > All scalable operations will be tied up by the predication vector > > anyway, and you already know what the vector type size is anyway. > > > > The only worry is about providing redundant information that could go > > stale and introduce bugs. > > > > I'm assuming the vectorizer will *have* to learn about the compulsory > > predication and build those vectors, or the back-end will have to > > handle them, and it can get ugly. > > > > > > >> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, > i32 %step) > > > > > > This intrinsic matches the seriesvector instruction we original > proposed. However, on reflection we didn't like how it allowed multiple > representations for the same constant. > > > > Can you expand how this allows multiple representations for the same > constant? > > > > This is a series, with a start and a step, and will only be identical > > to another which has the same start and step. > > > > Just like C constants can "appear" different... > > > > const int foo = 4; > > const int bar = foo; > > const int baz = 2 + 2; > > > > > > > I know this doesn't preclude the use of an intrinsic, I just > wanted to highlight that doing so doesn't automatically change the > surrounding IR. > > > > I don't mind IR changes, I'm just trying to understand the need for > it. > > > > Normally, what we did in the past for some things was to add > > intrinsics and then, if it's clear a native IR construct would be > > better, we change it. > > > > At least the intrinsic can be easily added without breaking > > compatibility with anything, and since we're in prototyping phase > > anyway, changing the IR would be the worst idea. > > > > cheers, > > --renato > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161128/6201cb4b/attachment.html>
James Y Knight via llvm-dev
2016-Nov-28 16:42 UTC
[llvm-dev] [RFC] Supporting ARM's SVE in LLVM
(This is somewhat of a digression from the topic of SVE, but...) On Mon, Nov 28, 2016 at 8:09 AM, Bruce Hoult via llvm-dev < llvm-dev at lists.llvm.org> wrote:> "If the above holds true then the the length would be only variable > between different hardware implementations.." > > This seems related to a problem that has independently hit several > different projects around the world for a while now, and only recently have > people understood what the problem is. > > These projects are all doing things that require cache invalidation, for > example JIT compilation. They have hit a problem that the size of the cache > block is changing unexpectedly underneath the program when the program is > migrated from a big processor to a LITTLE. The program might start on a CPU > with a 64 byte cache block and then suddenly find itself on a CPU with a 32 > byte cache block, but it's still doing cache flushes with a 64 byte stride. > So half the cache blocks don't get flushed. > > As far as I'm aware, there is no defined time at which this happens. Maybe > it could be between one instruction and the next! We don't even know a good > way to enumerate all cache block sizes present in the system at runtime > (and always use the smallest one as the stride). So we're for the moment > hard-coding a value which we hope will always be small enough, and taking > the (minor) hit from trying to flush the same cache block multiple times. A > 32 byte stride, say, on a machine with 128 byte cache blocks is still a lot > better than using a stride of 1 or 4 bytes. > > If there is a defined time when these changes can happen e.g. at a system > call then we'd really love to know about it! > > Not having seen any actual designs for SVE It seems possible to me that > the vector width could also change on migration between core types. So > perhaps the answer is the same. >The cache-line-size issue I believe you're referring to was hardware errata on a particular Samsung-designed core, not the way it is intended to work. The reported cache-line size is intended to be the smallest possible value across the system, but that particular CPU (Exynos 8890) was erroneously reporting 128 for code running on the "big" Exynos-M1 core, and 64 for code running on the "little" A53 core. The ARM docs for the Cortex-A15 ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438d/BABHAEIF.html) mention *exactly* this issue, and note that if you're mixing an A15 with a small core (such as an A7), the designer must set the IMINLN signal on the A15 to 0, to indicate that the A15 should also report a 32-byte cache-line instead of its native 64-byte cache line. Nothing about that issue is mentioned in the docs for ARMv8 cores, because, at least so far, all the ARM-designed 64-bit CPUs have 64byte cache lines. Obviously the same care ought to be taken if you change that property...but unfortunately it was forgotten in this case. In any case, that hardware defect has been worked around in linux 4.9 (116c81f427ff6c5380850963e3fb8798cc821d2b), and so it will now return a consistent cache-line size even if the CPU has that error. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161128/931ae381/attachment.html>