Hi, We would like to extend the vector operations in llvm a bit. We're hoping to get some feedback on the right way to go, or some starting points. I had previously had some discussion on this list about a subset of the changes we have in mind. All of these changes are intended to make target-independent IR (i.e. IR without machine specific intrinsics) generate better code or be easier to generate from a frontend with vector support (whether from manual or autovectorization). If you have any insight into how to best get started with any of these changes, and whether they are feasible and sensible, please let me know. We're mostly interested in x86 as a target in the short term, but obviously want these to apply to other LLVM targets as well. We're prepared to implement these changes, but would like to hear any suggestions and objections you might have. Below are the specific additions we have in mind. ==1) Vector shl, lshr, ashr I think these are no-brainers. We would like to extend the semantics of the shifting instructions to naturally apply to vectors as well. One issue is that these operations often only support a single shift amount for an entire vector. I assume it should be fairly straightforward to select on this pattern, and scalarize the general case as necessary. 2) Vector strunc, sext, zext, fptrunc and fpext Again, I think these are hopefully straightforward. Please let me know if you expect any issues with vector operations that change element sizes from the RHS to the LHS, e.g. around legalization. 3) Vector intrinsics for floor, ceil, round, frac/modf These are operations that are not trivially specified in terms of simpler operations. It would be nice to have these as overloaded, target-independent intrinsics, in the same way as llvm.cos etc. are supported now. 4) Vector select We consider a vector select extremely important for a number of operations. This would be an extension of select to support an <N x i1> vector mask to select between elements of <N x T> vectors for some basic type T. Vector min, max, sign, etc. can be built on top of this operation. 5) Vector comparisons that return <N x i1> This is maybe not a must-have, and perhaps more a question of preference. I understand the current vfcmp/vicmp semantics, returning a vector of iK where K matches the bitwidth of the operands being compared with the high bit set or not, are there for pragmatic reasons, and that these functions exist to aid with code emitted that uses machine-specific intrinsics. For code that does not use machine intrinsics, I believe it would be cleaner, simpler, and potentially more efficient, to have a vector compare that returns <N x i1> instead. For example, in conjunction with the above-mentioned vector select, this would allow a max to be expressed simply as a sequence of compare and select. Vector bitshifts would actually help with the amount of code generated for something like a vectorized max, but makes the patterns for recognizing these a lot longer. I realize this is probably the most controversial change amongst these. I gather there is some concern about representing "variable width" i1s, but I would contend that that's the case even for i1s which are not vectors. == In addition to the above suggestions, I'd also like to hear what others think about handling vector operations that aren't powers of two in size, e.g. <3 x float> operations. I gather the status quo is that only POT sizes are expected to work (although we've found some bugs for things like <2 x float> that we're submitting). Ideally things like <3 x float> operands would usually be rounded up to the size supported by the machine directly. We can try to do this in the frontend, but it would of course be ideal if these just worked. I'm curious if anyone else out there has dealt with this already and has some suggestions. Please let me know what you think, Stefanus -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
On Mon, Jul 21, 2008 at 1:21 PM, Stefanus Du Toit <stefanus.dutoit at rapidmind.com> wrote:> 1) Vector shl, lshr, ashr > > I think these are no-brainers. We would like to extend the semantics > of the shifting instructions to naturally apply to vectors as well. > One issue is that these operations often only support a single shift > amount for an entire vector. I assume it should be fairly > straightforward to select on this pattern, and scalarize the general > case as necessary.I have a rough draft of a patch for this that works reasonably well for simple cases... I don't think I really have the time to finish it properly, but I'll clean it up a bit and send it to you as a starting point.> 2) Vector strunc, sext, zext, fptrunc and fpext > > Again, I think these are hopefully straightforward. Please let me know > if you expect any issues with vector operations that change element > sizes from the RHS to the LHS, e.g. around legalization.These are tricky to generate efficient code for because they change the size of the vector and most likely introduce illegal vectors. It'll be a lot of work to get these working well for non-trivial cases.> 5) Vector comparisons that return <N x i1> > > This is maybe not a must-have, and perhaps more a question of > preference. I understand the current vfcmp/vicmp semantics, returning > a vector of iK where K matches the bitwidth of the operands being > compared with the high bit set or not, are there for pragmatic > reasons, and that these functions exist to aid with code emitted that > uses machine-specific intrinsics. > > For code that does not use machine intrinsics, I believe it would be > cleaner, simpler, and potentially more efficient, to have a vector > compare that returns <N x i1> instead. For example, in conjunction > with the above-mentioned vector select, this would allow a max to be > expressed simply as a sequence of compare and select. > > I realize this is probably the most controversial change amongst > these. I gather there is some concern about representing "variable > width" i1s, but I would contend that that's the case even for i1s > which are not vectors.It makes sense; the difficulty is making CodeGen generate efficient code for these constructs. Legalization has to transform illegal types into legal types, and as far as I know, none of the current transformations varies the type transformation based on what is using the value. I don't know exactly how hard it would be, though... maybe someone who's more familiar with the legalization infrastructure could say more? -Eli
On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:> Hi, > > We would like to extend the vector operations in llvm a bit. We're > hoping to get some feedback on the right way to go, or some starting > points. I had previously had some discussion on this list about a > subset of the changes we have in mind. > > All of these changes are intended to make target-independent IR (i.e. > IR without machine specific intrinsics) generate better code or be > easier to generate from a frontend with vector support (whether from > manual or autovectorization). > > If you have any insight into how to best get started with any of these > changes, and whether they are feasible and sensible, please let me > know. We're mostly interested in x86 as a target in the short term, > but obviously want these to apply to other LLVM targets as well. We're > prepared to implement these changes, but would like to hear any > suggestions and objections you might have. > > Below are the specific additions we have in mind. > > ==> 1) Vector shl, lshr, ashr > > I think these are no-brainers. We would like to extend the semantics > of the shifting instructions to naturally apply to vectors as well. > One issue is that these operations often only support a single shift > amount for an entire vector. I assume it should be fairly > straightforward to select on this pattern, and scalarize the general > case as necessary.That seems reasonable.> 2) Vector strunc, sext, zext, fptrunc and fpext > > Again, I think these are hopefully straightforward. Please let me know > if you expect any issues with vector operations that change element > sizes from the RHS to the LHS, e.g. around legalization.Is the proposed semantics here that the number of elements stays the same size, and the overall vector width changes?> 3) Vector intrinsics for floor, ceil, round, frac/modf > > These are operations that are not trivially specified in terms of > simpler operations. It would be nice to have these as overloaded, > target-independent intrinsics, in the same way as llvm.cos etc. are > supported now.It seems like these could be handled through intrinsics in the LLVM IR, and could use general improvement in the selection dag.> 4) Vector select > > We consider a vector select extremely important for a number of > operations. This would be an extension of select to support an <N x > i1> vector mask to select between elements of <N x T> vectors for some > basic type T. Vector min, max, sign, etc. can be built on top of this > operation.How is this anything other than AND/ANDN/OR on any integer vector type? I don't see what adding this to the IR gets you for vectors, since "vector of i1" doesn't mean "vector of bits" necessarily.> 5) Vector comparisons that return <N x i1> > > This is maybe not a must-have, and perhaps more a question of > preference. I understand the current vfcmp/vicmp semantics, returning > a vector of iK where K matches the bitwidth of the operands being > compared with the high bit set or not, are there for pragmatic > reasons, and that these functions exist to aid with code emitted that > uses machine-specific intrinsics.I totally disagree with this approach; A vector of i1 doesn't actually match what you want to do with the hardware, unless you had say, 128 x i1 for SSE, and it's strange when you have to spill and reload it. The current VICMP and VFCMP instructions do not exist for use with machine intrinsics; they exist to allow code written use C-style comparison operators to generate efficient code on a wide range of both scalar and vector hardware.> For code that does not use machine intrinsics, I believe it would be > cleaner, simpler, and potentially more efficient, to have a vector > compare that returns <N x i1> instead. For example, in conjunction > with the above-mentioned vector select, this would allow a max to be > expressed simply as a sequence of compare and select.Having gone down this path, I'd have to disagree with you.> > In addition to the above suggestions, I'd also like to hear what > others think about handling vector operations that aren't powers of > two in size, e.g. <3 x float> operations. I gather the status quo is > that only POT sizes are expected to work (although we've found some > bugs for things like <2 x float> that we're submitting). Ideally > things like <3 x float> operands would usually be rounded up to the > size supported by the machine directly. We can try to do this in the > frontend, but it would of course be ideal if these just worked. I'm > curious if anyone else out there has dealt with this already and has > some suggestions.Handling NPOT vectors in the code generator ideally would be great; I know some people are working on widening the operations to a wider legal vector type, and scalarizing is always a possibility as well. The main problem here is what to do with address calculations, and alignment. Nate -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20080721/ed9a1278/attachment.html>
On Jul 21, 2008, at 4:33 PM, Eli Friedman wrote:>> >> 2) Vector strunc, sext, zext, fptrunc and fpext >> >> Again, I think these are hopefully straightforward. Please let me >> know >> if you expect any issues with vector operations that change element >> sizes from the RHS to the LHS, e.g. around legalization. > > These are tricky to generate efficient code for because they change > the size of the vector and most likely introduce illegal vectors. > It'll be a lot of work to get these working well for non-trivial > cases.Consider something like <4 x int> -> <4 x double>. On a target with 128-bit vectors, this could be custom expanded/split into a convert hi and convert low operation. This is real work to support, but I don't think this is too crazy. -Chris
On Jul 21, 2008, at 4:46 PM, Nate Begeman wrote:> > I totally disagree with this approach; A vector of i1 doesn't > actually match what you want to do with the hardware, unless you had > say, 128 x i1 for SSE, and it's strange when you have to spill and > reload it. The current VICMP and VFCMP instructions do not exist > for use with machine intrinsics; they exist to allow code written > use C-style comparison operators to generate efficient code on a > wide range of both scalar and vector hardware.One interesting point here is that clang follows the same semantics that LLVM IR does for comparisons. This means you can use "V1 < V2" and the result is defined to be a vector of integers the same width with only the sign bits defined. -Chris
On 21-Jul-08, at 7:33 PM, Eli Friedman wrote:> On Mon, Jul 21, 2008 at 1:21 PM, Stefanus Du Toit > <stefanus.dutoit at rapidmind.com> wrote: >> 1) Vector shl, lshr, ashr >> > I have a rough draft of a patch for this that works reasonably well > for simple cases... I don't think I really have the time to finish it > properly, but I'll clean it up a bit and send it to you as a starting > point.That's great. Please do send it over, it'd be great to have a starting point like this.>> 2) Vector strunc, sext, zext, fptrunc and fpext >> > These are tricky to generate efficient code for because they change > the size of the vector and most likely introduce illegal vectors. > It'll be a lot of work to get these working well for non-trivial > cases.Right, though it seems they can generally be split up as necessary (as Chris pointed out as well). I just don't know how well this fits the existing infrastructure. More pointers would be welcome here!>> 5) Vector comparisons that return <N x i1> >> > It makes sense; the difficulty is making CodeGen generate efficient > code for these constructs. Legalization has to transform illegal > types into legal types, and as far as I know, none of the current > transformations varies the type transformation based on what is using > the value. I don't know exactly how hard it would be, though... maybe > someone who's more familiar with the legalization infrastructure could > say more?Right, the type that needs to be used at the birthpoint of the value will generally be determined by the operation being performed to define it. At uses, that representation will have to be converted. There are some operations which may not provide a unique representation at a definition. Phi statements are the most obvious ones - if a phi statement joins two boolean vectors of different representations, the optimal choice of the result may depend on where that result is used. Conversion code that may need to be inserted can also probably benefit from optimizations like CSE. -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
Hi Nate, On 21-Jul-08, at 7:46 PM, Nate Begeman wrote:> On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote: >> 1) Vector shl, lshr, ashr >> > That seems reasonable.Thanks.>> 2) Vector strunc, sext, zext, fptrunc and fpext >> >> Again, I think these are hopefully straightforward. Please let me >> know >> if you expect any issues with vector operations that change element >> sizes from the RHS to the LHS, e.g. around legalization. > > Is the proposed semantics here that the number of elements stays the > same size, and the overall vector width changes?Yes.>> 3) Vector intrinsics for floor, ceil, round, frac/modf >> >> These are operations that are not trivially specified in terms of >> simpler operations. It would be nice to have these as overloaded, >> target-independent intrinsics, in the same way as llvm.cos etc. are >> supported now. > > It seems like these could be handled through intrinsics in the LLVM > IR, and could use general improvement in the selection dag.Right, that's what we were thinking too. Glad to hear that makes sense!> >> 4) Vector select >> >> We consider a vector select extremely important for a number of >> operations. This would be an extension of select to support an <N x >> i1> vector mask to select between elements of <N x T> vectors for >> some >> basic type T. Vector min, max, sign, etc. can be built on top of this >> operation. > > How is this anything other than AND/ANDN/OR on any integer vector > type? I don't see what adding this to the IR gets you for vectors, > since "vector of i1" doesn't mean "vector of bits" necessarily.Note that I don't mean a "select bits", but rather a "select components" operation. In other words, a straightforward vectorization of the existing "select" IR operation. You can implement the existing LLVM select instruction with bitwise operations, but it's still convenient to have. For one, as I mentioned, it provides an obvious idiomatic way to express operations like min and max. Vector selection is a common operation amongst languages that support vectors directly, and vector code will often avoid branches by performing some form of predication instead. I'm really not that concerned about how it's expressed, but there needs to be a well-understood way to lower something that looks like "vector ?:", vector max, etc in a frontend to something that will actually generate good code in the backend. If the idiom for vector float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and that generates a single maxps from the x86 backend, great. If the idiom is "vfcmp, select", even better.>> 5) Vector comparisons that return <N x i1> >> >> This is maybe not a must-have, and perhaps more a question of >> preference. I understand the current vfcmp/vicmp semantics, returning >> a vector of iK where K matches the bitwidth of the operands being >> compared with the high bit set or not, are there for pragmatic >> reasons, and that these functions exist to aid with code emitted that >> uses machine-specific intrinsics. > > I totally disagree with this approach; A vector of i1 doesn't > actually match what you want to do with the hardware, unless you had > say, 128 x i1 for SSE, and it's strange when you have to spill and > reload it.I definitely am not thinking of <128 x i1>. The intent is really just to express "here is a vector of values where only one bit matters in each element" and have codegen map that to a representation appropriate to the machine being targeted. Of course these eventually need to be widened to an appropriately sized integer vector, and that vector may be a mask or a 0/1 value, or whatever. The responsibility to doing so can be placed at pretty much any level of the stack, all the way up to making the user worry about it. For us, making the user worry about it isn't an option; we have first class bools in our frontend, support vectors of these, and naturally define comparisons, selection, etc to be consistent with this. So that leaves it to either the part generating LLVM IR, or the LLVM backend/ mid-end. I'm of the tendency that this job is best suited to be done in an SSA representation, and best suited to be done with a specific machine in mind. If this is completely infeasible, we'll have to worry about it during generation of LLVM IR, which is fine. I'd just rather see it done in LLVM where it can hopefully benefit others with the same issues. To me this isn't much different whether we're talking about vectors or scalars; it's about the utility of an "i1" type generally, and about whose responsibility it is to map such a type to hardware.> The current VICMP and VFCMP instructions do not exist for use with > machine intrinsics; they exist to allow code written use C-style > comparison operators to generate efficient code on a wide range of > both scalar and vector hardware.OK. My understanding of this was based on an email from Chris to llvm- dev. I had asked how these were used today, especially given the lack of vector shifts. Here's his response:>> They can be used with target-specific intrinsics. For example, SSE >> provides a broad range of intrinsics to support instructions that >> LLVM >> IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for >> more details.If you have examples on how these are expected to be used today without machine intrinsics, that would really help - e.g. for expressing something like a vector max.>> For code that does not use machine intrinsics, I believe it would be >> cleaner, simpler, and potentially more efficient, to have a vector >> compare that returns <N x i1> instead. For example, in conjunction >> with the above-mentioned vector select, this would allow a max to be >> expressed simply as a sequence of compare and select. > > Having gone down this path, I'd have to disagree with you.OK. Hopefully my comments above will make my reasoning clearer. If not please let me know. You have a lot more experience with this in LLVM than me so pardon my ignorance :). I'm not suggesting that these semantics are absolutely the way to go, but we do need some way to address these issues.>> In addition to the above suggestions, I'd also like to hear what >> others think about handling vector operations that aren't powers of >> two in size, e.g. <3 x float> operations. I gather the status quo is >> that only POT sizes are expected to work (although we've found some >> bugs for things like <2 x float> that we're submitting). Ideally >> things like <3 x float> operands would usually be rounded up to the >> size supported by the machine directly. We can try to do this in the >> frontend, but it would of course be ideal if these just worked. I'm >> curious if anyone else out there has dealt with this already and has >> some suggestions. > > Handling NPOT vectors in the code generator ideally would be great; > I know some people are working on widening the operations to a wider > legal vector type, and scalarizing is always a possibility as well. > The main problem here is what to do with address calculations, and > alignment.Right, we've dealt with this in the past by effectively scalarizing loads and stores (since simply extending a load might go beyond a page boundary, etc.), but extending all register operations. I think this addresses the address calculation and alignment issues? Thanks for the feedback, I hope to hear more. Stefanus -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
On Monday 21 July 2008 15:21, Stefanus Du Toit wrote:> We would like to extend the vector operations in llvm a bit. We're > hoping to get some feedback on the right way to go, or some starting > points. I had previously had some discussion on this list about a > subset of the changes we have in mind.Woohoo! We've been interested in talking about this for some time.> All of these changes are intended to make target-independent IR (i.e. > IR without machine specific intrinsics) generate better code or be > easier to generate from a frontend with vector support (whether from > manual or autovectorization).Very excellent.> ==> 1) Vector shl, lshr, ashr > > I think these are no-brainers. We would like to extend the semantics > of the shifting instructions to naturally apply to vectors as well. > One issue is that these operations often only support a single shift > amount for an entire vector. I assume it should be fairlySo you're assuming a shift of a vector by a scalar? What about the general vector-by-vector version?> straightforward to select on this pattern, and scalarize the general > case as necessary.Yep.> 2) Vector strunc, sext, zext, fptrunc and fpext > > Again, I think these are hopefully straightforward. Please let me know > if you expect any issues with vector operations that change element > sizes from the RHS to the LHS, e.g. around legalization.Is the assumption that all elements are changed in the same way?> 4) Vector select > > We consider a vector select extremely important for a number of > operations. This would be an extension of select to support an <N x > i1> vector mask to select between elements of <N x T> vectors for some > basic type T. Vector min, max, sign, etc. can be built on top of this > operation.Yes, merge/blend is a very important operation. Also, it would be nice to think about generalizing this to apply masks to all vector operations, particularly loads and stores.> 5) Vector comparisons that return <N x i1> > > This is maybe not a must-have, and perhaps more a question of > preference. I understand the current vfcmp/vicmp semantics, returning > a vector of iK where K matches the bitwidth of the operands being > compared with the high bit set or not, are there for pragmatic > reasons, and that these functions exist to aid with code emitted that > uses machine-specific intrinsics.As long as we have some way of representing the result of a vector compare, I'm not too worried. As you say, target-specific code can convert these to masks or whatever form the target architecture has. I don't have any experience trying to do such a conversion, though, so there may be gotchas we're not aware of.> For code that does not use machine intrinsics, I believe it would be > cleaner, simpler, and potentially more efficient, to have a vector > compare that returns <N x i1> instead. For example, in conjunction > with the above-mentioned vector select, this would allow a max to be > expressed simply as a sequence of compare and select.True enough. It also leads naturally to generalized masked vector operations, which are _extremely_ handy.> Vector bitshifts would actually help with the amount of code generated > for something like a vectorized max, but makes the patterns for > recognizing these a lot longer.Right.> I realize this is probably the most controversial change amongst > these. I gather there is some concern about representing "variable > width" i1s, but I would contend that that's the case even for i1s > which are not vectors.What do you mean by "variable width?"> In addition to the above suggestions, I'd also like to hear what > others think about handling vector operations that aren't powers of > two in size, e.g. <3 x float> operations. I gather the status quo isIt's goodness.> that only POT sizes are expected to work (although we've found some > bugs for things like <2 x float> that we're submitting). Ideally > things like <3 x float> operands would usually be rounded up to the > size supported by the machine directly. We can try to do this in theYou might need mask support as well, especially if the operation can trap.> frontend, but it would of course be ideal if these just worked. I'm > curious if anyone else out there has dealt with this already and has > some suggestions.It's something we've thought about but not really delved into yet.> Please let me know what you think,Let's connect at the dev meeting along with others interested in this stuff and start thinking about how to proceed. -Dave
On Monday 21 July 2008 18:33, Eli Friedman wrote:> > 2) Vector strunc, sext, zext, fptrunc and fpext > > > > Again, I think these are hopefully straightforward. Please let me know > > if you expect any issues with vector operations that change element > > sizes from the RHS to the LHS, e.g. around legalization. > > These are tricky to generate efficient code for because they change > the size of the vector and most likely introduce illegal vectors. > It'll be a lot of work to get these working well for non-trivial > cases.It depends on the architecture. For something like SSE, I can see that this would be a real pain (the vector length changes). For a traditional Cray vector machine (as an example), it's trivial.> > 5) Vector comparisons that return <N x i1>> It makes sense; the difficulty is making CodeGen generate efficient > code for these constructs. Legalization has to transform illegal > types into legal types, and as far as I know, none of the current > transformations varies the type transformation based on what is using > the value. I don't know exactly how hard it would be, though... maybe > someone who's more familiar with the legalization infrastructure could > say more?Can you elaborate with a specific example? I'm having a hard time imagining what transformations need to do. When I think of a vector mask, the length of the mask is the same as the vector length of the operands. So a compare of two vectors with four elements results in a four bit mask. If the input types change to a different vector length, then the mask type changes as well. -Dave
On Monday 21 July 2008 18:46, Nate Begeman wrote:> > 4) Vector select > > > > We consider a vector select extremely important for a number of > > operations. This would be an extension of select to support an <N x > > i1> vector mask to select between elements of <N x T> vectors for some > > basic type T. Vector min, max, sign, etc. can be built on top of this > > operation. > > How is this anything other than AND/ANDN/OR on any integer vector > type? I don't see what adding this to the IR gets you for vectors, > since "vector of i1" doesn't mean "vector of bits" necessarily.Yes, you can implement the operation with bit twiddling but it's less convenient. Since we have a scalar select, it seems only natural to provide a vector version. Some vector hardware has direct support for this operation. AVX is just one example.> > 5) Vector comparisons that return <N x i1> > > > > This is maybe not a must-have, and perhaps more a question of > > preference. I understand the current vfcmp/vicmp semantics, returning > > a vector of iK where K matches the bitwidth of the operands being > > compared with the high bit set or not, are there for pragmatic > > reasons, and that these functions exist to aid with code emitted that > > uses machine-specific intrinsics. > > I totally disagree with this approach; A vector of i1 doesn't actually > match what you want to do with the hardware, unless you had say, 128 x > i1 for SSE, and it's strange when you have to spill and reload it.I don't follow. "What you want to do with the hardware" depends on what hardware you have. Don't be restricted in thinking SSE-style only. And for spill-reload on SSE, why couldn't you do a bitcast to some SSE-supported type and just dump it to memory?> The current VICMP and VFCMP instructions do not exist for use with > machine intrinsics; they exist to allow code written use C-style > comparison operators to generate efficient code on a wide range of > both scalar and vector hardware.Well, either you convert the current model to a bitvector for some hardware or you convert a bitvector to the current model for other hardware. Perhaps this can all be done by isel without too much trouble. I honestly don't know.> > For code that does not use machine intrinsics, I believe it would be > > cleaner, simpler, and potentially more efficient, to have a vector > > compare that returns <N x i1> instead. For example, in conjunction > > with the above-mentioned vector select, this would allow a max to be > > expressed simply as a sequence of compare and select. > > Having gone down this path, I'd have to disagree with you.Can you elaborate? What were your experiences? What architecture were you targeting? Real-world experience is valuable.> Handling NPOT vectors in the code generator ideally would be great; I > know some people are working on widening the operations to a wider > legal vector type, and scalarizing is always a possibility as well. > The main problem here is what to do with address calculations, and > alignment.With a vector select one could potentially use it to insert "safe" values into the unaffected vector elements on fixed-VL architectures like SSE. This would allow us to do NPOT operations on vector hardware that requires POT vector lengths without scalarizing. -Dave
On 23-Jul-08, at 12:15 PM, David Greene wrote:> On Monday 21 July 2008 15:21, Stefanus Du Toit wrote: > >> We would like to extend the vector operations in llvm a bit. We're >> hoping to get some feedback on the right way to go, or some starting >> points. I had previously had some discussion on this list about a >> subset of the changes we have in mind. > > Woohoo! We've been interested in talking about this for some time.Glad to hear it!>> ==>> 1) Vector shl, lshr, ashr >> >> I think these are no-brainers. We would like to extend the semantics >> of the shifting instructions to naturally apply to vectors as well. >> One issue is that these operations often only support a single shift >> amount for an entire vector. I assume it should be fairly > > So you're assuming a shift of a vector by a scalar? What about the > general vector-by-vector version?No, I am proposing a general vector-by-vector version, just pointing out that many architectures do not support this natively, but that this should be a simple matter of legalization.>> 2) Vector strunc, sext, zext, fptrunc and fpext >> >> Again, I think these are hopefully straightforward. Please let me >> know >> if you expect any issues with vector operations that change element >> sizes from the RHS to the LHS, e.g. around legalization. > > Is the assumption that all elements are changed in the same way?Since these only depend on the types of the source and destination operand, and vectors are homogeneous, yes. The intention is that for any of these conversion ops, %A = op <N x S> %B to <N x T> is semantically equivalent to this pseudo-IR: %A = undef <N x T> for i = 0 .. N - 1: %t1 = extractelement <N x S> %B, i %t2 = op S %t1 to T %A = insertelement <N x T> %A, %t2, i>> 4) Vector select >> >> We consider a vector select extremely important for a number of >> operations. This would be an extension of select to support an <N x >> i1> vector mask to select between elements of <N x T> vectors for >> some >> basic type T. Vector min, max, sign, etc. can be built on top of this >> operation. > > Yes, merge/blend is a very important operation. Also, it would be > nice to > think about generalizing this to apply masks to all vector operations, > particularly loads and stores.Can you elaborate about what you mean by this? I've built vector IRs in the past that have writemasking effectively built into all IR ops, and I don't know if I would recommend this approach generally. But I'm not sure that's what you mean.>> I realize this is probably the most controversial change amongst >> these. I gather there is some concern about representing "variable >> width" i1s, but I would contend that that's the case even for i1s >> which are not vectors. > > What do you mean by "variable width?"At some point the i1 vectors may need to be converted to masks. The sizes for these masks will often depend on how they're used. On ISAs like SSE and CellSPU, a comparison generates a mask corresponding to the size of the operands being compared, and a selection requires a mask corresponding to the size of the operands being selected. The "simple" way to deal with this is to insert appropriate conversion code at uses of i1s, and pick a representation for a given i1 based on its SSA birthpoint. It's a little more ambiguous when you start adding phis into the mix, e.g.: a: %a1 = fcmp olt <2 x float> %f1, %f2 ; yields <2 x i1> br label %c b: %a2 = fcmp olt <2 x double> %d1, %d2 ; yields <2 x i1> br label %c c: %a3 = phi <2 x i1> [%a1, %a], [%a2, %b] select <2 x i1> %a3, %s1, %s2 ; where s1, s2 are <2 x i16> select <2 x i1> %a3, %c1, %c2 ; where s1, s2 are <2 x i8> The representation for %a1 may be <2 x i32>, for %a2 <2 x i64>, but for %a3 it's less lear. I don't think this is a huge problem, but it's something to be aware of. Note the same issue can affect even scalar i1 values. In fact, Scott Michel posted to llvm-dev recently about dealing with this issue for setcc in the CellSPU backend.>> that only POT sizes are expected to work (although we've found some >> bugs for things like <2 x float> that we're submitting). Ideally >> things like <3 x float> operands would usually be rounded up to the >> size supported by the machine directly. We can try to do this in the > > You might need mask support as well, especially if the operation can > trap.Hmm. Yes, divisions by zero etc. are something we should think about.>> Please let me know what you think, > > Let's connect at the dev meeting along with others interested in > this stuff > and start thinking about how to proceed.Sounds good. -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463