Hi, I have some questions as to the definition of various vector instructions. In particular, I believe there are some gaps and inconsistencies in the vector instructions, and I'm interested in hearing whether you agree that these should be improved or whether there are other ways to solve these problems. ==1. Shufflevector only accepts vectors of the same type Shufflevector seems overly restrictive right now in terms of its type requirements. In particular, it requires (this could be clearer in the language reference, but is supported by equivalent assertions in the source code) that the types of the sources and destinations match exactly. Obviously it makes sense for the element types to match, but the requirement that the number of elements in each source operand matches the number of elements in the destination appears overly restrictive. I would propose to change the syntax from:> <result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32> > <mask> ; yields <n x <ty>>to:> <result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32> > <mask> ; yields <d x <ty>>With the requirement that the entries in the (still constant) mask are within the range of [0, a + b - 1]. This allows things like taking a <2 x i32> and duplicating it into a <4 x i32>. This is very useful for frontends that provide general vector functionality. I think this is more consistent with the relaxed rules on numbers of vector elements. 2. vector select SIMD-within-a-register ISAs generally provide some form of efficient vector select capabilities. It would be very convenient to be able to have a select of the form:> <result> = select <n x i1> <cond>, <n x <ty>> <val1>, <n x <ty>> > <val2> ; yields <n x <ty>>The semantics of this would be to set each element in the result to the corresponding element in val1 or val2, depending on whether the corresponding element in cond is set or not. This would make the vector types more consistent with the regular types. It's possible to do this with masking, but that requires building masks of the appropriate size in the IR each time one wants to do this kind of selection, which is cumbersome and will have to be matched by codegen to be turned into something like the Cell SPU's "selb" instruction. 3. vector trunc, sext, zext, fptrunc, fpext These seem like obvious gaps in the IR. Conversions between integer and floating point vectors are supported, but conversions from one integer vector to another integer vector type (and likewise for float) are not. Attempting to do this by hand is extremely cumbersome, and may hurt codegen opportunities as well as optimization opportunities. 4. vector shl, lshr, ashr Again, these simply seem to be simply missing (compared to the other bitwise ops). Not having vector versions of shift right, and not having vector trunc, makes things like converting bitmasks of various sizes very difficult. Modern SIMD architectures have good support for bitshifts of vector types. There are of course two ways to interpret a shift on a vector type: within elements and across elements. To us, shifts within elements (i.e. the equivalent of N scalar shifts) are more important than shifts (or rotates!) across elements, but the latter could be useful in some situations as well. 4. vfcmp, vicmp return types This topic came up recently on llvm-dev (thread "VFCmp failing when unordered or UnsafeFPMath on x86"). Having vector compares return a vector of integer elements with bit width equal to the element types being compared seems unnecessarily inconsistent with the icmp and fcmp instructions. Since only one bit is well-defined anyways, why not just return a vector of i1 elements? If after codegen this is expanded into some other width, that's fine -- the backend can do this since only one bit of information is exposed at the IR level anyways. Having these instructions return vector of i1 would also be nicely consistent with a vector select operation. Vector bit shifts and trunc would make this easier, but it seems to me that the entire IR would be far more consistent if these operations simply returned i1 vectors. == I am interested in hearing feedback on whether these ideas make sense, or if there's some good reason for things being the way they are now that I'm not seeing. I would also be interested, especially in the short term, in suggestions as to how to work around the lack of these operations. Thanks! Stefanus -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
On Jun 26, 2008, at 1:56 PM, Stefanus Du Toit wrote:> Hi, > > I have some questions as to the definition of various vector > instructions. In particular, I believe there are some gaps and > inconsistencies in the vector instructions, and I'm interested in > hearing whether you agree that these should be improved or whether > there are other ways to solve these problems. > > ==> 1. Shufflevector only accepts vectors of the same type > > Shufflevector seems overly restrictive right now in terms of its type > requirements. In particular, it requires (this could be clearer in the > language reference, but is supported by equivalent assertions in the > source code) that the types of the sources and destinations match > exactly. Obviously it makes sense for the element types to match, but > the requirement that the number of elements in each source operand > matches the number of elements in the destination appears overly > restrictive. > > I would propose to change the syntax from: > >> <result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32> >> <mask> ; yields <n x <ty>> > > > to: > >> <result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32> >> <mask> ; yields <d x <ty>> > > With the requirement that the entries in the (still constant) mask are > within the range of [0, a + b - 1]. > > This allows things like taking a <2 x i32> and duplicating it into a > <4 x i32>. > > This is very useful for frontends that provide general vector > functionality. I think this is more consistent with the relaxed rules > on numbers of vector elements.The alternative is to have the frontend synthesize the needed operations with extracts, inserts, and possibly shuffles if needed. LLVM is actually fairly well prepared to optimize code like this. I recommend giving this a try, and reporting any problems you encounter.> > > 2. vector select[...]> > 3. vector trunc, sext, zext, fptrunc, fpext[...]> > 4. vector shl, lshr, ashr[...] We agree that these would be useful. There are intentions to add them to LLVM; others can say more.> > > 4. vfcmp, vicmp return types > > This topic came up recently on llvm-dev (thread "VFCmp failing when > unordered or UnsafeFPMath on x86"). Having vector compares return a > vector of integer elements with bit width equal to the element types > being compared seems unnecessarily inconsistent with the icmp and fcmp > instructions. Since only one bit is well-defined anyways, why not just > return a vector of i1 elements? If after codegen this is expanded into > some other width, that's fine -- the backend can do this since only > one bit of information is exposed at the IR level anyways. > > Having these instructions return vector of i1 would also be nicely > consistent with a vector select operation. Vector bit shifts and trunc > would make this easier, but it seems to me that the entire IR would be > far more consistent if these operations simply returned i1 vectors.It turns out that having them return vectors of i1 would be somewhat complicated. For example, a <4 x i1> on an SSE2 target could expand either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would be to make the decision based on the comparison that produced them, but LLVM isn't yet equipped for that. vicmp and vfcmp are very much aimed at solving practical problems on popular architectures without needing significant new infrastructure They're relatively new, and as you say, they'll be more useful when combined with vector shifts and friends and we start teaching LLVM to recognize popular idioms with them. One interesting point here is that if we ever do want to have vector comparisons that return vectors of i1, one way to do that would be to overload plain icmp and fcmp, which would be neatly consistent with the way plain mul and add are overloaded for vector types instead of having a separate vadd and vmul. Dan
Hi Dan, Thanks for your comments. I've responded inline below. On 26-Jun-08, at 6:49 PM, Dan Gohman wrote:> On Jun 26, 2008, at 1:56 PM, Stefanus Du Toit wrote: >> >> ==>> 1. Shufflevector only accepts vectors of the same type >> >> I would propose to change the syntax from: >> >>> <result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32> >>> <mask> ; yields <n x <ty>> >> >> to: >> >>> <result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32> >>> <mask> ; yields <d x <ty>> >> >> With the requirement that the entries in the (still constant) mask >> are >> within the range of [0, a + b - 1].> The alternative is to have the frontend synthesize the needed > operations with extracts, inserts, and possibly shuffles if needed. > LLVM is actually fairly well prepared to optimize code like this. > I recommend giving this a try, and reporting any problems you > encounter.That certainly appears to be the only option at the moment, and we'll have a look to see how that works out. However, note that a sufficiently generalized shufflevector would remove the need for insertelement and extractelement to exist completely.>> 2. vector select >> 3. vector trunc, sext, zext, fptrunc, fpext >> 4. vector shl, lshr, ashr > [...] > > We agree that these would be useful. There are intentions to add them > to LLVM; others can say more.OK. I'd love to hear more, especially if someone is planning to do this in the short term.>> 4. vfcmp, vicmp return types >> >> This topic came up recently on llvm-dev (thread "VFCmp failing when >> unordered or UnsafeFPMath on x86"). Having vector compares return a >> vector of integer elements with bit width equal to the element types >> being compared seems unnecessarily inconsistent with the icmp and >> fcmp >> instructions. Since only one bit is well-defined anyways, why not >> just >> return a vector of i1 elements? If after codegen this is expanded >> into >> some other width, that's fine -- the backend can do this since only >> one bit of information is exposed at the IR level anyways. >> >> Having these instructions return vector of i1 would also be nicely >> consistent with a vector select operation. Vector bit shifts and >> trunc >> would make this easier, but it seems to me that the entire IR would >> be >> far more consistent if these operations simply returned i1 vectors. > > It turns out that having them return vectors of i1 would be somewhat > complicated. For example, a <4 x i1> on an SSE2 target could expand > either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would > be to make the decision based on the comparison that produced them, > but LLVM isn't yet equipped for that.Can you expand on this a bit? I'm guessing you're referring to specific mechanics in LLVM's code generation framework making this difficult to do? From a theoretical point of view (keeping in mind current architectures) there can definitely be a need to change the representation at various points in the program. For example, a <2 x i1> generated from comparing some <2 x float> types could later be used to select amongst two <2 x double> types, and on most current architectures this would require a mask to be expanded. There are three cases of instructions at which representations for i1 vectors must be chosen: instructions which generate vectors of i1s (e.g. vfcmp), instructions which consume them (e.g. a vector select) and instructions which do as both (e.g. shufflevector, phi statements). I would expect generating statements to generally have only one fixed choice (since common ISAs generally create masks of the size of what's being compared), and consumers to generally dictate a representation, possibly necessitating conversion code to be inserted. For instructions that transform vectors of i1 in some way, there may be some additional optimization opportunity to avoid conversion code. I would also presume that standard optimizations such as a form of CSE or VN and code motion would help reduce the number of conversions. In fact, this issue of representing i1 vectors isn't even constrained to vector types. Scalar code which stores i1 values in regular registers needs to have the same considerations (e.g. on Cell SPU where there is no condition register). Perhaps all that was way off topic, and you're just referring to a simple mechanical issue preventing this right now. My point is just that representation of booleans in registers is an important architecturally specific issue to handle, and the frontend is (in my humble opinion) probably not the best place for this.> vicmp and vfcmp are very much aimed at solving practical problems on > popular architectures without needing significant new infrastructure > They're relatively new, and as you say, they'll be more useful when > combined with vector shifts and friends and we start teaching LLVM > to recognize popular idioms with them.Can you give me examples of how they are used today at all? I'm having a really hard time figuring out a good use for them (that doesn't involve effectively scalarizing them immediately) without these other instructions.> One interesting point here is that if we ever do want to have vector > comparisons that return vectors of i1, one way to do that would be to > overload plain icmp and fcmp, which would be neatly consistent with > the way plain mul and add are overloaded for vector types instead of > having a separate vadd and vmul.Agreed, this would be nice. -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463