On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de> wrote:> On 2/4/19 9:18 PM, Robin Kruppe wrote: > > > > On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Simon Moll <moll at cs.uni-saarland.de> writes: >> >> > You are referring to the sub-vector sizes, if i am understanding >> > correctly. I'd assume that the mask sub-vector length always has to be >> > either 1 or the same as the data sub-vector length. For example, this >> > is ok: >> > >> > %result = call <scalable 3 x float> @llvm.evl.fsub.v4f32(<scalable 3 x >> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M, i32 %L) >> >> What does <scalable 1 x i1> applied to <scalable 3 x float> mean? I >> would expect a requirement of <scalable 3 x i1>. At least that's how I >> understood the SVE proposal [1]. The n's in <scalable n x type> have to >> match. >> > > I believe the idea is to allow each single mask bit to control multiple > consecutive lanes at once, effectively interpreting the vector being > operated on as "many short fixed-length vectors, concatenated" rather than > a single long vector of scalars. This is a different interpretation of that > type than usual, but it's not crazy, e.g. a similar reinterpretation of > vector types seems to be the favored approach for adding matrix operations > to LLVM IR. It somewhat obscures the point to discuss this only for > scalable vectors, there's no conceptual reason why one couldn't do the same > with fixed size vectors. > > In fact, I would recommend against making almost any new feature or > intrinsic exclusive to scalable vectors, including this one: there > shouldn't be much extra code required to allow and support it, and not > doing so makes the IR less orthogonal. For example, if a <scalable 4 x > float> fadd with a <scalable 1 x i1> mask works, then <4 x float> fadd with > a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask, etc. should also > be possible overloads of the same intrinsic. > > Yep. Doing the same for standard vector IR is on the radar: > https://reviews.llvm.org/D57504#1380587. > > > So far, so good. A bit odd, when I think about it, but if hardware out > there has that capability, maybe this is a good way to encode it in IR > (other options might work too, though). The crux, however, is the > interaction with the dynamic vector length: is it in terms of the mask? the > longer data vector? if the latter, what happens if it isn't divisible by > the mask length? There are multiple options and it's not clear to me which > one is "the right one", both for architectures with native support > (hopefully the one brough up here won't be the only one) and for internal > consistency of the IR. If there was an established architecture with this > kind of feature where people have gathered lots of practical experience > with it, we could use that inform the decision (just as we have for > ordinary predication and dynamic vector length). But I'm not aware of any > architecture that does this other than the one Jacob and lkcl are working > on, and as far as I know their project still in the early stages. > > The current understanding is that the dynamic vector length operates in > the granularity of the mask: https://reviews.llvm.org/D57504#1381211 >I do understand that this is what Jacob proposes based on the architecture he works on. However, it is not yet clear to me whether that is the most useful option overall, nor that it is the only option that will lead to reasonable codegen for their architecture. But let's leave discussion of the details on Phab. I just want to highlight one issue that is not specific to Jacob's angle, as it relates to the interpretation of scalable vectors more generally:> In unscaled IR types, this means VL masks each scalar result, in scaled > types VL masks sub vectors. E.g. for %L == 1 the following call produces a > pair of floats as the result: > > <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) > > As I wrote on Phab mere minutes before you sent this email, I do not thinkthis is the right interpretation for any architecture I know about (I do not know anything about the things Jacob and Luke are working on) nor from the POV of the scalable vector types proposal. A scalable vector is not conventionally "a variable-length vector of fixed-size vectors", it it simply an ordinary "flat" vector whose length happens to be mostly unknown at compile time. If some intrinsics want to interpret it differently, that is fine, but that's a property of those specific intrinsics -- similar to how proposed matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.> I agree that we should only consider the tied sub-vector case for this > first version and keep discussing the unconstrained version. It is > seductively easy to allow this but impossible to take it back. > > --- > > The story is different when we talk only(!) about memory accesses and > having different vector sizes in the operands and the transferred type > (result type for loads, value operand type for stores): > > Eg on AVX, this call could turn into a 64bit gather operation of pairs of > floats: > > <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> mask %M, i32 vlen 8) > > Is that IR you'd expect someone to generate (or a backend to consume) forthis operation? It seems like a rather unnatural or "magical" way to represent the intent (load 64b each from 8 pointers), at least with the way I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which may > be represented as: > > <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16) > > In contrast to the above I can't very well say one should write this as agather of i1024, but it also seems like a rather specialized instruction (presumably used for blocked processing of matrices?) so I can't say that this on its own motivates me to complicate a proposed core IR construct. Cheers, Robin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/6679b87d/attachment.html>
On 2/4/19 10:40 PM, Robin Kruppe wrote:> > > On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de > <mailto:moll at cs.uni-saarland.de>> wrote: > > On 2/4/19 9:18 PM, Robin Kruppe wrote: >> >> >> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Simon Moll <moll at cs.uni-saarland.de >> <mailto:moll at cs.uni-saarland.de>> writes: >> >> > You are referring to the sub-vector sizes, if i am >> understanding >> > correctly. I'd assume that the mask sub-vector length >> always has to be >> > either 1 or the same as the data sub-vector length. For >> example, this >> > is ok: >> > >> > %result = call <scalable 3 x float> >> @llvm.evl.fsub.v4f32(<scalable 3 x >> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M, >> i32 %L) >> >> What does <scalable 1 x i1> applied to <scalable 3 x float> >> mean? I >> would expect a requirement of <scalable 3 x i1>. At least >> that's how I >> understood the SVE proposal [1]. The n's in <scalable n x >> type> have to >> match. >> >> >> I believe the idea is to allow each single mask bit to control >> multiple consecutive lanes at once, effectively interpreting the >> vector being operated on as "many short fixed-length vectors, >> concatenated" rather than a single long vector of scalars. This >> is a different interpretation of that type than usual, but it's >> not crazy, e.g. a similar reinterpretation of vector types seems >> to be the favored approach for adding matrix operations to LLVM >> IR. It somewhat obscures the point to discuss this only for >> scalable vectors, there's no conceptual reason why one couldn't >> do the same with fixed size vectors. >> >> In fact, I would recommend against making almost any new feature >> or intrinsic exclusive to scalable vectors, including this one: >> there shouldn't be much extra code required to allow and support >> it, and not doing so makes the IR less orthogonal. For example, >> if a <scalable 4 x float> fadd with a <scalable 1 x i1> mask >> works, then <4 x float> fadd with a <1 x i1> mask, a <8 x float> >> fadd with a <2 x i1> mask, etc. should also be possible overloads >> of the same intrinsic. > Yep. Doing the same for standard vector IR is on the radar: > https://reviews.llvm.org/D57504#1380587. >> >> So far, so good. A bit odd, when I think about it, but if >> hardware out there has that capability, maybe this is a good way >> to encode it in IR (other options might work too, though). The >> crux, however, is the interaction with the dynamic vector length: >> is it in terms of the mask? the longer data vector? if the >> latter, what happens if it isn't divisible by the mask length? >> There are multiple options and it's not clear to me which one is >> "the right one", both for architectures with native support >> (hopefully the one brough up here won't be the only one) and for >> internal consistency of the IR. If there was an established >> architecture with this kind of feature where people have gathered >> lots of practical experience with it, we could use that inform >> the decision (just as we have for ordinary predication and >> dynamic vector length). But I'm not aware of any architecture >> that does this other than the one Jacob and lkcl are working on, >> and as far as I know their project still in the early stages. > > The current understanding is that the dynamic vector length > operates in the granularity of the mask: > https://reviews.llvm.org/D57504#1381211 > > I do understand that this is what Jacob proposes based on the > architecture he works on. However, it is not yet clear to me whether > that is the most useful option overall, nor that it is the only option > that will lead to reasonable codegen for their architecture. But let's > leave discussion of the details on Phab. I just want to highlight one > issue that is not specific to Jacob's angle, as it relates to the > interpretation of scalable vectors more generally: > > In unscaled IR types, this means VL masks each scalar result, in > scaled types VL masks sub vectors. E.g. for %L == 1 the following > call produces a pair of floats as the result: > > <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) > > As I wrote on Phab mere minutes before you sent this email, I do not > think this is the right interpretation for any architecture I know > about (I do not know anything about the things Jacob and Luke are > working on) nor from the POV of the scalable vector types proposal. A > scalable vector is not conventionally "a variable-length vector of > fixed-size vectors", it it simply an ordinary "flat" vector whose > length happens to be mostly unknown at compile time. If some > intrinsics want to interpret it differently, that is fine, but that's > a property of those specific intrinsics -- similar to how proposed > matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.On NEC SX-Aurora the vector length is always interpreted in 64bit data chunks. That is one example of a real architecture where the vscaled interpretation of VL makes sense.> I agree that we should only consider the tied sub-vector case for > this first version and keep discussing the unconstrained version. > It is seductively easy to allow this but impossible to take it back. > > --- > > The story is different when we talk only(!) about memory accesses > and having different vector sizes in the operands and the > transferred type (result type for loads, value operand type for > stores): > > Eg on AVX, this call could turn into a 64bit gather operation of > pairs of floats: > > <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> > mask %M, i32 vlen 8) > > Is that IR you'd expect someone to generate (or a backend to consume) > for this operation? It seems like a rather unnatural or "magical" way > to represent the intent (load 64b each from 8 pointers), at least with > the way I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast. > > And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, > which may be represented as: > > <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x > double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16) > > In contrast to the above I can't very well say one should write this > as a gather of i1024, but it also seems like a rather specialized > instruction (presumably used for blocked processing of matrices?) so I > can't say that this on its own motivates me to complicate a proposed > core IR construct.It actually reduces complexity by shifting it from the address computation into the instruction. This would cover all three cases: VLD2D, <2 x float> gather on AVX and <W x float> loads for this early RISC-V based architecture that Jacob and lkcl are working on. However, this is not a top priority and we can leave it out of the first version.> > Cheers, > Robin >- Simon -- Simon Moll Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland University, Computer Science Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/0327a7b9/attachment.html>
The architecture Luke and I are working on, assuming it goes the way I think it will, will have instructions like: fmadd.s.vvss rd, rs1, rs2, rs3, len=N*VL, pred=rp where N is from 1 to 4 which has the following pseudo-code: constexpr auto f32_per_reg = 2; union FReg { double f64[1]; float f32[f32_per_reg]; _Float16 f16[4]; }; union Reg { uint64_t i64[1]; uint32_t i32[2]; uint16_t i16[4]; uint8_t i8[8]; }; // registers FReg fregs[128]; Reg regs[128]; uint64_t vl; // instruction fields int rd, rs1, rs2, rs3, rp, N; // main code for(uint64_t i = 0; i < vl * N; i++) { if(regs[rp].i64[0] & (1ULL << i / N)) { auto rv = i / f32_per_reg; auto sv = i % f32_per_reg; auto rs = (i % N) / f32_per_reg; auto ss = (i % N) / f32_per_reg; // *+ is contracted into fma fregs[rd + rv].f32[sv] = fregs[rs1 + rv].f32[sv] * fregs[rs2 + rs].f32[ss] + fregs[rs3 + rs].f32[ss]; } } So it would be handy for the vector length on evl intrinsics to be in units of the mask length so we don't have to pattern match a division in the backend. We could have 2 variants of the vector length argument, one in terms of the data vector and one in terms of the mask vector. we could legalize the mask vector variant for those architectures that need it by pulling the multiplication out and switching to the data vector variants. Jacob On Mon, Feb 4, 2019, 13:41 Robin Kruppe via llvm-dev < llvm-dev at lists.llvm.org wrote:> > > On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de> wrote: > >> On 2/4/19 9:18 PM, Robin Kruppe wrote: >> >> >> >> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Simon Moll <moll at cs.uni-saarland.de> writes: >>> >>> > You are referring to the sub-vector sizes, if i am understanding >>> > correctly. I'd assume that the mask sub-vector length always has to be >>> > either 1 or the same as the data sub-vector length. For example, this >>> > is ok: >>> > >>> > %result = call <scalable 3 x float> @llvm.evl.fsub.v4f32(<scalable 3 x >>> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M, i32 %L) >>> >>> What does <scalable 1 x i1> applied to <scalable 3 x float> mean? I >>> would expect a requirement of <scalable 3 x i1>. At least that's how I >>> understood the SVE proposal [1]. The n's in <scalable n x type> have to >>> match. >>> >> >> I believe the idea is to allow each single mask bit to control multiple >> consecutive lanes at once, effectively interpreting the vector being >> operated on as "many short fixed-length vectors, concatenated" rather than >> a single long vector of scalars. This is a different interpretation of that >> type than usual, but it's not crazy, e.g. a similar reinterpretation of >> vector types seems to be the favored approach for adding matrix operations >> to LLVM IR. It somewhat obscures the point to discuss this only for >> scalable vectors, there's no conceptual reason why one couldn't do the same >> with fixed size vectors. >> >> In fact, I would recommend against making almost any new feature or >> intrinsic exclusive to scalable vectors, including this one: there >> shouldn't be much extra code required to allow and support it, and not >> doing so makes the IR less orthogonal. For example, if a <scalable 4 x >> float> fadd with a <scalable 1 x i1> mask works, then <4 x float> fadd with >> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask, etc. should also >> be possible overloads of the same intrinsic. >> >> Yep. Doing the same for standard vector IR is on the radar: >> https://reviews.llvm.org/D57504#1380587. >> >> >> So far, so good. A bit odd, when I think about it, but if hardware out >> there has that capability, maybe this is a good way to encode it in IR >> (other options might work too, though). The crux, however, is the >> interaction with the dynamic vector length: is it in terms of the mask? the >> longer data vector? if the latter, what happens if it isn't divisible by >> the mask length? There are multiple options and it's not clear to me which >> one is "the right one", both for architectures with native support >> (hopefully the one brough up here won't be the only one) and for internal >> consistency of the IR. If there was an established architecture with this >> kind of feature where people have gathered lots of practical experience >> with it, we could use that inform the decision (just as we have for >> ordinary predication and dynamic vector length). But I'm not aware of any >> architecture that does this other than the one Jacob and lkcl are working >> on, and as far as I know their project still in the early stages. >> >> The current understanding is that the dynamic vector length operates in >> the granularity of the mask: https://reviews.llvm.org/D57504#1381211 >> > I do understand that this is what Jacob proposes based on the architecture > he works on. However, it is not yet clear to me whether that is the most > useful option overall, nor that it is the only option that will lead to > reasonable codegen for their architecture. But let's leave discussion of > the details on Phab. I just want to highlight one issue that is not > specific to Jacob's angle, as it relates to the interpretation of scalable > vectors more generally: > >> In unscaled IR types, this means VL masks each scalar result, in scaled >> types VL masks sub vectors. E.g. for %L == 1 the following call produces a >> pair of floats as the result: >> >> <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) >> >> As I wrote on Phab mere minutes before you sent this email, I do not > think this is the right interpretation for any architecture I know about (I > do not know anything about the things Jacob and Luke are working on) nor > from the POV of the scalable vector types proposal. A scalable vector is > not conventionally "a variable-length vector of fixed-size vectors", it it > simply an ordinary "flat" vector whose length happens to be mostly unknown > at compile time. If some intrinsics want to interpret it differently, that > is fine, but that's a property of those specific intrinsics -- similar to > how proposed matrix intrinsics might interpret a 16 element vector as a 4x4 > matrix. > >> I agree that we should only consider the tied sub-vector case for this >> first version and keep discussing the unconstrained version. It is >> seductively easy to allow this but impossible to take it back. >> >> --- >> >> The story is different when we talk only(!) about memory accesses and >> having different vector sizes in the operands and the transferred type >> (result type for loads, value operand type for stores): >> >> Eg on AVX, this call could turn into a 64bit gather operation of pairs of >> floats: >> >> <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> mask %M, i32 vlen 8) >> >> Is that IR you'd expect someone to generate (or a backend to consume) for > this operation? It seems like a rather unnatural or "magical" way to > represent the intent (load 64b each from 8 pointers), at least with the way > I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast. > >> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which >> may be represented as: >> >> <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16) >> >> In contrast to the above I can't very well say one should write this as a > gather of i1024, but it also seems like a rather specialized instruction > (presumably used for blocked processing of matrices?) so I can't say that > this on its own motivates me to complicate a proposed core IR construct. > > Cheers, > Robin > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/5ad45d64/attachment-0001.html>
On Mon, Feb 4, 2019, 14:23 Jacob Lifshay <programmerjake at gmail.com wrote:> The architecture Luke and I are working on, assuming it goes the way I > think it will, will have instructions like: > fmadd.s.vvss rd, rs1, rs2, rs3, len=N*VL, pred=rp > where N is from 1 to 4 > which has the following pseudo-code: > constexpr auto f32_per_reg = 2; > union FReg > { > double f64[1]; > float f32[f32_per_reg]; > _Float16 f16[4]; > }; > union Reg > { > uint64_t i64[1]; > uint32_t i32[2]; > uint16_t i16[4]; > uint8_t i8[8]; > }; > // registers > FReg fregs[128]; > Reg regs[128]; > uint64_t vl; > // instruction fields > int rd, rs1, rs2, rs3, rp, N; > // main code > for(uint64_t i = 0; i < vl * N; i++) > { > if(regs[rp].i64[0] & (1ULL << i / N)) > { > auto rv = i / f32_per_reg; > auto sv = i % f32_per_reg; > auto rs = (i % N) / f32_per_reg; > auto ss = (i % N) / f32_per_reg; >should have been: auto ss = (i % N) % f32_per_reg;> // *+ is contracted into fma > fregs[rd + rv].f32[sv] = fregs[rs1 + rv].f32[sv] * fregs[rs2 + > rs].f32[ss] + fregs[rs3 + rs].f32[ss]; > } > } > > So it would be handy for the vector length on evl intrinsics to be in > units of the mask length so we don't have to pattern match a division in > the backend. We could have 2 variants of the vector length argument, one in > terms of the data vector and one in terms of the mask vector. we could > legalize the mask vector variant for those architectures that need it by > pulling the multiplication out and switching to the data vector variants. > > Jacob > > On Mon, Feb 4, 2019, 13:41 Robin Kruppe via llvm-dev < > llvm-dev at lists.llvm.org wrote: > >> >> >> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de> wrote: >> >>> On 2/4/19 9:18 PM, Robin Kruppe wrote: >>> >>> >>> >>> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> Simon Moll <moll at cs.uni-saarland.de> writes: >>>> >>>> > You are referring to the sub-vector sizes, if i am understanding >>>> > correctly. I'd assume that the mask sub-vector length always has to be >>>> > either 1 or the same as the data sub-vector length. For example, this >>>> > is ok: >>>> > >>>> > %result = call <scalable 3 x float> @llvm.evl.fsub.v4f32(<scalable 3 x >>>> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M, i32 %L) >>>> >>>> What does <scalable 1 x i1> applied to <scalable 3 x float> mean? I >>>> would expect a requirement of <scalable 3 x i1>. At least that's how I >>>> understood the SVE proposal [1]. The n's in <scalable n x type> have to >>>> match. >>>> >>> >>> I believe the idea is to allow each single mask bit to control multiple >>> consecutive lanes at once, effectively interpreting the vector being >>> operated on as "many short fixed-length vectors, concatenated" rather than >>> a single long vector of scalars. This is a different interpretation of that >>> type than usual, but it's not crazy, e.g. a similar reinterpretation of >>> vector types seems to be the favored approach for adding matrix operations >>> to LLVM IR. It somewhat obscures the point to discuss this only for >>> scalable vectors, there's no conceptual reason why one couldn't do the same >>> with fixed size vectors. >>> >>> In fact, I would recommend against making almost any new feature or >>> intrinsic exclusive to scalable vectors, including this one: there >>> shouldn't be much extra code required to allow and support it, and not >>> doing so makes the IR less orthogonal. For example, if a <scalable 4 x >>> float> fadd with a <scalable 1 x i1> mask works, then <4 x float> fadd with >>> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask, etc. should also >>> be possible overloads of the same intrinsic. >>> >>> Yep. Doing the same for standard vector IR is on the radar: >>> https://reviews.llvm.org/D57504#1380587. >>> >>> >>> So far, so good. A bit odd, when I think about it, but if hardware out >>> there has that capability, maybe this is a good way to encode it in IR >>> (other options might work too, though). The crux, however, is the >>> interaction with the dynamic vector length: is it in terms of the mask? the >>> longer data vector? if the latter, what happens if it isn't divisible by >>> the mask length? There are multiple options and it's not clear to me which >>> one is "the right one", both for architectures with native support >>> (hopefully the one brough up here won't be the only one) and for internal >>> consistency of the IR. If there was an established architecture with this >>> kind of feature where people have gathered lots of practical experience >>> with it, we could use that inform the decision (just as we have for >>> ordinary predication and dynamic vector length). But I'm not aware of any >>> architecture that does this other than the one Jacob and lkcl are working >>> on, and as far as I know their project still in the early stages. >>> >>> The current understanding is that the dynamic vector length operates in >>> the granularity of the mask: https://reviews.llvm.org/D57504#1381211 >>> >> I do understand that this is what Jacob proposes based on the >> architecture he works on. However, it is not yet clear to me whether that >> is the most useful option overall, nor that it is the only option that will >> lead to reasonable codegen for their architecture. But let's leave >> discussion of the details on Phab. I just want to highlight one issue that >> is not specific to Jacob's angle, as it relates to the interpretation of >> scalable vectors more generally: >> >>> In unscaled IR types, this means VL masks each scalar result, in scaled >>> types VL masks sub vectors. E.g. for %L == 1 the following call produces a >>> pair of floats as the result: >>> >>> <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) >>> >>> As I wrote on Phab mere minutes before you sent this email, I do not >> think this is the right interpretation for any architecture I know about (I >> do not know anything about the things Jacob and Luke are working on) nor >> from the POV of the scalable vector types proposal. A scalable vector is >> not conventionally "a variable-length vector of fixed-size vectors", it it >> simply an ordinary "flat" vector whose length happens to be mostly unknown >> at compile time. If some intrinsics want to interpret it differently, that >> is fine, but that's a property of those specific intrinsics -- similar to >> how proposed matrix intrinsics might interpret a 16 element vector as a 4x4 >> matrix. >> >>> I agree that we should only consider the tied sub-vector case for this >>> first version and keep discussing the unconstrained version. It is >>> seductively easy to allow this but impossible to take it back. >>> >>> --- >>> >>> The story is different when we talk only(!) about memory accesses and >>> having different vector sizes in the operands and the transferred type >>> (result type for loads, value operand type for stores): >>> >>> Eg on AVX, this call could turn into a 64bit gather operation of pairs >>> of floats: >>> >>> <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> mask %M, i32 vlen 8) >>> >>> Is that IR you'd expect someone to generate (or a backend to consume) >> for this operation? It seems like a rather unnatural or "magical" way to >> represent the intent (load 64b each from 8 pointers), at least with the way >> I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast. >> >>> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which >>> may be represented as: >>> >>> <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16) >>> >>> In contrast to the above I can't very well say one should write this as >> a gather of i1024, but it also seems like a rather specialized instruction >> (presumably used for blocked processing of matrices?) so I can't say that >> this on its own motivates me to complicate a proposed core IR construct. >> >> Cheers, >> Robin >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/b67100d5/attachment.html>
On Mon, 4 Feb 2019 at 23:04, Simon Moll <moll at cs.uni-saarland.de> wrote:> > On NEC SX-Aurora the vector length is always interpreted in 64bit data > chunks. That is one example of a real architecture where the vscaled > interpretation of VL makes sense. >Now this is a problem. Let's leave the details of why RISC-V V needs the other interpretation to Phab, but we definitely have a conflict in what these two architectures need. How do we reconcile them? Picking one option and requiring a multiplication/division of the vlen argument to get the other meaning is a nice canonical IR form, but it seems a bit problematic for codegen because that mul/div is a prime candicate for being CSE'd across blocks (pure calculation, repeated everywhere) and consequently being difficult to access for pattern matching in the backend. On the other hand, it's a less serious problem than was previously discussed re: vlen vs predication. The actual change in codegen is just omitting one instruction, which one can easily do that in an SSA-based MIR pass if necessary (instead of during ISel). Moreover, the cost of a missed folding opportunity is relatively minor, since it'll most likely be just a shift by an immediate, and it'll usually be amortized over basically the entire loop body in a lot of code. Still, does anyone have a better idea? Cheers, Robin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/6f0b5aa1/attachment.html>
Luke Kenneth Casson Leighton via llvm-dev
2019-Feb-05 00:27 UTC
[llvm-dev] [RFC] Vector Predication
On Mon, Feb 4, 2019 at 9:41 PM Robin Kruppe <robin.kruppe at gmail.com> wrote:> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de> wrote: >> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which may be represented as: >> >> <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16) > > In contrast to the above I can't very well say one should write this as a gather of i1024, but it also seems like a rather specialized instruction (presumably used for blocked processing of matrices?) so I can't say that this on its own motivates me to complicate a proposed core IR construct.i concur: any architecture that has fixed (SIMD-style) processing widths is basically hell for any compiler to deal with in a generic way. architecturally-fixed element widths really do not fit into the arbitrary-width vector processing paradigm, and i foresee efforts to try resulting in design conflict and ultimately failure. can i therefore recommend that variable-length ISAs be given top priority in the design, so that there is the possibility for various variable-length architectures to do near-direct highly optimised translations, and leave SIMD-style (fixed width) architectures to continue the long-standing well-established practice of creating "special-case / corner-case / cleanup" code? (with apologies to proponents of SX-Aurora, MMX, SSE and other fixed-width SIMD-style architectures...) l.
Luke Kenneth Casson Leighton via llvm-dev
2019-Feb-05 00:54 UTC
[llvm-dev] [RFC] Vector Predication
with apologies for breaking the thread, i wasn't cc'd earlier in the conversation. http://lists.llvm.org/pipermail/llvm-dev/2019-January/129806.html david, you wrote:> I'm solidly of the opinion that we already *have* IR support for > explicit masking in the form of gather/scatter/etc... Until someone has > taken the effort to make masking in this context *actually work well*, > I'm unconvinced that we should greatly expand the usage in the IR.the problem with gather/scatter is that it requires moving the data (MV or LD/ST) MV - particularly with quite large data sets - puts pressure on a microarchitecture to increase the size of the register file (otherwise data has to be pushed to stack). LD/ST - as shown by Jeff Bush in his work on nyuzi - results in *significant* power consumption increases due to having to push data through the L1/L2 cache (which is all CAMs). in SV we are deliberately dropping the vectorisation onto the *standard* register file *precisely* to avoid the need to exchange data between a special vector register file and a scalar register file. additionally, the microarchitecture being designed actually happens to effectively implement (use) gather/scatter techniques when a predicate mask is used. this through pushing element operations into a multi-issue instruction queue, and simply skipping of non-predicated elements [thus we get 100% ALU utilisation even when there are back-to-back "if then else" inverted predicate masks (the non-inverted predicate issuing one set of elements, and the inverted predicate matches perfectly with that). ] basically i feel that this is the right paradigm. now, if a given ISA doesn't *have* predicate masks, then yes, absolutely, gather/scatter at the *instruction* level (as opposed to the micro-architectural level) is the correct way to *emulate* predication. instructions may be issued that exclude the non-predicated elements, put them into a group (even a SIMD fixed-width group), and re-extract them on the other side of the group-operation into the required destination registers. even the previously-mentioned SX-Aurora architecture (and other SIMD architectures) could use this trick, to effectively "emulate" predication where the ISA doesn't have predicate masks, and it can also be used to emulate variable-length vectors, through simply setting the top elements of a SIMD block to zero (or ignoring them entirely) and only copying out the lower-indexed elements with a scatter operation. whilst that is not particularly efficient, that's not LLVM's problem: SIMD architectures were designed the way they are because it's seductively simpler at the hardware level. however to expect an architecture that *does* support proper predication to have to complexify the way it does predication, by shoe-horning it into gather/scatter... that's sub-optimal and i'm drawing a mental blank as to how it could be done, let alone done effectively and efficiently. that's not to say that gather/scatter should be removed entirely: that would be a mistake. there are circumstances where gather/scatter is far better suited for use than predicate masks. bottom line: i feel that expecting predication to be implemented in terms of gather/scatter is the wrong way round. the IR should have explicit and proper support for predicate masks, and architectures that don't *have* predicate masks should *use* gather/scatter instructions to emulate it. l.
Jacob Lifshay <programmerjake at gmail.com> writes:> So it would be handy for the vector length on evl intrinsics to be in > units of the mask length so we don't have to pattern match a division > in the backend. We could have 2 variants of the vector length > argument, one in terms of the data vector and one in terms of the mask > vector. we could legalize the mask vector variant for those > architectures that need it by pulling the multiplication out and > switching to the data vector variants.Would it make sense to have two different intrinsics? # "Normal" form, L is in terms of flat vector length. <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) # "Sub-vector" form, L is in terms of sub-vectors elements. <scalable 1 x <2 x float>> evl.fsub(<scalable 1 x <2 x float>> %x, <scalable 1 x <2 x float>> %y, <scalable 1 x <2 x i1>> %M, i32 %L Overloading types to mean two very different things is confusing to me. -David