Luke Kenneth Casson Leighton via llvm-dev
2019-Feb-01 07:52 UTC
[llvm-dev] [RFC] Vector Predication
--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> wrote:> > We're in-progress designing a RISC-V extension (http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) that would have variable-length vectors of short vectors (1 to 4): > <VL x <4 x float>> > where each predicate bit masks out a whole short vector. We're using this extension to vectorize graphics code where where variables in the pre-vectorization code are short vectors. > So, vectorizing code like: > for(int i = 0; i < 1000; i++) > { > vec4 color = colors[i]; > vec3 normal = normals[i]; > color.rgb *= fmax(0.0, dot(normal, light_dir)); > colors[i] = color; > } > > I'm planning on passing already vectorized code into LLVM and using LLVM as a backend for optimization and JIT code generation. > > Do you think the EVL proposal would support an ISA like this as it's currently > written (by pattern matching on predicate expansion and vector-length > multiplication)?whilst it may be tempting to suggest that a solution is to multiply up the bits in the predicate (into groups of 3 or 4), the problem with that is that if there are operations that require vec3 or vec4 as operands interspersed with predicated operations that do not, that realistically implies a need for two separate predicate registers, otherwise cycles are wasted swapping predicates OR it implies that the architecture *allows* two separate predicate registers to be selected. consequently, it would be much, much better to be able to have a single bit of a predicate apply to the *entire* vec3 or vec4 type, on each outer loop. l.
On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > --- > crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 > > On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> wrote: > > > > We're in-progress designing a RISC-V extension (http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) that would have variable-length vectors of short vectors (1 to 4): > > <VL x <4 x float>> > > where each predicate bit masks out a whole short vector. We're using this extension to vectorize graphics code where where variables in the pre-vectorization code are short vectors. > > So, vectorizing code like: > > for(int i = 0; i < 1000; i++) > > { > > vec4 color = colors[i]; > > vec3 normal = normals[i]; > > color.rgb *= fmax(0.0, dot(normal, light_dir)); > > colors[i] = color; > > } > > > > I'm planning on passing already vectorized code into LLVM and using LLVM as a backend for optimization and JIT code generation. > > > > Do you think the EVL proposal would support an ISA like this as it's currently > > written (by pattern matching on predicate expansion and vector-length > > multiplication)? > > whilst it may be tempting to suggest that a solution is to multiply up > the bits in the predicate (into groups of 3 or 4), the problem with > that is that if there are operations that require vec3 or vec4 as > operands interspersed with predicated operations that do not, that > realistically implies a need for two separate predicate registers, > otherwise cycles are wasted swapping predicates OR it implies that the > architecture *allows* two separate predicate registers to be selected. > > consequently, it would be much, much better to be able to have a > single bit of a predicate apply to the *entire* vec3 or vec4 type, on > each outer loop.This situation can be handled easily in the standard RISC-V vector extension. You'd do something like... vsetvli t0, a0, vsew128,vnreg8,vdiv4 ... to configure the vector unit to provide eight vector register variables divided into a standard element width of 128 bits (some instructions will widen or narrow one step to/from 64 bits or 256 bits), and then dividing each 128 bit element into 4 parts. Arithmetic/logical/shift will happen on 32 bit elements, but predication and loads and stores (including strided or scatter/gather) will operate on 128 bit elements. [I just made up "vnreg8" as an alias for the standard "vlmul4" because "vlmul4,vdiv4" might look confusing. Either way it means to put 0b10 into bits [1:0] of the vtype CSR specifying that the 32 vector registers should be ganged into 8 groups each 4x longer than standard because (I'm assuming) we need more than four vector registers in this loop, but no more than eight]
On Fri, Feb 1, 2019 at 1:19 AM Bruce Hoult <brucehoult at sifive.com> wrote:> On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via > llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > --- > > crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 > > > > On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> > wrote: > > > > > > We're in-progress designing a RISC-V extension ( > http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) > that would have variable-length vectors of short vectors (1 to 4): > > > <VL x <4 x float>> > > > where each predicate bit masks out a whole short vector. We're using > this extension to vectorize graphics code where where variables in the > pre-vectorization code are short vectors. > > > So, vectorizing code like: > > > for(int i = 0; i < 1000; i++) > > > { > > > vec4 color = colors[i]; > > > vec3 normal = normals[i]; > > > color.rgb *= fmax(0.0, dot(normal, light_dir)); > > > colors[i] = color; > > > } > > > > > > I'm planning on passing already vectorized code into LLVM and using > LLVM as a backend for optimization and JIT code generation. > > > > > > Do you think the EVL proposal would support an ISA like this as it's > currently > > > written (by pattern matching on predicate expansion and vector-length > > > multiplication)? > > > > whilst it may be tempting to suggest that a solution is to multiply up > > the bits in the predicate (into groups of 3 or 4), the problem with > > that is that if there are operations that require vec3 or vec4 as > > operands interspersed with predicated operations that do not, that > > realistically implies a need for two separate predicate registers, > > otherwise cycles are wasted swapping predicates OR it implies that the > > architecture *allows* two separate predicate registers to be selected. > > > > consequently, it would be much, much better to be able to have a > > single bit of a predicate apply to the *entire* vec3 or vec4 type, on > > each outer loop. > > This situation can be handled easily in the standard RISC-V vector > extension. You'd do something like... > > vsetvli t0, a0, vsew128,vnreg8,vdiv4 > > ... to configure the vector unit to provide eight vector register > variables divided into a standard element width of 128 bits (some > instructions will widen or narrow one step to/from 64 bits or 256 > bits), and then dividing each 128 bit element into 4 parts. > > Arithmetic/logical/shift will happen on 32 bit elements, but > predication and loads and stores (including strided or scatter/gather) > will operate on 128 bit elements. > > [I just made up "vnreg8" as an alias for the standard "vlmul4" because > "vlmul4,vdiv4" might look confusing. Either way it means to put 0b10 > into bits [1:0] of the vtype CSR specifying that the 32 vector > registers should be ganged into 8 groups each 4x longer than standard > because (I'm assuming) we need more than four vector registers in this > loop, but no more than eight] >Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well. We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension. Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions. We would also be reserving a 32-bit opcode for compressed variants of the most commonly used 48-bit prefixed instructions, similar in style to the C extension. Having a prefix also allows (assuming we don't run out of encoding space) larger register fields, we're planning on 128 integer and 128 fp registers. Btw, for anyone interested, feel free to join us over on libre-riscv-dev or follow at https://www.crowdsupply.com/libre-risc-v/m-class (currently somewhat out-of-date). Jacob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/e5796da6/attachment.html>