Luke Kenneth Casson Leighton via llvm-dev
2018-Aug-06 06:12 UTC
[llvm-dev] vectorisation, risc-v
(please do cc me to preserve thread as i am subscribed digest) Hi folks, i have a requirement to develop a libre licensed low power embedded 3D GPU and VPU and using RISCV as the basis (GPGPU style) seems eminently sensible, and anyone is invited to participate. A gsoc2017 student named Jake has already developed a Vulkan3D software renderer and shader, and (parallelised) llvm is a critical dependency to achieving the high efficiency needed. The difference when compared to gallium3d-llvmpipe is that Jake's software renderer also uses llvm for the shader, where g3dllvm does not. I have reviewed the llvm RV RFC and it looks very sensible, informative, and well thought through. Keeping VL changes restricted to function call boundaries is a very good idea (presumably "fake" function calls can be considered, as a way to break up large functions safely), the instrinsic vector length, ie passing in the vector length effectively as an additional hidden function parameter, also very sensible. I also liked that it was clear from the RFC that LLVM is divided into two parts, which I suspected but had not had it confirmed. As an aside I have to say that I am extremely surprised to learn that it is only in the past year that vectorisation or more specifically variable length SIMD has hit mainstream in libre licensed toolchains, through ARM and RISCV. So some background : I am the author of the SimpleV extension, which has been developed to provide a uniform *parallelism* API, *not* as a new Vector Microarchitecture (a common misconception). It has unintended sideeffects such as providing LD/ST multi with predication, which in turn can be used on function entry or context switch to save or load *up to* the entire register file with around three instructions. Another unintended sideeffect is code size reduction. There is a total of ZERO new RISCV instructions, the entire design is based around CSRs that implicitly mark the STANDARD registers as "vectorised", also providing a redirection table that can arbitrarily redirect the 32 registers to 64 REAL registers (64 real FP and 64 real int), including empowering Compressed instructions to access the full 64 registers, even when the C instruction is restricted to x8-x15. Predication similarly is via CSR redirection/lookups. SETVL is slightly different from RV as it requires an immediate length as an additional parameter. This because the Maximum Vector Length is no longer hardcoded into silicon, it instead specifies exactly how *many* contiguous registers in the standard regfile need to be used, NOT how many are in a totally different regfile and NOT the width of the SIMD / Vector Lane(s). So with that as background, I have some questions. 1. I note that the separation between LLVM front and backend looks like adding SV experimental support would be a simple matter of doing the backend assembly code translator, with little to no modifications to the front end needed, would that be about right? Particularly if LLVM-RV already adds a variable length concept. 2. With there being absolutely no new instructions whatsoever (standard existing AND FUTURE scalar ops are instead made implicitly parallel), and given the deliberate design similarities it seems to me that SV would be a good first experimental backend *ahead* of RVV, for which the 240+ opcodes have not yet been finalised. Would people concur? 3. If there are existing patches, where can they be found? 4. From Jeff Bush's Nyuzi work It has been noted that certain 3D operations are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to 24/32 bit direct overlay with transparency into a tile is therefore for example a high priority candidate for adding a special opcode that must explicitly be called. Is this relatively easy to do and is there documentation explaining how? 5. Although it is way way early to discuss optimisations I did have an idea that may benefit RVV SV and ARM vectors, jumpstarting them to the sorts of speeds associated with SIMD. SV has the concept of being able to mark register sequences (aka vectors) as "packed SIMD y/n" including overriding a standard opcode's default width, and including predication but on the PACKED width NOT the element width. Thus it would seem logical to reflect this in the extension of basic data types as vectorlen x simdwidth x datatype as opposed to just vectorlen * datatype as the RFC currently stands. In doing so *all* of the vectorisation systems could simply vectorise (and leverage) the *existing* proven SIMD patterns that have taken years to establish. To illustrate: if the loop length is divisible by two an instruction VL x 2 x 32bitint would be issued, the SIMD pattern for 2x32bitint could be deployed, including predication down to the 2x32bitint level if desired, and yet there would be no loop cleanup. It is worth emphasising that this shall not be a private proprietary hard fork of llvm, it is an entirely libre effort including the GPGPU (I read Alex's lowRISC posts on such private forking practices, a hard fork would be just insane and hugely counterproductive), so in particular regard to (4) documentation, guidelines and recommendations likely to result in the upstreaming process going smoothly also greatly appreciated. Many thanks, L. -- --- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180806/9557a4a9/attachment.html>
On 6 August 2018 at 07:12, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:> (please do cc me to preserve thread as i am subscribed digest) > > Hi folks, i have a requirement to develop a libre licensed low power > embedded 3D GPU and VPU and using RISCV as the basis (GPGPU style) seems > eminently sensible, and anyone is invited to participate. A gsoc2017 > student named Jake has already developed a Vulkan3D software renderer and > shader, and (parallelised) llvm is a critical dependency to achieving the > high efficiency needed. The difference when compared to gallium3d-llvmpipe > is that Jake's software renderer also uses llvm for the shader, where > g3dllvm does not.Hi Luke and welcome to the LLVM community.> I have reviewed the llvm RV RFC and it looks very sensible, informative, and > well thought through. Keeping VL changes restricted to function call > boundaries is a very good idea (presumably "fake" function calls can be > considered, as a way to break up large functions safely), the instrinsic > vector length, ie passing in the vector length effectively as an additional > hidden function parameter, also very sensible. > > I also liked that it was clear from the RFC that LLVM is divided into two > parts, which I suspected but had not had it confirmed. > > As an aside I have to say that I am extremely surprised to learn that it is > only in the past year that vectorisation or more specifically variable > length SIMD has hit mainstream in libre licensed toolchains, through ARM and > RISCV. > > So some background : I am the author of the SimpleV extension, which has > been developed to provide a uniform *parallelism* API, *not* as a new Vector > Microarchitecture (a common misconception). It has unintended sideeffects > such as providing LD/ST multi with predication, which in turn can be used on > function entry or context switch to save or load *up to* the entire register > file with around three instructions. Another unintended sideeffect is code > size reduction. > > There is a total of ZERO new RISCV instructions, the entire design is based > around CSRs that implicitly mark the STANDARD registers as "vectorised", > also providing a redirection table that can arbitrarily redirect the 32 > registers to 64 REAL registers (64 real FP and 64 real int), including > empowering Compressed instructions to access the full 64 registers, even > when the C instruction is restricted to x8-x15. Predication similarly is > via CSR redirection/lookups.It's worth noting the fact that there are zero new RISC-V instruction encodings doesn't mean it's necessarily easier to support vs a proposal that introduces new instructions. LLVM would have to be taught how to handle this register bank switching / redirection scheme and how to minimise the number of switches required. This does have the potential to be somewhat intrusive. It reduces work for the MC layer (assembler/disassembler), but the code generator would still need to understand the semantics of these overloaded instructions.> 1. I note that the separation between LLVM front and backend looks like > adding SV experimental support would be a simple matter of doing the backend > assembly code translator, with little to no modifications to the front end > needed, would that be about right? Particularly if LLVM-RV already adds a > variable length concept.As with most compilers you can separate the frontend, middle-end and backend. Adding SV experimental support would definitely, as you say, require work in the backend (supporting lowering of IR to machine instructions) but potentially also middle-end modifications (IR->IR transformations) to enable the existing vectorisation passes.> 2. With there being absolutely no new instructions whatsoever (standard > existing AND FUTURE scalar ops are instead made implicitly parallel), and > given the deliberate design similarities it seems to me that SV would be a > good first experimental backend *ahead* of RVV, for which the 240+ opcodes > have not yet been finalised. Would people concur?I'm not convinced it would actually be an easier starting point and I anticipate quite a lot of work describing these new instruction semantics and teaching LLVM how to use them. For clarity, is this something you're proposing to be done directly in upstream LLVM, or something you're asking for advice on in an (at least initially) downstream project?> 3. If there are existing patches, where can they be found?Robin Kruppe is the main person working on RVV support. I'm not sure if patches have been made available anywhere at this point.> 4. From Jeff Bush's Nyuzi work It has been noted that certain 3D operations > are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to > 24/32 bit direct overlay with transparency into a tile is therefore for > example a high priority candidate for adding a special opcode that must > explicitly be called. Is this relatively easy to do and is there > documentation explaining how?Adding a new instruction and making it available through inline assembly or intrinsics is pretty easy. I did a mini-tutorial on this at the RISC-V Workshop in Barcelona and really should tidy up and publish the extended materials I started on this subject.> It is worth emphasising that this shall not be a private proprietary hard > fork of llvm, it is an entirely libre effort including the GPGPU (I read > Alex's lowRISC posts on such private forking practices, a hard fork would be > just insane and hugely counterproductive), so in particular regard to (4) > documentation, guidelines and recommendations likely to result in the > upstreaming process going smoothly also greatly appreciated.One additional thought: I think RISC-V is somewhat unique in LLVM in that implementers are free to design and implement custom extensions without need for prior approval. Many such implementers may wish to see upstream LLVM support for their extensions. For any open source project, it's normal to consider factors such as the following when considering new contributions: * Potential value to the project (will there be users?) * Potential cost to the project (what is the maintenance burden? Is someone stepping up to maintain the addition?) * Is it stable? i.e. is the design and external interfaces finalised? If not, is the level of instability compatible with the project's release process ad sufficiently explained in docs etc. Support for any standard RISC-V Foundation published extensions is easy to justify. Also for custom extensions with shipping hardware that is programmable by end-users. Cases such as experimental non-standard extensions that haven't (yet) shipped in hardware might require more examination on a case-by-case basis. [Just sharing my initial thoughts here rather than official LLVM policy.] Best, Alex
Luke Kenneth Casson Leighton via llvm-dev
2018-Aug-06 16:47 UTC
[llvm-dev] vectorisation, risc-v
[hi, please do cc me to maintain thread integrity, i am subscribed digest] On Mon, Aug 6, 2018 at 2:32 PM, Alex Bradbury <asb at lowrisc.org> wrote:> Hi Luke and welcome to the LLVM community.... oh! hi alex! thanks :)>> There is a total of ZERO new RISCV instructions, the entire design is based >> around CSRs that implicitly mark the STANDARD registers as "vectorised", >> [...] > > It's worth noting the fact that there are zero new RISC-V instruction > encodings doesn't mean it's necessarily easier to support vs a > proposal that introduces new instructions.appreciate the insight.> LLVM would have to be > taught how to handle this register bank switching / redirection scheme > and how to minimise the number of switches required. This does have > the potential to be somewhat intrusive. It reduces work for the MC > layer (assembler/disassembler), but the code generator would still > need to understand the semantics of these overloaded instructions.indeed. there's a major difference between SV and RV here, which stems from the use of the standard register file. SV's SETVL *has* to guarantee that when the #immediate is set to e.g. 16, that if srcreg is >= 16, VL *must* be set to 16. this to ensure that if it is used for a single hit (i.e. with no looping), for example in a context-switch or LD/ST MULTI substitute, that the LD/ST or context-switch can be achieved in two and only two instructions [minus CSR setup]: SETVL and the LD/ST/other-op. as a first pass these kinds of... interesting semantics would probably be a good idea to skip, and instead break the (now-extended) register file into two groups: x0-x31 (standard regfile) and x32-x63 which would be utilised by an SV-aware MC layer. the "standard" x0-x31 regs would be treated as "scalar", the top ones treated as "RV-like". a bit like SSE, in other words. what do you think? btw my primary focus here is to do the research into what's the most practical path for empowering jake to create a Libre 3D GPGPU.>> 1. I note that the separation between LLVM front and backend looks like >> adding SV experimental support would be a simple matter of doing the backend >> assembly code translator, with little to no modifications to the front end >> needed, would that be about right? Particularly if LLVM-RV already adds a >> variable length concept. > > As with most compilers you can separate the frontend, middle-end and > backend. Adding SV experimental support would definitely, as you say, > require work in the backend (supporting lowering of IR to machine > instructions) but potentially also middle-end modifications (IR->IR > transformations) to enable the existing vectorisation passes.can you elaborate on that at all, or point me in the direction of some docs? or is it something that's... generally understood? if i can translate what you're saying to concepts that i understand ffrom prior experience: the people who helped developed pyjs (a python-to-javascript language *translator*), we wrote a python-to-AST parser (actually... just took parts of lib2to3, wholesale!), then added in "AST morphers" which would walk the AST looking for patterns and actually *modify* the AST in-memory to a more javascript-like format, and *then* handed it over to the JS outputter. the biggest of these was, if i recall the one that massively rewrote the AST to add proper support for python "yield". if i understand correctly, the intermediary morphing (IR->IR) is i *think* the same thing, does that sound about right? that you have an IR, but that, for the target ISA which has certain concepts that are less (or more) efficient, the IR needs rewriting from the "general-purpose" original down to a more "architecturally-specific" IR.>> 2. With there being absolutely no new instructions whatsoever (standard >> existing AND FUTURE scalar ops are instead made implicitly parallel), and >> given the deliberate design similarities it seems to me that SV would be a >> good first experimental backend *ahead* of RVV, for which the 240+ opcodes >> have not yet been finalised. Would people concur? > > I'm not convinced it would actually be an easier starting point and I > anticipate quite a lot of work describing these new instruction > semantics and teaching LLVM how to use them. > > For clarity, is this something you're proposing to be done directly in > upstream LLVM, or something you're asking for advice on in an (at > least initially) downstream project?i don't know yet: i do know that, ultimately, it'll need to be upstreamed. it's just not possible otherwise to have a goal with the word "libre" attached to it. if the M-Class SoC takes off, it would end up in hundreds of millions of ubiquitous devices (primarily and initially in india) - smartphones, tablets, netbooks and so on - and if there's even the sniff of a proprietary library or anything that's *not* fully upstreamed the site hosting it will be deluged with complaints from libre advocates (due to the proprietary libraries/applications), and, due to the sheer number of units, absolutely deluged with downloads. from a development perspective, being able to coordinate disparate groups via upstream repositories would be... a lot easier, but not a showstopper if they weren't. but ultimately everything does have to go upstream.>> 3. If there are existing patches, where can they be found? > > Robin Kruppe is the main person working on RVV support. I'm not sure > if patches have been made available anywhere at this point.ah! yes, sorry, hi robin, loved the RFC. do you have anything available that could be looked over?>> 4. From Jeff Bush's Nyuzi work It has been noted that certain 3D operations >> are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to >> 24/32 bit direct overlay with transparency into a tile is therefore for >> example a high priority candidate for adding a special opcode that must >> explicitly be called. Is this relatively easy to do and is there >> documentation explaining how? > > Adding a new instruction and making it available through inline > assembly or intrinsics is pretty easy.ok great, good to hear.> I did a mini-tutorial on this > at the RISC-V Workshop in Barcelona and really should tidy up and > publish the extended materials I started on this subject.no rush, here.>> It is worth emphasising that this shall not be a private proprietary hard >> fork of llvm, it is an entirely libre effort including the GPGPU (I read >> Alex's lowRISC posts on such private forking practices, a hard fork would be >> just insane and hugely counterproductive), so in particular regard to (4) >> documentation, guidelines and recommendations likely to result in the >> upstreaming process going smoothly also greatly appreciated. > > One additional thought: I think RISC-V is somewhat unique in LLVM in > that implementers are free to design and implement custom extensions > without need for prior approval. Many such implementers may wish to > see upstream LLVM support for their extensions.i don't know if you followed the isa-conflict-resolution discussion (which was itself ironically... full of conflict), i am... well, there's no easy way to say this, so i'll just say it straight: just as with gcc, if upstream LLVM accepts such extension support upstream (which implies that, publicly, that opcode is now permanently and irrevocably world-wide "taken over" and is *permanently* and implicitly *exclusively* reserved by that implementor) without them being able to switch it off, i.e. having something that's exactly or is orthogonal to the isa-mux proposal (which is exactly like the 386 "segment offset" concept... except for instructions), LLVM-RISC-V will get into an absolute world of pain. the very first time that LLVM has to generate (support) two uses of the exact same binary instruction encoding with two completely different meanings, that's it: RISC-V will be treated exactly like Altivec / SSE for PowerPC, i.e. dead. so please, please, guys, for goodness sake, please: when it comes to upstreaming non-standard custom extensions that "take over" opcode space in ways that *can't be switched off*, please for goodness sake put your foot down and say "no, sorry, you'll have to maintain this yourself as an unsupported hard fork". think about it: you let even *one* team make a public declaration "we effectively own this custom opcode, now", that's it: nobody else, anywhere in the world, can ever publicly consider using it. and you know how few custom opcodes there are [in the 32-bit space]. two. i can't.... i can't begin to express how absolutely critical this is, to the entire RISC-V ecosystem.> For any open source > project, it's normal to consider factors such as the following when > considering new contributions: > * Potential value to the project (will there be users?) > * Potential cost to the project (what is the maintenance burden? Is > someone stepping up to maintain the addition?) > * Is it stable? i.e. is the design and external interfaces finalised? > If not, is the level of instability compatible with the project's > release process ad sufficiently explained in docs etc.appreciated. well, i can say that the people i've encountered are extremely competent: Jake for example is amazing. and i'm used to dealing with software libre projects. i'll make sure that the different groups/contributors minimise impact. yes, SV is sufficiently straightforward that it's not going to significantly change. i say that, but i did reconsider the CSR element width meanings recently so that RV32G could transfer 64-bit ints to 64-bit floats with a single (meaning-overloaded) instruction, recently... whilst i can categorically say that there's not going to be major redesigns (i don't anticipate any being needed, the RV work is the base foundation and that's had *literally* years of thought put into it), small changes as issues are encountered are... kiiinda going to be inevitable, and can only realistically really be worked out as and when an actual implementation gets underway.> Support for any standard RISC-V Foundation published extensions is > easy to justify.indeed. for standard extensions, the RV Foundation acts as the atomic arbiter to ensure and guarantee world-wide exclusive unique meaning of any given opcode. that's their role, it's the purpose of the Certification Mark, and they'll do that job very very well.> Also for custom extensions with shipping hardware > that is programmable by end-users.ngggh... *sigh* ok much as i'd like to keep this topic on-track, the following is very important to be aware of. the custom opcode space (and the practice of overriding even standard opcodes) is where the RISC-V Foundation has unfortunately failed to comprehend the nature of the problem, and has effectively passed the burden of responsibility for atomically curating the public opcode space over to the FSF (in the case of gcc) and to the LLVM team (in the case of LLVM). this um may be news to you. we know for a fact, from many many historic examples, with PowerPC Altivec/SSE being the most publicly well-known one, that dual or greater *public* binary ISA encoding conflicts (private ones are not a problem at all) quite literally kill the entire ISA as it completely violates the expected contract that any binary will have one and only one "meaning". so if we want RISC-V to not be killed off stone-dead, it's absolutely critically important that this contract never be violated. and.. um... the responsibility for guaranteeing that that be the case... has been punted.... to *you*, the LLVM team. that may.. um... come as an alarming surprise. examples which have been *successful* in other architectures - and not burdensome at all - are those which dynamically set a CSR which changes big-endian / little-endian meaning of LD/ST instructions. these have an *exact* orthogonal equivalent system to the mvendorid-marchid-isamux concept.> Cases such as experimental > non-standard extensions that haven't (yet) shipped in hardware might > require more examination on a case-by-case basis. [Just sharing my > initial thoughts here rather than official LLVM policy.]understood. so, a couple of things: firstly, the isa-mux proposal basically extends standard opcodes from 32-bit to say 33, 34 or however many extra (hidden) bits are needed to *literally* change the meaning of 32-bit binary encoding(s). want an encoding completely switched off and to fall back to standard RVbase only? no problem: set muxid=0. done. transfer a binary to another architecture and you want to guarantee that you will get an exception trap that's world-wide unique and so can be software-emulated with publicly-available libre libraries? no problem: the trap will have access to the current "mux" setting from userspace so will know *exactly* which (publicly published) custom extension it should emulate. so the mvendorid-marchid-muxid becomes a globally world-wide unique tuple for which it is absolutely flat-out impossible to have any kind of conflict of the Altivec / SSE PowerPC type that killed *public* gcc / LLVM PowerPC vector support stone dead. nobody knew if a given binary would work on their hardware, because they had no idea if the binary encoding for an opcode was for Altivec or for its competitor... so nobody bothered with either. following the isamux concept, that situation is impossible to encounter. therefore, i will be (and have been) making it absolutely clear to people that the 3D and SV custom extensions *will* be run-time dynamically muxable (i.e. absolutely guaranteed to be world-wide unique). secondly, over the long term, there's going to need to be quite a bit of sustained coordination, world-wide, between completely different inter-dependent groups. i've yet to track down someone who can do the modifications to spike, for example, and they may be someone who doesn't know (and shouldn't have to know) about LLVM, or even the Libre 3D GPGPU project. if however i can point them at an upstream branch / repo for llvm and say to them "run this test code", things get a heck of a lot easier (i.e. don't go rapidly out-of-date, follow and keep up-to-date with standard practices)... you get where i'm going with that, i'm sure. ok, that's probably enough for now, apologies i am taking on a new project today, i may be a little delayed in responding. this is however very important so i will be putting some thought into replies. l.