thr3ads.net - llvm dev - [llvm-dev] vectorisation, risc-v [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Luke Kenneth Casson Leighton via llvm-dev

2018-Aug-06 06:12 UTC

[llvm-dev] vectorisation, risc-v

(please do cc me to preserve thread as i am subscribed digest)

Hi folks, i have a requirement to develop a libre licensed low power
embedded 3D GPU and VPU and using RISCV as the basis (GPGPU style) seems
eminently sensible, and anyone is invited to participate.  A gsoc2017
student named Jake has already developed a Vulkan3D software renderer and
shader, and (parallelised) llvm is a critical dependency to achieving the
high efficiency needed. The difference when compared to gallium3d-llvmpipe
is that Jake's software renderer also uses llvm for the shader, where
g3dllvm does not.

I have reviewed the llvm RV RFC and it looks very sensible, informative,
and well thought through. Keeping VL changes restricted to function call
boundaries is a very good idea (presumably "fake" function calls can
be
considered, as a way to break up large functions safely), the instrinsic
vector length, ie passing in the vector length effectively as an additional
hidden function parameter, also very sensible.

I also liked that it was clear from the RFC that LLVM is divided into two
parts, which I suspected but had not had it confirmed.

As an aside I have to say that I am extremely surprised to learn that it is
only in the past year that vectorisation or more specifically variable
length SIMD has hit mainstream in libre licensed toolchains, through ARM
and RISCV.

So some background : I am the author of the SimpleV extension, which has
been developed to provide a uniform *parallelism* API, *not* as a new
Vector Microarchitecture (a common misconception). It has unintended
sideeffects such as providing LD/ST multi with predication, which in turn
can be used on function entry or context switch to save or load *up to* the
entire register file with around three instructions. Another unintended
sideeffect is code size reduction.

There is a total of ZERO new RISCV instructions, the entire design is based
around CSRs that implicitly mark the STANDARD registers as
"vectorised",
also providing a redirection table that can arbitrarily redirect the 32
registers to 64 REAL registers (64 real FP and 64 real int), including
empowering Compressed instructions to access the full 64 registers, even
when the C instruction is restricted to x8-x15.  Predication similarly is
via CSR redirection/lookups.

SETVL is slightly different from RV as it requires an immediate length as
an additional parameter. This because the Maximum Vector Length is no
longer hardcoded into silicon, it instead specifies exactly how *many*
contiguous registers in the standard regfile need to be used, NOT how many
are in a totally different regfile and NOT the width of the SIMD / Vector
Lane(s).

So with that as background, I have some questions.

1. I note that the separation between LLVM front and backend looks like
adding SV experimental support would be a simple matter of doing the
backend assembly code translator, with little to no modifications to the
front end needed, would that be about right? Particularly if LLVM-RV
already adds a variable length concept.

2. With there being absolutely no new instructions whatsoever (standard
existing AND FUTURE scalar ops are instead made implicitly parallel), and
given the deliberate design similarities it seems to me that SV would be a
good first experimental backend  *ahead* of RVV, for which the 240+ opcodes
have not yet been finalised. Would people concur?

3. If there are existing patches, where can they be found?

4. From Jeff Bush's Nyuzi work It has been noted that certain 3D operations
are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to
24/32 bit direct overlay with transparency into a tile is therefore for
example a high priority candidate for adding a special opcode that must
explicitly be called. Is this relatively easy to do and is there
documentation explaining how?

5. Although it is way way early to discuss optimisations I did have an idea
that may benefit RVV SV and ARM vectors, jumpstarting them to the sorts of
speeds associated with SIMD. SV has the concept of being able to mark
register sequences (aka vectors) as "packed SIMD y/n" including
overriding
a standard opcode's default width, and including predication but on the
PACKED width NOT the element width.  Thus it would seem logical to reflect
this in the extension of basic data types as vectorlen x simdwidth x
datatype as opposed to just vectorlen * datatype as the RFC currently
stands. In doing so *all* of the vectorisation systems could simply
vectorise (and leverage) the *existing* proven SIMD patterns that have
taken years to establish. To illustrate: if the loop length is divisible by
two an instruction VL x 2 x 32bitint would be issued, the SIMD pattern for
2x32bitint could be deployed, including predication down to the 2x32bitint
level if desired, and yet there would be no loop cleanup.

It is worth emphasising that this shall not be a private proprietary hard
fork of llvm, it is an entirely libre effort including the GPGPU (I read
Alex's lowRISC posts on such private forking practices, a hard fork would
be just insane and hugely counterproductive), so in particular regard to
(4) documentation, guidelines and recommendations likely to result in the
upstreaming process going smoothly also greatly appreciated.

Many thanks,

L.


-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180806/9557a4a9/attachment.html>

Alex Bradbury via llvm-dev

2018-Aug-06 13:32 UTC

head link

[llvm-dev] vectorisation, risc-v

On 6 August 2018 at 07:12, Luke Kenneth Casson Leighton via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> (please do cc me to preserve thread as i am subscribed digest)
>
> Hi folks, i have a requirement to develop a libre licensed low power
> embedded 3D GPU and VPU and using RISCV as the basis (GPGPU style) seems
> eminently sensible, and anyone is invited to participate.  A gsoc2017
> student named Jake has already developed a Vulkan3D software renderer and
> shader, and (parallelised) llvm is a critical dependency to achieving the
> high efficiency needed. The difference when compared to gallium3d-llvmpipe
> is that Jake's software renderer also uses llvm for the shader, where
> g3dllvm does not.
Hi Luke and welcome to the LLVM community.
> I have reviewed the llvm RV RFC and it looks very sensible, informative,
and
> well thought through. Keeping VL changes restricted to function call
> boundaries is a very good idea (presumably "fake" function calls
can be
> considered, as a way to break up large functions safely), the instrinsic
> vector length, ie passing in the vector length effectively as an additional
> hidden function parameter, also very sensible.
>
> I also liked that it was clear from the RFC that LLVM is divided into two
> parts, which I suspected but had not had it confirmed.
>
> As an aside I have to say that I am extremely surprised to learn that it is
> only in the past year that vectorisation or more specifically variable
> length SIMD has hit mainstream in libre licensed toolchains, through ARM
and
> RISCV.
>
> So some background : I am the author of the SimpleV extension, which has
> been developed to provide a uniform *parallelism* API, *not* as a new
Vector
> Microarchitecture (a common misconception). It has unintended sideeffects
> such as providing LD/ST multi with predication, which in turn can be used
on
> function entry or context switch to save or load *up to* the entire
register
> file with around three instructions. Another unintended sideeffect is code
> size reduction.
>
> There is a total of ZERO new RISCV instructions, the entire design is based
> around CSRs that implicitly mark the STANDARD registers as
"vectorised",
> also providing a redirection table that can arbitrarily redirect the 32
> registers to 64 REAL registers (64 real FP and 64 real int), including
> empowering Compressed instructions to access the full 64 registers, even
> when the C instruction is restricted to x8-x15.  Predication similarly is
> via CSR redirection/lookups.
It's worth noting the fact that there are zero new RISC-V instruction
encodings doesn't mean it's necessarily easier to support vs a
proposal that introduces new instructions. LLVM would have to be
taught how to handle this register bank switching / redirection scheme
and how to minimise the number of switches required. This does have
the potential to be somewhat intrusive. It reduces work for the MC
layer (assembler/disassembler), but the code generator would still
need to understand the semantics of these overloaded instructions.
> 1. I note that the separation between LLVM front and backend looks like
> adding SV experimental support would be a simple matter of doing the
backend
> assembly code translator, with little to no modifications to the front end
> needed, would that be about right? Particularly if LLVM-RV already adds a
> variable length concept.
As with most compilers you can separate the frontend, middle-end and
backend. Adding SV experimental support would definitely, as you say,
require work in the backend (supporting lowering of IR to machine
instructions) but potentially also middle-end modifications (IR->IR
transformations) to enable the existing vectorisation passes.
> 2. With there being absolutely no new instructions whatsoever (standard
> existing AND FUTURE scalar ops are instead made implicitly parallel), and
> given the deliberate design similarities it seems to me that SV would be a
> good first experimental backend  *ahead* of RVV, for which the 240+ opcodes
> have not yet been finalised. Would people concur?
I'm not convinced it would actually be an easier starting point and I
anticipate quite a lot of work describing these new instruction
semantics and teaching LLVM how to use them.

For clarity, is this something you're proposing to be done directly in
upstream LLVM, or something you're asking for advice on in an (at
least initially) downstream project?
> 3. If there are existing patches, where can they be found?
Robin Kruppe is the main person working on RVV support. I'm not sure
if patches have been made available anywhere at this point.
> 4. From Jeff Bush's Nyuzi work It has been noted that certain 3D
operations
> are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to
> 24/32 bit direct overlay with transparency into a tile is therefore for
> example a high priority candidate for adding a special opcode that must
> explicitly be called. Is this relatively easy to do and is there
> documentation explaining how?
Adding a new instruction and making it available through inline
assembly or intrinsics is pretty easy. I did a mini-tutorial on this
at the RISC-V Workshop in Barcelona and really should tidy up and
publish the extended materials I started on this subject.
> It is worth emphasising that this shall not be a private proprietary hard
> fork of llvm, it is an entirely libre effort including the GPGPU (I read
> Alex's lowRISC posts on such private forking practices, a hard fork
would be
> just insane and hugely counterproductive), so in particular regard to (4)
> documentation, guidelines and recommendations likely to result in the
> upstreaming process going smoothly also greatly appreciated.
One additional thought: I think RISC-V is somewhat unique in LLVM in
that implementers are free to design and implement custom extensions
without need for prior approval. Many such implementers may wish to
see upstream LLVM support for their extensions. For any open source
project, it's normal to consider factors such as the following when
considering new contributions:
* Potential value to the project (will there be users?)
* Potential cost to the project (what is the maintenance burden? Is
someone stepping up to maintain the addition?)
* Is it stable? i.e. is the design and external interfaces finalised?
If not, is the level of instability compatible with the project's
release process ad sufficiently explained in docs etc.

Support for any standard RISC-V Foundation published extensions is
easy to justify. Also for custom extensions with shipping hardware
that is programmable by end-users. Cases such as experimental
non-standard extensions that haven't (yet) shipped in hardware might
require more examination on a case-by-case basis. [Just sharing my
initial thoughts here rather than official LLVM policy.]

Best,

Alex

Luke Kenneth Casson Leighton via llvm-dev

2018-Aug-06 16:47 UTC

head link

[llvm-dev] vectorisation, risc-v

[hi, please do cc me to maintain thread integrity, i am subscribed digest]

On Mon, Aug 6, 2018 at 2:32 PM, Alex Bradbury <asb at lowrisc.org> wrote:
> Hi Luke and welcome to the LLVM community.
 ... oh!  hi alex!  thanks :)
>> There is a total of ZERO new RISCV instructions, the entire design is
based
>> around CSRs that implicitly mark the STANDARD registers as
"vectorised",
>> [...]
>
> It's worth noting the fact that there are zero new RISC-V instruction
> encodings doesn't mean it's necessarily easier to support vs a
> proposal that introduces new instructions.
 appreciate the insight.
> LLVM would have to be
> taught how to handle this register bank switching / redirection scheme
> and how to minimise the number of switches required. This does have
> the potential to be somewhat intrusive. It reduces work for the MC
> layer (assembler/disassembler), but the code generator would still
> need to understand the semantics of these overloaded instructions.
 indeed.  there's a major difference between SV and RV here, which
stems from the use of the standard register file.  SV's SETVL *has* to
guarantee that when the #immediate is set to e.g. 16, that if srcreg
is >= 16, VL *must* be set to 16.  this to ensure that if it is used
for a single hit (i.e. with no looping), for example in a
context-switch or LD/ST MULTI substitute, that the LD/ST or
context-switch can be achieved in two and only two instructions [minus
CSR setup]: SETVL and the LD/ST/other-op.

 as a first pass these kinds of... interesting semantics would
probably be a good idea to skip, and instead break the (now-extended)
register file into two groups: x0-x31 (standard regfile) and x32-x63
which would be utilised by an SV-aware MC layer.  the "standard"
x0-x31 regs would be treated as "scalar", the top ones treated as
"RV-like".  a bit like SSE, in other words.  what do you think?

 btw my primary focus here is to do the research into what's the most
practical path for empowering jake to create a Libre 3D GPGPU.
>> 1. I note that the separation between LLVM front and backend looks like
>> adding SV experimental support would be a simple matter of doing the
backend
>> assembly code translator, with little to no modifications to the front
end
>> needed, would that be about right? Particularly if LLVM-RV already adds
a
>> variable length concept.
>
> As with most compilers you can separate the frontend, middle-end and
> backend. Adding SV experimental support would definitely, as you say,
> require work in the backend (supporting lowering of IR to machine
> instructions) but potentially also middle-end modifications (IR->IR
> transformations) to enable the existing vectorisation passes.
 can you elaborate on that at all, or point me in the direction of
some docs?  or is it something that's... generally understood?  if i
can translate what you're saying to concepts that i understand ffrom
prior experience: the people who helped developed pyjs (a
python-to-javascript language *translator*), we wrote a python-to-AST
parser (actually... just took parts of lib2to3, wholesale!), then
added in "AST morphers" which would walk the AST looking for patterns
and actually *modify* the AST in-memory to a more javascript-like
format, and *then* handed it over to the JS outputter.

the biggest of these was, if i recall the one that massively rewrote
the AST to add proper support for python "yield".

if i understand correctly, the intermediary morphing (IR->IR) is i
*think* the same thing, does that sound about right?  that you have an
IR, but that, for the target ISA which has certain concepts that are
less (or more) efficient, the IR needs rewriting from the
"general-purpose" original down to a more
"architecturally-specific"
IR.

>> 2. With there being absolutely no new instructions whatsoever (standard
>> existing AND FUTURE scalar ops are instead made implicitly parallel),
and
>> given the deliberate design similarities it seems to me that SV would
be a
>> good first experimental backend  *ahead* of RVV, for which the 240+
opcodes
>> have not yet been finalised. Would people concur?
>
> I'm not convinced it would actually be an easier starting point and I
> anticipate quite a lot of work describing these new instruction
> semantics and teaching LLVM how to use them.
>
> For clarity, is this something you're proposing to be done directly in
> upstream LLVM, or something you're asking for advice on in an (at
> least initially) downstream project?
 i don't know yet: i do know that, ultimately, it'll need to be
upstreamed.  it's just not possible otherwise to have a goal with the
word "libre" attached to it.  if the M-Class SoC takes off, it would
end up in hundreds of millions of ubiquitous devices (primarily and
initially in india) - smartphones, tablets, netbooks and so on - and
if there's even the sniff of a proprietary library or anything that's
*not* fully upstreamed the site hosting it will be deluged with
complaints from libre advocates (due to the proprietary
libraries/applications), and, due to the sheer number of units,
absolutely deluged with downloads.

 from a development perspective, being able to coordinate disparate
groups via upstream repositories would be... a lot easier, but not a
showstopper if they weren't.  but ultimately everything does have to
go upstream.

>> 3. If there are existing patches, where can they be found?
>
> Robin Kruppe is the main person working on RVV support. I'm not sure
> if patches have been made available anywhere at this point.
 ah!  yes, sorry, hi robin, loved the RFC.  do you have anything
available that could be looked over?
>> 4. From Jeff Bush's Nyuzi work It has been noted that certain 3D
operations
>> are just far too expensive to do as SIMD or vectors. Multiple FP ARGB
to
>> 24/32 bit direct overlay with transparency into a tile is therefore for
>> example a high priority candidate for adding a special opcode that must
>> explicitly be called. Is this relatively easy to do and is there
>> documentation explaining how?
>
> Adding a new instruction and making it available through inline
> assembly or intrinsics is pretty easy.
 ok great, good to hear.
> I did a mini-tutorial on this
> at the RISC-V Workshop in Barcelona and really should tidy up and
> publish the extended materials I started on this subject.
 no rush, here.
>> It is worth emphasising that this shall not be a private proprietary
hard
>> fork of llvm, it is an entirely libre effort including the GPGPU (I
read
>> Alex's lowRISC posts on such private forking practices, a hard fork
would be
>> just insane and hugely counterproductive), so in particular regard to
(4)
>> documentation, guidelines and recommendations likely to result in the
>> upstreaming process going smoothly also greatly appreciated.
>
> One additional thought: I think RISC-V is somewhat unique in LLVM in
> that implementers are free to design and implement custom extensions
> without need for prior approval. Many such implementers may wish to
> see upstream LLVM support for their extensions.
 i don't know if you followed the isa-conflict-resolution discussion
(which was itself ironically... full of conflict), i am... well,
there's no easy way to say this, so i'll just say it straight: just as
with gcc, if upstream LLVM accepts such extension support upstream
(which implies that, publicly, that opcode is now permanently and
irrevocably world-wide "taken over" and is *permanently* and
implicitly *exclusively* reserved by that implementor) without them
being able to switch it off, i.e. having something that's exactly or
is orthogonal to the isa-mux proposal (which is exactly like the 386
"segment offset" concept... except for instructions), LLVM-RISC-V will
get into an absolute world of pain.

the very first time that LLVM has to generate (support) two uses of
the exact same binary instruction encoding with two completely
different meanings, that's it: RISC-V will be treated exactly like
Altivec / SSE for PowerPC, i.e. dead.

so please, please, guys, for goodness sake, please: when it comes to
upstreaming non-standard custom extensions that "take over" opcode
space in ways that *can't be switched off*, please for goodness sake
put your foot down and say "no, sorry, you'll have to maintain this
yourself as an unsupported hard fork".

think about it: you let even *one* team make a public declaration "we
effectively own this custom opcode, now", that's it: nobody else,
anywhere in the world, can ever publicly consider using it.  and you
know how few custom opcodes there are [in the 32-bit space].  two.

i can't.... i can't begin to express how absolutely critical this is,
to the entire RISC-V ecosystem.
> For any open source
> project, it's normal to consider factors such as the following when
> considering new contributions:
> * Potential value to the project (will there be users?)
> * Potential cost to the project (what is the maintenance burden? Is
> someone stepping up to maintain the addition?)
> * Is it stable? i.e. is the design and external interfaces finalised?
> If not, is the level of instability compatible with the project's
> release process ad sufficiently explained in docs etc.
 appreciated.  well, i can say that the people i've encountered are
extremely competent: Jake for example is amazing.  and i'm used to
dealing with software libre projects.  i'll make sure that the
different groups/contributors minimise impact.

 yes, SV is sufficiently straightforward that it's not going to
significantly change.  i say that, but i did reconsider the CSR
element width meanings recently so that RV32G could transfer 64-bit
ints to 64-bit floats with a single (meaning-overloaded) instruction,
recently...

 whilst i can categorically say that there's not going to be major
redesigns (i don't anticipate any being needed, the RV work is the
base foundation and that's had *literally* years of thought put into
it), small changes as issues are encountered are... kiiinda going to
be inevitable, and can only realistically really be worked out as and
when an actual implementation gets underway.

> Support for any standard RISC-V Foundation published extensions is
> easy to justify.
 indeed.  for standard extensions, the RV Foundation acts as the
atomic arbiter to ensure and guarantee world-wide exclusive unique
meaning of any given opcode.  that's their role, it's the purpose of
the Certification Mark, and they'll do that job very very well.
> Also for custom extensions with shipping hardware
> that is programmable by end-users.
 ngggh... *sigh* ok much as i'd like to keep this topic on-track, the
following is very important to be aware of.  the custom opcode space
(and the practice of overriding even standard opcodes) is where the
RISC-V Foundation has unfortunately failed to comprehend the nature of
the problem, and has effectively passed the burden of responsibility
for atomically curating the public opcode space over to the FSF (in
the case of gcc) and to the LLVM team (in the case of LLVM).  this um
may be news to you.

 we know for a fact, from many many historic examples, with PowerPC
Altivec/SSE being the most publicly well-known one, that dual or
greater *public* binary ISA encoding conflicts (private ones are not a
problem at all) quite literally kill the entire ISA as it completely
violates the expected contract that any binary will have one and only
one "meaning".

 so if we want RISC-V to not be killed off stone-dead, it's absolutely
critically important that this contract never be violated.

 and.. um... the responsibility for guaranteeing that that be the
case... has been punted.... to *you*, the LLVM team.  that may.. um...
come as an alarming surprise.

 examples which have been *successful* in other architectures - and
not burdensome at all - are those which dynamically set a CSR which
changes big-endian / little-endian meaning of LD/ST instructions.
these have an *exact* orthogonal equivalent system to the
mvendorid-marchid-isamux concept.

> Cases such as experimental
> non-standard extensions that haven't (yet) shipped in hardware might
> require more examination on a case-by-case basis. [Just sharing my
> initial thoughts here rather than official LLVM policy.]
 understood.

 so, a couple of things:

 firstly, the isa-mux proposal basically extends standard opcodes from
32-bit to say 33, 34 or however many extra (hidden) bits are needed to
*literally* change the meaning of 32-bit binary encoding(s). want an
encoding completely switched off and to fall back to standard RVbase
only?  no problem: set muxid=0.  done.   transfer a binary to another
architecture and you want to guarantee that you will get an exception
trap that's world-wide unique and so can be software-emulated with
publicly-available libre libraries?  no problem: the trap will have
access to the current "mux" setting from userspace so will know
*exactly* which (publicly published) custom extension it should
emulate.

 so the mvendorid-marchid-muxid becomes a globally world-wide unique
tuple for which it is absolutely flat-out impossible to have any kind
of conflict of the Altivec / SSE PowerPC type that killed *public* gcc
/ LLVM PowerPC vector support stone dead.  nobody knew if a given
binary would work on their hardware, because they had no idea if the
binary encoding for an opcode was for Altivec or for its competitor...
so nobody bothered with either.

 following the isamux concept, that situation is impossible to
encounter.  therefore, i will be (and have been) making it absolutely
clear to people that the 3D and SV custom extensions *will* be
run-time dynamically muxable (i.e. absolutely guaranteed to be
world-wide unique).

 secondly, over the long term, there's going to need to be quite a bit
of sustained coordination, world-wide, between completely different
inter-dependent groups.  i've yet to track down someone who can do the
modifications to spike, for example, and they may be someone who
doesn't know (and shouldn't have to know) about LLVM, or even the
Libre 3D GPGPU project.  if however i can point them at an upstream
branch / repo for llvm and say to them "run this test code", things
get a heck of a lot easier (i.e. don't go rapidly out-of-date, follow
and keep up-to-date with standard practices)... you get where i'm
going with that, i'm sure.

ok, that's probably enough for now, apologies i am taking on a new
project today, i may be a little delayed in responding.  this is
however very important so i will be putting some thought into replies.

l.

llvm dev - Aug 2018 - vectorisation, risc-v

[llvm-dev] vectorisation, risc-v

[llvm-dev] vectorisation, risc-v

[llvm-dev] vectorisation, risc-v