thr3ads.net - llvm dev - [llvm-dev] Handling register allocation on Propeller 2 [Feb 2021]

If this information is useful, please help other people find it:
Share via:
Moony via llvm-dev
2021-Feb-23 20:51 UTC
[llvm-dev] Handling register allocation on Propeller 2

Thank you for the advice. As for the last one, the P2 is neither superscalar
nor OoO, so using the flags as a register for bitwise operations when preparing
to branch or conditionally execute some instruction(s) with them is efficient.
Thanks for the pointers to other backends, I’m likely going to have to take tips
from AMDGPU in several places, as it and the P2 both have very large flat
regfiles, and P2 has GPU-like instruction skipping functionality available as a
code size optimization utility (unlike AMDGPU it doesn’t have every core on one
instruction stream, so it’s purely a speed/space tool.)
As you mentioned, some ISA details would probably help people out here, so I’ll
try and summarize:
The P2 is a 32-bit in-order-execution microcontroller architecture, with no
caches (SRAM main memory) and 1-16 cores. It does not have atomics, but does
have HW locks. It is a RISC/CISC hybrid (read: I dunno how to classify it, but
instructions are fixed width.)
All instructions have conditional execution via a 4-bit predicate (which is used
as a LUT, the C and Z flags as the index.)
It is a load/store architecture and does not have any instructions that read
memory directly as an argument besides load/store type instructions.
It has no FPU, and uses a CORDIC for integer multiply/divide/sqrt/etc.
Further details can be found at https://parallax.com/propeller-2/ if needed, but
absolutely not expecting anyone to skim through that.
> On Feb 23, 2021, at 2:06 PM, Jason Eckhardt <jeckhardt at nvidia.com>
wrote:
> 
> Just a few very quick pointers which may or may not be of help.
> 
> >First issue: Allocating all of them is a bad idea. Space needs to be
left for interrupt handlers, core-local global data, etc.
> >ideally the compiler would only use, say, 384 of them or less. Even
more ideally, the amount a >particular function uses
> >would be configurable to permit situations where the developer needs
more of the regfile to themself, but I have no idea how to approach that.
> 
>   A purely static way of doing this is to simply define your register
classes accordingly (say 384 in an allocatable class, the rest not).
>   A dynamic way is to use MyTargetRegisterInfo::getReservedRegs. For a
straightforward example, see the RISCV backend which provides a user option to
reserve registers. For a more complicated scheme, see the AMDGPU backend which
trades occupancy vs registers.
> 
> >Second issue: When dealing with larger objects in the regfile, it is
strongly advisable to keep them continuous. It’s possible
> >to bulk-save/bulk-load any group of continuous registers in two
instructions, at a rate of one >saved per cycle. What’s be the
> >best way to utilize this? As this also impacts, say, loading small
arrays and structs into the regfile.
> 
>   The question is a bit vague or too general, you might ask more detailed
questions in a separate thread. That said, as far as mechanical issues such as
just representing such "load multiple" instructions of the ISA, see
the ARM or SystemZ for examples (the former also performs some memcpys with
ld/st-multiple).
>   If this is about a more general question of how to "best"
assign aggregates/objects to the register file, that can have many dimensions.
One could analyze all the objects as a whole and choose some for inclusion and
others not through some optimization criteria-- there is a large body of
research on this problem (and it isn't LLVM specific). Concretely, you might
take a look at the AMDGPUPromoteAlloca pass as well as the StackColoring pass.
> 
> >Fourth issue: The P2 has two flag regs, C and Z. All instructions that
write them have the ability to control, individually, if the flag is written.
Alongside this, Boolean operations and moves with the flags, and
> >between the flags and any bit of a register, are all single instruction
(and cheap-as-a-move) operations. What’d be a good way to take advantage of
this?
> 
>   Without more information about the ISA, it is hard to say much. The
feature you describe is similar to the "recording" PowerPC
instructions where appending (or not) a "." to such instructions
records (or not) certain status bits. Generally, setting flags like these can
serialize instructions during scheduling for processors where that is important,
so it is often best not to set them unless needed (e.g., for branching). Whether
that matters in your case is unknown.
>   
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Moony via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: Tuesday, February 23, 2021 10:52 AM
> To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> Subject: [llvm-dev] Handling register allocation on Propeller 2
>  
> External email: Use caution opening links or attachments
> 
> 
> This is a complex situation, so I’m opting to ask the people in this list
for assistance, as I’m still new to LLVM’s codebase.
> 
> The Parallax Propeller 2 (Henceforth P2) has 496 allocatable 32-bit
registers, 512 total.
> There are also two special registers, PTRA and PTRB, that have special
semantics when used with memory reads/writes to permit incrementing/decrementing
them in place and adding an index value to them. PTRA will likely be the stack
register, but PTRB will likely be free for allocation.
> 
> First issue: Allocating all of them is a bad idea. Space needs to be left
for interrupt handlers, core-local global data, etc. ideally the compiler would
only use, say, 384 of them or less. Even more ideally, the amount a particular
function uses would be configurable to permit situations where the developer
needs more of the regfile to themself, but I have no idea how to approach that.
> 
> Second issue: When dealing with larger objects in the regfile, it is
strongly advisable to keep them continuous. It’s possible to bulk-save/bulk-load
any group of continuous registers in two instructions, at a rate of one saved
per cycle. What’s be the best way to utilize this? As this also impacts, say,
loading small arrays and structs into the regfile.
> 
> Third issue: The regfile can be indexed indirectly for cheap, and
instructions exist to load individual aligned nibbles, bytes, and words from a
reg into another reg (even indirectly, so array access works). Memory
reads/writes are slow individually (9-26 cycles and 3-20 cycles respectively) so
ideally this’ll be taken advantage of somehow. This would permit rapidly loading
a small array or similar into the regfile and indexing it from there, which, if
the array is used multiple times, would almost always be faster if it’s small,
as the bulk read would take roughly 9-25 + read_amount cycles.
> 
> As far as I can tell, this isn’t an easy thing to take quick advantage of,
as the regfile can be treated as a bank of fast core local memory (I.e. a zero
page), and LLVM doesn’t seem immediately happy with this idea.
> 
> Fourth issue: The P2 has two flag regs, C and Z. All instructions that
write them have the ability to control, individually, if the flag is written.
Alongside this, Boolean operations and moves with the flags, and between the
flags and any bit of a register, are all single instruction (and
cheap-as-a-move) operations. What’d be a good way to take advantage of this?
> 
> The P2 as of now has no standard C calling convention (nor any calling
convention suitable for that), so I’m also stuck trying to define a calling
convention for this architecture. Any help with that would be appreciated as
well, because I’m not familiar with the requirements nor general advice.
> 
> Sorry if this is a bit much to ask, any help and/or advice is appreciated.
> —Braden N.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210223/d1b561ba/attachment.html>
llvm dev - Feb 2021 - Handling register allocation on Propeller 2

[llvm-dev] Handling register allocation on Propeller 2