Moony via llvm-dev
2021-Feb-23 20:51 UTC
[llvm-dev] Handling register allocation on Propeller 2
Thank you for the advice. As for the last one, the P2 is neither superscalar nor OoO, so using the flags as a register for bitwise operations when preparing to branch or conditionally execute some instruction(s) with them is efficient. Thanks for the pointers to other backends, I’m likely going to have to take tips from AMDGPU in several places, as it and the P2 both have very large flat regfiles, and P2 has GPU-like instruction skipping functionality available as a code size optimization utility (unlike AMDGPU it doesn’t have every core on one instruction stream, so it’s purely a speed/space tool.) As you mentioned, some ISA details would probably help people out here, so I’ll try and summarize: The P2 is a 32-bit in-order-execution microcontroller architecture, with no caches (SRAM main memory) and 1-16 cores. It does not have atomics, but does have HW locks. It is a RISC/CISC hybrid (read: I dunno how to classify it, but instructions are fixed width.) All instructions have conditional execution via a 4-bit predicate (which is used as a LUT, the C and Z flags as the index.) It is a load/store architecture and does not have any instructions that read memory directly as an argument besides load/store type instructions. It has no FPU, and uses a CORDIC for integer multiply/divide/sqrt/etc. Further details can be found at https://parallax.com/propeller-2/ if needed, but absolutely not expecting anyone to skim through that.> On Feb 23, 2021, at 2:06 PM, Jason Eckhardt <jeckhardt at nvidia.com> wrote: > > Just a few very quick pointers which may or may not be of help. > > >First issue: Allocating all of them is a bad idea. Space needs to be left for interrupt handlers, core-local global data, etc. > >ideally the compiler would only use, say, 384 of them or less. Even more ideally, the amount a >particular function uses > >would be configurable to permit situations where the developer needs more of the regfile to themself, but I have no idea how to approach that. > > A purely static way of doing this is to simply define your register classes accordingly (say 384 in an allocatable class, the rest not). > A dynamic way is to use MyTargetRegisterInfo::getReservedRegs. For a straightforward example, see the RISCV backend which provides a user option to reserve registers. For a more complicated scheme, see the AMDGPU backend which trades occupancy vs registers. > > >Second issue: When dealing with larger objects in the regfile, it is strongly advisable to keep them continuous. It’s possible > >to bulk-save/bulk-load any group of continuous registers in two instructions, at a rate of one >saved per cycle. What’s be the > >best way to utilize this? As this also impacts, say, loading small arrays and structs into the regfile. > > The question is a bit vague or too general, you might ask more detailed questions in a separate thread. That said, as far as mechanical issues such as just representing such "load multiple" instructions of the ISA, see the ARM or SystemZ for examples (the former also performs some memcpys with ld/st-multiple). > If this is about a more general question of how to "best" assign aggregates/objects to the register file, that can have many dimensions. One could analyze all the objects as a whole and choose some for inclusion and others not through some optimization criteria-- there is a large body of research on this problem (and it isn't LLVM specific). Concretely, you might take a look at the AMDGPUPromoteAlloca pass as well as the StackColoring pass. > > >Fourth issue: The P2 has two flag regs, C and Z. All instructions that write them have the ability to control, individually, if the flag is written. Alongside this, Boolean operations and moves with the flags, and > >between the flags and any bit of a register, are all single instruction (and cheap-as-a-move) operations. What’d be a good way to take advantage of this? > > Without more information about the ISA, it is hard to say much. The feature you describe is similar to the "recording" PowerPC instructions where appending (or not) a "." to such instructions records (or not) certain status bits. Generally, setting flags like these can serialize instructions during scheduling for processors where that is important, so it is often best not to set them unless needed (e.g., for branching). Whether that matters in your case is unknown. > > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Moony via llvm-dev <llvm-dev at lists.llvm.org> > Sent: Tuesday, February 23, 2021 10:52 AM > To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org> > Subject: [llvm-dev] Handling register allocation on Propeller 2 > > External email: Use caution opening links or attachments > > > This is a complex situation, so I’m opting to ask the people in this list for assistance, as I’m still new to LLVM’s codebase. > > The Parallax Propeller 2 (Henceforth P2) has 496 allocatable 32-bit registers, 512 total. > There are also two special registers, PTRA and PTRB, that have special semantics when used with memory reads/writes to permit incrementing/decrementing them in place and adding an index value to them. PTRA will likely be the stack register, but PTRB will likely be free for allocation. > > First issue: Allocating all of them is a bad idea. Space needs to be left for interrupt handlers, core-local global data, etc. ideally the compiler would only use, say, 384 of them or less. Even more ideally, the amount a particular function uses would be configurable to permit situations where the developer needs more of the regfile to themself, but I have no idea how to approach that. > > Second issue: When dealing with larger objects in the regfile, it is strongly advisable to keep them continuous. It’s possible to bulk-save/bulk-load any group of continuous registers in two instructions, at a rate of one saved per cycle. What’s be the best way to utilize this? As this also impacts, say, loading small arrays and structs into the regfile. > > Third issue: The regfile can be indexed indirectly for cheap, and instructions exist to load individual aligned nibbles, bytes, and words from a reg into another reg (even indirectly, so array access works). Memory reads/writes are slow individually (9-26 cycles and 3-20 cycles respectively) so ideally this’ll be taken advantage of somehow. This would permit rapidly loading a small array or similar into the regfile and indexing it from there, which, if the array is used multiple times, would almost always be faster if it’s small, as the bulk read would take roughly 9-25 + read_amount cycles. > > As far as I can tell, this isn’t an easy thing to take quick advantage of, as the regfile can be treated as a bank of fast core local memory (I.e. a zero page), and LLVM doesn’t seem immediately happy with this idea. > > Fourth issue: The P2 has two flag regs, C and Z. All instructions that write them have the ability to control, individually, if the flag is written. Alongside this, Boolean operations and moves with the flags, and between the flags and any bit of a register, are all single instruction (and cheap-as-a-move) operations. What’d be a good way to take advantage of this? > > The P2 as of now has no standard C calling convention (nor any calling convention suitable for that), so I’m also stuck trying to define a calling convention for this architecture. Any help with that would be appreciated as well, because I’m not familiar with the requirements nor general advice. > > Sorry if this is a bit much to ask, any help and/or advice is appreciated. > —Braden N. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210223/d1b561ba/attachment.html>