Fāng-ruì Sòng via llvm-dev
2021-Nov-20 08:26 UTC
[llvm-dev] [RFC] Asynchronous unwind tables attribute
On Wed, Nov 17, 2021 at 3:19 AM Momchil Velikov via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > On one hand, we have the `uwtable` attribute in LLVM IR, which tells > whether to emit CFI directives. On the other hand, we have the `clang > -cc1` command-line option `-funwind-tables=1|2 ` and the codegen > option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind tables (1) or > asynchronous unwind tables (2)`. > Thus we lose along the way the information whether we want just some > unwind tables or asynchronous unwind tables.Thanks for starting the topic. I am very interested in the topic and would like to see that CFI gets improved. I have looked into -funwind-tables/-fasynchronous-unwind-tables and done some relatively simple changes like (default to -fasynchronous-unwind-tables for aarch64/ppc, fix -f(no-)unwind-tables/-f(no-)asynchronous-unwind-tables/make -fno-asynchronous-unwind-tables work with instrumentation, add `-funwind-tables=1|2 `) but haven't done anything on the IR level. It's good to see that someone picks up the heavylift work so that I don't need to do it:) That said, if you need a reviewer or help on some work items, feel free to offload some to me.> Asynchronous unwind tables take more space in the runtime image, I'd > estimate something like 80-90% more, as the difference is adding > roughly the same number of CFI directives as for prologues, only a bit > simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even > more, if you consider tail duplication of epilogue blocks. > Asynchronous unwind tables could also restrict code generation to > having only a finite number of frame pointer adjustments (an example > of *not* having a finite number of `SP` adjustments is on AArch64 when > untagging the stack (MTE) in some cases the compiler can modify `SP` > in a loop).The restriction on MTE is new to me as I don't know much about MTE yet.> > Having the CFI precise up to an instruction generally also means one > cannot bundle together CFI instructions once the prologue is done, > they need to be interspersed with ordinary instructions, which means > extra `DW_CFA_advance_loc` commands, further increasing the unwind > tables size. > > That is to say, async unwind tables impose a non-negligible overhead, > yet for the most common use cases (like C++ exceptions), they are not > even needed. > > We could, for example, extend the `uwtable` attribute with an optional > value, e.g. > - `uwtable` (default to 2) > - `uwtable(1)`, sync unwind tables > - `uwtable(2)`, async unwind tables > - `uwtable(3)`, async unwind tables, but tracking only a subset of > registers (e.g. CFA and return address) > > Or add a new attribute `async_uwtable`. > > Other suggestions? Comments?I have thought about extending uwtable as well. In spirit the idea looks great to me. The mode removing most callee-saved registers is useful. For example, I think linux-perf just uses pc/sp/fp (as how its ORC unwinder is designed). My slight concern with uwtable(3) is that the amount of unwind information is not monotonic. Since sync->async and the number of registers are two dimensions, perhaps we should use two function attributes?> > ~chillBTW, are you working on improving the general CFI problems for aarch64? I tried to understand the implementation limitation in September (in https://reviews.llvm.org/D109253) but then stopped. If you have patches, I'll be happy to study them:) I know there are quite problems like: (a) .cfi_* directives in prologue are less precise % cat a.c void foo() { asm("" ::: "x23", "x24", "x25"); } % clang --target=aarch64-linux-gnu a.c -S -o - ... foo: // @foo .cfi_startproc // %bb.0: // %entry str x25, [sp, #-32]! // 8-byte Folded Spill stp x24, x23, [sp, #16] // 16-byte Folded Spill .cfi_def_cfa_offset 32 ////// should be immediately after the pre-increment str .cfi_offset w23, -8 .cfi_offset w24, -16 .cfi_offset w25, -32 //APP //NO_APP (b) .cfi_* directives (for MachineInstr::FrameDestroy) in epilogue are generally missing (c) A basic block following an exit block may have wrong CFI information (this can be fixed with .cfi_restore) Most problems apply to all non-x86 targets. --- Since we are discussing asynchronous unwind tables, may I ask two slightly off-topic things? (1) What's your opinion on ld --no-ld-generated-unwind-info? Mine is https://maskray.me/blog/2020-11-15-explain-gnu-linker-options#no-ld-generated-unwind-info (2) How should future stack unwinding strategy evolve? Hardware assisted approach like leveraging shadow call stack? Making FP more efficient so that user code can leverage -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer and drop inefficient (both size and run-time performance) .eh_frame? Last year I wrote a post https://maskray.me/blog/2020-11-08-stack-unwinding as I learn stack unwinding. I am going to amend it to include my recent thoughts.
Momchil Velikov via llvm-dev
2021-Nov-20 15:56 UTC
[llvm-dev] [RFC] Asynchronous unwind tables attribute
On Sat, 20 Nov 2021 at 08:26, Fāng-ruì Sòng <maskray at google.com> wrote:> > Asynchronous unwind tables could also restrict code generation to > > having only a finite number of frame pointer adjustments (an example > > of *not* having a finite number of `SP` adjustments is on AArch64 when > > untagging the stack (MTE) in some cases the compiler can modify `SP` > > in a loop). > > The restriction on MTE is new to me as I don't know much about MTE yet.It has nothing to do with MTE per se, I just noticed it in an MTE test (`llvm/test/CodeGen/AArch64/settag.ll:stg_alloca17()`). I've got a patch for this, that just uses an extra scratch register (that's in the epilogue before popping CSRs and we have plenty of registers) and a constant number (usually one) of SP adjustments by a constant.> > We could, for example, extend the `uwtable` attribute with an optional > > value, e.g. > > - `uwtable` (default to 2) > > - `uwtable(1)`, sync unwind tables > > - `uwtable(2)`, async unwind tables > > - `uwtable(3)`, async unwind tables, but tracking only a subset of > > registers (e.g. CFA and return address) > > > > Or add a new attribute `async_uwtable`. > > > > Other suggestions? Comments? > > I have thought about extending uwtable as well. In spirit the idea > looks great to me. > The mode removing most callee-saved registers is useful. > For example, I think linux-perf just uses pc/sp/fp (as how its ORC > unwinder is designed). > > My slight concern with uwtable(3) is that the amount of unwind > information is not monotonic. > Since sync->async and the number of registers are two dimensions, > perhaps we should use two function attributes?I reckon this matters when combining (for whatever reasons) multiple `uwtable` attributes? Indeed, in my first version, I dropped the encoding 3 and then I was able to synthesize the attribute for an outlined function by simply taking the max of the attribute in the outlined-from functions - it was just simpler. How about instead we exchange the meaning of 2 and 3 so we get - 1, sync unwind tables - 2, "minimal" async unwind tables - 3, full async unwind tables Then on the principle that we should always emit CFI information that the `uwtable` requested (as it may be an ABI mandate), possibly optimised, depending on the `nounwind` attribute, we would get: | nounwind 0 | nounwind 1 ----------+----------------------+-------------- uwtable 0 | sync, full | no CFI ----------+----------------------+-------------- uwtable 1 | sync, full | sync, full ----------+----------------------+-------------- uwtable 2 | async, full prologue,| | mininal epilogue | async, min ----------+----------------------+-------------- uwtable 3 | async, full | async, full as a starting point, and then backends may choose any of the entries in the following rows of the same column, as a QOI decision. All that said, I'm not even entirely convinced we need it as a separate `uwtable` option. The decision to skip some of the CFI instructions can be made during final object encoding. It probably has to be made during the final encoding, e.g. no point including epilogue CFI instructions in `.eh_frame`, or an ORC generator would naturally ignore most CFI instructions anyway.> BTW, are you working on improving the general CFI problems for aarch64?Yeah, I'm implementing support for `-fasynchronous-unwind-tables`. A slightly outdated series of patches start from https://reviews.llvm.org/D112330 The full list I have right now is: * [AArch64] Async unwind - Fix MTE codegen emitting frame adjustments in a loop - this fixes the issue described above * [AArch64] Async unwind - Adjust unwind info in AArch64LoadStoreOptimizer - this fixes some case(s) where load/store optimiser moves an SP inc/dec after the matching CFI instruction * [CodeGen] Async unwind - add a pass to fix CFI information - this is a pass that inserts `.cfi_remember_state`/`.cfi_restore_state`, ideally should work for all targets and replace `CFIInstrInserter` * [AArch64] Async unwind - function epilogues * [AArch64] Async unwind - function prologues - these are the core functionality * [AArch64] Async unwind - Refactor generation of shadow call stack prologue/epilogue * [AArch64] Async unwind - Always place the first LDP at the end when ReverseCSRRestoreSeq is true * [AArch64] Async unwind - helper functions to decide on CFI emission - the three above: preparation/refactoring/simplification, `emitEpilogue` especially is a big mess * [AArch64] Async unwind - do not schedule frame setup/destroy * Extend the `uwtable` attribute with unwind table kind (I was meaning to update it for a few days now, only always something else pops up ...)> Since we are discussing asynchronous unwind tables, may I ask two > slightly off-topic things? > > (1) What's your opinion on ld --no-ld-generated-unwind-info?I would say, from a design point of view, an unwinder of any kind should not analyse and interpret machine instructions as it's, in the general case, fragile - that's been my experience from developing and maintaining an unwinder that analysed prologues/epilogues, over a period of 10+ years, each new compiler version required adjustments. Then, PLT entries are likely to be a special case as they are both tiny and extremely unlikely to change between different compilers or different compiler versions. In a sense, one can treat them as having implicit identical unwind table entries (of any unwind table kind) associated with their address range, therefore explicit entries in the regular unwind tables are superfluous.> (2) How should future stack unwinding strategy evolve?Well, that's a good question ... :D ~chill -- Compiler scrub, Arm -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211120/516c427d/attachment.html>