thr3ads.net - llvm dev - [llvm-dev] [RFC] Asynchronous unwind tables attribute [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Momchil Velikov via llvm-dev

2021-Nov-20 15:56 UTC

[llvm-dev] [RFC] Asynchronous unwind tables attribute

On Sat, 20 Nov 2021 at 08:26, Fāng-ruì Sòng <maskray at google.com>
wrote:> > Asynchronous unwind tables could also restrict code generation to
> > having only a finite number of frame pointer adjustments (an example
> > of *not* having a finite number of `SP` adjustments is on AArch64 when
> > untagging the stack (MTE) in some cases the compiler can modify `SP`
> > in a loop).
>
> The restriction on MTE is new to me as I don't know much about MTE yet.
It has nothing to do with MTE per se, I just noticed it in an MTE test
(`llvm/test/CodeGen/AArch64/settag.ll:stg_alloca17()`).
I've got a patch for this, that just uses an extra scratch register
(that's
in
the epilogue before popping CSRs and we have plenty of registers) and a
constant
number (usually one) of SP adjustments by a constant.
> > We could, for example, extend the `uwtable` attribute with an optional
> > value, e.g.
> >   -  `uwtable` (default to 2)
> >   -  `uwtable(1)`, sync unwind tables
> >   -  `uwtable(2)`, async unwind tables
> >   -  `uwtable(3)`, async unwind tables, but tracking only a subset of
> > registers (e.g. CFA and return address)
> >
> > Or add a new attribute `async_uwtable`.
> >
> > Other suggestions? Comments?
>
> I have thought about extending uwtable as well. In spirit the idea
> looks great to me.
> The mode removing most callee-saved registers is useful.
> For example, I think linux-perf just uses pc/sp/fp (as how its ORC
> unwinder is designed).
>
> My slight concern with uwtable(3) is that the amount of unwind
> information is not monotonic.
> Since sync->async and the number of registers are two dimensions,
> perhaps we should use two function attributes?
I reckon this matters when combining (for whatever reasons) multiple
`uwtable` attributes?
Indeed, in my first version, I dropped the encoding 3 and then I was able
to synthesize
the attribute for an outlined function by simply taking the max of the
attribute
in the outlined-from functions - it was just simpler.

How about instead we exchange the meaning of 2 and 3 so we get
  - 1, sync unwind tables
  - 2, "minimal" async unwind tables
  - 3, full async unwind tables
Then on the principle that we should always emit CFI information that the
`uwtable` requested
(as it may be an ABI mandate), possibly optimised, depending on the
`nounwind` attribute, we would get:

          | nounwind 0           |  nounwind 1
----------+----------------------+--------------
uwtable 0 | sync, full           |  no CFI
----------+----------------------+--------------
uwtable 1 | sync, full           |  sync, full
----------+----------------------+--------------
uwtable 2 | async, full prologue,|
          | mininal epilogue     |  async, min
----------+----------------------+--------------
uwtable 3 | async, full          |  async, full

as a starting point, and then backends may choose any of the entries
in the following rows of the same column, as a QOI decision.

All that said, I'm not even entirely convinced we need it as a separate
`uwtable` option.
The decision to skip some of the CFI instructions can be made during final
object encoding.
It probably has to be made during the final encoding, e.g. no point
including epilogue CFI instructions
in `.eh_frame`, or an ORC generator would naturally ignore most CFI
instructions anyway.
> BTW, are you working on improving the general CFI problems for aarch64?Yeah, I'm implementing support for `-fasynchronous-unwind-tables`. A
slightly outdated
series of patches start from https://reviews.llvm.org/D112330

The full list I have right now is:
* [AArch64] Async unwind - Fix MTE codegen emitting frame adjustments in a
loop
  - this fixes the issue described above
* [AArch64] Async unwind - Adjust unwind info in AArch64LoadStoreOptimizer
  - this fixes some case(s) where load/store optimiser moves an SP inc/dec
after the matching CFI instruction
* [CodeGen] Async unwind - add a pass to fix CFI information
  - this is a pass that inserts `.cfi_remember_state`/`.cfi_restore_state`,
ideally should work
    for all targets and replace `CFIInstrInserter`
* [AArch64] Async unwind - function epilogues
* [AArch64] Async unwind - function prologues
  - these are the core functionality
* [AArch64] Async unwind - Refactor generation of shadow call stack
prologue/epilogue
* [AArch64] Async unwind - Always place the first LDP at the end when
ReverseCSRRestoreSeq is true
* [AArch64] Async unwind - helper functions to decide on CFI emission
  - the three above: preparation/refactoring/simplification, `emitEpilogue`
especially is a big mess
* [AArch64] Async unwind - do not schedule frame setup/destroy
* Extend the `uwtable` attribute with unwind table kind
(I was meaning to update it for a few days now, only always something else
pops up ...)
> Since we are discussing asynchronous unwind tables, may I ask two
> slightly off-topic things?
>
> (1) What's your opinion on ld --no-ld-generated-unwind-info?
I would say, from a design point of view, an unwinder of any kind should
not analyse and interpret machine
instructions as it's, in the general case, fragile - that's been my
experience from developing and maintaining
an unwinder that analysed prologues/epilogues, over a period of 10+ years,
each new compiler version required adjustments.

Then, PLT entries are likely to be a special case as they are both tiny and
extremely unlikely to change between
different compilers or different compiler versions. In a sense, one can
treat them as having implicit identical unwind
table entries (of any unwind table kind) associated with their address
range, therefore explicit entries in the regular
unwind tables are superfluous.
> (2) How should future stack unwinding strategy evolve?Well, that's a good question ... :D

~chill

-- 
Compiler scrub, Arm
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211120/516c427d/attachment.html>

Fāng-ruì Sòng via llvm-dev

2021-Nov-20 19:12 UTC

head link

[llvm-dev] [RFC] Asynchronous unwind tables attribute

On 2021-11-20, Momchil Velikov wrote:>On Sat, 20 Nov 2021 at 08:26, Fāng-ruì Sòng <maskray at google.com>
wrote:
>> > Asynchronous unwind tables could also restrict code generation to
>> > having only a finite number of frame pointer adjustments (an
example
>> > of *not* having a finite number of `SP` adjustments is on AArch64
when
>> > untagging the stack (MTE) in some cases the compiler can modify
`SP`
>> > in a loop).
>>
>> The restriction on MTE is new to me as I don't know much about MTE
yet.
>
>It has nothing to do with MTE per se, I just noticed it in an MTE test
>(`llvm/test/CodeGen/AArch64/settag.ll:stg_alloca17()`).
>I've got a patch for this, that just uses an extra scratch register
(that's
>in
>the epilogue before popping CSRs and we have plenty of registers) and a
>constant
>number (usually one) of SP adjustments by a constant.
>
>> > We could, for example, extend the `uwtable` attribute with an
optional
>> > value, e.g.
>> >   -  `uwtable` (default to 2)
>> >   -  `uwtable(1)`, sync unwind tables
>> >   -  `uwtable(2)`, async unwind tables
>> >   -  `uwtable(3)`, async unwind tables, but tracking only a subset
of
>> > registers (e.g. CFA and return address)
>> >
>> > Or add a new attribute `async_uwtable`.
>> >
>> > Other suggestions? Comments?
>>
>> I have thought about extending uwtable as well. In spirit the idea
>> looks great to me.
>> The mode removing most callee-saved registers is useful.
>> For example, I think linux-perf just uses pc/sp/fp (as how its ORC
>> unwinder is designed).
>>
>> My slight concern with uwtable(3) is that the amount of unwind
>> information is not monotonic.
>> Since sync->async and the number of registers are two dimensions,
>> perhaps we should use two function attributes?
>
>I reckon this matters when combining (for whatever reasons) multiple
>`uwtable` attributes?
>Indeed, in my first version, I dropped the encoding 3 and then I was able
>to synthesize
>the attribute for an outlined function by simply taking the max of the
>attribute
>in the outlined-from functions - it was just simpler.
>
>How about instead we exchange the meaning of 2 and 3 so we get
>  - 1, sync unwind tables
>  - 2, "minimal" async unwind tables
>  - 3, full async unwind tables
>Then on the principle that we should always emit CFI information that the
>`uwtable` requested
>(as it may be an ABI mandate), possibly optimised, depending on the
>`nounwind` attribute, we would get:
>
>          | nounwind 0           |  nounwind 1
>----------+----------------------+--------------
>uwtable 0 | sync, full           |  no CFI
>----------+----------------------+--------------
>uwtable 1 | sync, full           |  sync, full
>----------+----------------------+--------------
>uwtable 2 | async, full prologue,|
>          | mininal epilogue     |  async, min
>----------+----------------------+--------------
>uwtable 3 | async, full          |  async, full
>
>as a starting point, and then backends may choose any of the entries
>in the following rows of the same column, as a QOI decision.
>
>All that said, I'm not even entirely convinced we need it as a separate
>`uwtable` option.
>The decision to skip some of the CFI instructions can be made during final
>object encoding.
>It probably has to be made during the final encoding, e.g. no point
>including epilogue CFI instructions
>in `.eh_frame`, or an ORC generator would naturally ignore most CFI
>instructions anyway.
I have trouble understanding "min" in the uwtable 2 row.
What does it mean?

As IR attributes, I'd hope the behavior is strictly monotonic in every
dimensions. I.e. If uwtable A provides some information, I'd expect that
uwtable B provides at least the same amount of information if A < B.

We may end up with too many dimensions.
For such rare ones (e.g. the number of registers), I think it is
entirely fine to omit it from IR attributes and make it an
llvm::TargetOptions::* option, like -ffunction-sections. We just lose
fine-grained LTO merge behavior which is probably not needed at all.
>> BTW, are you working on improving the general CFI problems for aarch64?
>Yeah, I'm implementing support for `-fasynchronous-unwind-tables`. A
>slightly outdated
>series of patches start from https://reviews.llvm.org/D112330
>
>The full list I have right now is:
>* [AArch64] Async unwind - Fix MTE codegen emitting frame adjustments in a
>loop
>  - this fixes the issue described above
>* [AArch64] Async unwind - Adjust unwind info in AArch64LoadStoreOptimizer
>  - this fixes some case(s) where load/store optimiser moves an SP inc/dec
>after the matching CFI instruction
>* [CodeGen] Async unwind - add a pass to fix CFI information
>  - this is a pass that inserts `.cfi_remember_state`/`.cfi_restore_state`,
>ideally should work
>    for all targets and replace `CFIInstrInserter`
>* [AArch64] Async unwind - function epilogues
>* [AArch64] Async unwind - function prologues
>  - these are the core functionality
>* [AArch64] Async unwind - Refactor generation of shadow call stack
>prologue/epilogue
>* [AArch64] Async unwind - Always place the first LDP at the end when
>ReverseCSRRestoreSeq is true
>* [AArch64] Async unwind - helper functions to decide on CFI emission
>  - the three above: preparation/refactoring/simplification, `emitEpilogue`
>especially is a big mess
>* [AArch64] Async unwind - do not schedule frame setup/destroy
>* Extend the `uwtable` attribute with unwind table kind
>(I was meaning to update it for a few days now, only always something else
>pops up ...)
Thanks:)
>> Since we are discussing asynchronous unwind tables, may I ask two
>> slightly off-topic things?
>>
>> (1) What's your opinion on ld --no-ld-generated-unwind-info?
>
>I would say, from a design point of view, an unwinder of any kind should
>not analyse and interpret machine
>instructions as it's, in the general case, fragile - that's been my
>experience from developing and maintaining
>an unwinder that analysed prologues/epilogues, over a period of 10+ years,
>each new compiler version required adjustments.
>
>Then, PLT entries are likely to be a special case as they are both tiny and
>extremely unlikely to change between
>different compilers or different compiler versions. In a sense, one can
>treat them as having implicit identical unwind
>table entries (of any unwind table kind) associated with their address
>range, therefore explicit entries in the regular
>unwind tables are superfluous.
Thanks for the comments.
>> (2) How should future stack unwinding strategy evolve?
>Well, that's a good question ... :D
* compact unwind scheme.
* hardware assisted. Piggybacking on security hardening features like shadow
call stack. However, this unlikely provides more information about callee-saved
registers.
* mainly FP-based. People don't use FP due to performance loss. If
`-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer` doesn't hurt
performance that much, it may be better than some information in `.eh_frame`. We
can use unwind information to fill the gap, e.g. for shrink wrapping.

If we go with the FP route, the expectation from .eh_frame may be lower.
We just need it to fill the gap.
We can probably indicate the intention with a codegen only llvm::TargetOptions
option as well.
>~chill
>
>-- 
>Compiler scrub, Arm

llvm dev - Nov 2021 - [RFC] Asynchronous unwind tables attribute

[llvm-dev] [RFC] Asynchronous unwind tables attribute

[llvm-dev] [RFC] Asynchronous unwind tables attribute