thr3ads.net - llvm dev - [llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated [Jan 2020]

If this information is useful, please help other people find it:
Share via:

Reid Kleckner via llvm-dev

2020-Jan-28 00:57 UTC

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated

On Mon, Jan 27, 2020 at 4:31 PM Eli Friedman <efriedma at quicinc.com>
wrote:
> I assume by “drop support”, you mean reject it in the bitcode reader/IR
> parser?  We can’t reasonably support a complex feature like inalloca if
> nobody is testing it. If we can’t reasonably upgrade it, and we don’t think
> there are any users other than clang targeting 32-bit Windows, probably
> dropping support is best.
>
That's a good point. There are already enough lightly tested features in
LLVM. There's no reason to leave another one lying around like a trap for
the first unsuspecting user to try it.

More details comments on the proposal:>
>
>
> “llvm.call.setup must have exactly one corresponding call site”: Normal IR
> rules would allow cloning the call site (in jump threading), or erasing the
> call site (if there’s a noreturn call in an argument).  What’s the benefit
> of enforcing this rule, as opposed to just saying all the call sites must
> have the same signature?
>
I think we could cope with unreachable code elimination deleting a paired
call site (zero or one), but code duplication creating a second call site
could be problematic. The call setup doesn't describe the prototype of the
main call site, so if there were multiple call sites, the backend would
have to pick one call site arbitrarily or compare the call sites when
setting up the call. If there are zero call sites, the backend can create
static allocas of the appropriate type to satisfy the allocations. Of
course, an IR pass (instcombine?) should do this transform first if it sees
it. Maybe we could have CGP take care of it, too.

> The proposal doesn’t address what happens if llvm.call.setup is called
> while there’s another llvm.call.setup still active.  Is it legal to call
> llvm.call.setup in a loop?  Or should nested llvm.call.setup calls have the
> parent callsetup token as an operand?
>
Nested setup is OK, but the verifier rule that there must be a paired call
site should make it impossible to do in a loop. I guess we should have some
rule to reject the following:
%cs1 = llvm.call.setup()
%cs2 = llvm.call.setup()
call void @cs1() [ "callsetup"(token %cs1) ]
call void @cs2() [ "callsetup"(token %cs2) ]

> Is there some way we can allow optimizations if we can’t modify the
> callee, but we can prove nothing captures the address of the preallocated
> region before the call?  I guess under the current proposal we could
> transform preallocated->byval, but that isn’t very exciting.
>
I suppose we could say that the combo of byval+preallocated just means
`byval`, and teach transforms that that's OK.

> How does this interact with other dynamic stack allocations?  Should we
> switch VLAs to use a similar mechanism?  (The problems with dynamic alloca
> in general aren’t as terrible, but it might still benefit: for example,
> it’s much easier to transform a dynamic allocation into a static
> allocation.)
>
VLAs could use something like this, but they are generally of unknown size
while call sites have a known fixed size. I think that makes them pretty
different.

> “If an exception is thrown and caught within the call setup region, the
> newly established SP must be saved into the EH record when a call is
> setup.”  What makes this case special vs. what we currently implement?  Is
> this currently broken?  Or is it related to supporting frame pointer
> elimination?
>
I think of it as a special case because you can't write this in standard
C++. Today, I think we leak stack memory in this case. There's no
correctness issue because we copy SP into its own virtual register at the
point of the alloca, and arguments are addressed relative to the vreg. What
I had in mind for the new system is that we make some kind of fixed stack
object that uses pre-computed SP offsets, assuming there are no dynamic
allocas in the function. This would be a problem for a program that does:

setup call 1
store call 1 arg 0
try {
  setup call 2
  throw exception
  call 2
} catch (...) {}
; call 2's frame is still on the stack
store call 1 arg 1 ; SP offset would be incorrect
call 1
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200127/a1ff9555/attachment.html>

Eli Friedman via llvm-dev

2020-Jan-28 02:47 UTC

head link

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated

Reply inline. (Sorry about the formatting; I can't figure out how to avoid
destroying it in Outlook.)

From: Reid Kleckner <rnk at google.com>

Sent: Monday, January 27, 2020 4:58 PM

To: Eli Friedman <efriedma at quicinc.com>

Cc: llvm-dev <llvm-dev at lists.llvm.org>

Subject: [EXT] Re: [llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and
preallocated

>> “llvm.call.setup must have exactly one corresponding call site”: Normal
IR rules would allow cloning the call site (in jump threading), or erasing the
call site (if there’s a noreturn call in an argument).  What’s the benefit of
enforcing this rule, as opposed to just saying all the call sites must have the
same signature?

> I think we could cope with unreachable code elimination deleting a paired
call site (zero or one), but code duplication creating a second call site could
be problematic. The call setup doesn't describe the prototype of the main
call site, so if there were multiple call sites, the backend would have to pick
one call site arbitrarily or compare the call sites when setting up the call. If
there are zero call sites, the backend can create static allocas of the
appropriate type to satisfy the allocations. Of course, an IR pass
(instcombine?) should do this transform first if it sees it. Maybe we could have
CGP take care of it, too.

It doesn’t seem like multiple call sites should be a problem if they’re
sufficiently similar?  If the argument layout for each callsite is the same, it
doesn’t matter which callsite the backend chooses to compute the layout.

> Nested setup is OK, but the verifier rule that there must be a paired call
site should make it impossible to do in a loop. I guess we should have some rule
to reject the following:
%cs1 = llvm.call.setup()

%cs2 = llvm.call.setup()

call void @cs1() [ "callsetup"(token %cs1) ]

call void @cs2() [ "callsetup"(token %cs2) ]

I think in general, there can be arbitrary control flow between a token and its
uses, as long as the definition dominates the use.  So you could call
llvm.call.setup repeatedly in a loop, then call some function using the
callsetup token in a different loop, unless some rule specific to callsetup
forbids it.

It would be nice to make the rules strong enough to ensure we can statically
compute the size of the stack frame at any point (assuming no dynamic allocas). 
Code generated by clang would be statically well-nested, I think; not sure how
hard it would be to ensure optimizations maintain that invariant.

Connecting nested llvm.call.setups using tokens might make it easier for passes
to reason about the nesting, since the region nest would be explicitly encoded.
>> How does this interact with other dynamic stack allocations?  Should we
switch VLAs to use a similar mechanism?  (The problems with dynamic alloca in
general aren’t as terrible, but it might still benefit: for example, it’s much
easier to transform a dynamic allocation into a static allocation.)

> VLAs could use something like this, but they are generally of unknown size
while call sites have a known fixed size. I think that makes them pretty
different.

I don’t think we need to implement it at the same time, but the systems would
interact, so it might be worth planning out.

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200128/195b844e/attachment.html>

Reid Kleckner via llvm-dev

2020-Mar-28 18:59 UTC

head link

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated

Sorry for the delay. Arthur Eubanks has started working on the design here:
https://reviews.llvm.org/D74651
I felt I should follow up here about that.

On Mon, Jan 27, 2020 at 6:47 PM Eli Friedman <efriedma at quicinc.com>
wrote:
> It doesn’t seem like multiple call sites should be a problem if they’re
> sufficiently similar?  If the argument layout for each callsite is the
> same, it doesn’t matter which callsite the backend chooses to compute the
> layout
>It's feasible, but it seems like asking for bugs. What should the backend
do when it detects two preallocated calls with different layouts? Let's
imagine LTO is promoting one indirect call that uses call.setup into
several direct calls. The IR would look like this:

%cs = call.setup
switch i32 %callee ...
callee1:
  call void @callee1(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo)
  br label %rejoin
callee2:
  call void @callee2(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo)
  br label %rejoin
rejoin:
  ...

A logical next step would be to run DAE. Suppose one callee does not use
i32 %x above. Now the prototypes disagree, and we can't lower the call. We
could teach DAE that all calls using callsetup tokens have to have the same
prototype, but a simple verifier rule earlier (one call per call setup)
seems easier to enforce.
> > Nested setup is OK, but the verifier rule that there must be a paired
> call site should make it impossible to do in a loop. I guess we should have
> some rule to reject the following:
>
> %cs1 = llvm.call.setup()
>
> %cs2 = llvm.call.setup()
>
> call void @cs1() [ "callsetup"(token %cs1) ]
>
> call void @cs2() [ "callsetup"(token %cs2) ]
>
>
>
> I think in general, there can be arbitrary control flow between a token
> and its uses, as long as the definition dominates the use.  So you could
> call llvm.call.setup repeatedly in a loop, then call some function using
> the callsetup token in a different loop, unless some rule specific to
> callsetup forbids it.
>I agree that the IR would validate according to the current rules, but I
would interpret that IR as meaning: setup N call sites in a loop, then
destroy just the last call site N times in a loop. For example, if you
replaced the call site token in this example with stacksave+alloca and
teardown with stackrestore, the semantics would be: allocate lots of stack,
then reset to the last saved stack pointer repeatedly.

I think we can write some non-verifier rules that prohibit this kind of
code pattern. Optimizations should already obey these rules since they
don't reorder things that modify inaccessible state. Things like:
- It is UB to stacksave ; call.setup ; stackrestore ; call
- UB to call.setup 1 ; call.setup 2 ; call 1 ; call 2
- UB to call.setup ; alloca ; call
etc
> It would be nice to make the rules strong enough to ensure we can
> statically compute the size of the stack frame at any point (assuming no
> dynamic allocas).  Code generated by clang would be statically well-nested,
> I think; not sure how hard it would be to ensure optimizations maintain
> that invariant.
>I agree, that is definitely a goal of the redesign.
> Connecting nested llvm.call.setups using tokens might make it easier for
> passes to reason about the nesting, since the region nest would be
> explicitly encoded.
>I agree, that could be useful, it would replicate what we did for exception
handling.
> > VLAs could use something like this, but they are generally of unknown
> size while call sites have a known fixed size. I think that makes them
> pretty different.
>
> I don’t think we need to implement it at the same time, but the systems
> would interact, so it might be worth planning out.
>I do recall that MSVC does some pretty squirrelly stuff when you insert
`alloca` calls into a call sequence. In this example, arguments are
evaluated in the order 2, 3, 1:

struct Foo {
  Foo();
  Foo(const Foo &o);
  ~Foo();
  int x;
};
void receive_ptr(Foo, void *, Foo);
extern "C" void *_alloca(size_t n);
extern int gv;
void f(int n) { receive_ptr(Foo(), _alloca((++gv, n)), Foo()); }

The standard says order of argument evaluation is unspecified, so this
ordering is valid. However, I wouldn't want to implement the same behavior
in clang. We should probably implement some kind of warning, though.
Compiled with clang, this code has undefined behavior.

Is there really any other option for us here, other than to document that
using VLAs, alloca, and call.setup in the wrong order will result in UB? We
can precisely nail down what the "right order" is, but we basically
can't
make arbitrary stack adjustment orderings work.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200328/61a848df/attachment-0001.html>

llvm dev - Jan 2020 - [RFC] Replacing inalloca with llvm.call.setup and preallocated

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated

[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated