Reid Kleckner via llvm-dev
2020-Mar-28 18:59 UTC
[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated
Sorry for the delay. Arthur Eubanks has started working on the design here: https://reviews.llvm.org/D74651 I felt I should follow up here about that. On Mon, Jan 27, 2020 at 6:47 PM Eli Friedman <efriedma at quicinc.com> wrote:> It doesn’t seem like multiple call sites should be a problem if they’re > sufficiently similar? If the argument layout for each callsite is the > same, it doesn’t matter which callsite the backend chooses to compute the > layout >It's feasible, but it seems like asking for bugs. What should the backend do when it detects two preallocated calls with different layouts? Let's imagine LTO is promoting one indirect call that uses call.setup into several direct calls. The IR would look like this: %cs = call.setup switch i32 %callee ... callee1: call void @callee1(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo) br label %rejoin callee2: call void @callee2(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo) br label %rejoin rejoin: ... A logical next step would be to run DAE. Suppose one callee does not use i32 %x above. Now the prototypes disagree, and we can't lower the call. We could teach DAE that all calls using callsetup tokens have to have the same prototype, but a simple verifier rule earlier (one call per call setup) seems easier to enforce.> > Nested setup is OK, but the verifier rule that there must be a paired > call site should make it impossible to do in a loop. I guess we should have > some rule to reject the following: > > %cs1 = llvm.call.setup() > > %cs2 = llvm.call.setup() > > call void @cs1() [ "callsetup"(token %cs1) ] > > call void @cs2() [ "callsetup"(token %cs2) ] > > > > I think in general, there can be arbitrary control flow between a token > and its uses, as long as the definition dominates the use. So you could > call llvm.call.setup repeatedly in a loop, then call some function using > the callsetup token in a different loop, unless some rule specific to > callsetup forbids it. >I agree that the IR would validate according to the current rules, but I would interpret that IR as meaning: setup N call sites in a loop, then destroy just the last call site N times in a loop. For example, if you replaced the call site token in this example with stacksave+alloca and teardown with stackrestore, the semantics would be: allocate lots of stack, then reset to the last saved stack pointer repeatedly. I think we can write some non-verifier rules that prohibit this kind of code pattern. Optimizations should already obey these rules since they don't reorder things that modify inaccessible state. Things like: - It is UB to stacksave ; call.setup ; stackrestore ; call - UB to call.setup 1 ; call.setup 2 ; call 1 ; call 2 - UB to call.setup ; alloca ; call etc> It would be nice to make the rules strong enough to ensure we can > statically compute the size of the stack frame at any point (assuming no > dynamic allocas). Code generated by clang would be statically well-nested, > I think; not sure how hard it would be to ensure optimizations maintain > that invariant. >I agree, that is definitely a goal of the redesign.> Connecting nested llvm.call.setups using tokens might make it easier for > passes to reason about the nesting, since the region nest would be > explicitly encoded. >I agree, that could be useful, it would replicate what we did for exception handling.> > VLAs could use something like this, but they are generally of unknown > size while call sites have a known fixed size. I think that makes them > pretty different. > > I don’t think we need to implement it at the same time, but the systems > would interact, so it might be worth planning out. >I do recall that MSVC does some pretty squirrelly stuff when you insert `alloca` calls into a call sequence. In this example, arguments are evaluated in the order 2, 3, 1: struct Foo { Foo(); Foo(const Foo &o); ~Foo(); int x; }; void receive_ptr(Foo, void *, Foo); extern "C" void *_alloca(size_t n); extern int gv; void f(int n) { receive_ptr(Foo(), _alloca((++gv, n)), Foo()); } The standard says order of argument evaluation is unspecified, so this ordering is valid. However, I wouldn't want to implement the same behavior in clang. We should probably implement some kind of warning, though. Compiled with clang, this code has undefined behavior. Is there really any other option for us here, other than to document that using VLAs, alloca, and call.setup in the wrong order will result in UB? We can precisely nail down what the "right order" is, but we basically can't make arbitrary stack adjustment orderings work.>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200328/61a848df/attachment-0001.html>
Eli Friedman via llvm-dev
2020-Mar-28 21:20 UTC
[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated
Reply inline. From: Reid Kleckner <rnk at google.com> Sent: Saturday, March 28, 2020 11:59 AM To: Eli Friedman <efriedma at quicinc.com>; Arthur Eubanks <aeubanks at google.com> Cc: llvm-dev <llvm-dev at lists.llvm.org> Subject: [EXT] Re: [llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated Sorry for the delay. Arthur Eubanks has started working on the design here: https://reviews.llvm.org/D74651 I felt I should follow up here about that. On Mon, Jan 27, 2020 at 6:47 PM Eli Friedman <efriedma at quicinc.com<mailto:efriedma at quicinc.com>> wrote: It doesn’t seem like multiple call sites should be a problem if they’re sufficiently similar? If the argument layout for each callsite is the same, it doesn’t matter which callsite the backend chooses to compute the layout It's feasible, but it seems like asking for bugs. What should the backend do when it detects two preallocated calls with different layouts? Let's imagine LTO is promoting one indirect call that uses call.setup into several direct calls. The IR would look like this: %cs = call.setup switch i32 %callee ... callee1: call void @callee1(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo) br label %rejoin callee2: call void @callee2(i32 %x, %struct.Foo* preallocated(%struct.Foo) %foo) br label %rejoin rejoin: ... A logical next step would be to run DAE. Suppose one callee does not use i32 %x above. Now the prototypes disagree, and we can't lower the call. We could teach DAE that all calls using callsetup tokens have to have the same prototype, but a simple verifier rule earlier (one call per call setup) seems easier to enforce. This would specifically be for cases where we try to rewrite the signature? I would assume we should forbid rewriting the signature of a call with an operand bundle. And once some optimization drops the bundle and preallocated marking, to allow such rewriting, the signature doesn’t need to match anymore. Ultimately, I think it’s worth some effort here to try to avoid blocking optimizations like jump threading. That said, if you want to make the calls “noduplicate”, it isn’t that terrible of an alternative.> Nested setup is OK, but the verifier rule that there must be a paired call site should make it impossible to do in a loop. I guess we should have some rule to reject the following:%cs1 = llvm.call.setup() %cs2 = llvm.call.setup() call void @cs1() [ "callsetup"(token %cs1) ] call void @cs2() [ "callsetup"(token %cs2) ] I think in general, there can be arbitrary control flow between a token and its uses, as long as the definition dominates the use. So you could call llvm.call.setup repeatedly in a loop, then call some function using the callsetup token in a different loop, unless some rule specific to callsetup forbids it. I agree that the IR would validate according to the current rules, but I would interpret that IR as meaning: setup N call sites in a loop, then destroy just the last call site N times in a loop. For example, if you replaced the call site token in this example with stacksave+alloca and teardown with stackrestore, the semantics would be: allocate lots of stack, then reset to the last saved stack pointer repeatedly. I think we can write some non-verifier rules that prohibit this kind of code pattern. Optimizations should already obey these rules since they don't reorder things that modify inaccessible state. Things like: - It is UB to stacksave ; call.setup ; stackrestore ; call - UB to call.setup 1 ; call.setup 2 ; call 1 ; call 2 - UB to call.setup ; alloca ; call etc I’m fine with having UB in certain cases, as long as the rules are clear. It would be nice to make the rules strong enough to ensure we can statically compute the size of the stack frame at any point (assuming no dynamic allocas). Code generated by clang would be statically well-nested, I think; not sure how hard it would be to ensure optimizations maintain that invariant. I agree, that is definitely a goal of the redesign. Good. It might be a good idea to try to write out an algorithm for this to ensure this works out. In particular, I’m concerned about cases where two predecessors of a basic block appear to have a different stack size (an if-then-else, or a loop backedge). We need to make sure such cases are either invalid, or UB on entry to the block. I spent a little time thinking, and I’m not sure what rules we need to make this work out. For example, should we forbid tail-merging multiple calls to abort()? IF we should, how would we write a rule which restricts that? Connecting nested llvm.call.setups using tokens might make it easier for passes to reason about the nesting, since the region nest would be explicitly encoded. I agree, that could be useful, it would replicate what we did for exception handling.> VLAs could use something like this, but they are generally of unknown size while call sites have a known fixed size. I think that makes them pretty different.I don’t think we need to implement it at the same time, but the systems would interact, so it might be worth planning out. I do recall that MSVC does some pretty squirrelly stuff when you insert `alloca` calls into a call sequence. In this example, arguments are evaluated in the order 2, 3, 1: I’m not really concerned with funny usage of calls to alloca() in call arguments, or anything like that. I’m happy to pick whatever rule is easiest for us. I’m more concerned with ensuring nothing blows up if we inline a call to a function that contains a VLA, or something like that. -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200328/02c77738/attachment-0001.html>
Reid Kleckner via llvm-dev
2020-Apr-16 20:05 UTC
[llvm-dev] [RFC] Replacing inalloca with llvm.call.setup and preallocated
On Sat, Mar 28, 2020 at 2:20 PM Eli Friedman <efriedma at quicinc.com> wrote:> This would specifically be for cases where we try to rewrite the > signature? I would assume we should forbid rewriting the signature of a > call with an operand bundle. And once some optimization drops the bundle > and preallocated marking, to allow such rewriting, the signature doesn’t > need to match anymore. >Yes, I really would like to enable DAE and other signature rewriting IPO transforms. Maybe today DAE doesn't run on calls with bundles, but this feature is designed to allow the non-preallocated arguments to be removed or expanded into multiple arguments without disturbing the preallocated argument numbering.> Ultimately, I think it’s worth some effort here to try to avoid blocking > optimizations like jump threading. That said, if you want to make the > calls “noduplicate”, it isn’t that terrible of an alternative. >Does "noduplicate" block inlining, though? Maybe we really want "convergent"? In any case, I'm OK with powering down jump threading to pick up IPO, which is incompatible with today's inalloca.> Good. It might be a good idea to try to write out an algorithm for this > to ensure this works out. In particular, I’m concerned about cases where > two predecessors of a basic block appear to have a different stack size (an > if-then-else, or a loop backedge). We need to make sure such cases are > either invalid, or UB on entry to the block. > > > > I spent a little time thinking, and I’m not sure what rules we need to > make this work out. For example, should we forbid tail-merging multiple > calls to abort()? IF we should, how would we write a rule which restricts > that? >This is actually a big open problem, and it came up again in the SEH discussion. It seems to be my fate to struggle against the LLVM IR design decision to not have scopes. Without introducing new IR constructs, we could define a list of instructions that set the current region. We could write an algorithm for assigning regions to each block. The region is implicit in the IR. These are the things that could create regions: - call.preallocated.setup - catchpad - cleanuppad - lifetime.start? unclear Passes are required to ensure that each BB belongs to exactly one region. Each region belongs to its parent, and ending a region returns to the parent of the ending region. I don't think this idea is ready to be added to LangRef, but it is a good future direction, perhaps with new supporting IR constructs. I think for now we have to live with the possibility that the analysis which assigns SP adjustment levels to MBBs may fail to find a unique SP level, in which case we must use a frame pointer. OTOH, we can easily establish the invariant at the MIR level. We should always be able to assign each MBB a unique most recently active call site and an SP adjustment level. We can easily teach BranchFolding to preserve this invariant. We already do it for funclets. I’m not really concerned with funny usage of calls to alloca() in call> arguments, or anything like that. I’m happy to pick whatever rule is > easiest for us. I’m more concerned with ensuring nothing blows up if we > inline a call to a function that contains a VLA, or something like that. >Sounds good. Inlining dynamic allocas and VLAs should already just work. The inliner places stacksave/stackrestore calls around the original call site, if dynamic allocas were present in the inlined code. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200416/21452d2a/attachment.html>
Apparently Analagous Threads
- [RFC] Replacing inalloca with llvm.call.setup and preallocated
- [RFC] Replacing inalloca with llvm.call.setup and preallocated
- [RFC] Replacing inalloca with llvm.call.setup and preallocated
- [RFC] Replacing inalloca with llvm.call.setup and preallocated
- [LLVMdev] Starting implementation of 'inalloca' parameter attribute for MS C++ ABI pass-by-value