thr3ads.net - llvm dev - [llvm-dev] RFC: Adding argument allocas [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Reid Kleckner via llvm-dev

2016-Dec-09 01:05 UTC

[llvm-dev] RFC: Adding argument allocas

Clang is currently missing some low-hanging performance and code size
opportunities when receiving function parameters. For most scalar
parameters, Clang typically emits an alloca and stores the LLVM SSA
argument value into the alloca to initialize it. With optimizations, this
initialization is often removed, but it stays behind in debug builds and
when the user takes the address of a parameter (
https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory
allocation and store duplicate work already done by the caller.

Case 1: The parameter is already being passed in memory. In this case, we
waste memory and do an extra load and store to copy it into our own alloca.
This is very common on 32-bit x86.

Case 2: On Win64, the caller pre-allocates shadow stack slots for all
register parameters. This allows the callee to “home” register parameters
to ease debugging and the implementation of variadic functions. In this
case, the store is not dead, but we fail to use the pre-allocated memory.

Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
this case, nothing is wasted. Both the memory and store are needed.

I think we can improve our code for cases 1 and 2 by making it possible to
create allocas from parameters, and we can lower this down to a regular
stack object with a store in case 3. The syntax for this would look like:
define void @f(i32 %x) {
  %px = alloca i32, argument i32 %x, align 4

The semantics are the same as the following:
define void @f(i32 %x) {
  %px = alloca i32, align 4
  store i32 %x, i32* %px, align 4

It is invalid to make one of these argument allocas dynamic in any way,
either by using inalloca, using a dynamic element count, or being outside
the entry block.

If the semantics are the same, it begs the question, why don’t we pattern
match the alloca and store to elide dead stores and reuse existing argument
stack slots? My main preference for adding a new way to do this is that it
gives us simpler and smaller code in debug builds, where we presumably
don’t want to do this kind of pattern recognition. My experience with our
-O0 codegen is that we do a lot of useless copies to initialize parameters,
and this leads to larger output size, which slows things down. Having a
more easily recognizable idiom for getting the storage behind parameters if
it exists feels like a win.

Changing the semantics of alloca affects a lot of things, but I think it
makes sense to extend alloca here rather than a new intrinsic or
instruction that can create local variable stack memory that passes would
have to reason about. Here’s a list of things we’d need to change off the
top of my head:
1. Inliner: When hoisting static allocas to the entry block, the inliner
will now strip the argument operand off of allocas and insert the
equivalent store at the inlined call site.
2. Mem2reg: Loading from an argument alloca needs to produce the argument
value instead of undef.
3. GVN: We need to apply the same logic to GVN and similar store to load
forwarding transforms.
4. Instcombine: This transform has some simple store forwarding logic that
would need to be updated.

I’m sure there are more, but the changes seem worth it to get at these low
hanging opportunities.

One other questionable side benefit of doing this is that it would make it
possible to implement va_start by taking the address of a parameter and
doing pointer arithmetic. While that code is fairly invalid, it’s exactly
what the MSVC STL headers do for 32-bit x86. If we make this work in Clang,
we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass
flags that cause clang to skip these wrapper headers, and then they file
bugs complaining that va lists don't work.

At the end of the day, this feels like a straightforward engineering
improvement to LLVM, and it feels worth doing to me. Does anyone feel
otherwise or have any suggestions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161208/d5357451/attachment-0001.html>

Eric Christopher via llvm-dev

2016-Dec-09 01:11 UTC

head link

[llvm-dev] RFC: Adding argument allocas

Hi Reid,

This seems pretty reasonable to me.

For the debug info people:

Reid and I also chatted about dbg.declare vs dbg.value as a consequence of
this. His work here is going to take us closer to being able to get rid of
dbg.declare, but not actually be part of it (and isn't planned - it
doesn't
make sense as part of this).

-eric

On Thu, Dec 8, 2016 at 5:06 PM Reid Kleckner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Clang is currently missing some low-hanging performance and code size
> opportunities when receiving function parameters. For most scalar
> parameters, Clang typically emits an alloca and stores the LLVM SSA
> argument value into the alloca to initialize it. With optimizations, this
> initialization is often removed, but it stays behind in debug builds and
> when the user takes the address of a parameter (
> https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory
> allocation and store duplicate work already done by the caller.
>
> Case 1: The parameter is already being passed in memory. In this case, we
> waste memory and do an extra load and store to copy it into our own alloca.
> This is very common on 32-bit x86.
>
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
> register parameters. This allows the callee to “home” register parameters
> to ease debugging and the implementation of variadic functions. In this
> case, the store is not dead, but we fail to use the pre-allocated memory.
>
> Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
> this case, nothing is wasted. Both the memory and store are needed.
>
> I think we can improve our code for cases 1 and 2 by making it possible to
> create allocas from parameters, and we can lower this down to a regular
> stack object with a store in case 3. The syntax for this would look like:
> define void @f(i32 %x) {
>   %px = alloca i32, argument i32 %x, align 4
>
> The semantics are the same as the following:
> define void @f(i32 %x) {
>   %px = alloca i32, align 4
>   store i32 %x, i32* %px, align 4
>
> It is invalid to make one of these argument allocas dynamic in any way,
> either by using inalloca, using a dynamic element count, or being outside
> the entry block.
>
> If the semantics are the same, it begs the question, why don’t we pattern
> match the alloca and store to elide dead stores and reuse existing argument
> stack slots? My main preference for adding a new way to do this is that it
> gives us simpler and smaller code in debug builds, where we presumably
> don’t want to do this kind of pattern recognition. My experience with our
> -O0 codegen is that we do a lot of useless copies to initialize parameters,
> and this leads to larger output size, which slows things down. Having a
> more easily recognizable idiom for getting the storage behind parameters if
> it exists feels like a win.
>
> Changing the semantics of alloca affects a lot of things, but I think it
> makes sense to extend alloca here rather than a new intrinsic or
> instruction that can create local variable stack memory that passes would
> have to reason about. Here’s a list of things we’d need to change off the
> top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the inliner
> will now strip the argument operand off of allocas and insert the
> equivalent store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the argument
> value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to load
> forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic that
> would need to be updated.
>
> I’m sure there are more, but the changes seem worth it to get at these low
> hanging opportunities.
>
> One other questionable side benefit of doing this is that it would make it
> possible to implement va_start by taking the address of a parameter and
> doing pointer arithmetic. While that code is fairly invalid, it’s exactly
> what the MSVC STL headers do for 32-bit x86. If we make this work in Clang,
> we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass
> flags that cause clang to skip these wrapper headers, and then they file
> bugs complaining that va lists don't work.
>
> At the end of the day, this feels like a straightforward engineering
> improvement to LLVM, and it feels worth doing to me. Does anyone feel
> otherwise or have any suggestions?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161209/48442fd7/attachment.html>

Mehdi Amini via llvm-dev

2016-Dec-09 01:37 UTC

head link

[llvm-dev] RFC: Adding argument allocas

> On Dec 8, 2016, at 5:05 PM, Reid Kleckner via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Clang is currently missing some low-hanging performance and code size
opportunities when receiving function parameters. For most scalar parameters,
Clang typically emits an alloca and stores the LLVM SSA argument value into the
alloca to initialize it. With optimizations, this initialization is often
removed, but it stays behind in debug builds and when the user takes the address
of a parameter (https://llvm.org/bugs/show_bug.cgi?id=26328
<https://llvm.org/bugs/show_bug.cgi?id=26328>). In many cases, the memory
allocation and store duplicate work already done by the caller.
> 
> Case 1: The parameter is already being passed in memory. In this case, we
waste memory and do an extra load and store to copy it into our own alloca. This
is very common on 32-bit x86.
> 
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
register parameters. This allows the callee to “home” register parameters to
ease debugging and the implementation of variadic functions. In this case, the
store is not dead, but we fail to use the pre-allocated memory.
> 
> Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
this case, nothing is wasted. Both the memory and store are needed.
> 
> I think we can improve our code for cases 1 and 2 by making it possible to
create allocas from parameters, and we can lower this down to a regular stack
object with a store in case 3. The syntax for this would look like:
> define void @f(i32 %x) {
>   %px = alloca i32, argument i32 %x, align 4
> 
> The semantics are the same as the following:
> define void @f(i32 %x) {
>   %px = alloca i32, align 4
>   store i32 %x, i32* %px, align 4
> 
> It is invalid to make one of these argument allocas dynamic in any way,
either by using inalloca, using a dynamic element count, or being outside the
entry block.
> 
> If the semantics are the same, it begs the question, why don’t we pattern
match the alloca and store to elide dead stores and reuse existing argument
stack slots? My main preference for adding a new way to do this is that it gives
us simpler and smaller code in debug builds, where we presumably don’t want to
do this kind of pattern recognition.
So IIUC basically the *only* reason for this IR change is that we don’t want to
pattern match in debug build?
I don't understand right now why we wouldn’t want to do this?

Thanks,

— 
Mehdi 


> My experience with our -O0 codegen is that we do a lot of useless copies to
initialize parameters, and this leads to larger output size, which slows things
down. Having a more easily recognizable idiom for getting the storage behind
parameters if it exists feels like a win.
> 
> Changing the semantics of alloca affects a lot of things, but I think it
makes sense to extend alloca here rather than a new intrinsic or instruction
that can create local variable stack memory that passes would have to reason
about. Here’s a list of things we’d need to change off the top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the inliner
will now strip the argument operand off of allocas and insert the equivalent
store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the argument
value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to load
forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic that
would need to be updated.
> 
> I’m sure there are more, but the changes seem worth it to get at these low
hanging opportunities.
> 
> One other questionable side benefit of doing this is that it would make it
possible to implement va_start by taking the address of a parameter and doing
pointer arithmetic. While that code is fairly invalid, it’s exactly what the
MSVC STL headers do for 32-bit x86. If we make this work in Clang, we can remove
our stdarg.h and vadefs.h wrapper headers. Users often pass flags that cause
clang to skip these wrapper headers, and then they file bugs complaining that va
lists don't work.
> 
> At the end of the day, this feels like a straightforward engineering
improvement to LLVM, and it feels worth doing to me. Does anyone feel otherwise
or have any suggestions?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161208/cd47a2fb/attachment.html>

Friedman, Eli via llvm-dev

2016-Dec-09 01:47 UTC

head link

[llvm-dev] RFC: Adding argument allocas

On 12/8/2016 5:05 PM, Reid Kleckner via llvm-dev wrote:> If the semantics are the same, it begs the question, why don’t we 
> pattern match the alloca and store to elide dead stores and reuse 
> existing argument stack slots? My main preference for adding a new way 
> to do this is that it gives us simpler and smaller code in debug 
> builds, where we presumably don’t want to do this kind of pattern 
> recognition. My experience with our -O0 codegen is that we do a lot of 
> useless copies to initialize parameters, and this leads to larger 
> output size, which slows things down. Having a more easily 
> recognizable idiom for getting the storage behind parameters if it 
> exists feels like a win.
Why don't we want to do this kind of pattern recognition at -O0?  We can 
do a fast scan which just looks at the entry block if you're concerned 
about compile-time.
> One other questionable side benefit of doing this is that it would 
> make it possible to implement va_start by taking the address of a 
> parameter and doing pointer arithmetic. While that code is fairly 
> invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. If 
> we make this work in Clang, we can remove our stdarg.h and vadefs.h 
> wrapper headers. Users often pass flags that cause clang to skip these 
> wrapper headers, and then they file bugs complaining that va lists 
> don't work.
This ignores the other reason why this doesn't work at the moment: 
reading off the end of an alloca returns undef, whether or not the 
actual pointer value points at the data you want.  I mean, it's not 
impossible to make it work, but it involves a bunch of unrelated changes 
which I don't think we want to pursue.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

Reid Kleckner via llvm-dev

2016-Dec-09 16:45 UTC

head link

[llvm-dev] RFC: Adding argument allocas

On Thu, Dec 8, 2016 at 5:37 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> So IIUC basically the *only* reason for this IR change is that we don’t
> want to pattern match in debug build?
> I don't understand right now why we wouldn’t want to do this?
>
If we need to pattern match such a basic construct, it suggests to me that
we have the wrong representation, and we should instead make our
representation more accurately model reality. To me, it feels like this
representation allows several good things to just "fall out" without
any
additional work, and that suggests it's a good representation.

On Thu, Dec 8, 2016 at 5:47 PM, Friedman, Eli <efriedma at codeaurora.org>
wrote:>
> One other questionable side benefit of doing this is that it would make it
>> possible to implement va_start by taking the address of a parameter and
>> doing pointer arithmetic. While that code is fairly invalid, it’s
exactly
>> what the MSVC STL headers do for 32-bit x86. If we make this work in
Clang,
>> we can remove our stdarg.h and vadefs.h wrapper headers. Users often
pass
>> flags that cause clang to skip these wrapper headers, and then they
file
>> bugs complaining that va lists don't work.
>>
>
> This ignores the other reason why this doesn't work at the moment:
reading
> off the end of an alloca returns undef, whether or not the actual pointer
> value points at the data you want.  I mean, it's not impossible to make
it
> work, but it involves a bunch of unrelated changes which I don't think
we
> want to pursue.

I guess I'm more willing to pursue those changes at some vague unspecified
point in the future. It doesn't seem that crazy to add a special case to
allow OOB GEPs of off the last argument alloca of a variadic function.
Other OOB GEPs can continue to be UB as they are today. We have other
loopholes for these kinds of OOB GEP optimizations around zero-sized
globals so that people can use __start_section/__stop_section symbols to
implement .init_array style registration systems. This doesn't seem that
different.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161209/2dd39d81/attachment.html>

Hans Wennborg via llvm-dev

2016-Dec-09 17:08 UTC

head link

[llvm-dev] RFC: Adding argument allocas

On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> If the semantics are the same, it begs the question, why don’t we pattern
> match the alloca and store to elide dead stores and reuse existing argument
> stack slots?
One could reverse that argument: if we're going to emit the
alloca+store, as we currently do, only to pattern match it away, the
question is why emit the alloca+store in the first place? :-)

Also, pattern-matching would need to happen in the back-end since we
can't represent this in IR, right? Which means all the backends:
SelectionDAG, FastISel (and GlobalISel going forward).

Hal Finkel via llvm-dev

2016-Dec-10 16:00 UTC

head link

[llvm-dev] RFC: Adding argument allocas

----- Original Message -----> From: "Eli via llvm-dev Friedman" <llvm-dev at
lists.llvm.org>
> To: "Reid Kleckner" <rnk at google.com>,
"llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Thursday, December 8, 2016 7:47:28 PM
> Subject: Re: [llvm-dev] RFC: Adding argument allocas
> 
> On 12/8/2016 5:05 PM, Reid Kleckner via llvm-dev wrote:
> > If the semantics are the same, it begs the question, why don’t we
> > pattern match the alloca and store to elide dead stores and reuse
> > existing argument stack slots? My main preference for adding a new
> > way
> > to do this is that it gives us simpler and smaller code in debug
> > builds, where we presumably don’t want to do this kind of pattern
> > recognition. My experience with our -O0 codegen is that we do a lot
> > of
> > useless copies to initialize parameters, and this leads to larger
> > output size, which slows things down. Having a more easily
> > recognizable idiom for getting the storage behind parameters if it
> > exists feels like a win.
> 
> Why don't we want to do this kind of pattern recognition at -O0?  We
> can
> do a fast scan which just looks at the entry block if you're
> concerned
> about compile-time.
> 
> > One other questionable side benefit of doing this is that it would
> > make it possible to implement va_start by taking the address of a
> > parameter and doing pointer arithmetic. While that code is fairly
> > invalid, it’s exactly what the MSVC STL headers do for 32-bit x86.
> > If
> > we make this work in Clang, we can remove our stdarg.h and vadefs.h
> > wrapper headers. Users often pass flags that cause clang to skip
> > these
> > wrapper headers, and then they file bugs complaining that va lists
> > don't work.
> 
> This ignores the other reason why this doesn't work at the moment:
> reading off the end of an alloca returns undef, whether or not the
> actual pointer value points at the data you want.  I mean, it's not
> impossible to make it work, but it involves a bunch of unrelated
> changes
> which I don't think we want to pursue.
I agree. Making reading past the end of an alloca defined behavior seems
undesirable.

 -Hal
> 
> -Eli
> 
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel via llvm-dev

2016-Dec-10 16:20 UTC

head link

[llvm-dev] RFC: Adding argument allocas

----- Original Message -----
> From: "Reid Kleckner via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Thursday, December 8, 2016 7:05:44 PM
> Subject: [llvm-dev] RFC: Adding argument allocas
> Clang is currently missing some low-hanging performance and code size
> opportunities when receiving function parameters. For most scalar
> parameters, Clang typically emits an alloca and stores the LLVM SSA
> argument value into the alloca to initialize it. With optimizations,
> this initialization is often removed, but it stays behind in debug
> builds and when the user takes the address of a parameter (
> https://llvm.org/bugs/show_bug.cgi?id=26328 ). In many cases, the
> memory allocation and store duplicate work already done by the
> caller.
> Case 1: The parameter is already being passed in memory. In this
> case, we waste memory and do an extra load and store to copy it into
> our own alloca. This is very common on 32-bit x86.
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
> register parameters. This allows the callee to “home” register
> parameters to ease debugging and the implementation of variadic
> functions. In this case, the store is not dead, but we fail to use
> the pre-allocated memory.
> Case 3: The parameter is a register parameter in a normal Sys-V ABI.
> In this case, nothing is wasted. Both the memory and store are
> needed.
> I think we can improve our code for cases 1 and 2 by making it
> possible to create allocas from parameters, and we can lower this
> down to a regular stack object with a store in case 3. The syntax
> for this would look like:
> define void @f(i32 %x) {
> %px = alloca i32, argument i32 %x, align 4
> The semantics are the same as the following:
> define void @f(i32 %x) {
> %px = alloca i32, align 4
> store i32 %x, i32* %px, align 4
> It is invalid to make one of these argument allocas dynamic in any
> way, either by using inalloca, using a dynamic element count, or
> being outside the entry block.
Having an alloca with an initializer seems like a reasonable enhancement.
Please, however, without all of these special restrictions: any value should be
accepted on any alloca. We can match the special argument cases in the backend
and otherwise lower to the equivalent of an alloca+store.

-Hal 
> If the semantics are the same, it begs the question, why don’t we
> pattern match the alloca and store to elide dead stores and reuse
> existing argument stack slots? My main preference for adding a new
> way to do this is that it gives us simpler and smaller code in debug
> builds, where we presumably don’t want to do this kind of pattern
> recognition. My experience with our -O0 codegen is that we do a lot
> of useless copies to initialize parameters, and this leads to larger
> output size, which slows things down. Having a more easily
> recognizable idiom for getting the storage behind parameters if it
> exists feels like a win.
> Changing the semantics of alloca affects a lot of things, but I think
> it makes sense to extend alloca here rather than a new intrinsic or
> instruction that can create local variable stack memory that passes
> would have to reason about. Here’s a list of things we’d need to
> change off the top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the
> inliner will now strip the argument operand off of allocas and
> insert the equivalent store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the
> argument value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to
> load forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic
> that would need to be updated.
> I’m sure there are more, but the changes seem worth it to get at
> these low hanging opportunities.
> One other questionable side benefit of doing this is that it would
> make it possible to implement va_start by taking the address of a
> parameter and doing pointer arithmetic. While that code is fairly
> invalid, it’s exactly what the MSVC STL headers do for 32-bit x86.
> If we make this work in Clang, we can remove our stdarg.h and
> vadefs.h wrapper headers. Users often pass flags that cause clang to
> skip these wrapper headers, and then they file bugs complaining that
> va lists don't work.
> At the end of the day, this feels like a straightforward engineering
> improvement to LLVM, and it feels worth doing to me. Does anyone
> feel otherwise or have any suggestions?
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161210/e56670a8/attachment.html>

Reid Kleckner via llvm-dev

2016-Dec-12 23:03 UTC

head link

[llvm-dev] RFC: Adding argument allocas

On Sat, Dec 10, 2016 at 8:20 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Having an alloca with an initializer seems like a reasonable enhancement.
> Please, however, without all of these special restrictions: any value
> should be accepted on any alloca. We can match the special argument cases
> in the backend and otherwise lower to the equivalent of an alloca+store.
>
We can definitely generalize this to alloca initializers, but there's very
little reason for frontends to initialize allocas that aren't arguments.
Typically allocas for locals must live in the entry block, while
initialization might be buried in some conditionally executed code. A
frontend would have to know that the initializer doesn't depend on local
computation, and that hoisting the store to the entry block doesn't
pessimize the code. With arguments, the value is always available, and the
notional store will be optimized away or is needed in a debug build.

So, even though I think arbitrary alloca initializers are a reasonable
extension to the LLVM IR model, I don't see enough of a use case to add the
general feature. I'd rather have a feature that does one thing well rather
than trying to be overly general (e.g. llvm.lifetime.start).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161212/f2bf20bc/attachment.html>

Reid Kleckner via llvm-dev

2016-Dec-28 00:15 UTC

head link

[llvm-dev] RFC: Adding argument allocas

To recap, it seems like there are two people so far opposed to this
proposal, and tentative support from a number of others. It's not clear to
me if I should go ahead with this. I'll try to bug some more people to get
more input here.

In the meantime, I think I'll implement the simple pattern matching, since
that can be done incrementally, and can be simplified afterwards by the IR
change, or we can decide to extend it to run at -O0 or handle more complex
cases.

On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner <rnk at google.com> wrote:
> Clang is currently missing some low-hanging performance and code size
> opportunities when receiving function parameters. For most scalar
> parameters, Clang typically emits an alloca and stores the LLVM SSA
> argument value into the alloca to initialize it. With optimizations, this
> initialization is often removed, but it stays behind in debug builds and
> when the user takes the address of a parameter (
> https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory
> allocation and store duplicate work already done by the caller.
>
> Case 1: The parameter is already being passed in memory. In this case, we
> waste memory and do an extra load and store to copy it into our own alloca.
> This is very common on 32-bit x86.
>
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
> register parameters. This allows the callee to “home” register parameters
> to ease debugging and the implementation of variadic functions. In this
> case, the store is not dead, but we fail to use the pre-allocated memory.
>
> Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
> this case, nothing is wasted. Both the memory and store are needed.
>
> I think we can improve our code for cases 1 and 2 by making it possible to
> create allocas from parameters, and we can lower this down to a regular
> stack object with a store in case 3. The syntax for this would look like:
> define void @f(i32 %x) {
>   %px = alloca i32, argument i32 %x, align 4
>
> The semantics are the same as the following:
> define void @f(i32 %x) {
>   %px = alloca i32, align 4
>   store i32 %x, i32* %px, align 4
>
> It is invalid to make one of these argument allocas dynamic in any way,
> either by using inalloca, using a dynamic element count, or being outside
> the entry block.
>
> If the semantics are the same, it begs the question, why don’t we pattern
> match the alloca and store to elide dead stores and reuse existing argument
> stack slots? My main preference for adding a new way to do this is that it
> gives us simpler and smaller code in debug builds, where we presumably
> don’t want to do this kind of pattern recognition. My experience with our
> -O0 codegen is that we do a lot of useless copies to initialize parameters,
> and this leads to larger output size, which slows things down. Having a
> more easily recognizable idiom for getting the storage behind parameters if
> it exists feels like a win.
>
> Changing the semantics of alloca affects a lot of things, but I think it
> makes sense to extend alloca here rather than a new intrinsic or
> instruction that can create local variable stack memory that passes would
> have to reason about. Here’s a list of things we’d need to change off the
> top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the inliner
> will now strip the argument operand off of allocas and insert the
> equivalent store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the argument
> value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to load
> forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic that
> would need to be updated.
>
> I’m sure there are more, but the changes seem worth it to get at these low
> hanging opportunities.
>
> One other questionable side benefit of doing this is that it would make it
> possible to implement va_start by taking the address of a parameter and
> doing pointer arithmetic. While that code is fairly invalid, it’s exactly
> what the MSVC STL headers do for 32-bit x86. If we make this work in Clang,
> we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass
> flags that cause clang to skip these wrapper headers, and then they file
> bugs complaining that va lists don't work.
>
> At the end of the day, this feels like a straightforward engineering
> improvement to LLVM, and it feels worth doing to me. Does anyone feel
> otherwise or have any suggestions?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161227/bd485de2/attachment.html>

Chandler Carruth via llvm-dev

2016-Dec-28 00:57 UTC

head link

[llvm-dev] RFC: Adding argument allocas

I largely agree with Eli about this...

I dislike the exact form of the current proposal because I feel like this
shouldn't be a special form of alloca. I really dislike tightly coupled but
"independent" instructions that have special rules applied to them.

But that said, there are other ways of accomplishing the same core thing
that directly change the *argument* rather than having special
allocas.>From inside the function body these look similar to byval arguments today.I think a proposal that essentially integrates an argument passing protocol
like byval more effectively into the rest of LLVM might make sense....

But I come back to Eli's point that it doesn't feel like this carries
its
weight. My feeling is that pattern matching plus passing arguments in a
more natural way (aggregates) so that pattern matching is reasonably
straight forward is likely to get the overwhelming majority of the cases
you care about. If that is the case, it doesn't seem worth adding a new IR
feature (or adding complexity to the current byval feature).

One clear way to demonstrate this feature really is needed is to explore
pattern matching and report back serious problems along that path.

Another clear way (IMO) to demonstrate this feature is worth pursuing is to
show how it will subsume and simplify the model of byval we have today.

Hope this helps,
-Chandler

On Tue, Dec 27, 2016 at 4:15 PM Reid Kleckner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> To recap, it seems like there are two people so far opposed to this
> proposal, and tentative support from a number of others. It's not clear
to
> me if I should go ahead with this. I'll try to bug some more people to
get
> more input here.
>
> In the meantime, I think I'll implement the simple pattern matching,
since
> that can be done incrementally, and can be simplified afterwards by the IR
> change, or we can decide to extend it to run at -O0 or handle more complex
> cases.
>
> On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner <rnk at google.com>
wrote:
>
> Clang is currently missing some low-hanging performance and code size
> opportunities when receiving function parameters. For most scalar
> parameters, Clang typically emits an alloca and stores the LLVM SSA
> argument value into the alloca to initialize it. With optimizations, this
> initialization is often removed, but it stays behind in debug builds and
> when the user takes the address of a parameter (
> https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory
> allocation and store duplicate work already done by the caller.
>
> Case 1: The parameter is already being passed in memory. In this case, we
> waste memory and do an extra load and store to copy it into our own alloca.
> This is very common on 32-bit x86.
>
> Case 2: On Win64, the caller pre-allocates shadow stack slots for all
> register parameters. This allows the callee to “home” register parameters
> to ease debugging and the implementation of variadic functions. In this
> case, the store is not dead, but we fail to use the pre-allocated memory.
>
> Case 3: The parameter is a register parameter in a normal Sys-V ABI. In
> this case, nothing is wasted. Both the memory and store are needed.
>
> I think we can improve our code for cases 1 and 2 by making it possible to
> create allocas from parameters, and we can lower this down to a regular
> stack object with a store in case 3. The syntax for this would look like:
> define void @f(i32 %x) {
>   %px = alloca i32, argument i32 %x, align 4
>
> The semantics are the same as the following:
> define void @f(i32 %x) {
>   %px = alloca i32, align 4
>   store i32 %x, i32* %px, align 4
>
> It is invalid to make one of these argument allocas dynamic in any way,
> either by using inalloca, using a dynamic element count, or being outside
> the entry block.
>
> If the semantics are the same, it begs the question, why don’t we pattern
> match the alloca and store to elide dead stores and reuse existing argument
> stack slots? My main preference for adding a new way to do this is that it
> gives us simpler and smaller code in debug builds, where we presumably
> don’t want to do this kind of pattern recognition. My experience with our
> -O0 codegen is that we do a lot of useless copies to initialize parameters,
> and this leads to larger output size, which slows things down. Having a
> more easily recognizable idiom for getting the storage behind parameters if
> it exists feels like a win.
>
> Changing the semantics of alloca affects a lot of things, but I think it
> makes sense to extend alloca here rather than a new intrinsic or
> instruction that can create local variable stack memory that passes would
> have to reason about. Here’s a list of things we’d need to change off the
> top of my head:
> 1. Inliner: When hoisting static allocas to the entry block, the inliner
> will now strip the argument operand off of allocas and insert the
> equivalent store at the inlined call site.
> 2. Mem2reg: Loading from an argument alloca needs to produce the argument
> value instead of undef.
> 3. GVN: We need to apply the same logic to GVN and similar store to load
> forwarding transforms.
> 4. Instcombine: This transform has some simple store forwarding logic that
> would need to be updated.
>
> I’m sure there are more, but the changes seem worth it to get at these low
> hanging opportunities.
>
> One other questionable side benefit of doing this is that it would make it
> possible to implement va_start by taking the address of a parameter and
> doing pointer arithmetic. While that code is fairly invalid, it’s exactly
> what the MSVC STL headers do for 32-bit x86. If we make this work in Clang,
> we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass
> flags that cause clang to skip these wrapper headers, and then they file
> bugs complaining that va lists don't work.
>
> At the end of the day, this feels like a straightforward engineering
> improvement to LLVM, and it feels worth doing to me. Does anyone feel
> otherwise or have any suggestions?
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161228/0f355281/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Dec 2016 - RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

[llvm-dev] RFC: Adding argument allocas

Seemingly Similar Threads