Clang is currently missing some low-hanging performance and code size opportunities when receiving function parameters. For most scalar parameters, Clang typically emits an alloca and stores the LLVM SSA argument value into the alloca to initialize it. With optimizations, this initialization is often removed, but it stays behind in debug builds and when the user takes the address of a parameter ( https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory allocation and store duplicate work already done by the caller. Case 1: The parameter is already being passed in memory. In this case, we waste memory and do an extra load and store to copy it into our own alloca. This is very common on 32-bit x86. Case 2: On Win64, the caller pre-allocates shadow stack slots for all register parameters. This allows the callee to “home” register parameters to ease debugging and the implementation of variadic functions. In this case, the store is not dead, but we fail to use the pre-allocated memory. Case 3: The parameter is a register parameter in a normal Sys-V ABI. In this case, nothing is wasted. Both the memory and store are needed. I think we can improve our code for cases 1 and 2 by making it possible to create allocas from parameters, and we can lower this down to a regular stack object with a store in case 3. The syntax for this would look like: define void @f(i32 %x) { %px = alloca i32, argument i32 %x, align 4 The semantics are the same as the following: define void @f(i32 %x) { %px = alloca i32, align 4 store i32 %x, i32* %px, align 4 It is invalid to make one of these argument allocas dynamic in any way, either by using inalloca, using a dynamic element count, or being outside the entry block. If the semantics are the same, it begs the question, why don’t we pattern match the alloca and store to elide dead stores and reuse existing argument stack slots? My main preference for adding a new way to do this is that it gives us simpler and smaller code in debug builds, where we presumably don’t want to do this kind of pattern recognition. My experience with our -O0 codegen is that we do a lot of useless copies to initialize parameters, and this leads to larger output size, which slows things down. Having a more easily recognizable idiom for getting the storage behind parameters if it exists feels like a win. Changing the semantics of alloca affects a lot of things, but I think it makes sense to extend alloca here rather than a new intrinsic or instruction that can create local variable stack memory that passes would have to reason about. Here’s a list of things we’d need to change off the top of my head: 1. Inliner: When hoisting static allocas to the entry block, the inliner will now strip the argument operand off of allocas and insert the equivalent store at the inlined call site. 2. Mem2reg: Loading from an argument alloca needs to produce the argument value instead of undef. 3. GVN: We need to apply the same logic to GVN and similar store to load forwarding transforms. 4. Instcombine: This transform has some simple store forwarding logic that would need to be updated. I’m sure there are more, but the changes seem worth it to get at these low hanging opportunities. One other questionable side benefit of doing this is that it would make it possible to implement va_start by taking the address of a parameter and doing pointer arithmetic. While that code is fairly invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass flags that cause clang to skip these wrapper headers, and then they file bugs complaining that va lists don't work. At the end of the day, this feels like a straightforward engineering improvement to LLVM, and it feels worth doing to me. Does anyone feel otherwise or have any suggestions? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161208/d5357451/attachment-0001.html>
Eric Christopher via llvm-dev
2016-Dec-09 01:11 UTC
[llvm-dev] RFC: Adding argument allocas
Hi Reid, This seems pretty reasonable to me. For the debug info people: Reid and I also chatted about dbg.declare vs dbg.value as a consequence of this. His work here is going to take us closer to being able to get rid of dbg.declare, but not actually be part of it (and isn't planned - it doesn't make sense as part of this). -eric On Thu, Dec 8, 2016 at 5:06 PM Reid Kleckner via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Clang is currently missing some low-hanging performance and code size > opportunities when receiving function parameters. For most scalar > parameters, Clang typically emits an alloca and stores the LLVM SSA > argument value into the alloca to initialize it. With optimizations, this > initialization is often removed, but it stays behind in debug builds and > when the user takes the address of a parameter ( > https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory > allocation and store duplicate work already done by the caller. > > Case 1: The parameter is already being passed in memory. In this case, we > waste memory and do an extra load and store to copy it into our own alloca. > This is very common on 32-bit x86. > > Case 2: On Win64, the caller pre-allocates shadow stack slots for all > register parameters. This allows the callee to “home” register parameters > to ease debugging and the implementation of variadic functions. In this > case, the store is not dead, but we fail to use the pre-allocated memory. > > Case 3: The parameter is a register parameter in a normal Sys-V ABI. In > this case, nothing is wasted. Both the memory and store are needed. > > I think we can improve our code for cases 1 and 2 by making it possible to > create allocas from parameters, and we can lower this down to a regular > stack object with a store in case 3. The syntax for this would look like: > define void @f(i32 %x) { > %px = alloca i32, argument i32 %x, align 4 > > The semantics are the same as the following: > define void @f(i32 %x) { > %px = alloca i32, align 4 > store i32 %x, i32* %px, align 4 > > It is invalid to make one of these argument allocas dynamic in any way, > either by using inalloca, using a dynamic element count, or being outside > the entry block. > > If the semantics are the same, it begs the question, why don’t we pattern > match the alloca and store to elide dead stores and reuse existing argument > stack slots? My main preference for adding a new way to do this is that it > gives us simpler and smaller code in debug builds, where we presumably > don’t want to do this kind of pattern recognition. My experience with our > -O0 codegen is that we do a lot of useless copies to initialize parameters, > and this leads to larger output size, which slows things down. Having a > more easily recognizable idiom for getting the storage behind parameters if > it exists feels like a win. > > Changing the semantics of alloca affects a lot of things, but I think it > makes sense to extend alloca here rather than a new intrinsic or > instruction that can create local variable stack memory that passes would > have to reason about. Here’s a list of things we’d need to change off the > top of my head: > 1. Inliner: When hoisting static allocas to the entry block, the inliner > will now strip the argument operand off of allocas and insert the > equivalent store at the inlined call site. > 2. Mem2reg: Loading from an argument alloca needs to produce the argument > value instead of undef. > 3. GVN: We need to apply the same logic to GVN and similar store to load > forwarding transforms. > 4. Instcombine: This transform has some simple store forwarding logic that > would need to be updated. > > I’m sure there are more, but the changes seem worth it to get at these low > hanging opportunities. > > One other questionable side benefit of doing this is that it would make it > possible to implement va_start by taking the address of a parameter and > doing pointer arithmetic. While that code is fairly invalid, it’s exactly > what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, > we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass > flags that cause clang to skip these wrapper headers, and then they file > bugs complaining that va lists don't work. > > At the end of the day, this feels like a straightforward engineering > improvement to LLVM, and it feels worth doing to me. Does anyone feel > otherwise or have any suggestions? > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161209/48442fd7/attachment.html>
> On Dec 8, 2016, at 5:05 PM, Reid Kleckner via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Clang is currently missing some low-hanging performance and code size opportunities when receiving function parameters. For most scalar parameters, Clang typically emits an alloca and stores the LLVM SSA argument value into the alloca to initialize it. With optimizations, this initialization is often removed, but it stays behind in debug builds and when the user takes the address of a parameter (https://llvm.org/bugs/show_bug.cgi?id=26328 <https://llvm.org/bugs/show_bug.cgi?id=26328>). In many cases, the memory allocation and store duplicate work already done by the caller. > > Case 1: The parameter is already being passed in memory. In this case, we waste memory and do an extra load and store to copy it into our own alloca. This is very common on 32-bit x86. > > Case 2: On Win64, the caller pre-allocates shadow stack slots for all register parameters. This allows the callee to “home” register parameters to ease debugging and the implementation of variadic functions. In this case, the store is not dead, but we fail to use the pre-allocated memory. > > Case 3: The parameter is a register parameter in a normal Sys-V ABI. In this case, nothing is wasted. Both the memory and store are needed. > > I think we can improve our code for cases 1 and 2 by making it possible to create allocas from parameters, and we can lower this down to a regular stack object with a store in case 3. The syntax for this would look like: > define void @f(i32 %x) { > %px = alloca i32, argument i32 %x, align 4 > > The semantics are the same as the following: > define void @f(i32 %x) { > %px = alloca i32, align 4 > store i32 %x, i32* %px, align 4 > > It is invalid to make one of these argument allocas dynamic in any way, either by using inalloca, using a dynamic element count, or being outside the entry block. > > If the semantics are the same, it begs the question, why don’t we pattern match the alloca and store to elide dead stores and reuse existing argument stack slots? My main preference for adding a new way to do this is that it gives us simpler and smaller code in debug builds, where we presumably don’t want to do this kind of pattern recognition.So IIUC basically the *only* reason for this IR change is that we don’t want to pattern match in debug build? I don't understand right now why we wouldn’t want to do this? Thanks, — Mehdi> My experience with our -O0 codegen is that we do a lot of useless copies to initialize parameters, and this leads to larger output size, which slows things down. Having a more easily recognizable idiom for getting the storage behind parameters if it exists feels like a win. > > Changing the semantics of alloca affects a lot of things, but I think it makes sense to extend alloca here rather than a new intrinsic or instruction that can create local variable stack memory that passes would have to reason about. Here’s a list of things we’d need to change off the top of my head: > 1. Inliner: When hoisting static allocas to the entry block, the inliner will now strip the argument operand off of allocas and insert the equivalent store at the inlined call site. > 2. Mem2reg: Loading from an argument alloca needs to produce the argument value instead of undef. > 3. GVN: We need to apply the same logic to GVN and similar store to load forwarding transforms. > 4. Instcombine: This transform has some simple store forwarding logic that would need to be updated. > > I’m sure there are more, but the changes seem worth it to get at these low hanging opportunities. > > One other questionable side benefit of doing this is that it would make it possible to implement va_start by taking the address of a parameter and doing pointer arithmetic. While that code is fairly invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass flags that cause clang to skip these wrapper headers, and then they file bugs complaining that va lists don't work. > > At the end of the day, this feels like a straightforward engineering improvement to LLVM, and it feels worth doing to me. Does anyone feel otherwise or have any suggestions? > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161208/cd47a2fb/attachment.html>
On 12/8/2016 5:05 PM, Reid Kleckner via llvm-dev wrote:> If the semantics are the same, it begs the question, why don’t we > pattern match the alloca and store to elide dead stores and reuse > existing argument stack slots? My main preference for adding a new way > to do this is that it gives us simpler and smaller code in debug > builds, where we presumably don’t want to do this kind of pattern > recognition. My experience with our -O0 codegen is that we do a lot of > useless copies to initialize parameters, and this leads to larger > output size, which slows things down. Having a more easily > recognizable idiom for getting the storage behind parameters if it > exists feels like a win.Why don't we want to do this kind of pattern recognition at -O0? We can do a fast scan which just looks at the entry block if you're concerned about compile-time.> One other questionable side benefit of doing this is that it would > make it possible to implement va_start by taking the address of a > parameter and doing pointer arithmetic. While that code is fairly > invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. If > we make this work in Clang, we can remove our stdarg.h and vadefs.h > wrapper headers. Users often pass flags that cause clang to skip these > wrapper headers, and then they file bugs complaining that va lists > don't work.This ignores the other reason why this doesn't work at the moment: reading off the end of an alloca returns undef, whether or not the actual pointer value points at the data you want. I mean, it's not impossible to make it work, but it involves a bunch of unrelated changes which I don't think we want to pursue. -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
On Thu, Dec 8, 2016 at 5:37 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> So IIUC basically the *only* reason for this IR change is that we don’t > want to pattern match in debug build? > I don't understand right now why we wouldn’t want to do this? >If we need to pattern match such a basic construct, it suggests to me that we have the wrong representation, and we should instead make our representation more accurately model reality. To me, it feels like this representation allows several good things to just "fall out" without any additional work, and that suggests it's a good representation. On Thu, Dec 8, 2016 at 5:47 PM, Friedman, Eli <efriedma at codeaurora.org> wrote:> > One other questionable side benefit of doing this is that it would make it >> possible to implement va_start by taking the address of a parameter and >> doing pointer arithmetic. While that code is fairly invalid, it’s exactly >> what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, >> we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass >> flags that cause clang to skip these wrapper headers, and then they file >> bugs complaining that va lists don't work. >> > > This ignores the other reason why this doesn't work at the moment: reading > off the end of an alloca returns undef, whether or not the actual pointer > value points at the data you want. I mean, it's not impossible to make it > work, but it involves a bunch of unrelated changes which I don't think we > want to pursue.I guess I'm more willing to pursue those changes at some vague unspecified point in the future. It doesn't seem that crazy to add a special case to allow OOB GEPs of off the last argument alloca of a variadic function. Other OOB GEPs can continue to be UB as they are today. We have other loopholes for these kinds of OOB GEP optimizations around zero-sized globals so that people can use __start_section/__stop_section symbols to implement .init_array style registration systems. This doesn't seem that different. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161209/2dd39d81/attachment.html>
On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner via llvm-dev <llvm-dev at lists.llvm.org> wrote:> If the semantics are the same, it begs the question, why don’t we pattern > match the alloca and store to elide dead stores and reuse existing argument > stack slots?One could reverse that argument: if we're going to emit the alloca+store, as we currently do, only to pattern match it away, the question is why emit the alloca+store in the first place? :-) Also, pattern-matching would need to happen in the back-end since we can't represent this in IR, right? Which means all the backends: SelectionDAG, FastISel (and GlobalISel going forward).
----- Original Message -----> From: "Eli via llvm-dev Friedman" <llvm-dev at lists.llvm.org> > To: "Reid Kleckner" <rnk at google.com>, "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Thursday, December 8, 2016 7:47:28 PM > Subject: Re: [llvm-dev] RFC: Adding argument allocas > > On 12/8/2016 5:05 PM, Reid Kleckner via llvm-dev wrote: > > If the semantics are the same, it begs the question, why don’t we > > pattern match the alloca and store to elide dead stores and reuse > > existing argument stack slots? My main preference for adding a new > > way > > to do this is that it gives us simpler and smaller code in debug > > builds, where we presumably don’t want to do this kind of pattern > > recognition. My experience with our -O0 codegen is that we do a lot > > of > > useless copies to initialize parameters, and this leads to larger > > output size, which slows things down. Having a more easily > > recognizable idiom for getting the storage behind parameters if it > > exists feels like a win. > > Why don't we want to do this kind of pattern recognition at -O0? We > can > do a fast scan which just looks at the entry block if you're > concerned > about compile-time. > > > One other questionable side benefit of doing this is that it would > > make it possible to implement va_start by taking the address of a > > parameter and doing pointer arithmetic. While that code is fairly > > invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. > > If > > we make this work in Clang, we can remove our stdarg.h and vadefs.h > > wrapper headers. Users often pass flags that cause clang to skip > > these > > wrapper headers, and then they file bugs complaining that va lists > > don't work. > > This ignores the other reason why this doesn't work at the moment: > reading off the end of an alloca returns undef, whether or not the > actual pointer value points at the data you want. I mean, it's not > impossible to make it work, but it involves a bunch of unrelated > changes > which I don't think we want to pursue.I agree. Making reading past the end of an alloca defined behavior seems undesirable. -Hal> > -Eli > > -- > Employee of Qualcomm Innovation Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a > Linux Foundation Collaborative Project > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
----- Original Message -----> From: "Reid Kleckner via llvm-dev" <llvm-dev at lists.llvm.org> > To: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Thursday, December 8, 2016 7:05:44 PM > Subject: [llvm-dev] RFC: Adding argument allocas> Clang is currently missing some low-hanging performance and code size > opportunities when receiving function parameters. For most scalar > parameters, Clang typically emits an alloca and stores the LLVM SSA > argument value into the alloca to initialize it. With optimizations, > this initialization is often removed, but it stays behind in debug > builds and when the user takes the address of a parameter ( > https://llvm.org/bugs/show_bug.cgi?id=26328 ). In many cases, the > memory allocation and store duplicate work already done by the > caller.> Case 1: The parameter is already being passed in memory. In this > case, we waste memory and do an extra load and store to copy it into > our own alloca. This is very common on 32-bit x86.> Case 2: On Win64, the caller pre-allocates shadow stack slots for all > register parameters. This allows the callee to “home” register > parameters to ease debugging and the implementation of variadic > functions. In this case, the store is not dead, but we fail to use > the pre-allocated memory.> Case 3: The parameter is a register parameter in a normal Sys-V ABI. > In this case, nothing is wasted. Both the memory and store are > needed.> I think we can improve our code for cases 1 and 2 by making it > possible to create allocas from parameters, and we can lower this > down to a regular stack object with a store in case 3. The syntax > for this would look like: > define void @f(i32 %x) { > %px = alloca i32, argument i32 %x, align 4> The semantics are the same as the following: > define void @f(i32 %x) {> %px = alloca i32, align 4 > store i32 %x, i32* %px, align 4> It is invalid to make one of these argument allocas dynamic in any > way, either by using inalloca, using a dynamic element count, or > being outside the entry block.Having an alloca with an initializer seems like a reasonable enhancement. Please, however, without all of these special restrictions: any value should be accepted on any alloca. We can match the special argument cases in the backend and otherwise lower to the equivalent of an alloca+store. -Hal> If the semantics are the same, it begs the question, why don’t we > pattern match the alloca and store to elide dead stores and reuse > existing argument stack slots? My main preference for adding a new > way to do this is that it gives us simpler and smaller code in debug > builds, where we presumably don’t want to do this kind of pattern > recognition. My experience with our -O0 codegen is that we do a lot > of useless copies to initialize parameters, and this leads to larger > output size, which slows things down. Having a more easily > recognizable idiom for getting the storage behind parameters if it > exists feels like a win.> Changing the semantics of alloca affects a lot of things, but I think > it makes sense to extend alloca here rather than a new intrinsic or > instruction that can create local variable stack memory that passes > would have to reason about. Here’s a list of things we’d need to > change off the top of my head: > 1. Inliner: When hoisting static allocas to the entry block, the > inliner will now strip the argument operand off of allocas and > insert the equivalent store at the inlined call site. > 2. Mem2reg: Loading from an argument alloca needs to produce the > argument value instead of undef. > 3. GVN: We need to apply the same logic to GVN and similar store to > load forwarding transforms. > 4. Instcombine: This transform has some simple store forwarding logic > that would need to be updated.> I’m sure there are more, but the changes seem worth it to get at > these low hanging opportunities.> One other questionable side benefit of doing this is that it would > make it possible to implement va_start by taking the address of a > parameter and doing pointer arithmetic. While that code is fairly > invalid, it’s exactly what the MSVC STL headers do for 32-bit x86. > If we make this work in Clang, we can remove our stdarg.h and > vadefs.h wrapper headers. Users often pass flags that cause clang to > skip these wrapper headers, and then they file bugs complaining that > va lists don't work.> At the end of the day, this feels like a straightforward engineering > improvement to LLVM, and it feels worth doing to me. Does anyone > feel otherwise or have any suggestions? > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161210/e56670a8/attachment.html>
On Sat, Dec 10, 2016 at 8:20 AM, Hal Finkel <hfinkel at anl.gov> wrote:> Having an alloca with an initializer seems like a reasonable enhancement. > Please, however, without all of these special restrictions: any value > should be accepted on any alloca. We can match the special argument cases > in the backend and otherwise lower to the equivalent of an alloca+store. >We can definitely generalize this to alloca initializers, but there's very little reason for frontends to initialize allocas that aren't arguments. Typically allocas for locals must live in the entry block, while initialization might be buried in some conditionally executed code. A frontend would have to know that the initializer doesn't depend on local computation, and that hoisting the store to the entry block doesn't pessimize the code. With arguments, the value is always available, and the notional store will be optimized away or is needed in a debug build. So, even though I think arbitrary alloca initializers are a reasonable extension to the LLVM IR model, I don't see enough of a use case to add the general feature. I'd rather have a feature that does one thing well rather than trying to be overly general (e.g. llvm.lifetime.start). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161212/f2bf20bc/attachment.html>
To recap, it seems like there are two people so far opposed to this proposal, and tentative support from a number of others. It's not clear to me if I should go ahead with this. I'll try to bug some more people to get more input here. In the meantime, I think I'll implement the simple pattern matching, since that can be done incrementally, and can be simplified afterwards by the IR change, or we can decide to extend it to run at -O0 or handle more complex cases. On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner <rnk at google.com> wrote:> Clang is currently missing some low-hanging performance and code size > opportunities when receiving function parameters. For most scalar > parameters, Clang typically emits an alloca and stores the LLVM SSA > argument value into the alloca to initialize it. With optimizations, this > initialization is often removed, but it stays behind in debug builds and > when the user takes the address of a parameter ( > https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory > allocation and store duplicate work already done by the caller. > > Case 1: The parameter is already being passed in memory. In this case, we > waste memory and do an extra load and store to copy it into our own alloca. > This is very common on 32-bit x86. > > Case 2: On Win64, the caller pre-allocates shadow stack slots for all > register parameters. This allows the callee to “home” register parameters > to ease debugging and the implementation of variadic functions. In this > case, the store is not dead, but we fail to use the pre-allocated memory. > > Case 3: The parameter is a register parameter in a normal Sys-V ABI. In > this case, nothing is wasted. Both the memory and store are needed. > > I think we can improve our code for cases 1 and 2 by making it possible to > create allocas from parameters, and we can lower this down to a regular > stack object with a store in case 3. The syntax for this would look like: > define void @f(i32 %x) { > %px = alloca i32, argument i32 %x, align 4 > > The semantics are the same as the following: > define void @f(i32 %x) { > %px = alloca i32, align 4 > store i32 %x, i32* %px, align 4 > > It is invalid to make one of these argument allocas dynamic in any way, > either by using inalloca, using a dynamic element count, or being outside > the entry block. > > If the semantics are the same, it begs the question, why don’t we pattern > match the alloca and store to elide dead stores and reuse existing argument > stack slots? My main preference for adding a new way to do this is that it > gives us simpler and smaller code in debug builds, where we presumably > don’t want to do this kind of pattern recognition. My experience with our > -O0 codegen is that we do a lot of useless copies to initialize parameters, > and this leads to larger output size, which slows things down. Having a > more easily recognizable idiom for getting the storage behind parameters if > it exists feels like a win. > > Changing the semantics of alloca affects a lot of things, but I think it > makes sense to extend alloca here rather than a new intrinsic or > instruction that can create local variable stack memory that passes would > have to reason about. Here’s a list of things we’d need to change off the > top of my head: > 1. Inliner: When hoisting static allocas to the entry block, the inliner > will now strip the argument operand off of allocas and insert the > equivalent store at the inlined call site. > 2. Mem2reg: Loading from an argument alloca needs to produce the argument > value instead of undef. > 3. GVN: We need to apply the same logic to GVN and similar store to load > forwarding transforms. > 4. Instcombine: This transform has some simple store forwarding logic that > would need to be updated. > > I’m sure there are more, but the changes seem worth it to get at these low > hanging opportunities. > > One other questionable side benefit of doing this is that it would make it > possible to implement va_start by taking the address of a parameter and > doing pointer arithmetic. While that code is fairly invalid, it’s exactly > what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, > we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass > flags that cause clang to skip these wrapper headers, and then they file > bugs complaining that va lists don't work. > > At the end of the day, this feels like a straightforward engineering > improvement to LLVM, and it feels worth doing to me. Does anyone feel > otherwise or have any suggestions? >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161227/bd485de2/attachment.html>
Chandler Carruth via llvm-dev
2016-Dec-28 00:57 UTC
[llvm-dev] RFC: Adding argument allocas
I largely agree with Eli about this... I dislike the exact form of the current proposal because I feel like this shouldn't be a special form of alloca. I really dislike tightly coupled but "independent" instructions that have special rules applied to them. But that said, there are other ways of accomplishing the same core thing that directly change the *argument* rather than having special allocas.>From inside the function body these look similar to byval arguments today.I think a proposal that essentially integrates an argument passing protocol like byval more effectively into the rest of LLVM might make sense.... But I come back to Eli's point that it doesn't feel like this carries its weight. My feeling is that pattern matching plus passing arguments in a more natural way (aggregates) so that pattern matching is reasonably straight forward is likely to get the overwhelming majority of the cases you care about. If that is the case, it doesn't seem worth adding a new IR feature (or adding complexity to the current byval feature). One clear way to demonstrate this feature really is needed is to explore pattern matching and report back serious problems along that path. Another clear way (IMO) to demonstrate this feature is worth pursuing is to show how it will subsume and simplify the model of byval we have today. Hope this helps, -Chandler On Tue, Dec 27, 2016 at 4:15 PM Reid Kleckner via llvm-dev < llvm-dev at lists.llvm.org> wrote:> To recap, it seems like there are two people so far opposed to this > proposal, and tentative support from a number of others. It's not clear to > me if I should go ahead with this. I'll try to bug some more people to get > more input here. > > In the meantime, I think I'll implement the simple pattern matching, since > that can be done incrementally, and can be simplified afterwards by the IR > change, or we can decide to extend it to run at -O0 or handle more complex > cases. > > On Thu, Dec 8, 2016 at 5:05 PM, Reid Kleckner <rnk at google.com> wrote: > > Clang is currently missing some low-hanging performance and code size > opportunities when receiving function parameters. For most scalar > parameters, Clang typically emits an alloca and stores the LLVM SSA > argument value into the alloca to initialize it. With optimizations, this > initialization is often removed, but it stays behind in debug builds and > when the user takes the address of a parameter ( > https://llvm.org/bugs/show_bug.cgi?id=26328). In many cases, the memory > allocation and store duplicate work already done by the caller. > > Case 1: The parameter is already being passed in memory. In this case, we > waste memory and do an extra load and store to copy it into our own alloca. > This is very common on 32-bit x86. > > Case 2: On Win64, the caller pre-allocates shadow stack slots for all > register parameters. This allows the callee to “home” register parameters > to ease debugging and the implementation of variadic functions. In this > case, the store is not dead, but we fail to use the pre-allocated memory. > > Case 3: The parameter is a register parameter in a normal Sys-V ABI. In > this case, nothing is wasted. Both the memory and store are needed. > > I think we can improve our code for cases 1 and 2 by making it possible to > create allocas from parameters, and we can lower this down to a regular > stack object with a store in case 3. The syntax for this would look like: > define void @f(i32 %x) { > %px = alloca i32, argument i32 %x, align 4 > > The semantics are the same as the following: > define void @f(i32 %x) { > %px = alloca i32, align 4 > store i32 %x, i32* %px, align 4 > > It is invalid to make one of these argument allocas dynamic in any way, > either by using inalloca, using a dynamic element count, or being outside > the entry block. > > If the semantics are the same, it begs the question, why don’t we pattern > match the alloca and store to elide dead stores and reuse existing argument > stack slots? My main preference for adding a new way to do this is that it > gives us simpler and smaller code in debug builds, where we presumably > don’t want to do this kind of pattern recognition. My experience with our > -O0 codegen is that we do a lot of useless copies to initialize parameters, > and this leads to larger output size, which slows things down. Having a > more easily recognizable idiom for getting the storage behind parameters if > it exists feels like a win. > > Changing the semantics of alloca affects a lot of things, but I think it > makes sense to extend alloca here rather than a new intrinsic or > instruction that can create local variable stack memory that passes would > have to reason about. Here’s a list of things we’d need to change off the > top of my head: > 1. Inliner: When hoisting static allocas to the entry block, the inliner > will now strip the argument operand off of allocas and insert the > equivalent store at the inlined call site. > 2. Mem2reg: Loading from an argument alloca needs to produce the argument > value instead of undef. > 3. GVN: We need to apply the same logic to GVN and similar store to load > forwarding transforms. > 4. Instcombine: This transform has some simple store forwarding logic that > would need to be updated. > > I’m sure there are more, but the changes seem worth it to get at these low > hanging opportunities. > > One other questionable side benefit of doing this is that it would make it > possible to implement va_start by taking the address of a parameter and > doing pointer arithmetic. While that code is fairly invalid, it’s exactly > what the MSVC STL headers do for 32-bit x86. If we make this work in Clang, > we can remove our stdarg.h and vadefs.h wrapper headers. Users often pass > flags that cause clang to skip these wrapper headers, and then they file > bugs complaining that va lists don't work. > > At the end of the day, this feels like a straightforward engineering > improvement to LLVM, and it feels worth doing to me. Does anyone feel > otherwise or have any suggestions? > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161228/0f355281/attachment.html>