Louis Gerbarg
2014-May-22 22:42 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
Recently I posted a patch to migrate certain GEPs between basic blocks in cases where doing so would improve the ability of instcombine to merge into more complicated addressing mode (r209049 and r209065). After some build to failures it was rolled back. I now have a patch that no longer causes the regressions I was seeing, but it also no longer can optimize the case I was trying to optimize. As several other people have shown me optimization cases where the patch does work for them I am going to commit it again once I write some new test cases people who have access to the relevant (failing) configs can verify that it works for them, but I wanted to discuss the the case that I was trying to optimize, and how transforming the types of some some values would allow for the optimization to occur. It is worth noting up front that I could certainly do this optimization in a peephole, but I think by looking at some broader changes it might be possible to tease out other optimization opportunities. The problem: On x86 when you compile: unsigned testu(llvm::DenseMap<unsigned, unsigned> &dm, unsigned key) { return dm.lookup(key); } is compiled down it contains: leaq (%r8,%rdi,8), %rax movl 4(%rax), %eax when what you really want is: movl 4(%r8,%rdi,8), %eax Exactly how bad that is depends on the exact micro architecture as some machines have different cache port configs and AGU capabilities, but it isn’t good. The reason why we get into this situation is that The IR that results GEP->PHI->GEP->LOAD chain. Currently the first GEP and the GEP->LOAD are being matched separately into the leaq and movl respectively. I wrote some code to move GEPs across PHIs so that the whole GEP->GEP->LOAD could be matched, and it first glance it worked and generated the optimized movl: %struct2 = type { i32, i32 } bb1: %tmp10 = getelementptr inbounds %struct2* %tmp1, i64 %tmp9 br label %bb3 bb2: %tmp20 = getelementptr inbounds %struct2* %tmp1, i64 %tmp19 br label %bb3 bb3: %phi = phi %struct2* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ] %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 is rewritten as: %struct2 = type { i32, i32 } bb1: br label %bb3 bb2: br label %bb3 bb3: %phi = phi %struct2* [ %tmp9, %bb1 ], [ %tmp19, %bb2 ] %tno21 = getelementptr inbounds %struct2* %tmp1, i64 % phi %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 Which eventually gets InstCombined into a single getelementptr and matched to the single movl. The problem that the above transform is technically illegal because “When indexing into a (optionally packed) structure, only i32 integer constants are allowed (when using a vector of indices they must all be the same i32 integer constant).” rule <http://llvm.org/docs/LangRef.html#getelementptr-instruction>. In practice it worked in this case, I can only presume that despite the language ref claiming that GEPs into structs require constant indexes that in practice variable indexes work so long as the elements all have a common size/alignment. Having said that it blew up in other cases, and when I added a check to prevent merging indexes that were structs it caused the above case to no longer be optimizable. What to do about it: Well, there are a couple of pragmatic solutions that I could apply. I could handle it as a peephole on x86, or do some bit casts in the IR. Before I go down that path I wanted to ask a more interesting question: Are we missing other optimization opportunities because we are using structs for homogenous data that could be expressed in terms of arrays? If we were to write an early pass that effectively rewrote structs into arrays would we see any improvements (besides the one I mentioned in this email). There are a couple of different ways of doing it: {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost and good and potentially eliminates certain edge cases. How about heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16} allow for better optimizations? Or maybe it makes sense to loosen the language restrictions on GEP indexes? Louis
Chandler Carruth
2014-May-22 22:51 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:> The problem that the above transform is technically illegal because “When > indexing into a (optionally packed) structure, only i32 integer constants > are allowed (when using a vector of indices they must all be the same i32 > integer constant).” rule < > http://llvm.org/docs/LangRef.html#getelementptr-instruction>. >Wait, I don't follow. You don't violate this rule. The first index in GEP is *not* into the structure, it is along the implicit array of objects that any pointer refers to. Thus the i64 operand to the first GEP selects a different struct object in an array of them, and the first i64 operand to the second GEP re-selects the same struct object but then uses an i32 index into it. Perhaps you need a better example to show the illegal transform? That would help me understand the rest of your problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/fb4d074e/attachment.html>
Louis Gerbarg
2014-May-22 23:54 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On May 22, 2014, at 3:51 PM, Chandler Carruth <chandlerc at google.com> wrote:> > On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote: > The problem that the above transform is technically illegal because “When indexing into a (optionally packed) structure, only i32 integer constants are allowed (when using a vector of indices they must all be the same i32 integer constant).” rule <http://llvm.org/docs/LangRef.html#getelementptr-instruction>. > > Wait, I don't follow. You don't violate this rule. The first index in GEP is *not* into the structure, it is along the implicit array of objects that any pointer refers to. Thus the i64 operand to the first GEP selects a different struct object in an array of them, and the first i64 operand to the second GEP re-selects the same struct object but then uses an i32 index into it. > > Perhaps you need a better example to show the illegal transform? That would help me understand the rest of your problem.Hmm… you are correct, it turns out you are correct, I only have to worry if it is constant after the second arg of the GEP. That should allow my case to continue to work. I am currently building a stage2 now to see if works cleanly. Having said that, there still exist cases where you are indexing into a homogenous struct like so. Consider the following two functions, which are the same except one is written using structs and the other is written using arrays: %struct1 = type { i32, i32 } %struct2 = type { %struct1, %struct1 } ; Function Attrs: ssp uwtable define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb5 %tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0 br label %bb3 bb2: ; preds = %.lr.ph.i.i %tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1 br label %bb3 bb3: ; preds = %bb2, %bb1 %phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ] %tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } %array1 = type [2 x i32] %array2 = type [2 x %array1] ; Function Attrs: ssp uwtable define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb5 %tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0 br label %bb3 bb2: ; preds = %.lr.ph.i.i %tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1 br label %bb3 bb3: ; preds = %bb2, %bb1 %phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ] %tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } The @test1 function cannot have the optimization applied because the %tmp10 and %tmp20 GEPs vary by a field index into the struct, whereas @test2 can be optimized to: define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb br label %bb3 bb2: ; preds = %bb br label %bb3 bb3: ; preds = %bb2, %bb1 %0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ] %tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32 %0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } Louis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/435eb1e2/attachment.html>
Chandler Carruth
2014-May-23 04:46 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
Hal rightly pointed out (after some denseness on my part) that I had completely misread these two paragraphs. Sorry about that. Got distracted by understanding your example, and missed that you had already done the same analysis I had and even come to the exact same conclusion. On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:> Well, there are a couple of pragmatic solutions that I could apply. I > could handle it as a peephole on x86, or do some bit casts in the IR. > Before I go down that path I wanted to ask a more interesting question: Are > we missing other optimization opportunities because we are using structs > for homogenous data that could be expressed in terms of arrays? If we were > to write an early pass that effectively rewrote structs into arrays would > we see any improvements (besides the one I mentioned in this email). There > are a couple of different ways of doing it: > > {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling > is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is > almost and good and potentially eliminates certain edge cases. How about > heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} > as {[2 x i32], i16} allow for better optimizations? >I think that *if* this makes sense, it makes sense to phrase the transform as folding a run of the same type in a struct into an array of that type. That is, we should never remove a layer of {}s from a type because that might have alignment implications or other implications I've not thought of. But within a {}, I think it makes a *lot* of sense to have a single, canonical way to represent a sequence of N types, and [N x <type>] seems like the obvious representation, much more so than repeating the type N times. I think at least an early pass makes a lot of sense here. I wonder whether it makes sense to do something even more radical such as having the struct type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for all N !1. I think it would be good to do structural canonicalization of this form quite early on the structural LLVM types, but I haven't even begun to think about how much code that breaks. I would suggest implementing your GEP optimization conservatively and correctly, committing it, and then resuming this thread with some numbers regarding what performance improvements can be had by doing this kind of structural canonicalization? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/c1c3f712/attachment.html>
Louis Gerbarg
2014-May-23 22:00 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On May 22, 2014, at 9:46 PM, Chandler Carruth <chandlerc at google.com> wrote:> Hal rightly pointed out (after some denseness on my part) that I had completely misread these two paragraphs. Sorry about that. Got distracted by understanding your example, and missed that you had already done the same analysis I had and even come to the exact same conclusion.No worries. If anything the fact that you effectively came to the same conclusion in this way is confirmation that the idea is worth exploring.> On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote: > Well, there are a couple of pragmatic solutions that I could apply. I could handle it as a peephole on x86, or do some bit casts in the IR. Before I go down that path I wanted to ask a more interesting question: Are we missing other optimization opportunities because we are using structs for homogenous data that could be expressed in terms of arrays? If we were to write an early pass that effectively rewrote structs into arrays would we see any improvements (besides the one I mentioned in this email). There are a couple of different ways of doing it: > > {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost and good and potentially eliminates certain edge cases. How about heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16} allow for better optimizations? > > I think that *if* this makes sense, it makes sense to phrase the transform as folding a run of the same type in a struct into an array of that type. That is, we should never remove a layer of {}s from a type because that might have alignment implications or other implications I've not thought of. But within a {}, I think it makes a *lot* of sense to have a single, canonical way to represent a sequence of N types, and [N x <type>] seems like the obvious representation, much more so than repeating the type N times.That seems reasonable to me.> I think at least an early pass makes a lot of sense here. I wonder whether it makes sense to do something even more radical such as having the struct type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for all N != 1. I think it would be good to do structural canonicalization of this form quite early on the structural LLVM types, but I haven't even begun to think about how much code that breaks.Yeah. If we decided to progress down that route I think it probably make sense to start with homogenous structs as the GEP rewrites and bookkeeping are easier, and then move on from there if we see benefits.> I would suggest implementing your GEP optimization conservatively and correctly, committing it, and then resuming this thread with some numbers regarding what performance improvements can be had by doing this kind of structural canonicalization?That sounds good. I had hoped to get that done today, but between various other things coming up it might be sometime this weekend/early next week. Louis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140523/937e6aca/attachment.html>