Louis Gerbarg
2014-May-22 22:42 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
Recently I posted a patch to migrate certain GEPs between basic blocks in cases
where doing so would improve the ability of instcombine to merge into more
complicated addressing mode (r209049 and r209065). After some build to failures
it was rolled back. I now have a patch that no longer causes the regressions I
was seeing, but it also no longer can optimize the case I was trying to
optimize. As several other people have shown me optimization cases where the
patch does work for them I am going to commit it again once I write some new
test cases people who have access to the relevant (failing) configs can verify
that it works for them, but I wanted to discuss the the case that I was trying
to optimize, and how transforming the types of some some values would allow for
the optimization to occur.
It is worth noting up front that I could certainly do this optimization in a
peephole, but I think by looking at some broader changes it might be possible to
tease out other optimization opportunities.
The problem:
On x86 when you compile:
unsigned testu(llvm::DenseMap<unsigned, unsigned> &dm, unsigned key)
{
return dm.lookup(key);
}
is compiled down it contains:
leaq (%r8,%rdi,8), %rax
movl 4(%rax), %eax
when what you really want is:
movl 4(%r8,%rdi,8), %eax
Exactly how bad that is depends on the exact micro architecture as some machines
have different cache port configs and AGU capabilities, but it isn’t good. The
reason why we get into this situation is that The IR that results
GEP->PHI->GEP->LOAD chain. Currently the first GEP and the GEP->LOAD
are being matched separately into the leaq and movl respectively. I wrote some
code to move GEPs across PHIs so that the whole GEP->GEP->LOAD could be
matched, and it first glance it worked and generated the optimized movl:
%struct2 = type { i32, i32 }
bb1:
%tmp10 = getelementptr inbounds %struct2* %tmp1, i64 %tmp9
br label %bb3
bb2:
%tmp20 = getelementptr inbounds %struct2* %tmp1, i64 %tmp19
br label %bb3
bb3:
%phi = phi %struct2* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
is rewritten as:
%struct2 = type { i32, i32 }
bb1:
br label %bb3
bb2:
br label %bb3
bb3:
%phi = phi %struct2* [ %tmp9, %bb1 ], [ %tmp19, %bb2 ]
%tno21 = getelementptr inbounds %struct2* %tmp1, i64 % phi
%tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
Which eventually gets InstCombined into a single getelementptr and matched to
the single movl.
The problem that the above transform is technically illegal because “When
indexing into a (optionally packed) structure, only i32 integer constants are
allowed (when using a vector of indices they must all be the same i32 integer
constant).” rule
<http://llvm.org/docs/LangRef.html#getelementptr-instruction>. In practice
it worked in this case, I can only presume that despite the language ref
claiming that GEPs into structs require constant indexes that in practice
variable indexes work so long as the elements all have a common size/alignment.
Having said that it blew up in other cases, and when I added a check to prevent
merging indexes that were structs it caused the above case to no longer be
optimizable.
What to do about it:
Well, there are a couple of pragmatic solutions that I could apply. I could
handle it as a peephole on x86, or do some bit casts in the IR. Before I go down
that path I wanted to ask a more interesting question: Are we missing other
optimization opportunities because we are using structs for homogenous data that
could be expressed in terms of arrays? If we were to write an early pass that
effectively rewrote structs into arrays would we see any improvements (besides
the one I mentioned in this email). There are a couple of different ways of
doing it:
{i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is
that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost
and good and potentially eliminates certain edge cases. How about heterogenous
structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16}
allow for better optimizations? Or maybe it makes sense to loosen the language
restrictions on GEP indexes?
Louis
Chandler Carruth
2014-May-22 22:51 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:> The problem that the above transform is technically illegal because “When > indexing into a (optionally packed) structure, only i32 integer constants > are allowed (when using a vector of indices they must all be the same i32 > integer constant).” rule < > http://llvm.org/docs/LangRef.html#getelementptr-instruction>. >Wait, I don't follow. You don't violate this rule. The first index in GEP is *not* into the structure, it is along the implicit array of objects that any pointer refers to. Thus the i64 operand to the first GEP selects a different struct object in an array of them, and the first i64 operand to the second GEP re-selects the same struct object but then uses an i32 index into it. Perhaps you need a better example to show the illegal transform? That would help me understand the rest of your problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/fb4d074e/attachment.html>
Louis Gerbarg
2014-May-22 23:54 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On May 22, 2014, at 3:51 PM, Chandler Carruth <chandlerc at google.com> wrote:> > On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote: > The problem that the above transform is technically illegal because “When indexing into a (optionally packed) structure, only i32 integer constants are allowed (when using a vector of indices they must all be the same i32 integer constant).” rule <http://llvm.org/docs/LangRef.html#getelementptr-instruction>. > > Wait, I don't follow. You don't violate this rule. The first index in GEP is *not* into the structure, it is along the implicit array of objects that any pointer refers to. Thus the i64 operand to the first GEP selects a different struct object in an array of them, and the first i64 operand to the second GEP re-selects the same struct object but then uses an i32 index into it. > > Perhaps you need a better example to show the illegal transform? That would help me understand the rest of your problem.Hmm… you are correct, it turns out you are correct, I only have to worry if it is constant after the second arg of the GEP. That should allow my case to continue to work. I am currently building a stage2 now to see if works cleanly. Having said that, there still exist cases where you are indexing into a homogenous struct like so. Consider the following two functions, which are the same except one is written using structs and the other is written using arrays: %struct1 = type { i32, i32 } %struct2 = type { %struct1, %struct1 } ; Function Attrs: ssp uwtable define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb5 %tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0 br label %bb3 bb2: ; preds = %.lr.ph.i.i %tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1 br label %bb3 bb3: ; preds = %bb2, %bb1 %phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ] %tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } %array1 = type [2 x i32] %array2 = type [2 x %array1] ; Function Attrs: ssp uwtable define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb5 %tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0 br label %bb3 bb2: ; preds = %.lr.ph.i.i %tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1 br label %bb3 bb3: ; preds = %bb2, %bb1 %phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ] %tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } The @test1 function cannot have the optimization applied because the %tmp10 and %tmp20 GEPs vary by a field index into the struct, whereas @test2 can be optimized to: define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) { bb: br i1 %tmp4, label %bb1, label %bb2 bb1: ; preds = %bb br label %bb3 bb2: ; preds = %bb br label %bb3 bb3: ; preds = %bb2, %bb1 %0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ] %tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32 %0, i32 1 %tmp25 = load i32* %tmp24, align 4 ret i32 %tmp25 } Louis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/435eb1e2/attachment.html>
Chandler Carruth
2014-May-23 04:46 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
Hal rightly pointed out (after some denseness on my part) that I had completely misread these two paragraphs. Sorry about that. Got distracted by understanding your example, and missed that you had already done the same analysis I had and even come to the exact same conclusion. On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:> Well, there are a couple of pragmatic solutions that I could apply. I > could handle it as a peephole on x86, or do some bit casts in the IR. > Before I go down that path I wanted to ask a more interesting question: Are > we missing other optimization opportunities because we are using structs > for homogenous data that could be expressed in terms of arrays? If we were > to write an early pass that effectively rewrote structs into arrays would > we see any improvements (besides the one I mentioned in this email). There > are a couple of different ways of doing it: > > {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling > is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is > almost and good and potentially eliminates certain edge cases. How about > heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} > as {[2 x i32], i16} allow for better optimizations? >I think that *if* this makes sense, it makes sense to phrase the transform as folding a run of the same type in a struct into an array of that type. That is, we should never remove a layer of {}s from a type because that might have alignment implications or other implications I've not thought of. But within a {}, I think it makes a *lot* of sense to have a single, canonical way to represent a sequence of N types, and [N x <type>] seems like the obvious representation, much more so than repeating the type N times. I think at least an early pass makes a lot of sense here. I wonder whether it makes sense to do something even more radical such as having the struct type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for all N !1. I think it would be good to do structural canonicalization of this form quite early on the structural LLVM types, but I haven't even begun to think about how much code that breaks. I would suggest implementing your GEP optimization conservatively and correctly, committing it, and then resuming this thread with some numbers regarding what performance improvements can be had by doing this kind of structural canonicalization? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/c1c3f712/attachment.html>
Louis Gerbarg
2014-May-23 22:00 UTC
[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
On May 22, 2014, at 9:46 PM, Chandler Carruth <chandlerc at google.com> wrote:> Hal rightly pointed out (after some denseness on my part) that I had completely misread these two paragraphs. Sorry about that. Got distracted by understanding your example, and missed that you had already done the same analysis I had and even come to the exact same conclusion.No worries. If anything the fact that you effectively came to the same conclusion in this way is confirmation that the idea is worth exploring.> On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote: > Well, there are a couple of pragmatic solutions that I could apply. I could handle it as a peephole on x86, or do some bit casts in the IR. Before I go down that path I wanted to ask a more interesting question: Are we missing other optimization opportunities because we are using structs for homogenous data that could be expressed in terms of arrays? If we were to write an early pass that effectively rewrote structs into arrays would we see any improvements (besides the one I mentioned in this email). There are a couple of different ways of doing it: > > {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost and good and potentially eliminates certain edge cases. How about heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16} allow for better optimizations? > > I think that *if* this makes sense, it makes sense to phrase the transform as folding a run of the same type in a struct into an array of that type. That is, we should never remove a layer of {}s from a type because that might have alignment implications or other implications I've not thought of. But within a {}, I think it makes a *lot* of sense to have a single, canonical way to represent a sequence of N types, and [N x <type>] seems like the obvious representation, much more so than repeating the type N times.That seems reasonable to me.> I think at least an early pass makes a lot of sense here. I wonder whether it makes sense to do something even more radical such as having the struct type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for all N != 1. I think it would be good to do structural canonicalization of this form quite early on the structural LLVM types, but I haven't even begun to think about how much code that breaks.Yeah. If we decided to progress down that route I think it probably make sense to start with homogenous structs as the GEP rewrites and bookkeeping are easier, and then move on from there if we see benefits.> I would suggest implementing your GEP optimization conservatively and correctly, committing it, and then resuming this thread with some numbers regarding what performance improvements can be had by doing this kind of structural canonicalization?That sounds good. I had hoped to get that done today, but between various other things coming up it might be sometime this weekend/early next week. Louis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140523/937e6aca/attachment.html>
Apparently Analagous Threads
- [LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
- [LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
- [LLVMdev] ARM backend problem ?
- [LLVMdev] ARM backend problem ?
- [LLVMdev] Hoisting elements of array argument into registers