thr3ads.net - llvm dev - [LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer [May 2014]

If this information is useful, please help other people find it:
Share via:

Louis Gerbarg

2014-May-22 22:42 UTC

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

Recently I posted a patch to migrate certain GEPs between basic blocks in cases
where doing so would improve the ability of instcombine to merge into more
complicated addressing mode (r209049 and r209065). After some build to failures
it was rolled back. I now have a patch that no longer causes the regressions I
was seeing, but it also no longer can optimize the case I was trying to
optimize. As several other people have shown me optimization cases where the
patch does work for them I am going to commit it again once I write some new
test cases people who have access to the relevant (failing) configs can verify
that it works for them, but I wanted to discuss the the case that I was trying
to optimize, and how transforming the types of some some values would allow for
the optimization to occur.

It is worth noting up front that I could certainly do this optimization in a
peephole, but I think by looking at some broader changes it might be possible to
tease out other optimization opportunities.

The problem:

On x86 when you compile:

  unsigned testu(llvm::DenseMap<unsigned, unsigned> &dm, unsigned key)
{
    return dm.lookup(key);
  }

is compiled down it contains:

  leaq (%r8,%rdi,8), %rax
  movl 4(%rax), %eax

when what you really want is:

  movl	4(%r8,%rdi,8), %eax

Exactly how bad that is depends on the exact micro architecture as some machines
have different cache port configs and AGU capabilities, but it isn’t good. The
reason why we get into this situation is that The IR that results
GEP->PHI->GEP->LOAD chain. Currently the first GEP and the GEP->LOAD
are being matched separately into the leaq and movl respectively. I wrote some
code to move GEPs across PHIs so that the whole GEP->GEP->LOAD could be
matched, and it first glance it worked and generated the optimized movl:

  %struct2 = type { i32, i32 }
  bb1:
    %tmp10 = getelementptr inbounds %struct2* %tmp1, i64 %tmp9
    br label %bb3
                                                                                
  bb2:
    %tmp20 = getelementptr inbounds %struct2* %tmp1, i64 %tmp19
    br label %bb3
                                                                                
  bb3:
    %phi = phi %struct2* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
    %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
    %tmp25 = load i32* %tmp24, align 4

is rewritten as:

  %struct2 = type { i32, i32 }
  bb1:
    br label %bb3
                                                                                
  bb2:
    br label %bb3
                                                                                
  bb3:
    %phi = phi %struct2* [ %tmp9, %bb1 ], [ %tmp19, %bb2 ]     
    %tno21 = getelementptr inbounds %struct2* %tmp1, i64 % phi
    %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
    %tmp25 = load i32* %tmp24, align 4 

Which eventually gets InstCombined into a single getelementptr and matched to
the single movl.

The problem that the above transform is technically illegal because “When
indexing into a (optionally packed) structure, only i32 integer constants are
allowed (when using a vector of indices they must all be the same i32 integer
constant).” rule
<http://llvm.org/docs/LangRef.html#getelementptr-instruction>. In practice
it worked in this case, I can only presume that despite the language ref
claiming that GEPs into structs require constant indexes that in practice
variable indexes work so long as the elements all have a common size/alignment.
Having said that it blew up in other cases, and when I added a check to prevent
merging indexes that were structs it caused the above case to no longer be
optimizable.

What to do about it:

Well, there are a couple of pragmatic solutions that I could apply. I could
handle it as a peephole on x86, or do some bit casts in the IR. Before I go down
that path I wanted to ask a more interesting question: Are we missing other
optimization opportunities because we are using structs for homogenous data that
could be expressed in terms of arrays? If we were to write an early pass that
effectively rewrote structs into arrays would we see any improvements (besides
the one I mentioned in this email). There are a couple of different ways of
doing it:

{i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is
that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost
and good and potentially eliminates certain edge cases. How about heterogenous
structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16}
allow for better optimizations? Or maybe it makes sense to loosen the language
restrictions on GEP indexes?

Louis

Chandler Carruth

2014-May-22 22:51 UTC

head link

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:
> The problem that the above transform is technically illegal because “When
> indexing into a (optionally packed) structure, only i32 integer constants
> are allowed (when using a vector of indices they must all be the same i32
> integer constant).” rule <
> http://llvm.org/docs/LangRef.html#getelementptr-instruction>.
>
Wait, I don't follow. You don't violate this rule. The first index in
GEP
is *not* into the structure, it is along the implicit array of objects that
any pointer refers to. Thus the i64 operand to the first GEP selects a
different struct object in an array of them, and the first i64 operand to
the second GEP re-selects the same struct object but then uses an i32 index
into it.

Perhaps you need a better example to show the illegal transform? That would
help me understand the rest of your problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/fb4d074e/attachment.html>

Louis Gerbarg

2014-May-22 23:54 UTC

head link

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

On May 22, 2014, at 3:51 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> 
> On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com>
wrote:
> The problem that the above transform is technically illegal because “When
indexing into a (optionally packed) structure, only i32 integer constants are
allowed (when using a vector of indices they must all be the same i32 integer
constant).” rule
<http://llvm.org/docs/LangRef.html#getelementptr-instruction>.
> 
> Wait, I don't follow. You don't violate this rule. The first index
in GEP is *not* into the structure, it is along the implicit array of objects
that any pointer refers to. Thus the i64 operand to the first GEP selects a
different struct object in an array of them, and the first i64 operand to the
second GEP re-selects the same struct object but then uses an i32 index into it.
> 
> Perhaps you need a better example to show the illegal transform? That would
help me understand the rest of your problem.

Hmm… you are correct, it turns out you are correct, I only have to worry if it
is constant after the second arg of the GEP. That should allow my case to
continue to work. I am currently building a stage2 now to see if works cleanly.

Having said that, there still exist cases where you are indexing into a
homogenous struct like so. Consider the following two functions, which are the
same except one is written using structs and the other is written using arrays:

%struct1 = type { i32, i32 }                                                    
%struct2 = type { %struct1, %struct1 }                                          
; Function Attrs: ssp uwtable                                                   
define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {             
bb:                                                                             
  br i1 %tmp4, label %bb1, label %bb2                                           
bb1:                                              ; preds = %bb5                
  %tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0               
  br label %bb3                                                                 
                                                                                
bb2:                                             ; preds = %.lr.ph.i.i          
  %tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1              
  br label %bb3                                                                 
                                                                                
bb3:                                      ; preds = %bb2, %bb1                  
  %phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]                       
  %tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1                  
  %tmp25 = load i32* %tmp24, align 4                                            
  ret i32 %tmp25                                                                
}                                                                               
                                                                                
                                                                                
%array1 = type [2 x i32]                                                        
%array2 = type [2 x %array1]                                                    
                                                                                
; Function Attrs: ssp uwtable                                                   
define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {              
bb:                                                                             
  br i1 %tmp4, label %bb1, label %bb2                                           
bb1:                                              ; preds = %bb5                
  %tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0                
  br label %bb3                                                                 
                                                                                
bb2:                                             ; preds = %.lr.ph.i.i          
  %tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1               
  br label %bb3                                                                 
                                                                                
bb3:                                      ; preds = %bb2, %bb1                  
  %phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]                        
  %tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1                   
  %tmp25 = load i32* %tmp24, align 4                                            
  ret i32 %tmp25                                                                
}

The @test1 function cannot have the optimization applied because the %tmp10 and
%tmp20 GEPs vary by a field index into the struct, whereas @test2 can be
optimized to:

define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
  br i1 %tmp4, label %bb1, label %bb2

bb1:                                              ; preds = %bb
  br label %bb3

bb2:                                              ; preds = %bb
  br label %bb3

bb3:                                              ; preds = %bb2, %bb1
  %0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ]
  %tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32 %0, i32 1
  %tmp25 = load i32* %tmp24, align 4
  ret i32 %tmp25
}

Louis


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/435eb1e2/attachment.html>

Chandler Carruth

2014-May-23 04:46 UTC

head link

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

Hal rightly pointed out (after some denseness on my part) that I had
completely misread these two paragraphs. Sorry about that. Got distracted
by understanding your example, and missed that you had already done the
same analysis I had and even come to the exact same conclusion.

On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com> wrote:
> Well, there are a couple of pragmatic solutions that I could apply. I
> could handle it as a peephole on x86, or do some bit casts in the IR.
> Before I go down that path I wanted to ask a more interesting question: Are
> we missing other optimization opportunities because we are using structs
> for homogenous data that could be expressed in terms of arrays? If we were
> to write an early pass that effectively rewrote structs into arrays would
> we see any improvements (besides the one I mentioned in this email). There
> are a couple of different ways of doing it:
>
> {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling
> is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is
> almost and good and potentially eliminates certain edge cases. How about
> heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16}
> as {[2 x i32], i16} allow for better optimizations?
>
I think that *if* this makes sense, it makes sense to phrase the transform
as folding a run of the same type in a struct into an array of that type.
That is, we should never remove a layer of {}s from a type because that
might have alignment implications or other implications I've not thought
of. But within a {}, I think it makes a *lot* of sense to have a single,
canonical way to represent a sequence of N types, and [N x <type>] seems
like the obvious representation, much more so than repeating the type N
times.

I think at least an early pass makes a lot of sense here. I wonder whether
it makes sense to do something even more radical such as having the struct
type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}'
for all N !1. I think it would be good to do structural canonicalization of this
form
quite early on the structural LLVM types, but I haven't even begun to think
about how much code that breaks.

I would suggest implementing your GEP optimization conservatively and
correctly, committing it, and then resuming this thread with some numbers
regarding what performance improvements can be had by doing this kind of
structural canonicalization?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140522/c1c3f712/attachment.html>

Louis Gerbarg

2014-May-23 22:00 UTC

head link

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

On May 22, 2014, at 9:46 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> Hal rightly pointed out (after some denseness on my part) that I had
completely misread these two paragraphs. Sorry about that. Got distracted by
understanding your example, and missed that you had already done the same
analysis I had and even come to the exact same conclusion.
No worries. If anything the fact that you effectively came to the same
conclusion in this way is confirmation that the idea is worth exploring.
> On Thu, May 22, 2014 at 4:42 PM, Louis Gerbarg <lgg at apple.com>
wrote:
> Well, there are a couple of pragmatic solutions that I could apply. I could
handle it as a peephole on x86, or do some bit casts in the IR. Before I go down
that path I wanted to ask a more interesting question: Are we missing other
optimization opportunities because we are using structs for homogenous data that
could be expressed in terms of arrays? If we were to write an early pass that
effectively rewrote structs into arrays would we see any improvements (besides
the one I mentioned in this email). There are a couple of different ways of
doing it:
> 
> {i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling
is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is
almost and good and potentially eliminates certain edge cases. How about
heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2
x i32], i16} allow for better optimizations?
> 
> I think that *if* this makes sense, it makes sense to phrase the transform
as folding a run of the same type in a struct into an array of that type. That
is, we should never remove a layer of {}s from a type because that might have
alignment implications or other implications I've not thought of. But within
a {}, I think it makes a *lot* of sense to have a single, canonical way to
represent a sequence of N types, and [N x <type>] seems like the obvious
representation, much more so than repeating the type N times.
That seems reasonable to me. 
> I think at least an early pass makes a lot of sense here. I wonder whether
it makes sense to do something even more radical such as having the struct type
for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for
all N != 1. I think it would be good to do structural canonicalization of this
form quite early on the structural LLVM types, but I haven't even begun to
think about how much code that breaks.
Yeah. If we decided to progress down that route I think it probably make sense
to start with homogenous structs as the GEP rewrites and bookkeeping are easier,
and then move on from there if we see benefits.
> I would suggest implementing your GEP optimization conservatively and
correctly, committing it, and then resuming this thread with some numbers
regarding what performance improvements can be had by doing this kind of
structural canonicalization?
That sounds good. I had hoped to get that done today, but between various other
things coming up it might be sometime this weekend/early next week.

Louis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140523/937e6aca/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - May 2014 - [LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

Apparently Analagous Threads