thr3ads.net - llvm dev - [LLVMdev] Memory optimizations for LLVM JIT [Aug 2013]

If this information is useful, please help other people find it:
Share via:

王振江

2013-Aug-20 07:23 UTC

[LLVMdev] Memory optimizations for LLVM JIT

Hello.

I'm new to LLVM and I've got some problems with LLVM JIT.
I have set ExecutionEngine as CodeGenOpt::Aggressive and
PassManagerBuilder.OptLevel as 3. (mem2reg and GVN are included)
However, the machine code generated by JIT is not as good as that generated
by clang or llc.

Here is an example:

--------------------------------------------------------------------
    source fragment       ==>                 clang or llc

struct {
    uint64_t a[10];
} *p;
                                        mov    0x8(%rax),%rdx
p->a[2] = p->a[1];                      mov    %rdx,0x10(%rax)
p->a[3] = p->a[1];        ==>           mov    %rdx,0x18(%rax)
p->a[4] = p->a[2];                      mov    %rdx,0x20(%rax)
p->a[5] = p->a[4];                      mov    %rdx,0x28(%rax)

--------------------------------------------------------------------

  JIT (map p to GlobalVariable)   ==>    JIT (map p to constant
GlobalVariable)

 1* movabsq  $0x18c6b88, %rax              1* movabsq    $0x18c6b88, %rax

 2* movq    (%rax), %rcx      // p         2* movq   (%rax), %rax
 3* movq    0x8(%rcx), %rdx   // a[1]      3* movq   0x8(%rax), %rcx
 4* movq    %rdx, 0x10(%rcx)  // a[2]      4* movq   %rcx, 0x10(%rax)

 5  movq    (%rax), %rcx                   5
 6  movq    0x8(%rcx), %rdx                6  movq   0x8(%rax), %rcx
 7* movq    %rdx, 0x18(%rcx)      ==>      7* movq   %rcx, 0x18(%rax)

 8  movq    (%rax), %rcx                   8
 9  movq    0x10(%rcx), %rdx               9  movq   0x10(%rax), %rcx
10* movq    %rdx, 0x20(%rcx)              10* movq   %rcx, 0x20(%rax)

11  movq    (%rax), %rax                  11
12  movq    0x20(%rax), %rcx              12
13* movq    %rcx, 0x28(%rax)              13* movq   %rcx, 0x28(%rax)

----------------------------------------------------------------------

A GlobalValue was declared and mapped to the variable p.
Some LLVM IR instructions were created according to those generated by LLVM
from source.
I.e., load p, load a[1] based on p, load p again, store a[2] based on p,
etc.
The machine code turned out to be slightly optmized, as shown on the left.

Things were getting better after the GlobalVariable of p was set as a
constant.
Redundant Loads of p (line 5, 8 and 11) were removed, and so was line 12
because of line 10.
However, I could not make it better any more, although optimal machine code
just need those marked with '*'.

It seems that store instructions block the optimizations across them.
I.e., line 3&6 or 4&9 are similar to line 10&12, but they are not
optimized.
The store (line 4 or 7) between them obviously has no alias problem.

My question is: how to make LLVM JIT optimize this code?
Did I miss anything, or need I write some kind of optimization pass?
I will be grateful for any help you can provide.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/3e56fc81/attachment.html>

Micah Villmow

2013-Aug-20 16:05 UTC

head link

[LLVMdev] Memory optimizations for LLVM JIT

I would not expect a JIT to produce as good of code as a static compiler. A JIT
is supposed to run relatively fast, whereas a static compiler may take a lot
longer.

Micah

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of ???
Sent: Tuesday, August 20, 2013 12:23 AM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] Memory optimizations for LLVM JIT

Hello.

I'm new to LLVM and I've got some problems with LLVM JIT.
I have set ExecutionEngine as CodeGenOpt::Aggressive and
PassManagerBuilder.OptLevel as 3. (mem2reg and GVN are included)
However, the machine code generated by JIT is not as good as that generated by
clang or llc.

Here is an example:

--------------------------------------------------------------------
    source fragment       ==>                 clang or llc

struct {
    uint64_t a[10];
} *p;
                                        mov    0x8(%rax),%rdx
p->a[2] = p->a[1];                      mov    %rdx,0x10(%rax)
p->a[3] = p->a[1];        ==>           mov    %rdx,0x18(%rax)
p->a[4] = p->a[2];                      mov    %rdx,0x20(%rax)
p->a[5] = p->a[4];                      mov    %rdx,0x28(%rax)

--------------------------------------------------------------------

  JIT (map p to GlobalVariable)   ==>    JIT (map p to constant
GlobalVariable)

 1* movabsq  $0x18c6b88, %rax              1* movabsq    $0x18c6b88, %rax

 2* movq    (%rax), %rcx      // p         2* movq   (%rax), %rax
 3* movq    0x8(%rcx), %rdx   // a[1]      3* movq   0x8(%rax), %rcx
 4* movq    %rdx, 0x10(%rcx)  // a[2]      4* movq   %rcx, 0x10(%rax)

 5  movq    (%rax), %rcx                   5
 6  movq    0x8(%rcx), %rdx                6  movq   0x8(%rax), %rcx
 7* movq    %rdx, 0x18(%rcx)      ==>      7* movq   %rcx, 0x18(%rax)

 8  movq    (%rax), %rcx                   8
 9  movq    0x10(%rcx), %rdx               9  movq   0x10(%rax), %rcx
10* movq    %rdx, 0x20(%rcx)              10* movq   %rcx, 0x20(%rax)

11  movq    (%rax), %rax                  11
12  movq    0x20(%rax), %rcx              12
13* movq    %rcx, 0x28(%rax)              13* movq   %rcx, 0x28(%rax)

----------------------------------------------------------------------

A GlobalValue was declared and mapped to the variable p.
Some LLVM IR instructions were created according to those generated by LLVM from
source.
I.e., load p, load a[1] based on p, load p again, store a[2] based on p, etc.
The machine code turned out to be slightly optmized, as shown on the left.

Things were getting better after the GlobalVariable of p was set as a constant.
Redundant Loads of p (line 5, 8 and 11) were removed, and so was line 12 because
of line 10.
However, I could not make it better any more, although optimal machine code just
need those marked with '*'.

It seems that store instructions block the optimizations across them.
I.e., line 3&6 or 4&9 are similar to line 10&12, but they are not
optimized.
The store (line 4 or 7) between them obviously has no alias problem.

My question is: how to make LLVM JIT optimize this code?
Did I miss anything, or need I write some kind of optimization pass?
I will be grateful for any help you can provide.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/be979b00/attachment.html>

Kaylor, Andrew

2013-Aug-20 17:20 UTC

head link

[LLVMdev] Memory optimizations for LLVM JIT

I actually would expect the LLVM JIT engine to generate the same code if you
have everything prepped the same and use the same code generation options.  If
you use MCJIT, in fact, it uses exactly the same code generation mechanism.

Have you tried using MCJIT?

Also, have you compared the IR being passed into the JIT to that being used in
the clang and llc cases?  If the IR is the same, I would expect the generated
code to be the same, unless there's some additional optimization that
isn't being turned on in the JIT case.

-Andy


From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Micah Villmow
Sent: Tuesday, August 20, 2013 9:06 AM
To: ???; llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Memory optimizations for LLVM JIT

I would not expect a JIT to produce as good of code as a static compiler. A JIT
is supposed to run relatively fast, whereas a static compiler may take a lot
longer.

Micah

From: llvmdev-bounces at cs.uiuc.edu<mailto:llvmdev-bounces at
cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of ???
Sent: Tuesday, August 20, 2013 12:23 AM
To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>
Subject: [LLVMdev] Memory optimizations for LLVM JIT

Hello.

I'm new to LLVM and I've got some problems with LLVM JIT.
I have set ExecutionEngine as CodeGenOpt::Aggressive and
PassManagerBuilder.OptLevel as 3. (mem2reg and GVN are included)
However, the machine code generated by JIT is not as good as that generated by
clang or llc.

Here is an example:

--------------------------------------------------------------------
    source fragment       ==>                 clang or llc

struct {
    uint64_t a[10];
} *p;
                                        mov    0x8(%rax),%rdx
p->a[2] = p->a[1];                      mov    %rdx,0x10(%rax)
p->a[3] = p->a[1];        ==>           mov    %rdx,0x18(%rax)
p->a[4] = p->a[2];                      mov    %rdx,0x20(%rax)
p->a[5] = p->a[4];                      mov    %rdx,0x28(%rax)

--------------------------------------------------------------------

  JIT (map p to GlobalVariable)   ==>    JIT (map p to constant
GlobalVariable)

 1* movabsq  $0x18c6b88, %rax              1* movabsq    $0x18c6b88, %rax

 2* movq    (%rax), %rcx      // p         2* movq   (%rax), %rax
 3* movq    0x8(%rcx), %rdx   // a[1]      3* movq   0x8(%rax), %rcx
 4* movq    %rdx, 0x10(%rcx)  // a[2]      4* movq   %rcx, 0x10(%rax)

 5  movq    (%rax), %rcx                   5
 6  movq    0x8(%rcx), %rdx                6  movq   0x8(%rax), %rcx
 7* movq    %rdx, 0x18(%rcx)      ==>      7* movq   %rcx, 0x18(%rax)

 8  movq    (%rax), %rcx                   8
 9  movq    0x10(%rcx), %rdx               9  movq   0x10(%rax), %rcx
10* movq    %rdx, 0x20(%rcx)              10* movq   %rcx, 0x20(%rax)

11  movq    (%rax), %rax                  11
12  movq    0x20(%rax), %rcx              12
13* movq    %rcx, 0x28(%rax)              13* movq   %rcx, 0x28(%rax)

----------------------------------------------------------------------

A GlobalValue was declared and mapped to the variable p.
Some LLVM IR instructions were created according to those generated by LLVM from
source.
I.e., load p, load a[1] based on p, load p again, store a[2] based on p, etc.
The machine code turned out to be slightly optmized, as shown on the left.

Things were getting better after the GlobalVariable of p was set as a constant.
Redundant Loads of p (line 5, 8 and 11) were removed, and so was line 12 because
of line 10.
However, I could not make it better any more, although optimal machine code just
need those marked with '*'.

It seems that store instructions block the optimizations across them.
I.e., line 3&6 or 4&9 are similar to line 10&12, but they are not
optimized.
The store (line 4 or 7) between them obviously has no alias problem.

My question is: how to make LLVM JIT optimize this code?
Did I miss anything, or need I write some kind of optimization pass?
I will be grateful for any help you can provide.

















-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/e8efaa06/attachment.html>

Richard Osborne

2013-Aug-20 19:24 UTC

head link

[LLVMdev] Memory optimizations for LLVM JIT

On 20 Aug 2013, at 08:23, 王振江 <cryst216 at gmail.com<mailto:cryst216 at
gmail.com>> wrote:

A GlobalValue was declared and mapped to the variable p.
Some LLVM IR instructions were created according to those generated by LLVM from
source.
I.e., load p, load a[1] based on p, load p again, store a[2] based on p, etc.
The machine code turned out to be slightly optmized, as shown on the left.

I suspect this is due to possible aliasing. If p somehow pointed to itself then
the store p->a[x] might change the value of of p so p must be reloaded each
time. Clang will emit TBAA metadata nodes
(http://llvm.org/docs/LangRef.html#tbaa-metadata) that let the optimizers know
the load of p can't alias the stores through p since they are have different
high-level types. Without the TBAA metadata the optimizers must be conservative.


Things were getting better after the GlobalVariable of p was set as a constant.
Redundant Loads of p (line 5, 8 and 11) were removed, and so was line 12 because
of line 10.

This makes sense - if p is constant no store can possibly change the value of p
so it doesn't need to be reloaded.

However, I could not make it better any more, although optimal machine code just
need those marked with '*'.
This is strange, I'm not what sure what is going on here - assuming you are
running the same passes I'd expect no difference here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130820/9d57fe45/attachment.html>

王振江

2013-Aug-21 09:16 UTC

head link

[LLVMdev] Memory optimizations for LLVM JIT

Thank you very much for your explanations and suggestions.
I'm sorry that I have provided some wrong information last time: llc is
(probably?) not able to optimize such code either.

I tried something more according to the suggestions. Here are the results:
(using the same core code shown in the last email)
1. compile to object file  (clang -O3 -c test.c)
    : good code quality
2. compile to bitcode file (clang -O3 -c test.c -emit-llvm)
   : good
3. compile to bitcode file (clang -O0 -c test.c -emit-llvm)
   : bad, similar IR as I wrote manually
4. opt test.bc file in step 3                  (opt -O3 test.bc)
    : bad
5. compile to assembly, from test.bc in step 3 (llc -O3 test.bc)
    : bad
6. IR creation source, from test.bc in step 3  (llc -O3 -march=cpp test.bc)
   : bad, similar IR as I wrote manually
7. JIT or MCJIT the source in step 6           (modify and call jit/mcjit)
    : bad

In short, once the source is converted to bad bitcode (or equivalent IR
creation), I cannot optimize it back to the -O3 quality.
What can be the reason? Did the bitcode file lose some high level
information, so that certain optimizations are limited?
If so, is it possible to reconstruct some naive metadata to enable such
optimization? (just for this piece of code, as it is the most important
scenario in my project)

Any help will/would be appreciated.


The source of test.c
---------------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>

struct S {
    long long a[10];
} *p;

void foo () {
        p->a[2] = p->a[1];
        p->a[3] = p->a[1];
        p->a[4] = p->a[2];
        p->a[5] = p->a[4];
}

int main() {
        p = (struct S*) malloc(sizeof(struct S));
        p->a[1] = rand();
        foo();
        printf("%lld\n", p->a[5]);
        return 0;
}
---------------------------------------------------------------------------------------


2013/8/21 Richard Osborne <richard at xmos.com>
>  On 20 Aug 2013, at 08:23, 王振江 <cryst216 at gmail.com> wrote:
>
>
>  A GlobalValue was declared and mapped to the variable p.
> Some LLVM IR instructions were created according to those generated by
> LLVM from source.
> I.e., load p, load a[1] based on p, load p again, store a[2] based on p,
> etc.
> The machine code turned out to be slightly optmized, as shown on the left.
>
>
> I suspect this is due to possible aliasing. If p somehow pointed to itself
> then the store p->a[x] might change the value of of p so p must be
reloaded
> each time. Clang will emit TBAA metadata nodes (
> http://llvm.org/docs/LangRef.html#tbaa-metadata) that let the optimizers
> know the load of p can't alias the stores through p since they are have
> different high-level types. Without the TBAA metadata the optimizers must
> be conservative.
>
>
>  Things were getting better after the GlobalVariable of p was set as a
> constant.
> Redundant Loads of p (line 5, 8 and 11) were removed, and so was line 12
> because of line 10.
>
>
> This makes sense - if p is constant no store can possibly change the value
> of p so it doesn't need to be reloaded.
>
>  However, I could not make it better any more, although optimal machine
> code just need those marked with '*'.
>
> This is strange, I'm not what sure what is going on here - assuming you
> are running the same passes I'd expect no difference here.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130821/4db5c625/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Aug 2013 - [LLVMdev] Memory optimizations for LLVM JIT

[LLVMdev] Memory optimizations for LLVM JIT

[LLVMdev] Memory optimizations for LLVM JIT

[LLVMdev] Memory optimizations for LLVM JIT

[LLVMdev] Memory optimizations for LLVM JIT

[LLVMdev] Memory optimizations for LLVM JIT

Apparently Analagous Threads