thr3ads.net - llvm dev - [LLVMdev] help decompiling x86 ASM to LLVM IR [Mar 2013]

If this information is useful, please help other people find it:
Share via:

James Courtier-Dutton

2013-Mar-12 16:20 UTC

[LLVMdev] help decompiling x86 ASM to LLVM IR

Hi,

I am looking to decompile x86 ASM to LLVM IR.
The original C is this:
int test61 ( unsigned value ) {
        int ret;
        if (value < 1)
                ret = 0x40;
        else
                ret = 0x61;
        return ret;
}

It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
   0:   83 ff 01                cmp    $0x1,%edi
   3:   19 c0                   sbb    %eax,%eax
   5:   83 e0 df                and    $0xffffffdf,%eax
   8:   83 c0 61                add    $0x61,%eax
   b:   c3                      retq

How would I represent the SBB instruction in LLVM IR?
Would I have to first convert the ASM to something like:
   0000000000000000 <test61>:
   0:                   cmp    $0x1,%edi        Block A
   1:                   jb     4:               Block A
   2:                   mov    0x61,%eax        Block B
   3:                   jmp    5:               Block B
   4:                   mov    0x40,%eax        Block C
   5:                   retq                    Block D  (Due to join point)

...before I could convert it to LLVM IR ?
I.e. Re-write it in such a way as to not need the SBB instruction.

The aim is to be able to then recompile it to maybe a different target.
The aim is to go from binary -> LLVM IR -> binary for cases where the
C source code it not available or lost.

I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
The LLVM IR should be target agnostic, but would permit the
re-targetting task without having to build AST and structure as a C or
C++ source code program.

Any comments?

James

Óscar Fuentes

2013-Mar-12 16:39 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

James Courtier-Dutton <james.dutton at gmail.com> writes:
> I am looking to decompile x86 ASM to LLVM IR.
> The original C is this:
> int test61 ( unsigned value ) {
>         int ret;
>         if (value < 1)
>                 ret = 0x40;
>         else
>                 ret = 0x61;
>         return ret;
> }
>
> It compiles with GCC -O2 to (rather cleverly removing any branches):
> 0000000000000000 <test61>:
>    0:   83 ff 01                cmp    $0x1,%edi
>    3:   19 c0                   sbb    %eax,%eax
>    5:   83 e0 df                and    $0xffffffdf,%eax
>    8:   83 c0 61                add    $0x61,%eax
>    b:   c3                      retq
>
> How would I represent the SBB instruction in LLVM IR?
> Would I have to first convert the ASM to something like:
>    0000000000000000 <test61>:
>    0:                   cmp    $0x1,%edi        Block A
>    1:                   jb     4:               Block A
>    2:                   mov    0x61,%eax        Block B
>    3:                   jmp    5:               Block B
>    4:                   mov    0x40,%eax        Block C
>    5:                   retq                    Block D  (Due to join
point)
>
> ...before I could convert it to LLVM IR ?
> I.e. Re-write it in such a way as to not need the SBB instruction.
>
> The aim is to be able to then recompile it to maybe a different target.
> The aim is to go from binary -> LLVM IR -> binary for cases where the
> C source code it not available or lost.
>
> I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
> The LLVM IR should be target agnostic, but would permit the
> re-targetting task without having to build AST and structure as a C or
> C++ source code program.
>
> Any comments?
This is not possible, except for specific cases.

Consider this code:

long foo(long *p) {
  ++p;
  return *p;
}

The X86 machine code would do something like

add %eax, 4

for `++p', but for x86_64 it would be

add %rax, 8

But you can't know that without looking at the original C code.

And that's the most simple case.

The gist is that the assembly code does not contain enough semantic
information.

Joshua Cranmer 🐧

2013-Mar-12 16:45 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 3/12/2013 11:20 AM, James Courtier-Dutton wrote:> It compiles with GCC -O2 to (rather cleverly removing any branches):
> 0000000000000000 <test61>:
>     0:   83 ff 01                cmp    $0x1,%edi
>     3:   19 c0                   sbb    %eax,%eax
>     5:   83 e0 df                and    $0xffffffdf,%eax
>     8:   83 c0 61                add    $0x61,%eax
>     b:   c3                      retq
>
> How would I represent the SBB instruction in LLVM IR?
If you're decompiling an assembly language into IR, it is best to treat 
the CFLAGS register as just another register which is manipulated as a 
side effect of instructions and letting a dead-code elimination pass 
eliminate extraneous uses. A rough equivalent for llvm IR in this could 
would be
%cf = icmp lt i32 1, %edi
%eax2 = sub i32 %eax, %eax
%1 = zext i1 %cf to i32
%eax3 = sub i32 %eax2, %1
%eax4 = and i32 0xffffffdf, %eax3
%eax5 = add i32 0x61, %eax4
> The aim is to be able to then recompile it to maybe a different target.
> The aim is to go from binary -> LLVM IR -> binary for cases where the
> C source code it not available or lost.I know qemu can use LLVM IR as an intermediate form for optimizing 
emulation; you might want to look into their source code. Or actually 
just outright use qemu.
>
> I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
> The LLVM IR should be target agnostic, but would permit the
> re-targetting task without having to build AST and structure as a C or
> C++ source code program.Retargetting binaries for different hardware sounds like a losing 
proposition to me, especially if you're trying to retarget x86 binary 
code to x86-64: problems here include code acting as if sizeof(void*) = 
4 instead of the correct value of 8. The only safe way to do this is to 
effectively emulate the original target machine... which is more or less 
what qemu does.

-- 
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

Duncan Sands

2013-Mar-12 16:53 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

Hi James,

On 12/03/13 17:20, James Courtier-Dutton wrote:> Hi,
>
> I am looking to decompile x86 ASM to LLVM IR.
> The original C is this:
> int test61 ( unsigned value ) {
>          int ret;
>          if (value < 1)
>                  ret = 0x40;
>          else
>                  ret = 0x61;
>          return ret;
> }
>
> It compiles with GCC -O2 to (rather cleverly removing any branches):
> 0000000000000000 <test61>:
>     0:   83 ff 01                cmp    $0x1,%edi
>     3:   19 c0                   sbb    %eax,%eax
>     5:   83 e0 df                and    $0xffffffdf,%eax
>     8:   83 c0 61                add    $0x61,%eax
>     b:   c3                      retq
>
> How would I represent the SBB instruction in LLVM IR?
you could use an llvm.ssub.with.overflow.i32 intrinsic to get the
sub-with-carry, and then explicitly extend the carry flag to i32 and
subtract it off too.  See

http://llvm.org/docs/LangRef.html#llvm-ssub-with-overflow-intrinsics

Ciao, Duncan.
> Would I have to first convert the ASM to something like:
>     0000000000000000 <test61>:
>     0:                   cmp    $0x1,%edi        Block A
>     1:                   jb     4:               Block A
>     2:                   mov    0x61,%eax        Block B
>     3:                   jmp    5:               Block B
>     4:                   mov    0x40,%eax        Block C
>     5:                   retq                    Block D  (Due to join
point)
>
> ...before I could convert it to LLVM IR ?
> I.e. Re-write it in such a way as to not need the SBB instruction.
>
> The aim is to be able to then recompile it to maybe a different target.
> The aim is to go from binary -> LLVM IR -> binary for cases where the
> C source code it not available or lost.
>
> I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
> The LLVM IR should be target agnostic, but would permit the
> re-targetting task without having to build AST and structure as a C or
> C++ source code program.
>
> Any comments?
>
> James
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

James Courtier-Dutton

2013-Mar-12 16:55 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 12 March 2013 16:39, Óscar Fuentes <ofv at wanadoo.es>
wrote:>
> This is not possible, except for specific cases.
>
> Consider this code:
>
> long foo(long *p) {
>   ++p;
>   return *p;
> }
>
> The X86 machine code would do something like
>
> add %eax, 4
>
> for `++p', but for x86_64 it would be
>
> add %rax, 8
>
> But you can't know that without looking at the original C code.
>
> And that's the most simple case.
>
> The gist is that the assembly code does not contain enough semantic
> information.
I already know how to handle the case you describe.
I am not converting ASM to LLVM IR without doing quite a lot of analysis first.
1) I can already tell if a register is refering to a pointer or an
integer based on how it is used. Does it get de-referenced or not? So,
I would know that "p" is a pointer.
2) From the binary, I would know if it was for 32bit or 64bit.
3) I could then use (1) and (2) to know if "add %rax, 8" is "p =
p +
1" (64bit long), or "p = p + 2(32bit long)"

So, I think your "It is not possible" is a bit too black and white.

John Criswell

2013-Mar-12 17:01 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 3/12/13 11:39 AM, Óscar Fuentes wrote:> James Courtier-Dutton <james.dutton at gmail.com> writes:
>
>> I am looking to decompile x86 ASM to LLVM IR.
>> The original C is this:
>> int test61 ( unsigned value ) {
>>          int ret;
>>          if (value < 1)
>>                  ret = 0x40;
>>          else
>>                  ret = 0x61;
>>          return ret;
>> }
>>
>> It compiles with GCC -O2 to (rather cleverly removing any branches):
>> 0000000000000000 <test61>:
>>     0:   83 ff 01                cmp    $0x1,%edi
>>     3:   19 c0                   sbb    %eax,%eax
>>     5:   83 e0 df                and    $0xffffffdf,%eax
>>     8:   83 c0 61                add    $0x61,%eax
>>     b:   c3                      retq
>>
>> How would I represent the SBB instruction in LLVM IR?
>> Would I have to first convert the ASM to something like:
>>     0000000000000000 <test61>:
>>     0:                   cmp    $0x1,%edi        Block A
>>     1:                   jb     4:               Block A
>>     2:                   mov    0x61,%eax        Block B
>>     3:                   jmp    5:               Block B
>>     4:                   mov    0x40,%eax        Block C
>>     5:                   retq                    Block D  (Due to join
point)
>>
>> ...before I could convert it to LLVM IR ?
>> I.e. Re-write it in such a way as to not need the SBB instruction.
>>
>> The aim is to be able to then recompile it to maybe a different target.
>> The aim is to go from binary -> LLVM IR -> binary for cases where
the
>> C source code it not available or lost.
>>
>> I.e. binary available for x86 32 bit.  Re-target it to ARM or
x86-64bit.
>> The LLVM IR should be target agnostic, but would permit the
>> re-targetting task without having to build AST and structure as a C or
>> C++ source code program.
>>
>> Any comments?
> This is not possible, except for specific cases.
>
> Consider this code:
>
> long foo(long *p) {
>    ++p;
>    return *p;
> }
>
> The X86 machine code would do something like
>
> add %eax, 4
>
> for `++p', but for x86_64 it would be
>
> add %rax, 8
>
> But you can't know that without looking at the original C code.
This is a bad example.  A compiler compiling LP64 code would generate 
the above code on x86_64 for the given C code.  An ILP32 compiler for 
x86_64 would generate something more akin to the 32-bit x86_32 code 
given above.  It should be possible to statically convert such a simple 
program from one instruction set to another (provided that they're not 
funky instruction sets with 11 bit words).

That said, converting machine code from one machine to another is, I 
believe, an undecidable problem for arbitrary code.  Certainly 
self-modifying code can be a problem.  There's no type-information, 
either, so optimizations that may rely on it can't be done. Anything 
that uses memory-mapped I/O or I/O ports is going to cause a real 
challenge, and system calls won't work the same way on a different 
architecture.  There are probably other gotcha's of which I am not 
aware.  In short, it's an exercise fraught with danger, and there will 
always be a program that breaks your translator.

Most systems that do binary translation do it dynamically (i.e., they 
grab a set of instructions, translate them to the new instruction set, 
and then cache the translation for reuse as the program runs).  They are 
essentially machine code interpreters enhanced with Just-In-Time 
compilation for speed.

-- John T.

James Courtier-Dutton

2013-Mar-12 18:35 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 12 March 2013 16:45, Joshua Cranmer 🐧 <Pidgeot18 at gmail.com>
wrote:>
> I know qemu can use LLVM IR as an intermediate form for optimizing
> emulation; you might want to look into their source code. Or actually just
> outright use qemu.
>
I did not know that. Thank you. I will take a look.

Kind Regards

James

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Mar 2013 - [LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

Seemingly Similar Threads