Hi,
I am looking to decompile x86 ASM to LLVM IR.
The original C is this:
int test61 ( unsigned value ) {
        int ret;
        if (value < 1)
                ret = 0x40;
        else
                ret = 0x61;
        return ret;
}
It compiles with GCC -O2 to (rather cleverly removing any branches):
0000000000000000 <test61>:
   0:   83 ff 01                cmp    $0x1,%edi
   3:   19 c0                   sbb    %eax,%eax
   5:   83 e0 df                and    $0xffffffdf,%eax
   8:   83 c0 61                add    $0x61,%eax
   b:   c3                      retq
How would I represent the SBB instruction in LLVM IR?
Would I have to first convert the ASM to something like:
   0000000000000000 <test61>:
   0:                   cmp    $0x1,%edi        Block A
   1:                   jb     4:               Block A
   2:                   mov    0x61,%eax        Block B
   3:                   jmp    5:               Block B
   4:                   mov    0x40,%eax        Block C
   5:                   retq                    Block D  (Due to join point)
...before I could convert it to LLVM IR ?
I.e. Re-write it in such a way as to not need the SBB instruction.
The aim is to be able to then recompile it to maybe a different target.
The aim is to go from binary -> LLVM IR -> binary for cases where the
C source code it not available or lost.
I.e. binary available for x86 32 bit.  Re-target it to ARM or x86-64bit.
The LLVM IR should be target agnostic, but would permit the
re-targetting task without having to build AST and structure as a C or
C++ source code program.
Any comments?
James
James Courtier-Dutton <james.dutton at gmail.com> writes:> I am looking to decompile x86 ASM to LLVM IR. > The original C is this: > int test61 ( unsigned value ) { > int ret; > if (value < 1) > ret = 0x40; > else > ret = 0x61; > return ret; > } > > It compiles with GCC -O2 to (rather cleverly removing any branches): > 0000000000000000 <test61>: > 0: 83 ff 01 cmp $0x1,%edi > 3: 19 c0 sbb %eax,%eax > 5: 83 e0 df and $0xffffffdf,%eax > 8: 83 c0 61 add $0x61,%eax > b: c3 retq > > How would I represent the SBB instruction in LLVM IR? > Would I have to first convert the ASM to something like: > 0000000000000000 <test61>: > 0: cmp $0x1,%edi Block A > 1: jb 4: Block A > 2: mov 0x61,%eax Block B > 3: jmp 5: Block B > 4: mov 0x40,%eax Block C > 5: retq Block D (Due to join point) > > ...before I could convert it to LLVM IR ? > I.e. Re-write it in such a way as to not need the SBB instruction. > > The aim is to be able to then recompile it to maybe a different target. > The aim is to go from binary -> LLVM IR -> binary for cases where the > C source code it not available or lost. > > I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit. > The LLVM IR should be target agnostic, but would permit the > re-targetting task without having to build AST and structure as a C or > C++ source code program. > > Any comments?This is not possible, except for specific cases. Consider this code: long foo(long *p) { ++p; return *p; } The X86 machine code would do something like add %eax, 4 for `++p', but for x86_64 it would be add %rax, 8 But you can't know that without looking at the original C code. And that's the most simple case. The gist is that the assembly code does not contain enough semantic information.
On 3/12/2013 11:20 AM, James Courtier-Dutton wrote:> It compiles with GCC -O2 to (rather cleverly removing any branches): > 0000000000000000 <test61>: > 0: 83 ff 01 cmp $0x1,%edi > 3: 19 c0 sbb %eax,%eax > 5: 83 e0 df and $0xffffffdf,%eax > 8: 83 c0 61 add $0x61,%eax > b: c3 retq > > How would I represent the SBB instruction in LLVM IR?If you're decompiling an assembly language into IR, it is best to treat the CFLAGS register as just another register which is manipulated as a side effect of instructions and letting a dead-code elimination pass eliminate extraneous uses. A rough equivalent for llvm IR in this could would be %cf = icmp lt i32 1, %edi %eax2 = sub i32 %eax, %eax %1 = zext i1 %cf to i32 %eax3 = sub i32 %eax2, %1 %eax4 = and i32 0xffffffdf, %eax3 %eax5 = add i32 0x61, %eax4> The aim is to be able to then recompile it to maybe a different target. > The aim is to go from binary -> LLVM IR -> binary for cases where the > C source code it not available or lost.I know qemu can use LLVM IR as an intermediate form for optimizing emulation; you might want to look into their source code. Or actually just outright use qemu.> > I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit. > The LLVM IR should be target agnostic, but would permit the > re-targetting task without having to build AST and structure as a C or > C++ source code program.Retargetting binaries for different hardware sounds like a losing proposition to me, especially if you're trying to retarget x86 binary code to x86-64: problems here include code acting as if sizeof(void*) = 4 instead of the correct value of 8. The only safe way to do this is to effectively emulate the original target machine... which is more or less what qemu does. -- Joshua Cranmer Thunderbird and DXR developer Source code archæologist
Hi James, On 12/03/13 17:20, James Courtier-Dutton wrote:> Hi, > > I am looking to decompile x86 ASM to LLVM IR. > The original C is this: > int test61 ( unsigned value ) { > int ret; > if (value < 1) > ret = 0x40; > else > ret = 0x61; > return ret; > } > > It compiles with GCC -O2 to (rather cleverly removing any branches): > 0000000000000000 <test61>: > 0: 83 ff 01 cmp $0x1,%edi > 3: 19 c0 sbb %eax,%eax > 5: 83 e0 df and $0xffffffdf,%eax > 8: 83 c0 61 add $0x61,%eax > b: c3 retq > > How would I represent the SBB instruction in LLVM IR?you could use an llvm.ssub.with.overflow.i32 intrinsic to get the sub-with-carry, and then explicitly extend the carry flag to i32 and subtract it off too. See http://llvm.org/docs/LangRef.html#llvm-ssub-with-overflow-intrinsics Ciao, Duncan.> Would I have to first convert the ASM to something like: > 0000000000000000 <test61>: > 0: cmp $0x1,%edi Block A > 1: jb 4: Block A > 2: mov 0x61,%eax Block B > 3: jmp 5: Block B > 4: mov 0x40,%eax Block C > 5: retq Block D (Due to join point) > > ...before I could convert it to LLVM IR ? > I.e. Re-write it in such a way as to not need the SBB instruction. > > The aim is to be able to then recompile it to maybe a different target. > The aim is to go from binary -> LLVM IR -> binary for cases where the > C source code it not available or lost. > > I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit. > The LLVM IR should be target agnostic, but would permit the > re-targetting task without having to build AST and structure as a C or > C++ source code program. > > Any comments? > > James > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
On 12 March 2013 16:39, Óscar Fuentes <ofv at wanadoo.es> wrote:> > This is not possible, except for specific cases. > > Consider this code: > > long foo(long *p) { > ++p; > return *p; > } > > The X86 machine code would do something like > > add %eax, 4 > > for `++p', but for x86_64 it would be > > add %rax, 8 > > But you can't know that without looking at the original C code. > > And that's the most simple case. > > The gist is that the assembly code does not contain enough semantic > information.I already know how to handle the case you describe. I am not converting ASM to LLVM IR without doing quite a lot of analysis first. 1) I can already tell if a register is refering to a pointer or an integer based on how it is used. Does it get de-referenced or not? So, I would know that "p" is a pointer. 2) From the binary, I would know if it was for 32bit or 64bit. 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p + 1" (64bit long), or "p = p + 2(32bit long)" So, I think your "It is not possible" is a bit too black and white.
On 3/12/13 11:39 AM, Óscar Fuentes wrote:> James Courtier-Dutton <james.dutton at gmail.com> writes: > >> I am looking to decompile x86 ASM to LLVM IR. >> The original C is this: >> int test61 ( unsigned value ) { >> int ret; >> if (value < 1) >> ret = 0x40; >> else >> ret = 0x61; >> return ret; >> } >> >> It compiles with GCC -O2 to (rather cleverly removing any branches): >> 0000000000000000 <test61>: >> 0: 83 ff 01 cmp $0x1,%edi >> 3: 19 c0 sbb %eax,%eax >> 5: 83 e0 df and $0xffffffdf,%eax >> 8: 83 c0 61 add $0x61,%eax >> b: c3 retq >> >> How would I represent the SBB instruction in LLVM IR? >> Would I have to first convert the ASM to something like: >> 0000000000000000 <test61>: >> 0: cmp $0x1,%edi Block A >> 1: jb 4: Block A >> 2: mov 0x61,%eax Block B >> 3: jmp 5: Block B >> 4: mov 0x40,%eax Block C >> 5: retq Block D (Due to join point) >> >> ...before I could convert it to LLVM IR ? >> I.e. Re-write it in such a way as to not need the SBB instruction. >> >> The aim is to be able to then recompile it to maybe a different target. >> The aim is to go from binary -> LLVM IR -> binary for cases where the >> C source code it not available or lost. >> >> I.e. binary available for x86 32 bit. Re-target it to ARM or x86-64bit. >> The LLVM IR should be target agnostic, but would permit the >> re-targetting task without having to build AST and structure as a C or >> C++ source code program. >> >> Any comments? > This is not possible, except for specific cases. > > Consider this code: > > long foo(long *p) { > ++p; > return *p; > } > > The X86 machine code would do something like > > add %eax, 4 > > for `++p', but for x86_64 it would be > > add %rax, 8 > > But you can't know that without looking at the original C code.This is a bad example. A compiler compiling LP64 code would generate the above code on x86_64 for the given C code. An ILP32 compiler for x86_64 would generate something more akin to the 32-bit x86_32 code given above. It should be possible to statically convert such a simple program from one instruction set to another (provided that they're not funky instruction sets with 11 bit words). That said, converting machine code from one machine to another is, I believe, an undecidable problem for arbitrary code. Certainly self-modifying code can be a problem. There's no type-information, either, so optimizations that may rely on it can't be done. Anything that uses memory-mapped I/O or I/O ports is going to cause a real challenge, and system calls won't work the same way on a different architecture. There are probably other gotcha's of which I am not aware. In short, it's an exercise fraught with danger, and there will always be a program that breaks your translator. Most systems that do binary translation do it dynamically (i.e., they grab a set of instructions, translate them to the new instruction set, and then cache the translation for reuse as the program runs). They are essentially machine code interpreters enhanced with Just-In-Time compilation for speed. -- John T.
On 12 March 2013 16:45, Joshua Cranmer 🐧 <Pidgeot18 at gmail.com> wrote:> > I know qemu can use LLVM IR as an intermediate form for optimizing > emulation; you might want to look into their source code. Or actually just > outright use qemu. >I did not know that. Thank you. I will take a look. Kind Regards James