thr3ads.net - llvm dev - [LLVMdev] help decompiling x86 ASM to LLVM IR [Mar 2013]

If this information is useful, please help other people find it:
Share via:

James Courtier-Dutton

2013-Mar-12 16:55 UTC

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 12 March 2013 16:39, Óscar Fuentes <ofv at wanadoo.es>
wrote:>
> This is not possible, except for specific cases.
>
> Consider this code:
>
> long foo(long *p) {
>   ++p;
>   return *p;
> }
>
> The X86 machine code would do something like
>
> add %eax, 4
>
> for `++p', but for x86_64 it would be
>
> add %rax, 8
>
> But you can't know that without looking at the original C code.
>
> And that's the most simple case.
>
> The gist is that the assembly code does not contain enough semantic
> information.
I already know how to handle the case you describe.
I am not converting ASM to LLVM IR without doing quite a lot of analysis first.
1) I can already tell if a register is refering to a pointer or an
integer based on how it is used. Does it get de-referenced or not? So,
I would know that "p" is a pointer.
2) From the binary, I would know if it was for 32bit or 64bit.
3) I could then use (1) and (2) to know if "add %rax, 8" is "p =
p +
1" (64bit long), or "p = p + 2(32bit long)"

So, I think your "It is not possible" is a bit too black and white.

Joshua Cranmer 🐧

2013-Mar-12 17:10 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 3/12/2013 11:55 AM, James Courtier-Dutton wrote:> I already know how to handle the case you describe.
> I am not converting ASM to LLVM IR without doing quite a lot of analysis
first.
> 1) I can already tell if a register is refering to a pointer or an
> integer based on how it is used. Does it get de-referenced or not? So,
> I would know that "p" is a pointer.What if the variable is being loaded out of a memory location, and the 
current use increments it by four but never dereferences it, while some 
other location derefences it?

What if (in x86-64 code) the variable clears the low three bits of the 
pointer to use it as scratchpad space for a few tracking bits? In 32-bit 
code, that's unsafe, since you can only guarantee two unused bits.

What if you have a pointer variable in the middle of the struct, so you 
need to shift the data offset of a pointer-relative address to get the 
correct variable?

What if you have the equivalent assembly code for this C code:
union {
   struct {
     int *a;
     int b;
   };
   struct {
     int c;
     int d;
   };
} x;

...
switch () {
  case A: return &x->b;
  case B: return &x->d;
}

After optimization, cases A and B reduce to the same assembly in 32-bit 
code but not in 64-bit code.

How would you propose to detect and fix these cases?
> 2) From the binary, I would know if it was for 32bit or 64bit.
> 3) I could then use (1) and (2) to know if "add %rax, 8" is
"p = p +
> 1" (64bit long), or "p = p + 2(32bit long)"
>
> So, I think your "It is not possible" is a bit too black and
white.
No, it's AI-hard, as evidenced that porting programs from 32-bit to 
64-bit at the source-code level is nontrivial for large projects with 
lots of developers. And you only have less information at assembly level.

-- 
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

Óscar Fuentes

2013-Mar-12 17:17 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

James Courtier-Dutton <james.dutton at gmail.com> writes:
> I already know how to handle the case you describe.
> I am not converting ASM to LLVM IR without doing quite a lot of analysis
first.
> 1) I can already tell if a register is refering to a pointer or an
> integer based on how it is used. Does it get de-referenced or not? So,
> I would know that "p" is a pointer.
> 2) From the binary, I would know if it was for 32bit or 64bit.
> 3) I could then use (1) and (2) to know if "add %rax, 8" is
"p = p +
> 1" (64bit long), or "p = p + 2(32bit long)"
>
> So, I think your "It is not possible" is a bit too black and
white.
There is no amount of automated analysis that makes possible
"translating" arbitrary binary code from one architecture to another.

Your above stated rules would fail for my example. This code:

int foo(int *p) {
   ++p;
   return *p;
}

compiled in x86 (Linux or Windows) would generate the very same binary
code than

long foo(long *p) {
   ++p;
   return *p;
}

but those functions generate different code in x86_64-linux, where `int'
is 32 bits and `long' 64 bits. In the general case, it is unfeasible to
decide if `p' is a pointer to `int' or `long' on x86.

There are lots and lots of examples of that kind. Other type of problems
are translating ABI-related code, reflecting external data structures...

Óscar Fuentes

2013-Mar-12 17:27 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

Joshua Cranmer 🐧 <Pidgeot18 at gmail.com> writes:
>> So, I think your "It is not possible" is a bit too black and
white.
>
> No, it's AI-hard, as evidenced that porting programs from 32-bit to
> 64-bit at the source-code level is nontrivial for large projects with
> lots of developers. And you only have less information at assembly
> level.
Now, *that's* a very good argument.

James Courtier-Dutton

2013-Mar-12 18:17 UTC

head link

[LLVMdev] help decompiling x86 ASM to LLVM IR

On 12 March 2013 17:10, Joshua Cranmer 🐧 <Pidgeot18 at gmail.com>
wrote:> On 3/12/2013 11:55 AM, James Courtier-Dutton wrote:
>>
>
>> 2) From the binary, I would know if it was for 32bit or 64bit.
>> 3) I could then use (1) and (2) to know if "add %rax, 8" is
"p = p +
>> 1" (64bit long), or "p = p + 2(32bit long)"
>>
>> So, I think your "It is not possible" is a bit too black and
white.
>
>
> No, it's AI-hard, as evidenced that porting programs from 32-bit to
64-bit
> at the source-code level is nontrivial for large projects with lots of
> developers. And you only have less information at assembly level.
>
So, if we take the source-code level case.
You can write a source-code level program that will compile unchanged
to produce a 32-bit application or a 64-bit application.
Proof of this is just looking at almost any Linux based distro
available in 32-bit or 64-bitapplications.
So, if you then ask a different question:
Instead of porting a 32-bit program to 64-bit, port the 32-bit program
to a program that will work equally well if compiled for 32-bit target
or 64-bit target?

First steps in this might be looking at every use of "int" and
"long"
and replace them with int32_t and int64_t. I.e. replace target
specific types with target agnostic types.
So, if the binary is 32bit, int will be 32bit, change the source code
to say "int32_t" instead of "int".
if the binary is 32bit, and on that target long will be 32bit, change
the source code to say "int32_t".

I know that there will be special cases that are difficult to handle.
I don't expect 100%. I am looking to write a tool that can do say 80%
of the work.
I believe that I could recognise blocks that we know will work, and
highlight the "unsure" sections of the code, for closer inspection.
I am hoping to be able to highlight target agnostic code and highlight
target specific code and automate the target agnosic parts.

My current decompiler does statistical analysis in order to identify types.
E.g. This register at this instruction is most likely a int32_t but
might be a uint32_t, but definitely not a uint64_t.

So, it is not black and white. I want it to work say 80% of the time,
but at least highlight where the remaining 20% is, and do manual work
on it.

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Mar 2013 - [LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

[LLVMdev] help decompiling x86 ASM to LLVM IR

Maybe Matching Threads