On 12 March 2013 16:39, Óscar Fuentes <ofv at wanadoo.es> wrote:> > This is not possible, except for specific cases. > > Consider this code: > > long foo(long *p) { > ++p; > return *p; > } > > The X86 machine code would do something like > > add %eax, 4 > > for `++p', but for x86_64 it would be > > add %rax, 8 > > But you can't know that without looking at the original C code. > > And that's the most simple case. > > The gist is that the assembly code does not contain enough semantic > information.I already know how to handle the case you describe. I am not converting ASM to LLVM IR without doing quite a lot of analysis first. 1) I can already tell if a register is refering to a pointer or an integer based on how it is used. Does it get de-referenced or not? So, I would know that "p" is a pointer. 2) From the binary, I would know if it was for 32bit or 64bit. 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p + 1" (64bit long), or "p = p + 2(32bit long)" So, I think your "It is not possible" is a bit too black and white.
On 3/12/2013 11:55 AM, James Courtier-Dutton wrote:> I already know how to handle the case you describe. > I am not converting ASM to LLVM IR without doing quite a lot of analysis first. > 1) I can already tell if a register is refering to a pointer or an > integer based on how it is used. Does it get de-referenced or not? So, > I would know that "p" is a pointer.What if the variable is being loaded out of a memory location, and the current use increments it by four but never dereferences it, while some other location derefences it? What if (in x86-64 code) the variable clears the low three bits of the pointer to use it as scratchpad space for a few tracking bits? In 32-bit code, that's unsafe, since you can only guarantee two unused bits. What if you have a pointer variable in the middle of the struct, so you need to shift the data offset of a pointer-relative address to get the correct variable? What if you have the equivalent assembly code for this C code: union { struct { int *a; int b; }; struct { int c; int d; }; } x; ... switch () { case A: return &x->b; case B: return &x->d; } After optimization, cases A and B reduce to the same assembly in 32-bit code but not in 64-bit code. How would you propose to detect and fix these cases?> 2) From the binary, I would know if it was for 32bit or 64bit. > 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p + > 1" (64bit long), or "p = p + 2(32bit long)" > > So, I think your "It is not possible" is a bit too black and white.No, it's AI-hard, as evidenced that porting programs from 32-bit to 64-bit at the source-code level is nontrivial for large projects with lots of developers. And you only have less information at assembly level. -- Joshua Cranmer Thunderbird and DXR developer Source code archæologist
James Courtier-Dutton <james.dutton at gmail.com> writes:> I already know how to handle the case you describe. > I am not converting ASM to LLVM IR without doing quite a lot of analysis first. > 1) I can already tell if a register is refering to a pointer or an > integer based on how it is used. Does it get de-referenced or not? So, > I would know that "p" is a pointer. > 2) From the binary, I would know if it was for 32bit or 64bit. > 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p + > 1" (64bit long), or "p = p + 2(32bit long)" > > So, I think your "It is not possible" is a bit too black and white.There is no amount of automated analysis that makes possible "translating" arbitrary binary code from one architecture to another. Your above stated rules would fail for my example. This code: int foo(int *p) { ++p; return *p; } compiled in x86 (Linux or Windows) would generate the very same binary code than long foo(long *p) { ++p; return *p; } but those functions generate different code in x86_64-linux, where `int' is 32 bits and `long' 64 bits. In the general case, it is unfeasible to decide if `p' is a pointer to `int' or `long' on x86. There are lots and lots of examples of that kind. Other type of problems are translating ABI-related code, reflecting external data structures...
Joshua Cranmer 🐧 <Pidgeot18 at gmail.com> writes:>> So, I think your "It is not possible" is a bit too black and white. > > No, it's AI-hard, as evidenced that porting programs from 32-bit to > 64-bit at the source-code level is nontrivial for large projects with > lots of developers. And you only have less information at assembly > level.Now, *that's* a very good argument.
On 12 March 2013 17:10, Joshua Cranmer 🐧 <Pidgeot18 at gmail.com> wrote:> On 3/12/2013 11:55 AM, James Courtier-Dutton wrote: >> > >> 2) From the binary, I would know if it was for 32bit or 64bit. >> 3) I could then use (1) and (2) to know if "add %rax, 8" is "p = p + >> 1" (64bit long), or "p = p + 2(32bit long)" >> >> So, I think your "It is not possible" is a bit too black and white. > > > No, it's AI-hard, as evidenced that porting programs from 32-bit to 64-bit > at the source-code level is nontrivial for large projects with lots of > developers. And you only have less information at assembly level. >So, if we take the source-code level case. You can write a source-code level program that will compile unchanged to produce a 32-bit application or a 64-bit application. Proof of this is just looking at almost any Linux based distro available in 32-bit or 64-bitapplications. So, if you then ask a different question: Instead of porting a 32-bit program to 64-bit, port the 32-bit program to a program that will work equally well if compiled for 32-bit target or 64-bit target? First steps in this might be looking at every use of "int" and "long" and replace them with int32_t and int64_t. I.e. replace target specific types with target agnostic types. So, if the binary is 32bit, int will be 32bit, change the source code to say "int32_t" instead of "int". if the binary is 32bit, and on that target long will be 32bit, change the source code to say "int32_t". I know that there will be special cases that are difficult to handle. I don't expect 100%. I am looking to write a tool that can do say 80% of the work. I believe that I could recognise blocks that we know will work, and highlight the "unsure" sections of the code, for closer inspection. I am hoping to be able to highlight target agnostic code and highlight target specific code and automate the target agnosic parts. My current decompiler does statistical analysis in order to identify types. E.g. This register at this instruction is most likely a int32_t but might be a uint32_t, but definitely not a uint64_t. So, it is not black and white. I want it to work say 80% of the time, but at least highlight where the remaining 20% is, and do manual work on it.