For every level of translation [in terms of "human readable -> machine code translation", not someone translating a literary work from one language to another - although often some subtle details are lost here too], a little bit of the semantic meaning is lost. This means that you can almost never completely reconstruct the code in original form from the machine-code, or the C-code from the LLVM IR, or the C++ code from the output of something like cfront (the original C++ -> C translator), or the original Pascal code from a Pascal to C compiler, etc. It is, at least sometimes, possible to reconstruct something that can then be "compiled" [in quotes as it's a loose term in this discussion] again from the binary file, but it's often lacking some of the original subtlety. And there are certainly cases where the original code is very hard to derive from the machine-code. I played with a "symbolic disassembler" many years back, and on "well-behaved code" it would reconstruct assembly code that could be recompiled, but it struggled with for example switch-statements that became a PC-relative jump-table, because when you modify the code, it couldn't figure out what the jumps were - just as one example. I'm pretty sure it's possible to, at least as a human, write code that is nearly impossible to translate back to a higher level language. And modern compilers may not use the same types of obfuscation, but they will certainly produce code that is complex, hard to follow and not using obvious instructions for some particular purpose. -- Mats On 17 July 2015 at 17:11, Shuai Wang <wangshuai901 at gmail.com> wrote:> This is not a easy task. And I believe there is *NO* (open-source) tool > can fully solve this problem (statically). Correct me if I was wrong. > > It would be more helpful if you can provide details about what you want to > do, say, static or dynamic ? stripped binary or binary with symbolic > information? > What compiler do you work on? > > Check out papers below if you are interested. > > http://dl.acm.org/citation.cfm?id=2465380 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2465380&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=74RkRYSGnXHwJXd5DvxXdamQv0mj7_NjyBzbdCNRrYo&e=> > > http://dl.acm.org/citation.cfm?id=2462165 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2462165&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=rpl0PCuoy_iecIKs3lz3F0nGYQYw1J1cqTapvfLsceo&e=> > > > > Shuai > > > > On Fri, Jul 17, 2015 at 3:09 AM, 慕冬亮 <mudongliangabcd at gmail.com> wrote: > >> I want to transform elf binary to llvm IR, and do some instrumentation >> based on llvm. >> Is there any tool which can do the transformation? >> Thanks in advance. >> >> - mudongliang >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150717/c59fcebe/attachment.html>
Hello Mats, I am sorry but I didn't fully get your point. Actually things have moving forward and recently research have (marginally) solved some obstacles proposed before. Actually I am working on related reverse engineering topics for a while and according to my review this is no open-source tool can fully solve this challenge, even for binaries generated from well-written C program by widely-used compiler (32-bit gcc, with no optimization). We can discuss more in the email if you would like to. You might want to check papers I listed in the previous email, which discussed several issues in translating binary into LLVM IR, and also some recent research paper on disassembling itself. Sincerely, Shuai On Fri, Jul 17, 2015 at 12:45 PM, mats petersson <mats at planetcatfish.com> wrote:> For every level of translation [in terms of "human readable -> machine > code translation", not someone translating a literary work from one > language to another - although often some subtle details are lost here > too], a little bit of the semantic meaning is lost. This means that you can > almost never completely reconstruct the code in original form from the > machine-code, or the C-code from the LLVM IR, or the C++ code from the > output of something like cfront (the original C++ -> C translator), or the > original Pascal code from a Pascal to C compiler, etc. > > It is, at least sometimes, possible to reconstruct something that can then > be "compiled" [in quotes as it's a loose term in this discussion] again > from the binary file, but it's often lacking some of the original subtlety. > And there are certainly cases where the original code is very hard to > derive from the machine-code. I played with a "symbolic disassembler" many > years back, and on "well-behaved code" it would reconstruct assembly code > that could be recompiled, but it struggled with for example > switch-statements that became a PC-relative jump-table, because when you > modify the code, it couldn't figure out what the jumps were - just as one > example. > > > I'm pretty sure it's possible to, at least as a human, write code that is > nearly impossible to translate back to a higher level language. And modern > compilers may not use the same types of obfuscation, but they will > certainly produce code that is complex, hard to follow and not using > obvious instructions for some particular purpose. > > -- > Mats > > On 17 July 2015 at 17:11, Shuai Wang <wangshuai901 at gmail.com> wrote: > >> This is not a easy task. And I believe there is *NO* (open-source) tool >> can fully solve this problem (statically). Correct me if I was wrong. >> >> It would be more helpful if you can provide details about what you want >> to do, say, static or dynamic ? stripped binary or binary with symbolic >> information? >> What compiler do you work on? >> >> Check out papers below if you are interested. >> >> http://dl.acm.org/citation.cfm?id=2465380 >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2465380&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=74RkRYSGnXHwJXd5DvxXdamQv0mj7_NjyBzbdCNRrYo&e=> >> >> http://dl.acm.org/citation.cfm?id=2462165 >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2462165&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=rpl0PCuoy_iecIKs3lz3F0nGYQYw1J1cqTapvfLsceo&e=> >> >> >> >> Shuai >> >> >> >> On Fri, Jul 17, 2015 at 3:09 AM, 慕冬亮 <mudongliangabcd at gmail.com> wrote: >> >>> I want to transform elf binary to llvm IR, and do some instrumentation >>> based on llvm. >>> Is there any tool which can do the transformation? >>> Thanks in advance. >>> >>> - mudongliang >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150717/3f9fef88/attachment.html>
Shuai: I think we are agreeing - I was just saying that it's very difficult, but in a different way than how you were saying it. A large part of the difficulty is that there is "more information" in the higher level description of code, than there is in the lower level, and the compiler/translator "removes" (some of) that information when compiling. A very simple example is: struct ab { int a; float b; }; struct ab a; foo(a); will look (in most compilers) the same as int a; float b; foo(a, b); Debug information and symbols can of course help here, but if the code doesn't have that, then there isn't any way to tell `foo(int, float)` from `foo(struct ab)` as a signature. So whilst it MAY be possible to recreate the code at a higher level that is functionally equivalent, a lot of the "helping" features in the high-level language will go missing because the information was "removed" by the compiler. -- Mats On 17 July 2015 at 18:31, Shuai Wang <wangshuai901 at gmail.com> wrote:> Hello Mats, > > I am sorry but I didn't fully get your point. Actually things have moving > forward and recently research have (marginally) solved some obstacles > proposed before. > > Actually I am working on related reverse engineering topics for a while > and according to my review this is no open-source tool can fully solve this > challenge, > even for binaries generated from well-written C program by widely-used > compiler (32-bit gcc, with no optimization). We can discuss more in the > email if you would like to. > > You might want to check papers I listed in the previous email, which > discussed several issues in translating binary into LLVM IR, > and also some recent research paper on disassembling itself. > > > Sincerely, > Shuai > > > On Fri, Jul 17, 2015 at 12:45 PM, mats petersson <mats at planetcatfish.com> > wrote: > >> For every level of translation [in terms of "human readable -> machine >> code translation", not someone translating a literary work from one >> language to another - although often some subtle details are lost here >> too], a little bit of the semantic meaning is lost. This means that you can >> almost never completely reconstruct the code in original form from the >> machine-code, or the C-code from the LLVM IR, or the C++ code from the >> output of something like cfront (the original C++ -> C translator), or the >> original Pascal code from a Pascal to C compiler, etc. >> >> It is, at least sometimes, possible to reconstruct something that can >> then be "compiled" [in quotes as it's a loose term in this discussion] >> again from the binary file, but it's often lacking some of the original >> subtlety. And there are certainly cases where the original code is very >> hard to derive from the machine-code. I played with a "symbolic >> disassembler" many years back, and on "well-behaved code" it would >> reconstruct assembly code that could be recompiled, but it struggled with >> for example switch-statements that became a PC-relative jump-table, because >> when you modify the code, it couldn't figure out what the jumps were - just >> as one example. >> >> >> I'm pretty sure it's possible to, at least as a human, write code that is >> nearly impossible to translate back to a higher level language. And modern >> compilers may not use the same types of obfuscation, but they will >> certainly produce code that is complex, hard to follow and not using >> obvious instructions for some particular purpose. >> >> -- >> Mats >> >> On 17 July 2015 at 17:11, Shuai Wang <wangshuai901 at gmail.com> wrote: >> >>> This is not a easy task. And I believe there is *NO* (open-source) tool >>> can fully solve this problem (statically). Correct me if I was wrong. >>> >>> It would be more helpful if you can provide details about what you want >>> to do, say, static or dynamic ? stripped binary or binary with symbolic >>> information? >>> What compiler do you work on? >>> >>> Check out papers below if you are interested. >>> >>> http://dl.acm.org/citation.cfm?id=2465380 >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2465380&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=74RkRYSGnXHwJXd5DvxXdamQv0mj7_NjyBzbdCNRrYo&e=> >>> >>> http://dl.acm.org/citation.cfm?id=2462165 >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cfm-3Fid-3D2462165&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=PMWV93YoHpzwPfOq-d9rjutlZ5ICwU8uIp3HLShT_D0&s=rpl0PCuoy_iecIKs3lz3F0nGYQYw1J1cqTapvfLsceo&e=> >>> >>> >>> >>> Shuai >>> >>> >>> >>> On Fri, Jul 17, 2015 at 3:09 AM, 慕冬亮 <mudongliangabcd at gmail.com> wrote: >>> >>>> I want to transform elf binary to llvm IR, and do some instrumentation >>>> based on llvm. >>>> Is there any tool which can do the transformation? >>>> Thanks in advance. >>>> >>>> - mudongliang >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>> >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150717/0eb2c8fe/attachment.html>