Hi, I am writing a decompiler. I was wondering if some of LLVM could be used for a decompiler. There are several stages in the decompiler process. 1) Take binary and create a higher level representation of it. Like RTL. 2) The output is then broken into blocks or nodes, each block ends in a CALL, JMP, RET, or 2-way or multiway conditional JMP. 3) The blocks or nodes are then analyzed for structure in order to extract loop information and if...then...else information. 4) Once structure is obtained, data types can be analyzed. 5) Lastly, source code is output in C or C++ or whatever is needed. I was wondering if LLVM could help with any of these steps. I am looking at doing step (3) better. Can LLVM help in that area? Kind Regards James
> 3) The blocks or nodes are then analyzed for structure in order to > extract loop information and if...then...else information. > 4) Once structure is obtained, data types can be analyzed. > 5) Lastly, source code is output in C or C++ or whatever is needed. > > I was wondering if LLVM could help with any of these steps. > I am looking at doing step (3) better. Can LLVM help in that area?You can check out lib/Analysis/* . Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj
Hi James, In short, decompilation is currently not available. However, there was a discussion on the mailinglist a while ago where someone proposed a Google Summer of Code project to make it available. You might find the resulting discussion interesting. http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-April/048617.html Kind regards, Roel On 07/05/12 12:47, James Courtier-Dutton wrote:> Hi, > > I am writing a decompiler. I was wondering if some of LLVM could be > used for a decompiler. > There are several stages in the decompiler process. > 1) Take binary and create a higher level representation of it. Like RTL. > 2) The output is then broken into blocks or nodes, each block ends in > a CALL, JMP, RET, or 2-way or multiway conditional JMP. > 3) The blocks or nodes are then analyzed for structure in order to > extract loop information and if...then...else information. > 4) Once structure is obtained, data types can be analyzed. > 5) Lastly, source code is output in C or C++ or whatever is needed. > > I was wondering if LLVM could help with any of these steps. > I am looking at doing step (3) better. Can LLVM help in that area? > > Kind Regards > > James > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 5/7/2012 5:47 AM, James Courtier-Dutton wrote:> Hi, > > I am writing a decompiler. I was wondering if some of LLVM could be > used for a decompiler. > There are several stages in the decompiler process. > 1) Take binary and create a higher level representation of it. Like RTL. > 2) The output is then broken into blocks or nodes, each block ends in > a CALL, JMP, RET, or 2-way or multiway conditional JMP. > 3) The blocks or nodes are then analyzed for structure in order to > extract loop information and if...then...else information. > 4) Once structure is obtained, data types can be analyzed. > 5) Lastly, source code is output in C or C++ or whatever is needed. > > I was wondering if LLVM could help with any of these steps. > I am looking at doing step (3) better. Can LLVM help in that area?The problem of extracting structured control flow graphs is, to my knowledge, relatively solved. I have seen some work by a few groups on decompiling LLVM IR to C; one of the results may be found here: <https://bitbucket.org/gnarf/axtor/>. The other issue is that it is not that feasible to translate machine code to LLVM IR right now since the LLVM IR is at a very high level, so the task of translation would be akin to reimplementing much of, say, IDA. -- Joshua Cranmer News submodule owner DXR coauthor
On 5/7/12 5:47 AM, James Courtier-Dutton wrote:> Hi, > > I am writing a decompiler. I was wondering if some of LLVM could be > used for a decompiler. > There are several stages in the decompiler process. > 1) Take binary and create a higher level representation of it. Like RTL. > 2) The output is then broken into blocks or nodes, each block ends in > a CALL, JMP, RET, or 2-way or multiway conditional JMP.I'm not sure that there's anything that will help you with this step for LLVM. The closest I can think of is Qemu, and I think that uses dynamic binary translation (i.e., you have to run the binary program).> 3) The blocks or nodes are then analyzed for structure in order to > extract loop information and if...then...else information.Given that you've completed steps one and two (i.e., you've converted the binary instructions to LLVM IR and then discovered basic blocks), then yes, LLVM's current analysis passes should help you with this third step. LLVM has passes that normalize loops, identify loops in local control-flow graphs, identify dominators/post-dominators, etc.> 4) Once structure is obtained, data types can be analyzed.The only thing for LLVM which could help here is a type-inference/points-to analysis called DSA. However, since you're reversing everything from binary code, I doubt DSA's type-inference will work well, so I don't think it will find many (if any) high-level types like structs or arrays of structs. You might be able to build a more sophisticated analysis yourself, but you'll pretty much be on your own.> 5) Lastly, source code is output in C or C++ or whatever is needed.LLVM might have facilities for converting LLVM IR to C or C++ code (the C backend was recently removed; there might be a C++ backend, but I'm not sure). However, they are primarily designed for systems for which LLVM does not provide a native code generator, so the C/C++ code they output isn't very readable. -- John T.> > I was wondering if LLVM could help with any of these steps. > I am looking at doing step (3) better. Can LLVM help in that area? > > Kind Regards > > James > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > I am writing a decompiler. I was wondering if some of LLVM could be > > used for a decompiler. > > There are several stages in the decompiler process. > > 1) Take binary and create a higher level representation of it. Like RTL. > > 2) The output is then broken into blocks or nodes, each block ends in > > a CALL, JMP, RET, or 2-way or multiway conditional JMP. > > I'm not sure that there's anything that will help you with this step for > LLVM. The closest I can think of is Qemu, and I think that uses dynamic > binary translation (i.e., you have to run the binary program).You can tranlate TCG IR into LLVM IR instead of writing your own tranlation function, if you want to leverage QEMU for this part. Regards, chenwj -- Wei-Ren Chen (陳韋任) Computer Systems Lab, Institute of Information Science, Academia Sinica, Taiwan (R.O.C.) Tel:886-2-2788-3799 #1667 Homepage: http://people.cs.nctu.edu.tw/~chenwj
On 7 May 2012 16:31, John Criswell <criswell at illinois.edu> wrote:> On 5/7/12 5:47 AM, James Courtier-Dutton wrote: >> >> Hi, >> >> I am writing a decompiler. I was wondering if some of LLVM could be >> used for a decompiler. >> There are several stages in the decompiler process. >> 1) Take binary and create a higher level representation of it. Like RTL. >> 2) The output is then broken into blocks or nodes, each block ends in >> a CALL, JMP, RET, or 2-way or multiway conditional JMP. > > > I'm not sure that there's anything that will help you with this step for > LLVM. The closest I can think of is Qemu, and I think that uses dynamic > binary translation (i.e., you have to run the binary program). >No problem. I have already coded (1) and (2) https://github.com/jcdutton/libbeauty It uses a tiny VM to help it do 1 and 2, so will eventually be able to also handle self modifying code. It is similar to qemu in that respect, except that it is not so concerned with real time execution of the code that qemu provides.> >> 3) The blocks or nodes are then analyzed for structure in order to >> extract loop information and if...then...else information. > > > Given that you've completed steps one and two (i.e., you've converted the > binary instructions to LLVM IR and then discovered basic blocks), then yes, > LLVM's current analysis passes should help you with this third step. LLVM > has passes that normalize loops, identify loops in local control-flow > graphs, identify dominators/post-dominators, etc.Great, which bit of the LLVM source code does this bit (3)?> > >> 4) Once structure is obtained, data types can be analyzed. > > > The only thing for LLVM which could help here is a type-inference/points-to > analysis called DSA. However, since you're reversing everything from binary > code, I doubt DSA's type-inference will work well, so I don't think it will > find many (if any) high-level types like structs or arrays of structs. > > You might be able to build a more sophisticated analysis yourself, but > you'll pretty much be on your own.I agree, my need to discover data types is not a function that a compiler needs to do. The Source code has already told the compiler about the data types. I will work on this on my own, once I have (3) done.> > >> 5) Lastly, source code is output in C or C++ or whatever is needed. > > > LLVM might have facilities for converting LLVM IR to C or C++ code (the C > backend was recently removed; there might be a C++ backend, but I'm not > sure). However, they are primarily designed for systems for which LLVM does > not provide a native code generator, so the C/C++ code they output isn't > very readable.Is the reason that the C/C++ code is not very readable, because there is not high enough metadata in the LLVM IR to do it? I.e. Lack of structure. Or was it because it was only designed to be input to another compiler, so pretty structure like for loops etc, was not necessary? Why was the C backend removed? James