thr3ads.net - llvm dev - [LLVMdev] Using LLVM for decompiling. [May 2012]

If this information is useful, please help other people find it:
Share via:

James Courtier-Dutton

2012-May-07 16:45 UTC

[LLVMdev] Using LLVM for decompiling.

On 7 May 2012 16:31, John Criswell <criswell at illinois.edu>
wrote:> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>
>> Hi,
>>
>> I am writing a decompiler. I was wondering if some of LLVM could be
>> used for a decompiler.
>> There are several stages in the decompiler process.
>> 1) Take binary and create a higher level representation of it. Like
RTL.
>> 2) The output is then broken into blocks or nodes, each block ends in
>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>
>
> I'm not sure that there's anything that will help you with this
step for
> LLVM.  The closest I can think of is Qemu, and I think that uses dynamic
> binary translation (i.e., you have to run the binary program).
>No problem. I have already coded (1) and (2)
https://github.com/jcdutton/libbeauty
It uses a tiny VM to help it do 1 and 2, so will eventually be able to
also handle self modifying code.
It is similar to qemu in that respect, except that it is not so
concerned with real time execution of the code that qemu
provides.>
>> 3) The blocks or nodes are then analyzed for structure in order to
>> extract loop information and if...then...else information.
>
>
> Given that you've completed steps one and two (i.e., you've
converted the
> binary instructions to LLVM IR and then discovered basic blocks), then yes,
> LLVM's current analysis passes should help you with this third step.
 LLVM
> has passes that normalize loops, identify loops in local control-flow
> graphs, identify dominators/post-dominators, etc.Great, which bit of the LLVM source code does this bit (3)?
>
>
>> 4) Once structure is obtained, data types can be analyzed.
>
>
> The only thing for LLVM which could help here is a type-inference/points-to
> analysis called DSA.  However, since you're reversing everything from
binary
> code, I doubt DSA's type-inference will work well, so I don't think
it will
> find many (if any) high-level types like structs or arrays of structs.
>
> You might be able to build a more sophisticated analysis yourself, but
> you'll pretty much be on your own.I agree, my need to discover data types is not a function that a
compiler needs to do.
The Source code has already told the compiler about the data types.
I will work on this on my own, once I have (3) done.
>
>
>> 5) Lastly, source code is output in C or C++ or whatever is needed.
>
>
> LLVM might have facilities for converting LLVM IR to C or C++ code (the C
> backend was recently removed; there might be a C++ backend, but I'm not
> sure).  However, they are primarily designed for systems for which LLVM
does
> not provide a native code generator, so the C/C++ code they output
isn't
> very readable.Is the reason that the C/C++ code is not very readable, because there
is not high enough metadata in the LLVM IR to do it?
I.e. Lack of structure.
Or was it because it was only designed to be input to another
compiler, so pretty structure like for loops etc, was not necessary?
Why was the C backend removed?

James

Eli Friedman

2012-May-07 17:05 UTC

head link

[LLVMdev] Using LLVM for decompiling.

On Mon, May 7, 2012 at 9:45 AM, James Courtier-Dutton
<james.dutton at gmail.com> wrote:> On 7 May 2012 16:31, John Criswell <criswell at illinois.edu> wrote:
>> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>>
>>> Hi,
>>>
>>> I am writing a decompiler. I was wondering if some of LLVM could be
>>> used for a decompiler.
>>> There are several stages in the decompiler process.
>>> 1) Take binary and create a higher level representation of it. Like
RTL.
>>> 2) The output is then broken into blocks or nodes, each block ends
in
>>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>>
>>
>> I'm not sure that there's anything that will help you with this
step for
>> LLVM.  The closest I can think of is Qemu, and I think that uses
dynamic
>> binary translation (i.e., you have to run the binary program).
>>
> No problem. I have already coded (1) and (2)
> https://github.com/jcdutton/libbeauty
> It uses a tiny VM to help it do 1 and 2, so will eventually be able to
> also handle self modifying code.
> It is similar to qemu in that respect, except that it is not so
> concerned with real time execution of the code that qemu provides.
>>
>>> 3) The blocks or nodes are then analyzed for structure in order to
>>> extract loop information and if...then...else information.
>>
>>
>> Given that you've completed steps one and two (i.e., you've
converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then
yes,
>> LLVM's current analysis passes should help you with this third
step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?
>
>>
>>
>>> 4) Once structure is obtained, data types can be analyzed.
>>
>>
>> The only thing for LLVM which could help here is a
type-inference/points-to
>> analysis called DSA.  However, since you're reversing everything
from binary
>> code, I doubt DSA's type-inference will work well, so I don't
think it will
>> find many (if any) high-level types like structs or arrays of structs.
>>
>> You might be able to build a more sophisticated analysis yourself, but
>> you'll pretty much be on your own.
> I agree, my need to discover data types is not a function that a
> compiler needs to do.
> The Source code has already told the compiler about the data types.
> I will work on this on my own, once I have (3) done.
>
>>
>>
>>> 5) Lastly, source code is output in C or C++ or whatever is needed.
>>
>>
>> LLVM might have facilities for converting LLVM IR to C or C++ code (the
C
>> backend was recently removed; there might be a C++ backend, but I'm
not
>> sure).  However, they are primarily designed for systems for which LLVM
does
>> not provide a native code generator, so the C/C++ code they output
isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?
The old C backend was purely designed to be input to another compiler;
you could easily write a version with much prettier output.
> Why was the C backend removed?
Basically, it was unmaintained, the output was broken, and nobody was
willing to spend the time to fix it.

-Eli

Joshua Cranmer

2012-May-07 17:08 UTC

head link

[LLVMdev] Using LLVM for decompiling.

On 5/7/2012 11:45 AM, James Courtier-Dutton wrote:> On 7 May 2012 16:31, John Criswell<criswell at illinois.edu>  wrote:
>> Given that you've completed steps one and two (i.e., you've
converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then
yes,
>> LLVM's current analysis passes should help you with this third
step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?
Several of the passes in Analysis or Transforms. The code doesn't really 
work if it's not being output in IR, which your current library doesn't 
appear to be spitting out.>> LLVM might have facilities for converting LLVM IR to C or C++ code (the
C
>> backend was recently removed; there might be a C++ backend, but I'm
not
>> sure).  However, they are primarily designed for systems for which LLVM
does
>> not provide a native code generator, so the C/C++ code they output
isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?
> Why was the C backend removed?
>The original reason for the C-backend was to be able to use LLVM for 
optimizations on architectures that it didn't have code generators for. 
However, several changes had been subsequently made to the IR that 
didn't update the C-backend code generator, so it got to the point where 
it tended to work only on small, trivial code samples.

It was never intended to be a decompiler, so the output was never 
particularly readable, and it didn't bother trying to structure if 
statements (IIRC, it was pretty much entirely goto-based). I earlier 
linked to a project which purports to be able to output more readable 
code as a backend (to try to do source-to-source translations for OpenCL 
kernels), but I haven't had the time to investigate it in great detail, 
so I don't know how well it holds up to the code.

Decompiling control flow is fairly easy to do (google "control flow 
structuring" or similar terms and you'll hit open a few troves of
papers
which give fairly clear details on how to do it), so it's rather the 
problem that stripped, optimized executables don't give you reliable 
ways to find functions (or parameters, for that matter) and the complete 
abolition of typing and variable information that makes decompiling 
extremely difficult.

-- 
Joshua Cranmer
News submodule owner
DXR coauthor

John Criswell

2012-May-07 17:13 UTC

head link

[LLVMdev] Using LLVM for decompiling.

On 5/7/12 11:45 AM, James Courtier-Dutton wrote:> On 7 May 2012 16:31, John Criswell<criswell at illinois.edu>  wrote:
>> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>> Hi,
>>>
>>> I am writing a decompiler. I was wondering if some of LLVM could be
>>> used for a decompiler.
>>> There are several stages in the decompiler process.
>>> 1) Take binary and create a higher level representation of it. Like
RTL.
>>> 2) The output is then broken into blocks or nodes, each block ends
in
>>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>>
>> I'm not sure that there's anything that will help you with this
step for
>> LLVM.  The closest I can think of is Qemu, and I think that uses
dynamic
>> binary translation (i.e., you have to run the binary program).
>>
> No problem. I have already coded (1) and (2)
> https://github.com/jcdutton/libbeauty
> It uses a tiny VM to help it do 1 and 2, so will eventually be able to
> also handle self modifying code.
> It is similar to qemu in that respect, except that it is not so
> concerned with real time execution of the code that qemu provides.
>>> 3) The blocks or nodes are then analyzed for structure in order to
>>> extract loop information and if...then...else information.
>>
>> Given that you've completed steps one and two (i.e., you've
converted the
>> binary instructions to LLVM IR and then discovered basic blocks), then
yes,
>> LLVM's current analysis passes should help you with this third
step.  LLVM
>> has passes that normalize loops, identify loops in local control-flow
>> graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?
There are various passes: use http://llvm.org/docs/Passes.html to see 
which analyses exist in LLVM and then use the doxygen docs 
(http://llvm.org/doxygen/hierarchy.html) to get exact information on how 
to use them.

Examples of the above passes include llvm::LoopInfo, 
llvm::DominatorTree, and llvm::PostDominatorTree.

>
>>
>>> 4) Once structure is obtained, data types can be analyzed.
>>
>> The only thing for LLVM which could help here is a
type-inference/points-to
>> analysis called DSA.  However, since you're reversing everything
from binary
>> code, I doubt DSA's type-inference will work well, so I don't
think it will
>> find many (if any) high-level types like structs or arrays of structs.
>>
>> You might be able to build a more sophisticated analysis yourself, but
>> you'll pretty much be on your own.
> I agree, my need to discover data types is not a function that a
> compiler needs to do.
Actually, tools like the SAFECode compiler might benefit from more 
aggressive type-reconstruction.  What we've seen is that certain LLVM 
transforms (I think the loop transforms and instcombine) "remove" type
information in the IR.  For example, a transform will change a 
getelementptr instruction that indexes into an array of structures into 
a getelementptr instruction that treats the same memory object as an 
array of bytes.  Both instructions do the same thing, but the former has 
information that make type-inference much easier.

There are two ways to fix the issue: either modify the transforms to 
maintain the type information when possible, or work harder to 
reconstruct it when it gets lost.  The former is better, I think, but 
the latter may still be useful.

In short, your efforts may benefit more than just you.

Just out of curiosity, what approach do you plan to take for this problem?
> The Source code has already told the compiler about the data types.
> I will work on this on my own, once I have (3) done.
>
>>
>>> 5) Lastly, source code is output in C or C++ or whatever is needed.
>>
>> LLVM might have facilities for converting LLVM IR to C or C++ code (the
C
>> backend was recently removed; there might be a C++ backend, but I'm
not
>> sure).  However, they are primarily designed for systems for which LLVM
does
>> not provide a native code generator, so the C/C++ code they output
isn't
>> very readable.
> Is the reason that the C/C++ code is not very readable, because there
> is not high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another
> compiler, so pretty structure like for loops etc, was not necessary?
The latter; it was only designed to provide output to be consumed by 
another C compiler.
> Why was the C backend removed?
No one used it, and it was bit rotting.  If someone wants to bring it 
back to life, patches are welcome.

-- John T.
>
> James

Gordon Keiser

2012-May-07 17:35 UTC

head link

[LLVMdev] Using LLVM for decompiling.

> -----Original Message-----
> On Behalf Of James Courtier-Dutton
> To: John Criswell
> 
> On 7 May 2012 16:31, John Criswell <criswell at illinois.edu> wrote:
> > On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
> >>
> >> Hi,
> >>
> >> I am writing a decompiler. I was wondering if some of LLVM could
be
> >> used for a decompiler.
> >> There are several stages in the decompiler process.
> >> 1) Take binary and create a higher level representation of it.
Like RTL.
> >> 2) The output is then broken into blocks or nodes, each block ends
in
> >> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
> >
> >
> > I'm not sure that there's anything that will help you with
this step
> > for LLVM.  The closest I can think of is Qemu, and I think that uses
> > dynamic binary translation (i.e., you have to run the binary program).
> >
> No problem. I have already coded (1) and (2)
> https://github.com/jcdutton/libbeauty
> It uses a tiny VM to help it do 1 and 2, so will eventually be able to also
handle
> self modifying code.
> It is similar to qemu in that respect, except that it is not so concerned
with real
> time execution of the code that qemu provides.
> >
> >> 3) The blocks or nodes are then analyzed for structure in order to
> >> extract loop information and if...then...else information.
> >
> >
> > Given that you've completed steps one and two (i.e., you've
converted
> > the binary instructions to LLVM IR and then discovered basic blocks),
> > then yes, LLVM's current analysis passes should help you with this
> > third step.  LLVM has passes that normalize loops, identify loops in
> > local control-flow graphs, identify dominators/post-dominators, etc.
> Great, which bit of the LLVM source code does this bit (3)?
> 
> >
> >
> >> 4) Once structure is obtained, data types can be analyzed.
> >
> >
> > The only thing for LLVM which could help here is a
> > type-inference/points-to analysis called DSA.  However, since
you're
> > reversing everything from binary code, I doubt DSA's
type-inference
> > will work well, so I don't think it will find many (if any)
high-level types like
> structs or arrays of structs.
> >
> > You might be able to build a more sophisticated analysis yourself, but
> > you'll pretty much be on your own.
> I agree, my need to discover data types is not a function that a compiler
needs
> to do.
> The Source code has already told the compiler about the data types.
> I will work on this on my own, once I have (3) done.
> 
> >
> >
> >> 5) Lastly, source code is output in C or C++ or whatever is
needed.
> >
> >
> > LLVM might have facilities for converting LLVM IR to C or C++ code
> > (the C backend was recently removed; there might be a C++ backend, but
> > I'm not sure).  However, they are primarily designed for systems
for
> > which LLVM does not provide a native code generator, so the C/C++ code
> > they output isn't very readable.
> Is the reason that the C/C++ code is not very readable, because there is
not
> high enough metadata in the LLVM IR to do it?
> I.e. Lack of structure.
> Or was it because it was only designed to be input to another compiler, so
> pretty structure like for loops etc, was not necessary?
> Why was the C backend removed?
> 
> James
> 

Maybe OT, but you might take a look at the emscripten project (
https://github.com/kripken/emscripten/wiki )  , which is a Javascript backend
for LLVM.   It's actively maintained and might give you some better ideas of
how to do something like this once you can get back to bitcode level from
assembly.

Also take a look at  Disarm ( http://code.google.com/p/disarm/  ) which is an
ARM machine code to LLVM-IR disassembler.   There may be quite a bit of relevant
work done there.

Gordon Keiser
Software Development Engineer
Arxan Technologies
w:+1.765.889.4756   m:+1.765.237.4833  f:+1.415.247.0910
gkeiser at arxan.com  www.arxan.com 
Protecting the App Economy™

James Courtier-Dutton

2012-May-07 18:12 UTC

head link

[LLVMdev] Using LLVM for decompiling.

On 7 May 2012 18:08, Joshua Cranmer <pidgeot18 at gmail.com>
wrote:> On 5/7/2012 11:45 AM, James Courtier-Dutton wrote:
>> On 7 May 2012 16:31, John Criswell<criswell at illinois.edu>
 wrote:
>>> Given that you've completed steps one and two (i.e., you've
converted the
>>> binary instructions to LLVM IR and then discovered basic blocks),
then yes,
>>> LLVM's current analysis passes should help you with this third
step.  LLVM
>>> has passes that normalize loops, identify loops in local
control-flow
>>> graphs, identify dominators/post-dominators, etc.
>> Great, which bit of the LLVM source code does this bit (3)?
>
> Several of the passes in Analysis or Transforms. The code doesn't
really
> work if it's not being output in IR, which your current library
doesn't
> appear to be spitting out.
Yes, my code does not yet produce IR. I am exploring whether doing so,
would it help in the decompiler, or be more like square peg in round
hole.
>
> Decompiling control flow is fairly easy to do (google "control flow
> structuring" or similar terms and you'll hit open a few troves of
papers
> which give fairly clear details on how to do it), so it's rather the
> problem that stripped, optimized executables don't give you reliable
> ways to find functions (or parameters, for that matter) and the complete
> abolition of typing and variable information that makes decompiling
> extremely difficult.
>
That is right, I need to get the control flow structure working before
I can look further into data type analysis.
My current interest is decompiling stripped .o files.>From the point of view of data type analysis, so far, I have functionparameters listed, without typing. I have local variables listed.
I have determination of pointer or not.
I am going to approach it from a statistical point of view. I.e. If
the code implies certain data types, act it them. If it is still
ambiguous, ask a human to decide.
I am hoping to determine things like size of arrays by analyzing the
control flow for what range of values the index into the array can
take. So, a lot of previous work into static analysis will help me
there.
I am approaching the problem from the point of view of starting with a
binary that you have to decompile manually, with my decompile tool
helping to remove a lot of the painstaking work that can be done using
a computer algorithm of some type.

Kind Regards

James

Evan Cheng

2012-May-08 01:16 UTC

head link

[LLVMdev] Using LLVM for decompiling.

There is no code in LLVM to translate object code to LLVM ir. However, there is
work to do object to "structured" MC. Benjamin K. can you tell you all
about it.

Evan

On May 7, 2012, at 10:35 AM, Gordon Keiser wrote:
>> -----Original Message-----
>> On Behalf Of James Courtier-Dutton
>> To: John Criswell
>> 
>> On 7 May 2012 16:31, John Criswell <criswell at illinois.edu>
wrote:
>>> On 5/7/12 5:47 AM, James Courtier-Dutton wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I am writing a decompiler. I was wondering if some of LLVM
could be
>>>> used for a decompiler.
>>>> There are several stages in the decompiler process.
>>>> 1) Take binary and create a higher level representation of it.
Like RTL.
>>>> 2) The output is then broken into blocks or nodes, each block
ends in
>>>> a CALL, JMP, RET, or 2-way or multiway conditional JMP.
>>> 
>>> 
>>> I'm not sure that there's anything that will help you with
this step
>>> for LLVM.  The closest I can think of is Qemu, and I think that
uses
>>> dynamic binary translation (i.e., you have to run the binary
program).
>>> 
>> No problem. I have already coded (1) and (2)
>> https://github.com/jcdutton/libbeauty
>> It uses a tiny VM to help it do 1 and 2, so will eventually be able to
also handle
>> self modifying code.
>> It is similar to qemu in that respect, except that it is not so
concerned with real
>> time execution of the code that qemu provides.
>>> 
>>>> 3) The blocks or nodes are then analyzed for structure in order
to
>>>> extract loop information and if...then...else information.
>>> 
>>> 
>>> Given that you've completed steps one and two (i.e., you've
converted
>>> the binary instructions to LLVM IR and then discovered basic
blocks),
>>> then yes, LLVM's current analysis passes should help you with
this
>>> third step.  LLVM has passes that normalize loops, identify loops
in
>>> local control-flow graphs, identify dominators/post-dominators,
etc.
>> Great, which bit of the LLVM source code does this bit (3)?
>> 
>>> 
>>> 
>>>> 4) Once structure is obtained, data types can be analyzed.
>>> 
>>> 
>>> The only thing for LLVM which could help here is a
>>> type-inference/points-to analysis called DSA.  However, since
you're
>>> reversing everything from binary code, I doubt DSA's
type-inference
>>> will work well, so I don't think it will find many (if any)
high-level types like
>> structs or arrays of structs.
>>> 
>>> You might be able to build a more sophisticated analysis yourself,
but
>>> you'll pretty much be on your own.
>> I agree, my need to discover data types is not a function that a
compiler needs
>> to do.
>> The Source code has already told the compiler about the data types.
>> I will work on this on my own, once I have (3) done.
>> 
>>> 
>>> 
>>>> 5) Lastly, source code is output in C or C++ or whatever is
needed.
>>> 
>>> 
>>> LLVM might have facilities for converting LLVM IR to C or C++ code
>>> (the C backend was recently removed; there might be a C++ backend,
but
>>> I'm not sure).  However, they are primarily designed for
systems for
>>> which LLVM does not provide a native code generator, so the C/C++
code
>>> they output isn't very readable.
>> Is the reason that the C/C++ code is not very readable, because there
is not
>> high enough metadata in the LLVM IR to do it?
>> I.e. Lack of structure.
>> Or was it because it was only designed to be input to another compiler,
so
>> pretty structure like for loops etc, was not necessary?
>> Why was the C backend removed?
>> 
>> James
>> 
> 
> 
> Maybe OT, but you might take a look at the emscripten project (
https://github.com/kripken/emscripten/wiki )  , which is a Javascript backend
for LLVM.   It's actively maintained and might give you some better ideas of
how to do something like this once you can get back to bitcode level from
assembly.
> 
> Also take a look at  Disarm ( http://code.google.com/p/disarm/  ) which is
an ARM machine code to LLVM-IR disassembler.   There may be quite a bit of
relevant work done there.
> 
> Gordon Keiser
> Software Development Engineer
> Arxan Technologies
> w:+1.765.889.4756   m:+1.765.237.4833  f:+1.415.247.0910
> gkeiser at arxan.com  www.arxan.com 
> Protecting the App Economy™  
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2012 - [LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

[LLVMdev] Using LLVM for decompiling.

Possibly Parallel Threads