thr3ads.net - llvm dev - [LLVMdev] GSoC Proposal: Table-Driven Decompilation [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Charles Davis

2012-Apr-04 05:08 UTC

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

Hi,

Here's one of my proposals for GSoC 2012. What do you think?

Chip

Project Title: Table-Driven Decompilation

Abstract:
Over the years, the LLVM family has grown to include nearly every type of build
tool in existence. One of the few missing is a decompiler. LLVM's TableGen
tool could potentially accelerate development of such a tool; most backends
already have the information needed to implement it. This project proposes
implementing support for decompilation in LLVM using information gleaned from
target description files. Such a decompiler could be used for analysis,
optimization, and recompilation of machine code.

Proposal:
Since its humble beginnings in 2001, LLVM has grown from a simple compiler
toolkit to an entire family of build tools. Currently, it includes an assembler,
a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for
analyzing object files, and even a linker. In fact, just about the only tool
missing from this set (aside from various language compilers) is a decompiler--a
tool to turn machine code back into LLVM IR. This project proposes adding such a
tool.

Some of the information needed to produce such a tool is already present--in the
form of target description files, some of which contain patterns used to
transform LLVM IR--or, more accurately, a selection DAG--into machine code.
Since a decompiler is largely a compiler working in reverse, it should be
conceivable to use these patterns to transform machine code back into a
selection DAG--and transform that, in turn, back into raw LLVM IR. To actually
read machine code, the decompiler will use the MC disassembler to produce MCInst
objects, which can be transformed back into CodeGen's MachineInstr
representation, so it can be fed through the selection DAG in reverse.

Some DAG->machine code transformations aren't controlled by TableGen
patterns, but by custom transformations implemented as C++ code. For those
transformations, custom C++ code performing the reverse transformation will be
necessary.

Proposed schedule:
Week 1-4: Work on MCInst->MachineInstr transformation.
Week 5: Work on TableGen backend to generate MachineInstr->SDAG tables.
Week 6-8: Work on support for custom MachineInstr->SDAG transformation.
Week 9-12: Work on SDAG->LLVM IR transformation.

Contact Information:
email: cdavis (at) mymail (dot) mines (dot) edu
phone: (seven-one-nine) 963-4781
IRC: cdavis5x at oftc, cdavis5x at freenode

Interest:
Since Apple gutted PowerPC support from Mac OS 10.7, I have been searching for a
way to run old PowerPC apps. While it would be easy to make my own recompiler
manually using LLVM's JIT, I believe it would be far more useful to create a
general framework for transforming machine code back to LLVM IR.

Use to LLVM:
Decomposing machine code back to LLVM IR could have potentially many uses within
the LLVM community. Such a component would make building a recompiler--a tool
that transforms machine code from one form to another--easier. In addition, it
would allow clients to use the existing LLVM optimization and analysis passes on
machine code, without the need to implement special tools to operate directly on
MCInst objects.

Experience:
I have been working with LLVM for over two and a half years. I implemented
support for the force_align_arg_pointer attribute, as well as the beginnings of
Win64 exception-handling support. In GSoC 2010, I added support for multiple C++
ABIs, and the beginnings of support for the Microsoft Visual C++ ABI. In
addition, I have also contributed to the Wine project--some of which was making
it easier to compile with Clang :).

Tobias Grosser

2012-Apr-04 06:50 UTC

head link

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

On 04/04/2012 07:08 AM, Charles Davis wrote:> Hi,
>
> Here's one of my proposals for GSoC 2012. What do you think?
>
> Chip
>
> Project Title: Table-Driven Decompilation
>
> Abstract:
> Over the years, the LLVM family has grown to include nearly every type of
build tool in existence. One of the few missing is a decompiler. LLVM's
TableGen tool could potentially accelerate development of such a tool; most
backends already have the information needed to implement it. This project
proposes implementing support for decompilation in LLVM using information
gleaned from target description files. Such a decompiler could be used for
analysis, optimization, and recompilation of machine code.
Hi Chip,

I have little experience in this area, but here some feedback:

The proposal looks nice and decompilation and later binary to binary 
translation sounds very interesting. However, to me it seems this is a 
very difficult topic and it would be good if the proposal would show you 
understand the difficulties and you have ideas how to solve them. A 
topic that I heard is difficult is e.g. how to keep track of the state 
of registers and CPU flags.

The libcpu project solves this by not even trying to reverse the 
individual LLVM-IR to machine code transformations, but to directly emit 
LLVM-IR that directly models each instruction as function calls that 
perform the original calculation and that, at the same time, model the 
CPU state. You may want to investigate what they do exactly,
Akso it would be interesting to explain how your approach comparer to 
the libcpu approach? Do you think yours has benefits? What are its 
drawbacks? Did you consider to improve libcpu instead of starting your 
own tool?

Cheers
Tobi

Joshua Cranmer

2012-Apr-04 14:54 UTC

head link

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

On 4/4/2012 12:08 AM, Charles Davis wrote:> Proposal:
> Since its humble beginnings in 2001, LLVM has grown from a simple compiler
toolkit to an entire family of build tools. Currently, it includes an assembler,
a disassembler, a JIT, a C compiler, a debugger, an archiver, various tools for
analyzing object files, and even a linker. In fact, just about the only tool
missing from this set (aside from various language compilers) is a decompiler--a
tool to turn machine code back into LLVM IR. This project proposes adding such a
tool.
>
> Some of the information needed to produce such a tool is already
present--in the form of target description files, some of which contain patterns
used to transform LLVM IR--or, more accurately, a selection DAG--into machine
code. Since a decompiler is largely a compiler working in reverse, it should be
conceivable to use these patterns to transform machine code back into a
selection DAG--and transform that, in turn, back into raw LLVM IR. To actually
read machine code, the decompiler will use the MC disassembler to produce MCInst
objects, which can be transformed back into CodeGen's MachineInstr
representation, so it can be fed through the selection DAG in reverse.
>
> Some DAG->machine code transformations aren't controlled by TableGen
patterns, but by custom transformations implemented as C++ code. For those
transformations, custom C++ code performing the reverse transformation will be
necessary.
This strikes me as an extremely ambitious project, and I wonder how much 
you could actually get done in a summer.

Take the MCInst->MachineInstr conversion. In order to properly do this 
phase, you have to reconstruct functions, basic blocks, control flow... 
as well as identifying global variables and their initializers (should 
they have any). I am not an expert in the state of the art, but some 
experiments with IDA Free indicate that identifying the latter correctly 
(consider arrays where you have references to particular elements of the 
array: you have references to the middle of the array, which would 
generally construct a new symbol offset... arrays of structs are even 
worse!).

Another issue you would have is getting the implicit register usages 
correct... if you leave these out, you would get a MachineFunction that 
you can't do any code generation optimizations on safely. This isn't 
necessarily a problem, but it's something which would need warning 
labels in big, flashing neon signs. In any case, tracking the implicit 
use of registers is critical to being able to do register
"deallocation"
and recovering SSA values for the machine code. It's not clear to me 
that you could let this fall out during the SelectionDAG process.

Finally, it's not clear to me that using a selection DAG would derive 
you much benefit instead of just doing a straight MachineInstr->LLVM IR 
transformation. However, this is not my area of expertise, so I won't 
comment any further in this regard.

-- 
Joshua Cranmer
News submodule owner
DXR coauthor

Charles Davis

2012-Apr-05 13:42 UTC

head link

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

On Apr 4, 2012, at 12:50 AM, Tobias Grosser wrote:
> On 04/04/2012 07:08 AM, Charles Davis wrote:
>> Hi,
>> 
>> Here's one of my proposals for GSoC 2012. What do you think?
>> 
>> Chip
>> 
>> Project Title: Table-Driven Decompilation
>> 
>> Abstract:
>> Over the years, the LLVM family has grown to include nearly every type
of build tool in existence. One of the few missing is a decompiler. LLVM's
TableGen tool could potentially accelerate development of such a tool; most
backends already have the information needed to implement it. This project
proposes implementing support for decompilation in LLVM using information
gleaned from target description files. Such a decompiler could be used for
analysis, optimization, and recompilation of machine code.
> 
> Hi Chip,
> 
> I have little experience in this area, but here some feedback:
> 
> The proposal looks nice and decompilation and later binary to binary
translation sounds very interesting. However, to me it seems this is a very
difficult topic and it would be good if the proposal would show you understand
the difficulties and you have ideas how to solve them. A topic that I heard is
difficult is e.g. how to keep track of the state of registers and CPU flags.
> 
> The libcpu project solves this by not even trying to reverse the individual
LLVM-IR to machine code transformations, but to directly emit LLVM-IR that
directly models each instruction as function calls that perform the original
calculation and that, at the same time, model the CPU state. You may want to
investigate what they do exactly,
> Akso it would be interesting to explain how your approach comparer to the
libcpu approach? Do you think yours has benefits? What are its drawbacks?My plan was to do "register deallocation," whereupon code which uses a
limited number of physical registers would be converted back to SSA form. (I
guess I should have elaborated on that in my proposal, huh? :) I think my
approach would result in faster code, since it wouldn't basically be trying
to run an emulator on each instruction. A major drawback is that it seems to be
very difficult to reverse the IR->SDAG and SDAG->MachineInstr
transformations. Perhaps it would be better and faster to use something like
FastISel for MachineInstr->LLVM IR transformations?> 
> Did you consider to improve libcpu instead of starting your own tool?No, I did not. I'll have to look into that.

Thanks.

Chip

Charles Davis

2012-Apr-05 15:18 UTC

head link

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

On Apr 4, 2012, at 8:54 AM, Joshua Cranmer wrote:
> On 4/4/2012 12:08 AM, Charles Davis wrote:
>> Proposal:
>> Since its humble beginnings in 2001, LLVM has grown from a simple
compiler toolkit to an entire family of build tools. Currently, it includes an
assembler, a disassembler, a JIT, a C compiler, a debugger, an archiver, various
tools for analyzing object files, and even a linker. In fact, just about the
only tool missing from this set (aside from various language compilers) is a
decompiler--a tool to turn machine code back into LLVM IR. This project proposes
adding such a tool.
>> 
>> Some of the information needed to produce such a tool is already
present--in the form of target description files, some of which contain patterns
used to transform LLVM IR--or, more accurately, a selection DAG--into machine
code. Since a decompiler is largely a compiler working in reverse, it should be
conceivable to use these patterns to transform machine code back into a
selection DAG--and transform that, in turn, back into raw LLVM IR. To actually
read machine code, the decompiler will use the MC disassembler to produce MCInst
objects, which can be transformed back into CodeGen's MachineInstr
representation, so it can be fed through the selection DAG in reverse.
>> 
>> Some DAG->machine code transformations aren't controlled by
TableGen patterns, but by custom transformations implemented as C++ code. For
those transformations, custom C++ code performing the reverse transformation
will be necessary.
> 
> This strikes me as an extremely ambitious project, and I wonder how much 
> you could actually get done in a summer.
> 
> Take the MCInst->MachineInstr conversion. In order to properly do this 
> phase, you have to reconstruct functions, basic blocks, control flow... 
> as well as identifying global variables and their initializers (should 
> they have any). I am not an expert in the state of the art, but some 
> experiments with IDA Free indicate that identifying the latter correctly 
> (consider arrays where you have references to particular elements of the 
> array: you have references to the middle of the array, which would 
> generally construct a new symbol offset... arrays of structs are even 
> worse!).Maybe I should devote my SoC project to just this, then? Perhaps, make a simple
MCInst->MachineInstr transformer that doesn't handle anything complex
like global variables or arrays or anything like that?> 
> Another issue you would have is getting the implicit register usages 
> correct... if you leave these out, you would get a MachineFunction that 
> you can't do any code generation optimizations on safely. This
isn't
> necessarily a problem, but it's something which would need warning 
> labels in big, flashing neon signs. In any case, tracking the implicit 
> use of registers is critical to being able to do register
"deallocation"
> and recovering SSA values for the machine code. It's not clear to me 
> that you could let this fall out during the SelectionDAG process.
True.> 
> Finally, it's not clear to me that using a selection DAG would derive 
> you much benefit instead of just doing a straight MachineInstr->LLVM IR 
> transformation. However, this is not my area of expertise, so I won't 
> comment any further in this regard.Perhaps after I finish MCInst->MachineInstr conversion, I should do a
MachineInstr->LLVM IR transformation--or maybe I should just do
MCInst->LLVM IR directly! I wanted to use the SelectionDAG because I wanted
to take advantage of the patterns stored in the target description files, but
now that I think about it, I'm not sure I'll get much out of it, either…

Thanks for your input.

Chip

Seemingly Similar Threads

Search for more reasonably related threads

llvm dev - Apr 2012 - [LLVMdev] GSoC Proposal: Table-Driven Decompilation

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

[LLVMdev] GSoC Proposal: Table-Driven Decompilation

Seemingly Similar Threads