On Dec 9, 2007, at 5:22 AM, Emmanuel Bastien wrote:
> Hi,
> Apart from the Calysto project
(http://www.cs.ubc.ca/~babic/index_calysto.htm
> ), is there any other static code analysis tool based of the LLVM
> framework ?
> Calysto may be great but it seems that the source is not available
> (yet?).
> I was quite excited by Oink/Elsa few years ago but the project is
> almost dead even if the C++ parser is far from being complete.
> It seems to me that everything is ready in LLVM to build industrial-
> strength static analysis tools. Clang is of course a big step
> towards real-time parsing and IDE integration but the quality of
> llvm-gcc should be enough for many practical applications.
> I am interested in automated code review, coverage and metrics. as
> done by commercial products like Parasoft C++test. What I am not
> sure yet is whether the LLVM IR is rich enough for the job or if I
> should wait for the dedicated C++ ASTof clang.
>
> Best regards,
> Emmanuel Bastien
> Amadeus IT Group SA
Hi Emmanuel,
We are currently building a static analysis framework as part of
clang. The goal is to provide a framework for a variety of tools that
could benefit from source-code level analysis, with a particular focus
on bug-finding (and possibly verification) tools. This work is
currently in the early stages, but we expect it to rapidly progress
over the next 6 months. Naturally this work would target what
languages are currently supported (or partially supported) by clang (C
and Objective-C), but of course the framework could naturally progress
to analyzing C++ as that language becomes supported by the frontend.
We currently already have a library in clang for performing flow-
sensitive, intra-procedural dataflow analyses, and plan on eventually
providing a framework for inter-procedural, path-sensitive analysis
over entire code bases. If you are interested in following the
progress of this work, I encourage you to subscribe to the cfe-dev
mailing list. You are also more than welcome to get involved in the
actual development of this framework by submitting patches or
providing feedback.
Aside from our plans, it is probably worth me taking a moment to
explain why we are even implementing a source-level analysis
framework, especially when LLVM already supports an IR for analysis
and transformation. The motivation for providing the ability to
perform static analysis at the source-level all comes down to
tradeoffs. The LLVM IR has some truly beautiful properties such as an
SSA-form and a low-level IR that is essentially a typed assembly
language. The IR can capture much of the type information of the
original program while still providing a lowered program
representation that simplifies many kinds of analyses and program
optimizations. This lowering, however, is also a double-edged sword.
Much of the original (high-level) type information of the program is
discarded in the LLVM IR, which becomes extremely important when we
start talking about analyzing objected-oriented languages or any
language that has a rich type system. Such information can be
extremely useful when improving the precision of an analysis, or
simply for providing diagnosable output for a user concerning possible
bugs found by the tool. Moreover, a source-level analysis framework
captures a wide variety of other sources of information from the
program, such as macros, templates, scope, loop constructs, accurate
information regarding variable and function names, etc. All of these
things are marginalized away in the lowering to LLVM IR. It is also
in many cases much easer to provide diagnosable output to the user
about potential bugs when full source-level information is available
(which is more than just lines and column information which may be
present in a .o file's debugging information or an LLVM bitcode
file). Of course analyzing the original source can be much messier; a
language like C contains far more esoteric edges cases to reason about
than the LLVM IR.
Most state-of-the-art (commercial) bug-finding tools based on static
analysis operate on an IR that is close to the source-level. Analysis
tools that operate on Java code can often get away with doing just
analysis on the bytecode level since the bytecode contains enough
information to recreate much of the original Java program (the type
system of Java is captured explicitly in the bytecode). Nevertheless,
this isn't always enough information. Things especially get difficult
in a language like C++, where macros, template instantiation, and
operator overloading can significantly obfuscate the mapping between a
lowered IR such as that used by LLVM and the original source code.
There are many other tradeoffs between doing source-level and LLVM IR-
level analysis. Which one you use at the end of the day depends on
your application and your precise goals. Of course many bug-finding
analyses could actually be done (well) at the LLVM IR level, while
others could be done far more successfully at the source-level.
Finally, an analysis framework that allows us to reason statically at
the source-level about the properties of C/Objective-C/C++/whatever
programs only provides another tool in the LLVM toolbox. When
building a bug-finding tool, one can potentially use both the LLVM IR
and the source-level analysis framework that will be built into clang,
although we believe that in order to build a successful (static) bug-
finding tool a good source-level analysis framework is a prerequisite
piece.
Ted