thr3ads.net - llvm dev - [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Raphael Ernani Rodrigues

2012-Mar-30 18:08 UTC

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Dear LLVMers,

   My name is Raphael Ernani, and I am doing my MsC at the Federal
University of Minas Gerais, Brazil. I have been using LLVM for a
while, and I would like to participate in this year's Summer of Code.
One particular idea, in your "open projects" page caught my eye, and I
decided to write a proposal about it. The line that I liked in the
page was "Create an LLVM pass that adds memory safety checks to code,
like Valgrind does for binaries, or like mudflap does for gcc compiled
code.", and my proposal is written below:


===============================================Adding memory safety checks to
the LLVM bitcodes
===============================================

Objective
---------

The objective of this project is to create an LLVM pass that adds
memory safety checks to code, like Valgrind does for binaries, or like
mudflap does for gcc compiled code.


How it will be done
-------------------

We will achieve our goal by creating a LLVM pass that does the following:
1 - Propagate the array size to the sites in the program where the
array is accessed.
2 - Insert code in the program to make a bound-check and throw an
exception if the program tries to access a position out of the array
bounds.
3 - Eliminate redundant bound-checks.


Background
----------

Invalid memory access is a major source of program failures. If a
program statement dereferences a pointer that points to an invalid
memory cell, the program is either aborted by the operating system or,
often worse, the program continues to run with an undefined behavior.
To avoid the latter, one can perform checks before every memory access
at run time. For some programming languages (e.g., Java) this is done
automatically by the compiler/run-time environment. For the language
C, neither the compiler nor the run-time environment enforces
memory-safety policies [1].

An example of access violation is given by the program below. When
this program runs, probably no visible error happens. However, the
program violates the semantics of well defined C code: it is accessing
the 11-th cell of an array that was declared with only 10 cells. Thus,
the behavior of this program is undefined.

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv){
       int i;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++)
               A[i] = i;
       printf("%d\n",A[11]);
       return 0;
}

This type of behavior can be the source of some very hard-to-find
bugs. If no explicit error occurs, other variables can have their
values changed without any reasonable explanation. Finding and fixing
this type of bug can be really hard, especially if the program runs
with more than one active thread. There exists tools that have been
built specially to track access violations like the one in the program
above. Valgrind is probably the best known example. These tools are
dynamic: they are used during the execution of the program, either
emulating it, or instrumenting it to catch errors at runtime. Tools
like Valgrind and mudflap are very useful; however, LLVM does not have
any similar function yet. Our goal is to help filling this gap.


How we plan to improve the code
-------------------------------

Before every read or write in a position of an array, our pass will
add instructions to check if the index of the array is outside the
bounds.
The secured program, without optimizations, is shown below:

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char** argv){
       int i, tmp0;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++){
               if (i<0 or i>10) raise("Array out of bounds.");
               A[i] = i;
       }
       tmp0 = 11;
       if (tmp0<0 or tmp0>10) raise("Array out of bounds.");
       printf("%d\n",A[11]);
       return 0;
}

In order to accomplish this object, the new LLVM pass will have to
perform a number of actions:
1) Collect the allocation sizes of each array. This can be done at
runtime, simply reading the arguments passed to the allocation
function.
2) Store the size of the array together with the allocated area. This
can also be done dynamically, increasing in 32 bits the size of the
allocated area that is given to the array base address.
3) Prefix any array access code with a bounds check to ensure that the
used index is inside the correct limits.

After this, to improve the performance of the "safe programs", I can
remove redundant checks, by not adding new checks where already exist
a safety check. I can use two different abstract domains to accomplish
this goal, a simpler, and another more complex one:

1) The domain of "less-than" names. This is the domain used in the
ABCD array bounds check algorithm [2]. For each variable v is gives
the set of variables that are provably less than v. It is relatively
simple - in terms of implementation - to compute this domain; however,
this computation is costly: it is cubic on the number of variables in
the program. Nevertheless, I can implement the domain, to see if it
pays off its costs. There was an implementation of ABCD in LLVM 2.7,
and I can reuse a substantial part of that code.

2) The domain of intervals. For each variable v, this domain gives an
interval [l, u] with the minimum and maximum values that v can assume
during the program execution. I can use the implementation of range
analysis that we have developed in the Federal University of Minas
Gerais to compute this domain. The implementation is quite mature now,
and I was one of the developers that worked on it.

Linked programs
----------------

In order to be effective, the instrumentation pass should be a whole
program analysis. However, even in this case, only the code that has
been compiled with the pass will be secure. It is possible that
library calls still lead to undefined behavior, like in the program
below:

// We do not have access to this code, because it is an external library:
void foo(int* A, int i){
       A[i] = i;
}

// We have secured this code with our instrumentation:
#include <stdlib.h>
#include <stdio.h>
#include "A.h"
int main(int argc, char** argv){
       int i;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++){
               foo(A,i);
       }
       printf("%d\n",A[11]);
       return 0;
}

Timeline
--------

This is a three months project, and we would like to schedule the time
in the following way (schedule given in weeks):

[1-4]: Change the allocation scheme used by LLVM to store the array
size next to the area that is allocated to the array.

[5-6] Write the pass to insert access checks before array accesses.

[7-10] Improve the performance of safe programs.
- Avoid inserting bound-checks where already exists a memory check.
- Avoid inserting bound-checks where the "less-than" domain indicates
that the memory access is safe.

[11-12] Final tests and report
- Correctly compile the whole LLVM test suite.
- Generate performance numbers.
- Write the final report.


My background
-------------

I am currently a first year master student at the Federal University
of Minas Gerais, under orientation of professor Fernando Pereira. I
have graduated in this school on the year of 2011. I've worked with
software development for five years, in different projects. I built a
Data Warehouse viewer and Business Intelligence graphics and gauges
using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
a logistics planning system, in Excel VBA. I have five years of
experience with SQL. Now, I'm working with LLVM, writing passes to
support scientifical researchers.


Why I can do well in this project
---------------------------------

I believe that I am a good candidate for this Summer of Code for four
reasons. Firstly, I am a master student of computer science in the
Federal
University of Minas Gerais (UFMG), which is one of the best schools of
computer science in Brazil. To support this claim, notice that UFMG
has the maximum grade in the ranking of CAPES (Coordenação de
Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
under the Ministry of Education). I am working in the compiler's lab,
under the advice of Fernando Pereira, who was a Summer of Coder
himself in 2006. Second, I have good experience with programming, with
almost six years of experience in many languages (Java, C, C++, C#,
VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
LLVM passes. I already know some LLVM tools and I already have the
basic knowledge to start this project. I have written a pass to
instrument LLVM bitcodes, in order to find the minimum and maximum
values that each variable assumes at runtime. This code has been used
to check the precision of our range analysis. It is publicly available
at (
http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation
).
Finally, in the lab where I work there are six other people
who work everyday with LLVM; thus, in addition to getting help in the
forum, I can easily talk about my project to my colleagues.

References
----------

[1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
Springer-Verlag, 2005

[2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/f1bc10b9/attachment.html>

John Criswell

2012-Mar-30 18:49 UTC

head link

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:> Dear LLVMers,
>
>    My name is Raphael Ernani, and I am doing my MsC at the Federal
> University of Minas Gerais, Brazil. I have been using LLVM for a
> while, and I would like to participate in this year's Summer of Code.
> One particular idea, in your "open projects" page caught my eye,
and I
> decided to write a proposal about it. The line that I liked in the
> page was "Create an LLVM pass that adds memory safety checks to code,
> like Valgrind does for binaries, or like mudflap does for gcc compiled
> code.", and my proposal is written below:
Hi Raphael!

I haven't read your proposal in detail, but there are actually three 
projects that implement memory safety by transforming LLVM bitcode:

1) ASAN is integrated into LLVM.  It is meant as a low-overhead 
debugging tool in the spirit of Valgrind.  It's primary technique is to 
place guard memory in between memory objects and check loads and stores 
to see if they return a poisoned bit-patter from a guard.  It isn't 
sound (certain dangling pointer errors and array bound violations can 
get past it), but it's easy to implement.

2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to 
provide sound memory safety checks and is a sub-project of LLVM.  It 
inserts function pointer checks, precise structure and array bounds 
checks, load/store checks, and checks on C library function calls into 
code.  It is also designed to work with automatic pool allocation to 
provide sound points-to analysis and partial type-safety, but these 
features are currently disabled because they are not sufficiently 
robust.  While initially designed to protect applications during 
production, SAFECode has a generic pass to add debug information to its 
run-time checks, essentially make it a valgrind replacement like ASAN.

3) SoftBound and its CETS extension 
(http://www.cis.upenn.edu/acg/softbound) have been integrated into the 
SAFECode compiler and can be enabled with options in SAFECode's clang.  
It provides array bounds checking and, with CETS, optional dangling 
pointer detection.

While I don't think we need another memory safety checker for LLVM, I do 
think that there's *a lot* of interesting projects in making each of 
these tools better.  Some ideas that come to mind:

o Creating common instrumentation passes for ASAN, SAFECode, and 
SoftBound to permit code reuse.  All three essentially place checks on 
loads, stores, and atomic operations.  If they use the same run-time 
check names, then the same optimizations can improve the speed of all three.

o Creating or improving optimizations for memory safety checks.  Static 
array bounds checking alone provides numerous options, such as 
implementing ABCD, using value-range analysis (mentioned in another GSoC 
post), investigating SMT solver-based techniques, etc, etc.  Other 
optimizations include hoisting checks out of loops, polishing and 
re-enabling the SAFECode type-safety optimization, and removing 
redundant load/store checks.

o Polishing and enabling the Baggy Bounds Checking (BBC) run-time.  This 
run-time can speed up the run-time checks and should be faster than the 
splay-tree approach that SAFECode currently uses.

I'll go ahead and create a list of open projects on the SAFECode web 
page (http://sva.cs.illinois.edu), and I'll remove the "Valgrind"
tool
project off the open projects page as there are now several memory 
safety tools built using LLVM.  If you want my opinion, I think the 
static array bounds checker or the monotonic loop optimization make 
nice, self-contained projects.

Finally, you might be interested in the Memory Safety Menagerie 
(http://sva.cs.illinois.edu/menagerie/).  This web page contains a whole 
list of papers on the subject of memory safety transforms.

-- John T.
>
>
> ===============================================> Adding memory safety
checks to the LLVM bitcodes
> ===============================================>
>
> Objective
> ---------
>
> The objective of this project is to create an LLVM pass that adds
> memory safety checks to code, like Valgrind does for binaries, or like
> mudflap does for gcc compiled code.
>
>
> How it will be done
> -------------------
>
> We will achieve our goal by creating a LLVM pass that does the following:
> 1 - Propagate the array size to the sites in the program where the
> array is accessed.
> 2 - Insert code in the program to make a bound-check and throw an
> exception if the program tries to access a position out of the array
> bounds.
> 3 - Eliminate redundant bound-checks.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. Tools
> like Valgrind and mudflap are very useful; however, LLVM does not have
> any similar function yet. Our goal is to help filling this gap.
>
>
> How we plan to improve the code
> -------------------------------
>
> Before every read or write in a position of an array, our pass will
> add instructions to check if the index of the array is outside the
> bounds.
> The secured program, without optimizations, is shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i, tmp0;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                if (i<0 or i>10) raise("Array out of
bounds.");
>                A[i] = i;
>        }
>        tmp0 = 11;
>        if (tmp0<0 or tmp0>10) raise("Array out of
bounds.");
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> In order to accomplish this object, the new LLVM pass will have to
> perform a number of actions:
> 1) Collect the allocation sizes of each array. This can be done at
> runtime, simply reading the arguments passed to the allocation
> function.
> 2) Store the size of the array together with the allocated area. This
> can also be done dynamically, increasing in 32 bits the size of the
> allocated area that is given to the array base address.
> 3) Prefix any array access code with a bounds check to ensure that the
> used index is inside the correct limits.
>
> After this, to improve the performance of the "safe programs", I
can
> remove redundant checks, by not adding new checks where already exist
> a safety check. I can use two different abstract domains to accomplish
> this goal, a simpler, and another more complex one:
>
> 1) The domain of "less-than" names. This is the domain used in
the
> ABCD array bounds check algorithm [2]. For each variable v is gives
> the set of variables that are provably less than v. It is relatively
> simple - in terms of implementation - to compute this domain; however,
> this computation is costly: it is cubic on the number of variables in
> the program. Nevertheless, I can implement the domain, to see if it
> pays off its costs. There was an implementation of ABCD in LLVM 2.7,
> and I can reuse a substantial part of that code.
>
> 2) The domain of intervals. For each variable v, this domain gives an
> interval [l, u] with the minimum and maximum values that v can assume
> during the program execution. I can use the implementation of range
> analysis that we have developed in the Federal University of Minas
> Gerais to compute this domain. The implementation is quite mature now,
> and I was one of the developers that worked on it.
>
> Linked programs
> ----------------
>
> In order to be effective, the instrumentation pass should be a whole
> program analysis. However, even in this case, only the code that has
> been compiled with the pass will be secure. It is possible that
> library calls still lead to undefined behavior, like in the program
> below:
>
> // We do not have access to this code, because it is an external library:
> void foo(int* A, int i){
>        A[i] = i;
> }
>
> // We have secured this code with our instrumentation:
> #include <stdlib.h>
> #include <stdio.h>
> #include "A.h"
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                foo(A,i);
>        }
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Change the allocation scheme used by LLVM to store the array
> size next to the area that is allocated to the array.
>
> [5-6] Write the pass to insert access checks before array accesses.
>
> [7-10] Improve the performance of safe programs.
> - Avoid inserting bound-checks where already exists a memory check.
> - Avoid inserting bound-checks where the "less-than" domain
indicates
> that the memory access is safe.
>
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at 
>
(http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
>
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
> and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/cc95419e/attachment.html>

John Criswell

2012-Mar-30 19:13 UTC

head link

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Dear All,

Okay, the SAFECode open projects page is done: 
http://sva.cs.illinois.edu/projects.html.

-- John T.

Raphael Ernani Rodrigues

2012-Apr-02 18:40 UTC

head link

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Dear John,

    in the last e-mail we'd exchanged I forgot to CC the llvmdev list,
so I am copying the e-mail back to the list.
> One thing I should point out is that SAFECode and SoftBound (ideally)
should be using a static array bounds checking algorithm that is sound with
respect to integer overflow.  For example, x < x+5 is always true if x is an
integer, but it's not always true if x is a 32-bit two's complement
number; additional constraints on x are needed to prove that x < x+5 is true.
>
> Does your analysis take two's complement overflow/underflow into
account?
No. The less-than lattice only reports if a variable is less than
another, assuming that your lattice is {-inf, ... -2, -1, 0, 1, 2, ...
+inf}. But I know of another lattice that takes overflows into
consideration. It is called the Pentagon domain, and it was published
by Logozzo and Fahndrich in the paper [1]:

[1] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
relational abstract domain for the efficient validation of array
accesses. Sci. Comput. Program. 75(9): 796-807 (2010)

Pentagons combine the less-than lattice with the interval lattice to
have an analysis that is sound in terms of wrapping arithmetics. We
have an implementation of range analysis here at UFMG. It also uses
the lattice {-inf, ... -1, 0, 1, ... +inf}, but it is safe to use
limits that are bound by values different than infinity, e.g., [10,
100], for instance. On the other hand, [10, +inf] could wrap around,
thus compromising the lower bound, and it is not safe to use it.
>
> Just to make sure I understand: you're using abstract interpretation to
map SSA values to elements in some lattice.  In your standard value range
analysis, you're assigning concrete values ranges (e.g., (4,10)) to each SSA
value.  In the "less-than" domain, you're mapping each LLVM SSA
value to a set of SSA values which are greater than or equal to the SSA value
(e.g., x -> [a, y, z] means that x <=a and x <= y and x <=z).  Is
this correct?
Yes!
>
> Interesting.  One problem with on-demand analysis (at least the way
it's done in LLVM for things like alias analysis) is that it gets tricky for
inter-procedural analyses.
>
> With the less-than domain, I suppose what you could do is have the
"user" of the analysis first specify which variable relationships they
want to query.  With that information, your 'less-than' analysis runs as
usual with the exception that it ignores variables not queried by the
"user", thus making the sets of variables smaller.  The run-time is
still cubic, but the input size is smaller.
Yes: that is the idea of doing it on demand. And the analysis must be
context-sensitive too. Venet and Brat had discussed it before in the
paper [2]. One way to get a bit of context sensitiveness is with
function inlining, that LLVM already does for us.

[2] Precise and Efficient Static Array Bound Checking for Large
Embedded C Programs, Arnaud Venet and Guillaume Brat.

Thus, to summary my GSoC project, I would implement an array bounds
check elimination engine, after the paper [1], using the Pentagon
domain, which combines the less-than lattice with the interval
lattice. The analysis would be context sensitive. I think to do all of
it in three months is too much, but I could implement the pentagon
domain, and then use function inlining to get some sort of context
sensitiveness. I would pursue this project further, as my Master
dissertation. Then I would try to implement true context
sensitiveness.

What do you think? Could I give it a try and rewrite my initial proposal?

By the way, I've been following your discussion regarding Victor's
proposal. We worked together in the implementation of the range
analysis engine.

regards,

Raphael.

Kostya Serebryany

2012-Apr-03 05:08 UTC

head link

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

On Fri, Mar 30, 2012 at 11:49 AM, John Criswell <criswell at
illinois.edu>wrote:
>  On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:
>
> Dear LLVMers,
>
>    My name is Raphael Ernani, and I am doing my MsC at the Federal
> University of Minas Gerais, Brazil. I have been using LLVM for a
> while, and I would like to participate in this year's Summer of Code.
> One particular idea, in your "open projects" page caught my eye,
and I
> decided to write a proposal about it. The line that I liked in the
> page was "Create an LLVM pass that adds memory safety checks to code,
> like Valgrind does for binaries, or like mudflap does for gcc compiled
> code.", and my proposal is written below:
>
>
> Hi Raphael!
>
> I haven't read your proposal in detail, but there are actually three
> projects that implement memory safety by transforming LLVM bitcode:
>
> 1) ASAN is integrated into LLVM.  It is meant as a low-overhead debugging
> tool in the spirit of Valgrind.  It's primary technique is to place
guard
> memory in between memory objects and check loads and stores to see if they
> return a poisoned bit-patter from a guard.  It isn't sound (certain
> dangling pointer errors and array bound violations can get past it), but
> it's easy to implement.
>
> 2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to provide
> sound memory safety checks and is a sub-project of LLVM.  It inserts
> function pointer checks, precise structure and array bounds checks,
> load/store checks, and checks on C library function calls into code.  It is
> also designed to work with automatic pool allocation to provide sound
> points-to analysis and partial type-safety, but these features are
> currently disabled because they are not sufficiently robust.  While
> initially designed to protect applications during production, SAFECode has
> a generic pass to add debug information to its run-time checks, essentially
> make it a valgrind replacement like ASAN.
>
> 3) SoftBound and its CETS extension (
> http://www.cis.upenn.edu/acg/softbound) have been integrated into the
> SAFECode compiler and can be enabled with options in SAFECode's clang. 
It
> provides array bounds checking and, with CETS, optional dangling pointer
> detection.
>
> While I don't think we need another memory safety checker for LLVM, I
do
> think that there's *a lot* of interesting projects in making each of
these
> tools better.  Some ideas that come to mind:
>
> o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound
> to permit code reuse.  All three essentially place checks on loads, stores,
> and atomic operations.  If they use the same run-time check names, then the
> same optimizations can improve the speed of all three.
>
> o Creating or improving optimizations for memory safety checks.  Static
> array bounds checking alone provides numerous options, such as implementing
> ABCD, using value-range analysis (mentioned in another GSoC post),
> investigating SMT solver-based techniques, etc, etc.  Other optimizations
> include hoisting checks out of loops, polishing and re-enabling the
> SAFECode type-safety optimization, and removing redundant load/store
checks.
>
"removing redundant load/store checks" could be quite useful for both
AddressSanitizer (http://clang.llvm.org/docs/AddressSanitizer.html) and
ThreadSanitizer (
http://code.google.com/p/data-race-test/wiki/ThreadSanitizer2).
Some basic examples here:
http://code.google.com/p/address-sanitizer/wiki/CompileTimeOptimizations
I am looking into having some simple and human verifiable analysis with
lots of unit tests,
rather than something heavyweight and exhaustive.

--kcc


>
> o Polishing and enabling the Baggy Bounds Checking (BBC) run-time.  This
> run-time can speed up the run-time checks and should be faster than the
> splay-tree approach that SAFECode currently uses.
>
> I'll go ahead and create a list of open projects on the SAFECode web
page (
> http://sva.cs.illinois.edu), and I'll remove the "Valgrind"
tool project
> off the open projects page as there are now several memory safety tools
> built using LLVM.  If you want my opinion, I think the static array bounds
> checker or the monotonic loop optimization make nice, self-contained
> projects.
>
> Finally, you might be interested in the Memory Safety Menagerie (
> http://sva.cs.illinois.edu/menagerie/).  This web page contains a whole
> list of papers on the subject of memory safety transforms.
>
> -- John T.
>
>
>
>  ===============================================> Adding memory safety
checks to the LLVM bitcodes
> ===============================================>
>
> Objective
> ---------
>
> The objective of this project is to create an LLVM pass that adds
> memory safety checks to code, like Valgrind does for binaries, or like
> mudflap does for gcc compiled code.
>
>
> How it will be done
> -------------------
>
> We will achieve our goal by creating a LLVM pass that does the following:
> 1 - Propagate the array size to the sites in the program where the
> array is accessed.
> 2 - Insert code in the program to make a bound-check and throw an
> exception if the program tries to access a position out of the array
> bounds.
> 3 - Eliminate redundant bound-checks.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. Tools
> like Valgrind and mudflap are very useful; however, LLVM does not have
> any similar function yet. Our goal is to help filling this gap.
>
>
> How we plan to improve the code
> -------------------------------
>
> Before every read or write in a position of an array, our pass will
> add instructions to check if the index of the array is outside the
> bounds.
> The secured program, without optimizations, is shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i, tmp0;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                if (i<0 or i>10) raise("Array out of
bounds.");
>                A[i] = i;
>        }
>        tmp0 = 11;
>        if (tmp0<0 or tmp0>10) raise("Array out of
bounds.");
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> In order to accomplish this object, the new LLVM pass will have to
> perform a number of actions:
> 1) Collect the allocation sizes of each array. This can be done at
> runtime, simply reading the arguments passed to the allocation
> function.
> 2) Store the size of the array together with the allocated area. This
> can also be done dynamically, increasing in 32 bits the size of the
> allocated area that is given to the array base address.
> 3) Prefix any array access code with a bounds check to ensure that the
> used index is inside the correct limits.
>
> After this, to improve the performance of the "safe programs", I
can
> remove redundant checks, by not adding new checks where already exist
> a safety check. I can use two different abstract domains to accomplish
> this goal, a simpler, and another more complex one:
>
> 1) The domain of "less-than" names. This is the domain used in
the
> ABCD array bounds check algorithm [2]. For each variable v is gives
> the set of variables that are provably less than v. It is relatively
> simple - in terms of implementation - to compute this domain; however,
> this computation is costly: it is cubic on the number of variables in
> the program. Nevertheless, I can implement the domain, to see if it
> pays off its costs. There was an implementation of ABCD in LLVM 2.7,
> and I can reuse a substantial part of that code.
>
> 2) The domain of intervals. For each variable v, this domain gives an
> interval [l, u] with the minimum and maximum values that v can assume
> during the program execution. I can use the implementation of range
> analysis that we have developed in the Federal University of Minas
> Gerais to compute this domain. The implementation is quite mature now,
> and I was one of the developers that worked on it.
>
> Linked programs
> ----------------
>
> In order to be effective, the instrumentation pass should be a whole
> program analysis. However, even in this case, only the code that has
> been compiled with the pass will be secure. It is possible that
> library calls still lead to undefined behavior, like in the program
> below:
>
> // We do not have access to this code, because it is an external library:
> void foo(int* A, int i){
>        A[i] = i;
> }
>
> // We have secured this code with our instrumentation:
> #include <stdlib.h>
> #include <stdio.h>
> #include "A.h"
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                foo(A,i);
>        }
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Change the allocation scheme used by LLVM to store the array
> size next to the area that is allocated to the array.
>
> [5-6] Write the pass to insert access checks before array accesses.
>
> [7-10] Improve the performance of safe programs.
> - Avoid inserting bound-checks where already exists a memory check.
> - Avoid inserting bound-checks where the "less-than" domain
indicates
> that the memory access is safe.
>
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at (
>
http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation
> ).
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
> and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120402/43a75f82/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Apr 2012 - [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Possibly Parallel Threads