llvm dev - Apr 2012 - [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

On Fri, Mar 30, 2012 at 11:49 AM, John Criswell <criswell at
illinois.edu>wrote:
>  On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:
>
> Dear LLVMers,
>
>    My name is Raphael Ernani, and I am doing my MsC at the Federal
> University of Minas Gerais, Brazil. I have been using LLVM for a
> while, and I would like to participate in this year's Summer of Code.
> One particular idea, in your "open projects" page caught my eye,
and I
> decided to write a proposal about it. The line that I liked in the
> page was "Create an LLVM pass that adds memory safety checks to code,
> like Valgrind does for binaries, or like mudflap does for gcc compiled
> code.", and my proposal is written below:
>
>
> Hi Raphael!
>
> I haven't read your proposal in detail, but there are actually three
> projects that implement memory safety by transforming LLVM bitcode:
>
> 1) ASAN is integrated into LLVM.  It is meant as a low-overhead debugging
> tool in the spirit of Valgrind.  It's primary technique is to place
guard
> memory in between memory objects and check loads and stores to see if they
> return a poisoned bit-patter from a guard.  It isn't sound (certain
> dangling pointer errors and array bound violations can get past it), but
> it's easy to implement.
>
> 2) SAFECode (my project: safecode.cs.illinois.edu) aims to provide
> sound memory safety checks and is a sub-project of LLVM.  It inserts
> function pointer checks, precise structure and array bounds checks,
> load/store checks, and checks on C library function calls into code.  It is
> also designed to work with automatic pool allocation to provide sound
> points-to analysis and partial type-safety, but these features are
> currently disabled because they are not sufficiently robust.  While
> initially designed to protect applications during production, SAFECode has
> a generic pass to add debug information to its run-time checks, essentially
> make it a valgrind replacement like ASAN.
>
> 3) SoftBound and its CETS extension (
> cis.upenn.edu/acg/softbound) have been integrated into the
> SAFECode compiler and can be enabled with options in SAFECode's clang. 
It
> provides array bounds checking and, with CETS, optional dangling pointer
> detection.
>
> While I don't think we need another memory safety checker for LLVM, I
do
> think that there's *a lot* of interesting projects in making each of
these
> tools better.  Some ideas that come to mind:
>
> o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound
> to permit code reuse.  All three essentially place checks on loads, stores,
> and atomic operations.  If they use the same run-time check names, then the
> same optimizations can improve the speed of all three.
>
> o Creating or improving optimizations for memory safety checks.  Static
> array bounds checking alone provides numerous options, such as implementing
> ABCD, using value-range analysis (mentioned in another GSoC post),
> investigating SMT solver-based techniques, etc, etc.  Other optimizations
> include hoisting checks out of loops, polishing and re-enabling the
> SAFECode type-safety optimization, and removing redundant load/store
checks.
>
"removing redundant load/store checks" could be quite useful for both
AddressSanitizer (clang.llvm.org/docs/AddressSanitizer.html) and
ThreadSanitizer (
code.google.com/p/data-race-test/wiki/ThreadSanitizer2).
Some basic examples here:
code.google.com/p/address-sanitizer/wiki/CompileTimeOptimizations
I am looking into having some simple and human verifiable analysis with
lots of unit tests,
rather than something heavyweight and exhaustive.

--kcc


>
> o Polishing and enabling the Baggy Bounds Checking (BBC) run-time.  This
> run-time can speed up the run-time checks and should be faster than the
> splay-tree approach that SAFECode currently uses.
>
> I'll go ahead and create a list of open projects on the SAFECode web
page (
> sva.cs.illinois.edu), and I'll remove the "Valgrind"
tool project
> off the open projects page as there are now several memory safety tools
> built using LLVM.  If you want my opinion, I think the static array bounds
> checker or the monotonic loop optimization make nice, self-contained
> projects.
>
> Finally, you might be interested in the Memory Safety Menagerie (
> sva.cs.illinois.edu/menagerie).  This web page contains a whole
> list of papers on the subject of memory safety transforms.
>
> -- John T.
>
>
>
>  ===============================================> Adding memory safety
checks to the LLVM bitcodes
> ===============================================>
>
> Objective
> ---------
>
> The objective of this project is to create an LLVM pass that adds
> memory safety checks to code, like Valgrind does for binaries, or like
> mudflap does for gcc compiled code.
>
>
> How it will be done
> -------------------
>
> We will achieve our goal by creating a LLVM pass that does the following:
> 1 - Propagate the array size to the sites in the program where the
> array is accessed.
> 2 - Insert code in the program to make a bound-check and throw an
> exception if the program tries to access a position out of the array
> bounds.
> 3 - Eliminate redundant bound-checks.
>
>
> Background
> ----------
>
> Invalid memory access is a major source of program failures. If a
> program statement dereferences a pointer that points to an invalid
> memory cell, the program is either aborted by the operating system or,
> often worse, the program continues to run with an undefined behavior.
> To avoid the latter, one can perform checks before every memory access
> at run time. For some programming languages (e.g., Java) this is done
> automatically by the compiler/run-time environment. For the language
> C, neither the compiler nor the run-time environment enforces
> memory-safety policies [1].
>
> An example of access violation is given by the program below. When
> this program runs, probably no visible error happens. However, the
> program violates the semantics of well defined C code: it is accessing
> the 11-th cell of an array that was declared with only 10 cells. Thus,
> the behavior of this program is undefined.
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++)
>                A[i] = i;
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> This type of behavior can be the source of some very hard-to-find
> bugs. If no explicit error occurs, other variables can have their
> values changed without any reasonable explanation. Finding and fixing
> this type of bug can be really hard, especially if the program runs
> with more than one active thread. There exists tools that have been
> built specially to track access violations like the one in the program
> above. Valgrind is probably the best known example. These tools are
> dynamic: they are used during the execution of the program, either
> emulating it, or instrumenting it to catch errors at runtime. Tools
> like Valgrind and mudflap are very useful; however, LLVM does not have
> any similar function yet. Our goal is to help filling this gap.
>
>
> How we plan to improve the code
> -------------------------------
>
> Before every read or write in a position of an array, our pass will
> add instructions to check if the index of the array is outside the
> bounds.
> The secured program, without optimizations, is shown below:
>
> #include <stdlib.h>
> #include <stdio.h>
> int main(int argc, char** argv){
>        int i, tmp0;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                if (i<0 or i>10) raise("Array out of
bounds.");
>                A[i] = i;
>        }
>        tmp0 = 11;
>        if (tmp0<0 or tmp0>10) raise("Array out of
bounds.");
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> In order to accomplish this object, the new LLVM pass will have to
> perform a number of actions:
> 1) Collect the allocation sizes of each array. This can be done at
> runtime, simply reading the arguments passed to the allocation
> function.
> 2) Store the size of the array together with the allocated area. This
> can also be done dynamically, increasing in 32 bits the size of the
> allocated area that is given to the array base address.
> 3) Prefix any array access code with a bounds check to ensure that the
> used index is inside the correct limits.
>
> After this, to improve the performance of the "safe programs", I
can
> remove redundant checks, by not adding new checks where already exist
> a safety check. I can use two different abstract domains to accomplish
> this goal, a simpler, and another more complex one:
>
> 1) The domain of "less-than" names. This is the domain used in
the
> ABCD array bounds check algorithm [2]. For each variable v is gives
> the set of variables that are provably less than v. It is relatively
> simple - in terms of implementation - to compute this domain; however,
> this computation is costly: it is cubic on the number of variables in
> the program. Nevertheless, I can implement the domain, to see if it
> pays off its costs. There was an implementation of ABCD in LLVM 2.7,
> and I can reuse a substantial part of that code.
>
> 2) The domain of intervals. For each variable v, this domain gives an
> interval [l, u] with the minimum and maximum values that v can assume
> during the program execution. I can use the implementation of range
> analysis that we have developed in the Federal University of Minas
> Gerais to compute this domain. The implementation is quite mature now,
> and I was one of the developers that worked on it.
>
> Linked programs
> ----------------
>
> In order to be effective, the instrumentation pass should be a whole
> program analysis. However, even in this case, only the code that has
> been compiled with the pass will be secure. It is possible that
> library calls still lead to undefined behavior, like in the program
> below:
>
> // We do not have access to this code, because it is an external library:
> void foo(int* A, int i){
>        A[i] = i;
> }
>
> // We have secured this code with our instrumentation:
> #include <stdlib.h>
> #include <stdio.h>
> #include "A.h"
> int main(int argc, char** argv){
>        int i;
>        int* A;
>        A = (int*)calloc(10,sizeof(int));
>        for(i=0;i<=11;i++){
>                foo(A,i);
>        }
>        printf("%d\n",A[11]);
>        return 0;
> }
>
> Timeline
> --------
>
> This is a three months project, and we would like to schedule the time
> in the following way (schedule given in weeks):
>
> [1-4]: Change the allocation scheme used by LLVM to store the array
> size next to the area that is allocated to the array.
>
> [5-6] Write the pass to insert access checks before array accesses.
>
> [7-10] Improve the performance of safe programs.
> - Avoid inserting bound-checks where already exists a memory check.
> - Avoid inserting bound-checks where the "less-than" domain
indicates
> that the memory access is safe.
>
> [11-12] Final tests and report
> - Correctly compile the whole LLVM test suite.
> - Generate performance numbers.
> - Write the final report.
>
>
> My background
> -------------
>
> I am currently a first year master student at the Federal University
> of Minas Gerais, under orientation of professor Fernando Pereira. I
> have graduated in this school on the year of 2011. I've worked with
> software development for five years, in different projects. I built a
> Data Warehouse viewer and Business Intelligence graphics and gauges
> using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
> a logistics planning system, in Excel VBA. I have five years of
> experience with SQL. Now, I'm working with LLVM, writing passes to
> support scientifical researchers.
>
>
> Why I can do well in this project
> ---------------------------------
>
> I believe that I am a good candidate for this Summer of Code for four
> reasons. Firstly, I am a master student of computer science in the
> Federal
> University of Minas Gerais (UFMG), which is one of the best schools of
> computer science in Brazil. To support this claim, notice that UFMG
> has the maximum grade in the ranking of CAPES (Coordenação de
> Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
> under the Ministry of Education). I am working in the compiler's lab,
> under the advice of Fernando Pereira, who was a Summer of Coder
> himself in 2006. Second, I have good experience with programming, with
> almost six years of experience in many languages (Java, C, C++, C#,
> VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
> LLVM passes. I already know some LLVM tools and I already have the
> basic knowledge to start this project. I have written a pass to
> instrument LLVM bitcodes, in order to find the minimum and maximum
> values that each variable assumes at runtime. This code has been used
> to check the precision of our range analysis. It is publicly available
> at (
>
code.google.com/p/range-analysis/source/browse/#svn/trunk/src/RAInstrumentation
> ).
> Finally, in the lab where I work there are six other people
> who work everyday with LLVM; thus, in addition to getting help in the
> forum, I can easily talk about my project to my colleagues.
>
> References
> ----------
>
> [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
> Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
> Springer-Verlag, 2005
>
> [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik
> and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu        
llvm.cs.uiuc.eduhttp:lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         llvm.cs.uiuc.edu
> lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20120402/43a75f82/attachment.html>

llvm dev - Apr 2012 - [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Apparently Analagous Threads