Kostya Serebryany
2012-Apr-03 05:08 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
On Fri, Mar 30, 2012 at 11:49 AM, John Criswell <criswell at illinois.edu>wrote:> On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote: > > Dear LLVMers, > > My name is Raphael Ernani, and I am doing my MsC at the Federal > University of Minas Gerais, Brazil. I have been using LLVM for a > while, and I would like to participate in this year's Summer of Code. > One particular idea, in your "open projects" page caught my eye, and I > decided to write a proposal about it. The line that I liked in the > page was "Create an LLVM pass that adds memory safety checks to code, > like Valgrind does for binaries, or like mudflap does for gcc compiled > code.", and my proposal is written below: > > > Hi Raphael! > > I haven't read your proposal in detail, but there are actually three > projects that implement memory safety by transforming LLVM bitcode: > > 1) ASAN is integrated into LLVM. It is meant as a low-overhead debugging > tool in the spirit of Valgrind. It's primary technique is to place guard > memory in between memory objects and check loads and stores to see if they > return a poisoned bit-patter from a guard. It isn't sound (certain > dangling pointer errors and array bound violations can get past it), but > it's easy to implement. > > 2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to provide > sound memory safety checks and is a sub-project of LLVM. It inserts > function pointer checks, precise structure and array bounds checks, > load/store checks, and checks on C library function calls into code. It is > also designed to work with automatic pool allocation to provide sound > points-to analysis and partial type-safety, but these features are > currently disabled because they are not sufficiently robust. While > initially designed to protect applications during production, SAFECode has > a generic pass to add debug information to its run-time checks, essentially > make it a valgrind replacement like ASAN. > > 3) SoftBound and its CETS extension ( > http://www.cis.upenn.edu/acg/softbound) have been integrated into the > SAFECode compiler and can be enabled with options in SAFECode's clang. It > provides array bounds checking and, with CETS, optional dangling pointer > detection. > > While I don't think we need another memory safety checker for LLVM, I do > think that there's *a lot* of interesting projects in making each of these > tools better. Some ideas that come to mind: > > o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound > to permit code reuse. All three essentially place checks on loads, stores, > and atomic operations. If they use the same run-time check names, then the > same optimizations can improve the speed of all three. > > o Creating or improving optimizations for memory safety checks. Static > array bounds checking alone provides numerous options, such as implementing > ABCD, using value-range analysis (mentioned in another GSoC post), > investigating SMT solver-based techniques, etc, etc. Other optimizations > include hoisting checks out of loops, polishing and re-enabling the > SAFECode type-safety optimization, and removing redundant load/store checks. >"removing redundant load/store checks" could be quite useful for both AddressSanitizer (http://clang.llvm.org/docs/AddressSanitizer.html) and ThreadSanitizer ( http://code.google.com/p/data-race-test/wiki/ThreadSanitizer2). Some basic examples here: http://code.google.com/p/address-sanitizer/wiki/CompileTimeOptimizations I am looking into having some simple and human verifiable analysis with lots of unit tests, rather than something heavyweight and exhaustive. --kcc> > o Polishing and enabling the Baggy Bounds Checking (BBC) run-time. This > run-time can speed up the run-time checks and should be faster than the > splay-tree approach that SAFECode currently uses. > > I'll go ahead and create a list of open projects on the SAFECode web page ( > http://sva.cs.illinois.edu), and I'll remove the "Valgrind" tool project > off the open projects page as there are now several memory safety tools > built using LLVM. If you want my opinion, I think the static array bounds > checker or the monotonic loop optimization make nice, self-contained > projects. > > Finally, you might be interested in the Memory Safety Menagerie ( > http://sva.cs.illinois.edu/menagerie/). This web page contains a whole > list of papers on the subject of memory safety transforms. > > -- John T. > > > > ===============================================> Adding memory safety checks to the LLVM bitcodes > ===============================================> > > Objective > --------- > > The objective of this project is to create an LLVM pass that adds > memory safety checks to code, like Valgrind does for binaries, or like > mudflap does for gcc compiled code. > > > How it will be done > ------------------- > > We will achieve our goal by creating a LLVM pass that does the following: > 1 - Propagate the array size to the sites in the program where the > array is accessed. > 2 - Insert code in the program to make a bound-check and throw an > exception if the program tries to access a position out of the array > bounds. > 3 - Eliminate redundant bound-checks. > > > Background > ---------- > > Invalid memory access is a major source of program failures. If a > program statement dereferences a pointer that points to an invalid > memory cell, the program is either aborted by the operating system or, > often worse, the program continues to run with an undefined behavior. > To avoid the latter, one can perform checks before every memory access > at run time. For some programming languages (e.g., Java) this is done > automatically by the compiler/run-time environment. For the language > C, neither the compiler nor the run-time environment enforces > memory-safety policies [1]. > > An example of access violation is given by the program below. When > this program runs, probably no visible error happens. However, the > program violates the semantics of well defined C code: it is accessing > the 11-th cell of an array that was declared with only 10 cells. Thus, > the behavior of this program is undefined. > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++) > A[i] = i; > printf("%d\n",A[11]); > return 0; > } > > This type of behavior can be the source of some very hard-to-find > bugs. If no explicit error occurs, other variables can have their > values changed without any reasonable explanation. Finding and fixing > this type of bug can be really hard, especially if the program runs > with more than one active thread. There exists tools that have been > built specially to track access violations like the one in the program > above. Valgrind is probably the best known example. These tools are > dynamic: they are used during the execution of the program, either > emulating it, or instrumenting it to catch errors at runtime. Tools > like Valgrind and mudflap are very useful; however, LLVM does not have > any similar function yet. Our goal is to help filling this gap. > > > How we plan to improve the code > ------------------------------- > > Before every read or write in a position of an array, our pass will > add instructions to check if the index of the array is outside the > bounds. > The secured program, without optimizations, is shown below: > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i, tmp0; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > if (i<0 or i>10) raise("Array out of bounds."); > A[i] = i; > } > tmp0 = 11; > if (tmp0<0 or tmp0>10) raise("Array out of bounds."); > printf("%d\n",A[11]); > return 0; > } > > In order to accomplish this object, the new LLVM pass will have to > perform a number of actions: > 1) Collect the allocation sizes of each array. This can be done at > runtime, simply reading the arguments passed to the allocation > function. > 2) Store the size of the array together with the allocated area. This > can also be done dynamically, increasing in 32 bits the size of the > allocated area that is given to the array base address. > 3) Prefix any array access code with a bounds check to ensure that the > used index is inside the correct limits. > > After this, to improve the performance of the "safe programs", I can > remove redundant checks, by not adding new checks where already exist > a safety check. I can use two different abstract domains to accomplish > this goal, a simpler, and another more complex one: > > 1) The domain of "less-than" names. This is the domain used in the > ABCD array bounds check algorithm [2]. For each variable v is gives > the set of variables that are provably less than v. It is relatively > simple - in terms of implementation - to compute this domain; however, > this computation is costly: it is cubic on the number of variables in > the program. Nevertheless, I can implement the domain, to see if it > pays off its costs. There was an implementation of ABCD in LLVM 2.7, > and I can reuse a substantial part of that code. > > 2) The domain of intervals. For each variable v, this domain gives an > interval [l, u] with the minimum and maximum values that v can assume > during the program execution. I can use the implementation of range > analysis that we have developed in the Federal University of Minas > Gerais to compute this domain. The implementation is quite mature now, > and I was one of the developers that worked on it. > > Linked programs > ---------------- > > In order to be effective, the instrumentation pass should be a whole > program analysis. However, even in this case, only the code that has > been compiled with the pass will be secure. It is possible that > library calls still lead to undefined behavior, like in the program > below: > > // We do not have access to this code, because it is an external library: > void foo(int* A, int i){ > A[i] = i; > } > > // We have secured this code with our instrumentation: > #include <stdlib.h> > #include <stdio.h> > #include "A.h" > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > foo(A,i); > } > printf("%d\n",A[11]); > return 0; > } > > Timeline > -------- > > This is a three months project, and we would like to schedule the time > in the following way (schedule given in weeks): > > [1-4]: Change the allocation scheme used by LLVM to store the array > size next to the area that is allocated to the array. > > [5-6] Write the pass to insert access checks before array accesses. > > [7-10] Improve the performance of safe programs. > - Avoid inserting bound-checks where already exists a memory check. > - Avoid inserting bound-checks where the "less-than" domain indicates > that the memory access is safe. > > [11-12] Final tests and report > - Correctly compile the whole LLVM test suite. > - Generate performance numbers. > - Write the final report. > > > My background > ------------- > > I am currently a first year master student at the Federal University > of Minas Gerais, under orientation of professor Fernando Pereira. I > have graduated in this school on the year of 2011. I've worked with > software development for five years, in different projects. I built a > Data Warehouse viewer and Business Intelligence graphics and gauges > using Delphi; Commercial programs in MS Visual FoxPro; Improvements in > a logistics planning system, in Excel VBA. I have five years of > experience with SQL. Now, I'm working with LLVM, writing passes to > support scientifical researchers. > > > Why I can do well in this project > --------------------------------- > > I believe that I am a good candidate for this Summer of Code for four > reasons. Firstly, I am a master student of computer science in the > Federal > University of Minas Gerais (UFMG), which is one of the best schools of > computer science in Brazil. To support this claim, notice that UFMG > has the maximum grade in the ranking of CAPES (Coordenação de > Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body > under the Ministry of Education). I am working in the compiler's lab, > under the advice of Fernando Pereira, who was a Summer of Coder > himself in 2006. Second, I have good experience with programming, with > almost six years of experience in many languages (Java, C, C++, C#, > VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with > LLVM passes. I already know some LLVM tools and I already have the > basic knowledge to start this project. I have written a pass to > instrument LLVM bitcodes, in order to find the minimum and maximum > values that each variable assumes at runtime. This code has been used > to check the precision of our range analysis. It is publicly available > at ( > http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation > ). > Finally, in the lab where I work there are six other people > who work everyday with LLVM; thus, in addition to getting help in the > forum, I can easily talk about my project to my colleagues. > > References > ---------- > > [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. > Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, > Springer-Verlag, 2005 > > [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik > and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333 > > > _______________________________________________ > LLVM Developers mailing listLLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120402/43a75f82/attachment.html>
Raphael Ernani Rodrigues
2012-Apr-03 14:54 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
Dear LLVMers, I wrote a new proposal, to improve the static array bounds checking in SAFEcode, as follows: Improving static array bounds checking in SAFEcode ================================================= Objective --------- the main objective of this project is to improve the static array bounds checking engine used in SAFECode. It was written after the open project at http://safecode.cs.illinois.edu/projects.html, which reads as follows: "Improve static array bounds checking: SAFECode used to have an inter-procedural static array bounds checking pass, but it suffered serious scalability issues and was eventually disabled. Improving the old code or writing a new pass would be extremely beneficial." I plan to build a new array bounds checking algorithm from scratch. How it will be done ------------------- I will build an array bounds checking algorithm based on the ABCD algorithm that was described by Bodik in the paper [3]. The key insight in the ABCD was to use a lattice that some researchers, e.g., [2], call the "less-than" lattice. Instead of determining precise ranges for the variables using for instance, the interval lattice, Bodik's less-than lattice finds, for each variable v, the set of other variables that are less-than it. It is a requirement that the algorithm must be relient against the wrapping semantics of 2-complement's arithmetic. The original algorithm proposed by Bodik is not. However, it is possible to limit Bodik's algorithm in such a way that only overflow resilient equalities are taken into consideration. In other words, if the source program contains a comparison such as: if (x < y) { // p1 } else { // p2 } then we are guaranteed that inside p1 we have that x < y. I propose an analysis that, contrary to the original ABCD algorithm is inter-procedural. To obtain context sensitiveness, which is very important, I will initially use function inlining, which is available in LLVM. If time allows it, I will implement true context sensitiveness using an "on-demand" approach on the call graph of functions. Context sensitiveness is remarkably important in this type of domain because it is necessary to match array sizes and identifiers across function calls in order to have a precise analysis, as described in the paper [1]. I do not plan to couple my analysis with the interval domain, but, if the LLVM community judge that it should be done, than I can implement my algorithm after the pentagon domain, proposed by Logozzo and Fahndrich in 2010 [2]. The pentagon domain is a different approach to get an analysis that is overflow resilient. It combines the interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than" domain. If some variable is bound to an interval like [l, u], where l -inf and u < +inf, then we know that operations that add up to MAX_INT - u to it will not produce overflow. Same for the lower limit. I am reticent to use the pentagon domain because Logozzo et al report some relatively long runtimes (in the order of minutes) and so this approach might not attack the scalability issue that the SAFECode maintainers would like to handle. Background ---------- Invalid memory access is a major source of program failures. If a program statement dereferences a pointer that points to an invalid memory cell, the program is either aborted by the operating system or, often worse, the program continues to run with an undefined behavior. To avoid the latter, one can perform checks before every memory access at run time. For some programming languages (e.g., Java) this is done automatically by the compiler/run-time environment. For the language C, neither the compiler nor the run-time environment enforces memory-safety policies [1]. An example of access violation is given by the program below. When this program runs, probably no visible error happens. However, the program violates the semantics of well defined C code: it is accessing the 11-th cell of an array that was declared with only 10 cells. Thus, the behavior of this program is undefined. #include <stdlib.h> #include <stdio.h> int main(int argc, char** argv){ int i; int* A; A = (int*)calloc(10,sizeof(int)); for(i=0;i<=11;i++) A[i] = i; printf("%d\n",A[11]); return 0; } This type of behavior can be the source of some very hard-to-find bugs. If no explicit error occurs, other variables can have their values changed without any reasonable explanation. Finding and fixing this type of bug can be really hard, especially if the program runs with more than one active thread. There exists tools that have been built specially to track access violations like the one in the program above. Valgrind is probably the best known example. These tools are dynamic: they are used during the execution of the program, either emulating it, or instrumenting it to catch errors at runtime. We would like to implement the detection of array bounds violation statically. Which results the analysis will generate ------------------------------------------- This analysis will classify each array access site into one of two categories: (i) always safe (ii) possibly unsafe The locations classified as always safe will never be able to cause an array access violation, even if we take the wrapping semantics of integer arithmetics of C/C++. Additionally, I would like to implement an algorithm that inserts safety checks in some possibly unsafe locations. The less-than domain keeps, for each array, the variables that are equal to its allocated size. Thus, if this information is available in the sites that the array is used, we can insert a safety check there. For instance, consider the program below: #include <stdlib.h> #include <string.h> #include <stdio.h> char* alloca2(int size) { return (char*) malloc(size * sizeof(char)); } int main() { char* c = alloca2(10); char* x; strcpy(c, "hello!"); x = c; while (*x != '\0') { putc(*x, stdout); x++; } } The inter-procedural analysis should be able to relate 10 with the size of array c. Then it could insert the following test in the code: int main() { char* c = alloca2(10); char* x; strcpy(c, "hello!"); x = c; while (*x != '\0') { if (x > c + 10) { abort(); } putc(*x, stdout); x++; } } It is not possible to secure statically every array access site, but we should be able to secure at least a good percentage of them. Timeline -------- This is a three months project, and we would like to schedule the time in the following way (schedule given in weeks): [1-4]: Implement the less-than relational domain. [5-7] Implement static array bounds checking. The analysis should point that a violation might happen in a given array access site. [8-10] Implement a diagnosis tool, that inserts access checks whenever we have the information of the limits of an array available. [11-12] Final tests and report - Correctly compile the whole LLVM test suite. - Generate performance numbers. - Write the final report. My background ------------- I am currently a first year master student at the Federal University of Minas Gerais, under orientation of professor Fernando Pereira. I have graduated in this school on the year of 2011. I've worked with software development for five years, in different projects. I built a Data Warehouse viewer and Business Intelligence graphics and gauges using Delphi; Commercial programs in MS Visual FoxPro; Improvements in a logistics planning system, in Excel VBA. I have five years of experience with SQL. Now, I'm working with LLVM, writing passes to support scientifical researchers. Why I can do well in this project --------------------------------- I believe that I am a good candidate for this Summer of Code for four reasons. Firstly, I am a master student of computer science in the Federal University of Minas Gerais (UFMG), which is one of the best schools of computer science in Brazil. To support this claim, notice that UFMG has the maximum grade in the ranking of CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body under the Ministry of Education). I am working in the compiler's lab, under the advice of Fernando Pereira, who was a Summer of Coder himself in 2006. Second, I have good experience with programming, with almost six years of experience in many languages (Java, C, C++, C#, VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with LLVM passes. I already know some LLVM tools and I already have the basic knowledge to start this project. I have written a pass to instrument LLVM bitcodes, in order to find the minimum and maximum values that each variable assumes at runtime. This code has been used to check the precision of our range analysis. It is publicly available at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation). Finally, in the lab where I work there are six other people who work everyday with LLVM; thus, in addition to getting help in the forum, I can easily talk about my project to my colleagues. References ---------- [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, Springer-Verlag, 2005 [2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly relational abstract domain for the efficient validation of array accesses. Sci. Comput. Program. 75(9): 796-807 (2010) [3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating array bounds checks on demand. PLDI 2000: 321-333
John Criswell
2012-Apr-03 15:20 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
On 4/3/12 9:54 AM, Raphael Ernani Rodrigues wrote:> Dear LLVMers, > > I wrote a new proposal, to improve the static array bounds checking in > SAFEcode, as follows:I haven't had time to read over this carefully; hopefully I can do that tomorrow. However, one quick comment is that SAFECode already has a pass that instruments GEPs (the array indexing instruction in LLVM) with run-time checks. You can see the code at the URL below (or you can check out SAFECode and examine the source code): http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/GEPChecks.cpp?revision=136382&view=markup SAFECode is moving to a design in which the Clang front-end inserts run-time checks and then later passes (including passes within libLTO) optimize those checks away. It makes the code easier to maintain and understand, and it makes experiments easier. Since your analysis will need to be run within libLTO *after* SAFECode has inserted its run-time checks, I recommend that you write a pass that *removes* run-time checks using your analysis. I think you should read the (very early draft of the) SAFECode Software Architecture Manual located within the SAFECode source tree (you'll need LaTeX to compile it to a PDF file). It describes the source code layout and provides an overview of how the analysis and instrumentation passes are integrated into Clang/libLTO. -- John T.> > > > Improving static array bounds checking in SAFEcode > =================================================> > > Objective > --------- > > the main objective of this project is to improve the static array > bounds checking engine used in SAFECode. It was written after the open > project at http://safecode.cs.illinois.edu/projects.html, which reads > as follows: > > "Improve static array bounds checking: SAFECode used to have an > inter-procedural static array bounds checking pass, but it suffered > serious scalability issues and was eventually disabled. Improving the > old code or writing a new pass would be extremely beneficial." > > I plan to build a new array bounds checking algorithm from scratch. > > How it will be done > ------------------- > > I will build an array bounds checking algorithm based on the ABCD > algorithm that was described by Bodik in the paper [3]. The key > insight in the ABCD was to use a lattice that some researchers, e.g., > [2], call the "less-than" lattice. Instead of determining precise > ranges for the variables using for instance, the interval lattice, > Bodik's less-than lattice finds, for each variable v, the set of other > variables that are less-than it. > It is a requirement that the algorithm must be relient against the > wrapping semantics of 2-complement's arithmetic. The original > algorithm proposed by Bodik is not. However, it is possible to limit > Bodik's algorithm in such a way that only overflow resilient > equalities are taken into consideration. In other words, if the source > program contains a comparison such as: > > if (x< y) { > // p1 > } else { > // p2 > } > > then we are guaranteed that inside p1 we have that x< y. > > I propose an analysis that, contrary to the original ABCD > algorithm is inter-procedural. To obtain context sensitiveness, which > is very important, I will initially use function inlining, which is > available in LLVM. If time allows it, I will implement true context > sensitiveness using an "on-demand" approach on the call graph of > functions. Context sensitiveness is remarkably important in this type > of domain because it is necessary to match array sizes and identifiers > across function calls in order to have a precise analysis, as > described in the paper [1]. > I do not plan to couple my analysis with the interval domain, but, > if the LLVM community judge that it should be done, than I can > implement my algorithm after the pentagon domain, proposed by Logozzo > and Fahndrich in 2010 [2]. The pentagon domain is a different approach > to get an analysis that is overflow resilient. It combines the > interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than" > domain. If some variable is bound to an interval like [l, u], where l > -inf and u< +inf, then we know that operations that add up to > MAX_INT - u to it will not produce overflow. Same for the lower limit. > I am reticent to use the pentagon domain because Logozzo et al report > some relatively long runtimes (in the order of minutes) and so this > approach might not attack the scalability issue that the SAFECode > maintainers would like to handle. > > > Background > ---------- > > Invalid memory access is a major source of program failures. If a > program statement dereferences a pointer that points to an invalid > memory cell, the program is either aborted by the operating system or, > often worse, the program continues to run with an undefined behavior. > To avoid the latter, one can perform checks before every memory access > at run time. For some programming languages (e.g., Java) this is done > automatically by the compiler/run-time environment. For the language > C, neither the compiler nor the run-time environment enforces > memory-safety policies [1]. > > An example of access violation is given by the program below. When > this program runs, probably no visible error happens. However, the > program violates the semantics of well defined C code: it is accessing > the 11-th cell of an array that was declared with only 10 cells. Thus, > the behavior of this program is undefined. > > #include<stdlib.h> > #include<stdio.h> > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++) > A[i] = i; > printf("%d\n",A[11]); > return 0; > } > > This type of behavior can be the source of some very hard-to-find > bugs. If no explicit error occurs, other variables can have their > values changed without any reasonable explanation. Finding and fixing > this type of bug can be really hard, especially if the program runs > with more than one active thread. There exists tools that have been > built specially to track access violations like the one in the program > above. Valgrind is probably the best known example. These tools are > dynamic: they are used during the execution of the program, either > emulating it, or instrumenting it to catch errors at runtime. We would > like to implement the detection of array bounds violation statically. > > > Which results the analysis will generate > ------------------------------------------- > > This analysis will classify each array access site into one of two categories: > (i) always safe > (ii) possibly unsafe > > The locations classified as always safe will never be able to cause an > array access violation, even if we take the wrapping semantics of > integer arithmetics of C/C++. > > Additionally, I would like to implement an algorithm that inserts > safety checks in some possibly unsafe locations. The less-than domain > keeps, for each array, the variables that are equal to its allocated > size. Thus, if this information is available in the sites that the > array is used, we can insert a safety check there. For instance, > consider the program below: > > #include<stdlib.h> > #include<string.h> > #include<stdio.h> > > char* alloca2(int size) { > return (char*) malloc(size * sizeof(char)); > } > > int main() { > char* c = alloca2(10); > char* x; > strcpy(c, "hello!"); > x = c; > while (*x != '\0') { > putc(*x, stdout); > x++; > } > } > > The inter-procedural analysis should be able to relate 10 with the > size of array c. Then it could insert the following test in the code: > > int main() { > char* c = alloca2(10); > char* x; > strcpy(c, "hello!"); > x = c; > while (*x != '\0') { > if (x> c + 10) { > abort(); > } > putc(*x, stdout); > x++; > } > } > > It is not possible to secure statically every array access site, but > we should be able to secure at least a good percentage of them. > > > Timeline > -------- > > This is a three months project, and we would like to schedule the time > in the following way (schedule given in weeks): > > [1-4]: Implement the less-than relational domain. > [5-7] Implement static array bounds checking. The analysis should > point that a violation might happen in a given array access site. > [8-10] Implement a diagnosis tool, that inserts access checks whenever > we have the information of the limits of an array available. > [11-12] Final tests and report > - Correctly compile the whole LLVM test suite. > - Generate performance numbers. > - Write the final report. > > > My background > ------------- > > I am currently a first year master student at the Federal University > of Minas Gerais, under orientation of professor Fernando Pereira. I > have graduated in this school on the year of 2011. I've worked with > software development for five years, in different projects. I built a > Data Warehouse viewer and Business Intelligence graphics and gauges > using Delphi; Commercial programs in MS Visual FoxPro; Improvements in > a logistics planning system, in Excel VBA. I have five years of > experience with SQL. Now, I'm working with LLVM, writing passes to > support scientifical researchers. > > > Why I can do well in this project > --------------------------------- > > I believe that I am a good candidate for this Summer of Code for four > reasons. Firstly, I am a master student of computer science in the > Federal > University of Minas Gerais (UFMG), which is one of the best schools of > computer science in Brazil. To support this claim, notice that UFMG > has the maximum grade in the ranking of CAPES (Coordenação de > Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body > under the Ministry of Education). I am working in the compiler's lab, > under the advice of Fernando Pereira, who was a Summer of Coder > himself in 2006. Second, I have good experience with programming, with > almost six years of experience in many languages (Java, C, C++, C#, > VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with > LLVM passes. I already know some LLVM tools and I already have the > basic knowledge to start this project. I have written a pass to > instrument LLVM bitcodes, in order to find the minimum and maximum > values that each variable assumes at runtime. This code has been used > to check the precision of our range analysis. It is publicly available > at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation). > Finally, in the lab where I work there are six other people > who work everyday with LLVM; thus, in addition to getting help in the > forum, I can easily talk about my project to my colleagues. > > > References > ---------- > > [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. > Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, > Springer-Verlag, 2005 > > [2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly > relational abstract domain for the efficient validation of array > accesses. Sci. Comput. Program. 75(9): 796-807 (2010) > > [3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating > array bounds checks on demand. PLDI 2000: 321-333
John Criswell
2012-Apr-05 00:21 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
On 4/3/12 9:54 AM, Raphael Ernani Rodrigues wrote:> Dear LLVMers, > > I wrote a new proposal, to improve the static array bounds checking in > SAFEcode, as follows: > > > > Improving static array bounds checking in SAFEcode > =================================================> > > Objective > --------- > > the main objective of this project is to improve the static array > bounds checking engine used in SAFECode. It was written after the open > project athttp://safecode.cs.illinois.edu/projects.html, which reads > as follows: > > "Improve static array bounds checking: SAFECode used to have an > inter-procedural static array bounds checking pass, but it suffered > serious scalability issues and was eventually disabled. Improving the > old code or writing a new pass would be extremely beneficial." > > I plan to build a new array bounds checking algorithm from scratch.I think what you want to say here is that you're going to build a new static array bounds checking pass for SAFECode/LLVM but that the code may also be useful for LLVM sub-projects like VMKit and DragonEgg's Ada support. As far as organization goes, think you should provide the background section first and then describe how you will solve the problem.> How it will be done > ------------------- > > I will build an array bounds checking algorithm based on the ABCD > algorithm that was described by Bodik in the paper [3]. The key > insight in the ABCD was to use a lattice that some researchers, e.g., > [2], call the "less-than" lattice. Instead of determining precise > ranges for the variables using for instance, the interval lattice, > Bodik's less-than lattice finds, for each variable v, the set of other > variables that are less-than it. > It is a requirement that the algorithm must be relient against theReplace reliant with resilient.> wrapping semantics of 2-complement's arithmetic. The original > algorithm proposed by Bodik is not. However, it is possible to limit > Bodik's algorithm in such a way that only overflow resilient > equalities are taken into consideration. In other words, if the source > program contains a comparison such as: > > if (x< y) { > // p1 > } else { > // p2 > } > > then we are guaranteed that inside p1 we have that x< y.This example is not clear, does not demonstrate integer overflow, and is not related to memory safety. I would use an example like this: p = malloc ((x * sizeof (struct foo)) + 1); p[x] = ... It is not necessarily the case that 0 <= x <= ((x * sizeof (struct foo)) + 1). Therefore, this array access, which looks safe at first glance, may not be safe.> I propose an analysis that, contrary to the original ABCDs/contrary to/unlike> algorithm is inter-procedural. To obtain context sensitiveness, whichs/sensitiveness/sensitivity> is very important, I will initially use function inlining, which is > available in LLVM. If time allows it, I will implement true contextIf time permits> sensitiveness using an "on-demand" approach on the call graph of > functions. Context sensitiveness is remarkably important in this type > of domain because it is necessary to match array sizes and identifiers > across function calls in order to have a precise analysis, as > described in the paper [1]. > I do not plan to couple my analysis with the interval domain, but, > if the LLVM community judge that it should be done, than I can > implement my algorithm after the pentagon domain, proposed by Logozzo > and Fahndrich in 2010 [2]. The pentagon domain is a different approach > to get an analysis that is overflow resilient. It combines the > interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than" > domain. If some variable is bound to an interval like [l, u], where l > -inf and u< +inf, then we know that operations that add up to > MAX_INT - u to it will not produce overflow. Same for the lower limit. > I am reticent to use the pentagon domain because Logozzo et al report > some relatively long runtimes (in the order of minutes) and so this > approach might not attack the scalability issue that the SAFECode > maintainers would like to handle.I think you should remove the last paragraph; it doesn't seem relevant. Chances are good that you know more about these static analyses than the judges do. There's no point in letting them select which algorithm you implement.> > Background > ---------- > > Invalid memory access is a major source of program failures. If a > program statement dereferences a pointer that points to an invalid > memory cell, the program is either aborted by the operating system or, > often worse, the program continues to run with an undefined behavior.You can make an even stronger argument: out of bounds pointers are used to subvert system security.> To avoid the latter, one can perform checks before every memory access > at run time. For some programming languages (e.g., Java) this is done > automatically by the compiler/run-time environment. For the language > C, neither the compiler nor the run-time environment enforces > memory-safety policies [1]. > > An example of access violation is given by the program below. When > this program runs, probably no visible error happens. However, the > program violates the semantics of well defined C code: it is accessing > the 11-th cell of an array that was declared with only 10 cells. Thus, > the behavior of this program is undefined. > > #include<stdlib.h> > #include<stdio.h> > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++) > A[i] = i; > printf("%d\n",A[11]); > return 0; > } > > This type of behavior can be the source of some very hard-to-find > bugs. If no explicit error occurs, other variables can have their > values changed without any reasonable explanation. Finding and fixing > this type of bug can be really hard, especially if the program runs > with more than one active thread. There exists tools that have been > built specially to track access violations like the one in the program > above. Valgrind is probably the best known example. These tools are > dynamic: they are used during the execution of the program, either > emulating it, or instrumenting it to catch errors at runtime. We would > like to implement the detection of array bounds violation statically.First, Valgrind does not do bounds checking in its default mode; it merely finds invalid loads and stores. Its ptrcheck tool does bounds checking but is slow and can sometimes be unreliable. Second, you should actually mention that tools for dynamic array bounds checking do exist (primarily SAFECode and SoftBound). You should emphasize that you're trying to optimize the speed at which SAFECode/SoftBound instrumented programs can run by optimizing away run-time array bounds checks. Third, I don't think you need a C code example to demonstrate your point. Everyone knows what a buffer overflow is. I think you want to argue that buffer overflows cause security vulnerabilities and hard-to-fix bugs, there are tools that catch these errors at run-time, and you're going to try to optimize the run-time checks by using static array bounds checking. Fourth, I think you should point out that your work can easily be modified to work with other tools such as VMKit, SoftBound, and the Dragonegg Ada front-end even if it targets SAFECode first.> > Which results the analysis will generate > ------------------------------------------- > > This analysis will classify each array access site into one of two categories: > (i) always safe > (ii) possibly unsafe > > The locations classified as always safe will never be able to cause an > array access violation, even if we take the wrapping semantics of > integer arithmetics of C/C++. > > Additionally, I would like to implement an algorithm that inserts > safety checks in some possibly unsafe locations. The less-than domain > keeps, for each array, the variables that are equal to its allocated > size. Thus, if this information is available in the sites that the > array is used, we can insert a safety check there. For instance, > consider the program below:As I mentioned before, SAFECode and SoftBound already insert the checks. Your analysis would be used to remove those that are unneeded.> > #include<stdlib.h> > #include<string.h> > #include<stdio.h> > > char* alloca2(int size) { > return (char*) malloc(size * sizeof(char)); > } > > int main() { > char* c = alloca2(10); > char* x; > strcpy(c, "hello!"); > x = c; > while (*x != '\0') { > putc(*x, stdout); > x++; > } > } > > The inter-procedural analysis should be able to relate 10 with the > size of array c. Then it could insert the following test in the code: > > int main() { > char* c = alloca2(10); > char* x; > strcpy(c, "hello!"); > x = c; > while (*x != '\0') { > if (x> c + 10) { > abort(); > } > putc(*x, stdout); > x++; > } > } > > It is not possible to secure statically every array access site, but > we should be able to secure at least a good percentage of them.Remove the code example and the last sentence.> > > Timeline > -------- > > This is a three months project, and we would like to schedule the time > in the following way (schedule given in weeks): > > [1-4]: Implement the less-than relational domain.Will you be reusing infrastructure from your research group? If so, make sure to say that. Also specify the open-source license under which that infrastructure is released. Finally, be sure to explain how that infrastructure is going to help you build the less-than analysis more quickly.> [5-7] Implement static array bounds checking. The analysis should > point that a violation might happen in a given array access site. > [8-10] Implement a diagnosis tool, that inserts access checks whenever > we have the information of the limits of an array available.You could do that, but then you're not helping make SAFECode faster. Your static array bounds checking pass should be used to remove run-time checks on GEPs that your analysis shows are always safe.> [11-12] Final tests and report > - Correctly compile the whole LLVM test suite. > - Generate performance numbers. > - Write the final report. > > > My background > ------------- > > I am currently a first year master student at the Federal University > of Minas Gerais, under orientation of professor Fernando Pereira. I > have graduated in this school on the year of 2011.I assume you have graduated with a Bachelor's. If so, you should state that explicitly. Also, it's "in the year" and not "on the year."> I've worked with > software development for five years, in different projects. I built a > Data Warehouse viewer and Business Intelligence graphics and gauges > using Delphi; Commercial programs in MS Visual FoxPro; Improvements in > a logistics planning system, in Excel VBA. I have five years of > experience with SQL. Now, I'm working with LLVM, writing passes to > support scientifical researchers.s/scientifical researchers/scientific research There are two problems with the above paragraph: 1) Your work in industry doesn't really argue for your strengths. Can you tell us how large these programs are? Are they still used today? Providing a few key details can make this stronger. 2) You don't highlight your C++ and LLVM programming skills, and nothing in your experience suggests that you have a strong compiler background. Can you explain more about what work you've done on the range analysis with LLVM? Do you have any other C++ programming experience?> > Why I can do well in this project > --------------------------------- > > I believe that I am a good candidate for this Summer of Code for four > reasons. Firstly, I am a master student of computer science in the > Federal > University of Minas Gerais (UFMG), which is one of the best schools of > computer science in Brazil. To support this claim, notice that UFMG > has the maximum grade in the ranking of CAPES (Coordenação de > Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body > under the Ministry of Education). I am working in the compiler's lab,"working in the compiler lab"> under the advice of Fernando Pereira, who was a Summer of Coder"under the supervision"> himself in 2006."who was a Google Summer of Code participant in 2006."> Second, I have good experience with programming, with > almost six years of experience in many languages (Java, C, C++, C#, > VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with > LLVM passes. I already know some LLVM tools and I already have the > basic knowledge to start this project. I have written a pass to > instrument LLVM bitcodes, in order to find the minimum and maximum > values that each variable assumes at runtime.Did this pass find the dynamic minimum and maximum values when the program was executed?> This code has been used > to check the precision of our range analysis. It is publicly available > at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation). > Finally, in the lab where I work there are six other people > who work everyday with LLVM; thus, in addition to getting help in the > forum, I can easily talk about my project to my colleagues.I would drop the last sentence. Having smart colleagues is great, but it's a weak argument when trying to convince someone to hire you.> > References > ---------- > > [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. > Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, > Springer-Verlag, 2005 > > [2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly > relational abstract domain for the efficient validation of array > accesses. Sci. Comput. Program. 75(9): 796-807 (2010) > > [3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating > array bounds checks on demand. PLDI 2000: 321-333You didn't provide any references for SAFECode. See http://sva.cs.illinois.edu/pubs.html. -- John T.
Possibly Parallel Threads
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes