Raphael Ernani Rodrigues
2012-Mar-30 18:08 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
Dear LLVMers, My name is Raphael Ernani, and I am doing my MsC at the Federal University of Minas Gerais, Brazil. I have been using LLVM for a while, and I would like to participate in this year's Summer of Code. One particular idea, in your "open projects" page caught my eye, and I decided to write a proposal about it. The line that I liked in the page was "Create an LLVM pass that adds memory safety checks to code, like Valgrind does for binaries, or like mudflap does for gcc compiled code.", and my proposal is written below: ===============================================Adding memory safety checks to the LLVM bitcodes =============================================== Objective --------- The objective of this project is to create an LLVM pass that adds memory safety checks to code, like Valgrind does for binaries, or like mudflap does for gcc compiled code. How it will be done ------------------- We will achieve our goal by creating a LLVM pass that does the following: 1 - Propagate the array size to the sites in the program where the array is accessed. 2 - Insert code in the program to make a bound-check and throw an exception if the program tries to access a position out of the array bounds. 3 - Eliminate redundant bound-checks. Background ---------- Invalid memory access is a major source of program failures. If a program statement dereferences a pointer that points to an invalid memory cell, the program is either aborted by the operating system or, often worse, the program continues to run with an undefined behavior. To avoid the latter, one can perform checks before every memory access at run time. For some programming languages (e.g., Java) this is done automatically by the compiler/run-time environment. For the language C, neither the compiler nor the run-time environment enforces memory-safety policies [1]. An example of access violation is given by the program below. When this program runs, probably no visible error happens. However, the program violates the semantics of well defined C code: it is accessing the 11-th cell of an array that was declared with only 10 cells. Thus, the behavior of this program is undefined. #include <stdlib.h> #include <stdio.h> int main(int argc, char** argv){ int i; int* A; A = (int*)calloc(10,sizeof(int)); for(i=0;i<=11;i++) A[i] = i; printf("%d\n",A[11]); return 0; } This type of behavior can be the source of some very hard-to-find bugs. If no explicit error occurs, other variables can have their values changed without any reasonable explanation. Finding and fixing this type of bug can be really hard, especially if the program runs with more than one active thread. There exists tools that have been built specially to track access violations like the one in the program above. Valgrind is probably the best known example. These tools are dynamic: they are used during the execution of the program, either emulating it, or instrumenting it to catch errors at runtime. Tools like Valgrind and mudflap are very useful; however, LLVM does not have any similar function yet. Our goal is to help filling this gap. How we plan to improve the code ------------------------------- Before every read or write in a position of an array, our pass will add instructions to check if the index of the array is outside the bounds. The secured program, without optimizations, is shown below: #include <stdlib.h> #include <stdio.h> int main(int argc, char** argv){ int i, tmp0; int* A; A = (int*)calloc(10,sizeof(int)); for(i=0;i<=11;i++){ if (i<0 or i>10) raise("Array out of bounds."); A[i] = i; } tmp0 = 11; if (tmp0<0 or tmp0>10) raise("Array out of bounds."); printf("%d\n",A[11]); return 0; } In order to accomplish this object, the new LLVM pass will have to perform a number of actions: 1) Collect the allocation sizes of each array. This can be done at runtime, simply reading the arguments passed to the allocation function. 2) Store the size of the array together with the allocated area. This can also be done dynamically, increasing in 32 bits the size of the allocated area that is given to the array base address. 3) Prefix any array access code with a bounds check to ensure that the used index is inside the correct limits. After this, to improve the performance of the "safe programs", I can remove redundant checks, by not adding new checks where already exist a safety check. I can use two different abstract domains to accomplish this goal, a simpler, and another more complex one: 1) The domain of "less-than" names. This is the domain used in the ABCD array bounds check algorithm [2]. For each variable v is gives the set of variables that are provably less than v. It is relatively simple - in terms of implementation - to compute this domain; however, this computation is costly: it is cubic on the number of variables in the program. Nevertheless, I can implement the domain, to see if it pays off its costs. There was an implementation of ABCD in LLVM 2.7, and I can reuse a substantial part of that code. 2) The domain of intervals. For each variable v, this domain gives an interval [l, u] with the minimum and maximum values that v can assume during the program execution. I can use the implementation of range analysis that we have developed in the Federal University of Minas Gerais to compute this domain. The implementation is quite mature now, and I was one of the developers that worked on it. Linked programs ---------------- In order to be effective, the instrumentation pass should be a whole program analysis. However, even in this case, only the code that has been compiled with the pass will be secure. It is possible that library calls still lead to undefined behavior, like in the program below: // We do not have access to this code, because it is an external library: void foo(int* A, int i){ A[i] = i; } // We have secured this code with our instrumentation: #include <stdlib.h> #include <stdio.h> #include "A.h" int main(int argc, char** argv){ int i; int* A; A = (int*)calloc(10,sizeof(int)); for(i=0;i<=11;i++){ foo(A,i); } printf("%d\n",A[11]); return 0; } Timeline -------- This is a three months project, and we would like to schedule the time in the following way (schedule given in weeks): [1-4]: Change the allocation scheme used by LLVM to store the array size next to the area that is allocated to the array. [5-6] Write the pass to insert access checks before array accesses. [7-10] Improve the performance of safe programs. - Avoid inserting bound-checks where already exists a memory check. - Avoid inserting bound-checks where the "less-than" domain indicates that the memory access is safe. [11-12] Final tests and report - Correctly compile the whole LLVM test suite. - Generate performance numbers. - Write the final report. My background ------------- I am currently a first year master student at the Federal University of Minas Gerais, under orientation of professor Fernando Pereira. I have graduated in this school on the year of 2011. I've worked with software development for five years, in different projects. I built a Data Warehouse viewer and Business Intelligence graphics and gauges using Delphi; Commercial programs in MS Visual FoxPro; Improvements in a logistics planning system, in Excel VBA. I have five years of experience with SQL. Now, I'm working with LLVM, writing passes to support scientifical researchers. Why I can do well in this project --------------------------------- I believe that I am a good candidate for this Summer of Code for four reasons. Firstly, I am a master student of computer science in the Federal University of Minas Gerais (UFMG), which is one of the best schools of computer science in Brazil. To support this claim, notice that UFMG has the maximum grade in the ranking of CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body under the Ministry of Education). I am working in the compiler's lab, under the advice of Fernando Pereira, who was a Summer of Coder himself in 2006. Second, I have good experience with programming, with almost six years of experience in many languages (Java, C, C++, C#, VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with LLVM passes. I already know some LLVM tools and I already have the basic knowledge to start this project. I have written a pass to instrument LLVM bitcodes, in order to find the minimum and maximum values that each variable assumes at runtime. This code has been used to check the precision of our range analysis. It is publicly available at ( http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation ). Finally, in the lab where I work there are six other people who work everyday with LLVM; thus, in addition to getting help in the forum, I can easily talk about my project to my colleagues. References ---------- [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, Springer-Verlag, 2005 [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/f1bc10b9/attachment.html>
John Criswell
2012-Mar-30 18:49 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote:> Dear LLVMers, > > My name is Raphael Ernani, and I am doing my MsC at the Federal > University of Minas Gerais, Brazil. I have been using LLVM for a > while, and I would like to participate in this year's Summer of Code. > One particular idea, in your "open projects" page caught my eye, and I > decided to write a proposal about it. The line that I liked in the > page was "Create an LLVM pass that adds memory safety checks to code, > like Valgrind does for binaries, or like mudflap does for gcc compiled > code.", and my proposal is written below:Hi Raphael! I haven't read your proposal in detail, but there are actually three projects that implement memory safety by transforming LLVM bitcode: 1) ASAN is integrated into LLVM. It is meant as a low-overhead debugging tool in the spirit of Valgrind. It's primary technique is to place guard memory in between memory objects and check loads and stores to see if they return a poisoned bit-patter from a guard. It isn't sound (certain dangling pointer errors and array bound violations can get past it), but it's easy to implement. 2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to provide sound memory safety checks and is a sub-project of LLVM. It inserts function pointer checks, precise structure and array bounds checks, load/store checks, and checks on C library function calls into code. It is also designed to work with automatic pool allocation to provide sound points-to analysis and partial type-safety, but these features are currently disabled because they are not sufficiently robust. While initially designed to protect applications during production, SAFECode has a generic pass to add debug information to its run-time checks, essentially make it a valgrind replacement like ASAN. 3) SoftBound and its CETS extension (http://www.cis.upenn.edu/acg/softbound) have been integrated into the SAFECode compiler and can be enabled with options in SAFECode's clang. It provides array bounds checking and, with CETS, optional dangling pointer detection. While I don't think we need another memory safety checker for LLVM, I do think that there's *a lot* of interesting projects in making each of these tools better. Some ideas that come to mind: o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound to permit code reuse. All three essentially place checks on loads, stores, and atomic operations. If they use the same run-time check names, then the same optimizations can improve the speed of all three. o Creating or improving optimizations for memory safety checks. Static array bounds checking alone provides numerous options, such as implementing ABCD, using value-range analysis (mentioned in another GSoC post), investigating SMT solver-based techniques, etc, etc. Other optimizations include hoisting checks out of loops, polishing and re-enabling the SAFECode type-safety optimization, and removing redundant load/store checks. o Polishing and enabling the Baggy Bounds Checking (BBC) run-time. This run-time can speed up the run-time checks and should be faster than the splay-tree approach that SAFECode currently uses. I'll go ahead and create a list of open projects on the SAFECode web page (http://sva.cs.illinois.edu), and I'll remove the "Valgrind" tool project off the open projects page as there are now several memory safety tools built using LLVM. If you want my opinion, I think the static array bounds checker or the monotonic loop optimization make nice, self-contained projects. Finally, you might be interested in the Memory Safety Menagerie (http://sva.cs.illinois.edu/menagerie/). This web page contains a whole list of papers on the subject of memory safety transforms. -- John T.> > > ===============================================> Adding memory safety checks to the LLVM bitcodes > ===============================================> > > Objective > --------- > > The objective of this project is to create an LLVM pass that adds > memory safety checks to code, like Valgrind does for binaries, or like > mudflap does for gcc compiled code. > > > How it will be done > ------------------- > > We will achieve our goal by creating a LLVM pass that does the following: > 1 - Propagate the array size to the sites in the program where the > array is accessed. > 2 - Insert code in the program to make a bound-check and throw an > exception if the program tries to access a position out of the array > bounds. > 3 - Eliminate redundant bound-checks. > > > Background > ---------- > > Invalid memory access is a major source of program failures. If a > program statement dereferences a pointer that points to an invalid > memory cell, the program is either aborted by the operating system or, > often worse, the program continues to run with an undefined behavior. > To avoid the latter, one can perform checks before every memory access > at run time. For some programming languages (e.g., Java) this is done > automatically by the compiler/run-time environment. For the language > C, neither the compiler nor the run-time environment enforces > memory-safety policies [1]. > > An example of access violation is given by the program below. When > this program runs, probably no visible error happens. However, the > program violates the semantics of well defined C code: it is accessing > the 11-th cell of an array that was declared with only 10 cells. Thus, > the behavior of this program is undefined. > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++) > A[i] = i; > printf("%d\n",A[11]); > return 0; > } > > This type of behavior can be the source of some very hard-to-find > bugs. If no explicit error occurs, other variables can have their > values changed without any reasonable explanation. Finding and fixing > this type of bug can be really hard, especially if the program runs > with more than one active thread. There exists tools that have been > built specially to track access violations like the one in the program > above. Valgrind is probably the best known example. These tools are > dynamic: they are used during the execution of the program, either > emulating it, or instrumenting it to catch errors at runtime. Tools > like Valgrind and mudflap are very useful; however, LLVM does not have > any similar function yet. Our goal is to help filling this gap. > > > How we plan to improve the code > ------------------------------- > > Before every read or write in a position of an array, our pass will > add instructions to check if the index of the array is outside the > bounds. > The secured program, without optimizations, is shown below: > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i, tmp0; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > if (i<0 or i>10) raise("Array out of bounds."); > A[i] = i; > } > tmp0 = 11; > if (tmp0<0 or tmp0>10) raise("Array out of bounds."); > printf("%d\n",A[11]); > return 0; > } > > In order to accomplish this object, the new LLVM pass will have to > perform a number of actions: > 1) Collect the allocation sizes of each array. This can be done at > runtime, simply reading the arguments passed to the allocation > function. > 2) Store the size of the array together with the allocated area. This > can also be done dynamically, increasing in 32 bits the size of the > allocated area that is given to the array base address. > 3) Prefix any array access code with a bounds check to ensure that the > used index is inside the correct limits. > > After this, to improve the performance of the "safe programs", I can > remove redundant checks, by not adding new checks where already exist > a safety check. I can use two different abstract domains to accomplish > this goal, a simpler, and another more complex one: > > 1) The domain of "less-than" names. This is the domain used in the > ABCD array bounds check algorithm [2]. For each variable v is gives > the set of variables that are provably less than v. It is relatively > simple - in terms of implementation - to compute this domain; however, > this computation is costly: it is cubic on the number of variables in > the program. Nevertheless, I can implement the domain, to see if it > pays off its costs. There was an implementation of ABCD in LLVM 2.7, > and I can reuse a substantial part of that code. > > 2) The domain of intervals. For each variable v, this domain gives an > interval [l, u] with the minimum and maximum values that v can assume > during the program execution. I can use the implementation of range > analysis that we have developed in the Federal University of Minas > Gerais to compute this domain. The implementation is quite mature now, > and I was one of the developers that worked on it. > > Linked programs > ---------------- > > In order to be effective, the instrumentation pass should be a whole > program analysis. However, even in this case, only the code that has > been compiled with the pass will be secure. It is possible that > library calls still lead to undefined behavior, like in the program > below: > > // We do not have access to this code, because it is an external library: > void foo(int* A, int i){ > A[i] = i; > } > > // We have secured this code with our instrumentation: > #include <stdlib.h> > #include <stdio.h> > #include "A.h" > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > foo(A,i); > } > printf("%d\n",A[11]); > return 0; > } > > Timeline > -------- > > This is a three months project, and we would like to schedule the time > in the following way (schedule given in weeks): > > [1-4]: Change the allocation scheme used by LLVM to store the array > size next to the area that is allocated to the array. > > [5-6] Write the pass to insert access checks before array accesses. > > [7-10] Improve the performance of safe programs. > - Avoid inserting bound-checks where already exists a memory check. > - Avoid inserting bound-checks where the "less-than" domain indicates > that the memory access is safe. > > [11-12] Final tests and report > - Correctly compile the whole LLVM test suite. > - Generate performance numbers. > - Write the final report. > > > My background > ------------- > > I am currently a first year master student at the Federal University > of Minas Gerais, under orientation of professor Fernando Pereira. I > have graduated in this school on the year of 2011. I've worked with > software development for five years, in different projects. I built a > Data Warehouse viewer and Business Intelligence graphics and gauges > using Delphi; Commercial programs in MS Visual FoxPro; Improvements in > a logistics planning system, in Excel VBA. I have five years of > experience with SQL. Now, I'm working with LLVM, writing passes to > support scientifical researchers. > > > Why I can do well in this project > --------------------------------- > > I believe that I am a good candidate for this Summer of Code for four > reasons. Firstly, I am a master student of computer science in the > Federal > University of Minas Gerais (UFMG), which is one of the best schools of > computer science in Brazil. To support this claim, notice that UFMG > has the maximum grade in the ranking of CAPES (Coordenação de > Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body > under the Ministry of Education). I am working in the compiler's lab, > under the advice of Fernando Pereira, who was a Summer of Coder > himself in 2006. Second, I have good experience with programming, with > almost six years of experience in many languages (Java, C, C++, C#, > VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with > LLVM passes. I already know some LLVM tools and I already have the > basic knowledge to start this project. I have written a pass to > instrument LLVM bitcodes, in order to find the minimum and maximum > values that each variable assumes at runtime. This code has been used > to check the precision of our range analysis. It is publicly available > at > (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation). > > Finally, in the lab where I work there are six other people > who work everyday with LLVM; thus, in addition to getting help in the > forum, I can easily talk about my project to my colleagues. > > References > ---------- > > [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. > Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, > Springer-Verlag, 2005 > > [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik > and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333 > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120330/cc95419e/attachment.html>
John Criswell
2012-Mar-30 19:13 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
Dear All, Okay, the SAFECode open projects page is done: http://sva.cs.illinois.edu/projects.html. -- John T.
Raphael Ernani Rodrigues
2012-Apr-02 18:40 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
Dear John, in the last e-mail we'd exchanged I forgot to CC the llvmdev list, so I am copying the e-mail back to the list.> One thing I should point out is that SAFECode and SoftBound (ideally) should be using a static array bounds checking algorithm that is sound with respect to integer overflow. For example, x < x+5 is always true if x is an integer, but it's not always true if x is a 32-bit two's complement number; additional constraints on x are needed to prove that x < x+5 is true. > > Does your analysis take two's complement overflow/underflow into account?No. The less-than lattice only reports if a variable is less than another, assuming that your lattice is {-inf, ... -2, -1, 0, 1, 2, ... +inf}. But I know of another lattice that takes overflows into consideration. It is called the Pentagon domain, and it was published by Logozzo and Fahndrich in the paper [1]: [1] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly relational abstract domain for the efficient validation of array accesses. Sci. Comput. Program. 75(9): 796-807 (2010) Pentagons combine the less-than lattice with the interval lattice to have an analysis that is sound in terms of wrapping arithmetics. We have an implementation of range analysis here at UFMG. It also uses the lattice {-inf, ... -1, 0, 1, ... +inf}, but it is safe to use limits that are bound by values different than infinity, e.g., [10, 100], for instance. On the other hand, [10, +inf] could wrap around, thus compromising the lower bound, and it is not safe to use it.> > Just to make sure I understand: you're using abstract interpretation to map SSA values to elements in some lattice. In your standard value range analysis, you're assigning concrete values ranges (e.g., (4,10)) to each SSA value. In the "less-than" domain, you're mapping each LLVM SSA value to a set of SSA values which are greater than or equal to the SSA value (e.g., x -> [a, y, z] means that x <=a and x <= y and x <=z). Is this correct?Yes!> > Interesting. One problem with on-demand analysis (at least the way it's done in LLVM for things like alias analysis) is that it gets tricky for inter-procedural analyses. > > With the less-than domain, I suppose what you could do is have the "user" of the analysis first specify which variable relationships they want to query. With that information, your 'less-than' analysis runs as usual with the exception that it ignores variables not queried by the "user", thus making the sets of variables smaller. The run-time is still cubic, but the input size is smaller.Yes: that is the idea of doing it on demand. And the analysis must be context-sensitive too. Venet and Brat had discussed it before in the paper [2]. One way to get a bit of context sensitiveness is with function inlining, that LLVM already does for us. [2] Precise and Efficient Static Array Bound Checking for Large Embedded C Programs, Arnaud Venet and Guillaume Brat. Thus, to summary my GSoC project, I would implement an array bounds check elimination engine, after the paper [1], using the Pentagon domain, which combines the less-than lattice with the interval lattice. The analysis would be context sensitive. I think to do all of it in three months is too much, but I could implement the pentagon domain, and then use function inlining to get some sort of context sensitiveness. I would pursue this project further, as my Master dissertation. Then I would try to implement true context sensitiveness. What do you think? Could I give it a try and rewrite my initial proposal? By the way, I've been following your discussion regarding Victor's proposal. We worked together in the implementation of the range analysis engine. regards, Raphael.
Kostya Serebryany
2012-Apr-03 05:08 UTC
[LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
On Fri, Mar 30, 2012 at 11:49 AM, John Criswell <criswell at illinois.edu>wrote:> On 3/30/12 1:08 PM, Raphael Ernani Rodrigues wrote: > > Dear LLVMers, > > My name is Raphael Ernani, and I am doing my MsC at the Federal > University of Minas Gerais, Brazil. I have been using LLVM for a > while, and I would like to participate in this year's Summer of Code. > One particular idea, in your "open projects" page caught my eye, and I > decided to write a proposal about it. The line that I liked in the > page was "Create an LLVM pass that adds memory safety checks to code, > like Valgrind does for binaries, or like mudflap does for gcc compiled > code.", and my proposal is written below: > > > Hi Raphael! > > I haven't read your proposal in detail, but there are actually three > projects that implement memory safety by transforming LLVM bitcode: > > 1) ASAN is integrated into LLVM. It is meant as a low-overhead debugging > tool in the spirit of Valgrind. It's primary technique is to place guard > memory in between memory objects and check loads and stores to see if they > return a poisoned bit-patter from a guard. It isn't sound (certain > dangling pointer errors and array bound violations can get past it), but > it's easy to implement. > > 2) SAFECode (my project: http://safecode.cs.illinois.edu) aims to provide > sound memory safety checks and is a sub-project of LLVM. It inserts > function pointer checks, precise structure and array bounds checks, > load/store checks, and checks on C library function calls into code. It is > also designed to work with automatic pool allocation to provide sound > points-to analysis and partial type-safety, but these features are > currently disabled because they are not sufficiently robust. While > initially designed to protect applications during production, SAFECode has > a generic pass to add debug information to its run-time checks, essentially > make it a valgrind replacement like ASAN. > > 3) SoftBound and its CETS extension ( > http://www.cis.upenn.edu/acg/softbound) have been integrated into the > SAFECode compiler and can be enabled with options in SAFECode's clang. It > provides array bounds checking and, with CETS, optional dangling pointer > detection. > > While I don't think we need another memory safety checker for LLVM, I do > think that there's *a lot* of interesting projects in making each of these > tools better. Some ideas that come to mind: > > o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound > to permit code reuse. All three essentially place checks on loads, stores, > and atomic operations. If they use the same run-time check names, then the > same optimizations can improve the speed of all three. > > o Creating or improving optimizations for memory safety checks. Static > array bounds checking alone provides numerous options, such as implementing > ABCD, using value-range analysis (mentioned in another GSoC post), > investigating SMT solver-based techniques, etc, etc. Other optimizations > include hoisting checks out of loops, polishing and re-enabling the > SAFECode type-safety optimization, and removing redundant load/store checks. >"removing redundant load/store checks" could be quite useful for both AddressSanitizer (http://clang.llvm.org/docs/AddressSanitizer.html) and ThreadSanitizer ( http://code.google.com/p/data-race-test/wiki/ThreadSanitizer2). Some basic examples here: http://code.google.com/p/address-sanitizer/wiki/CompileTimeOptimizations I am looking into having some simple and human verifiable analysis with lots of unit tests, rather than something heavyweight and exhaustive. --kcc> > o Polishing and enabling the Baggy Bounds Checking (BBC) run-time. This > run-time can speed up the run-time checks and should be faster than the > splay-tree approach that SAFECode currently uses. > > I'll go ahead and create a list of open projects on the SAFECode web page ( > http://sva.cs.illinois.edu), and I'll remove the "Valgrind" tool project > off the open projects page as there are now several memory safety tools > built using LLVM. If you want my opinion, I think the static array bounds > checker or the monotonic loop optimization make nice, self-contained > projects. > > Finally, you might be interested in the Memory Safety Menagerie ( > http://sva.cs.illinois.edu/menagerie/). This web page contains a whole > list of papers on the subject of memory safety transforms. > > -- John T. > > > > ===============================================> Adding memory safety checks to the LLVM bitcodes > ===============================================> > > Objective > --------- > > The objective of this project is to create an LLVM pass that adds > memory safety checks to code, like Valgrind does for binaries, or like > mudflap does for gcc compiled code. > > > How it will be done > ------------------- > > We will achieve our goal by creating a LLVM pass that does the following: > 1 - Propagate the array size to the sites in the program where the > array is accessed. > 2 - Insert code in the program to make a bound-check and throw an > exception if the program tries to access a position out of the array > bounds. > 3 - Eliminate redundant bound-checks. > > > Background > ---------- > > Invalid memory access is a major source of program failures. If a > program statement dereferences a pointer that points to an invalid > memory cell, the program is either aborted by the operating system or, > often worse, the program continues to run with an undefined behavior. > To avoid the latter, one can perform checks before every memory access > at run time. For some programming languages (e.g., Java) this is done > automatically by the compiler/run-time environment. For the language > C, neither the compiler nor the run-time environment enforces > memory-safety policies [1]. > > An example of access violation is given by the program below. When > this program runs, probably no visible error happens. However, the > program violates the semantics of well defined C code: it is accessing > the 11-th cell of an array that was declared with only 10 cells. Thus, > the behavior of this program is undefined. > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++) > A[i] = i; > printf("%d\n",A[11]); > return 0; > } > > This type of behavior can be the source of some very hard-to-find > bugs. If no explicit error occurs, other variables can have their > values changed without any reasonable explanation. Finding and fixing > this type of bug can be really hard, especially if the program runs > with more than one active thread. There exists tools that have been > built specially to track access violations like the one in the program > above. Valgrind is probably the best known example. These tools are > dynamic: they are used during the execution of the program, either > emulating it, or instrumenting it to catch errors at runtime. Tools > like Valgrind and mudflap are very useful; however, LLVM does not have > any similar function yet. Our goal is to help filling this gap. > > > How we plan to improve the code > ------------------------------- > > Before every read or write in a position of an array, our pass will > add instructions to check if the index of the array is outside the > bounds. > The secured program, without optimizations, is shown below: > > #include <stdlib.h> > #include <stdio.h> > int main(int argc, char** argv){ > int i, tmp0; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > if (i<0 or i>10) raise("Array out of bounds."); > A[i] = i; > } > tmp0 = 11; > if (tmp0<0 or tmp0>10) raise("Array out of bounds."); > printf("%d\n",A[11]); > return 0; > } > > In order to accomplish this object, the new LLVM pass will have to > perform a number of actions: > 1) Collect the allocation sizes of each array. This can be done at > runtime, simply reading the arguments passed to the allocation > function. > 2) Store the size of the array together with the allocated area. This > can also be done dynamically, increasing in 32 bits the size of the > allocated area that is given to the array base address. > 3) Prefix any array access code with a bounds check to ensure that the > used index is inside the correct limits. > > After this, to improve the performance of the "safe programs", I can > remove redundant checks, by not adding new checks where already exist > a safety check. I can use two different abstract domains to accomplish > this goal, a simpler, and another more complex one: > > 1) The domain of "less-than" names. This is the domain used in the > ABCD array bounds check algorithm [2]. For each variable v is gives > the set of variables that are provably less than v. It is relatively > simple - in terms of implementation - to compute this domain; however, > this computation is costly: it is cubic on the number of variables in > the program. Nevertheless, I can implement the domain, to see if it > pays off its costs. There was an implementation of ABCD in LLVM 2.7, > and I can reuse a substantial part of that code. > > 2) The domain of intervals. For each variable v, this domain gives an > interval [l, u] with the minimum and maximum values that v can assume > during the program execution. I can use the implementation of range > analysis that we have developed in the Federal University of Minas > Gerais to compute this domain. The implementation is quite mature now, > and I was one of the developers that worked on it. > > Linked programs > ---------------- > > In order to be effective, the instrumentation pass should be a whole > program analysis. However, even in this case, only the code that has > been compiled with the pass will be secure. It is possible that > library calls still lead to undefined behavior, like in the program > below: > > // We do not have access to this code, because it is an external library: > void foo(int* A, int i){ > A[i] = i; > } > > // We have secured this code with our instrumentation: > #include <stdlib.h> > #include <stdio.h> > #include "A.h" > int main(int argc, char** argv){ > int i; > int* A; > A = (int*)calloc(10,sizeof(int)); > for(i=0;i<=11;i++){ > foo(A,i); > } > printf("%d\n",A[11]); > return 0; > } > > Timeline > -------- > > This is a three months project, and we would like to schedule the time > in the following way (schedule given in weeks): > > [1-4]: Change the allocation scheme used by LLVM to store the array > size next to the area that is allocated to the array. > > [5-6] Write the pass to insert access checks before array accesses. > > [7-10] Improve the performance of safe programs. > - Avoid inserting bound-checks where already exists a memory check. > - Avoid inserting bound-checks where the "less-than" domain indicates > that the memory access is safe. > > [11-12] Final tests and report > - Correctly compile the whole LLVM test suite. > - Generate performance numbers. > - Write the final report. > > > My background > ------------- > > I am currently a first year master student at the Federal University > of Minas Gerais, under orientation of professor Fernando Pereira. I > have graduated in this school on the year of 2011. I've worked with > software development for five years, in different projects. I built a > Data Warehouse viewer and Business Intelligence graphics and gauges > using Delphi; Commercial programs in MS Visual FoxPro; Improvements in > a logistics planning system, in Excel VBA. I have five years of > experience with SQL. Now, I'm working with LLVM, writing passes to > support scientifical researchers. > > > Why I can do well in this project > --------------------------------- > > I believe that I am a good candidate for this Summer of Code for four > reasons. Firstly, I am a master student of computer science in the > Federal > University of Minas Gerais (UFMG), which is one of the best schools of > computer science in Brazil. To support this claim, notice that UFMG > has the maximum grade in the ranking of CAPES (Coordenação de > Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body > under the Ministry of Education). I am working in the compiler's lab, > under the advice of Fernando Pereira, who was a Summer of Coder > himself in 2006. Second, I have good experience with programming, with > almost six years of experience in many languages (Java, C, C++, C#, > VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with > LLVM passes. I already know some LLVM tools and I already have the > basic knowledge to start this project. I have written a pass to > instrument LLVM bitcodes, in order to find the minimum and maximum > values that each variable assumes at runtime. This code has been used > to check the precision of our range analysis. It is publicly available > at ( > http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation > ). > Finally, in the lab where I work there are six other people > who work everyday with LLVM; thus, in addition to getting help in the > forum, I can easily talk about my project to my colleagues. > > References > ---------- > > [1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar. > Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18, > Springer-Verlag, 2005 > > [2] ABCD: eliminating array bounds checks on demand. Rastislav Bodik > and Rajiv Gupta and Vivek Sarkar. PLDI 2000, ACM, pages 321-333 > > > _______________________________________________ > LLVM Developers mailing listLLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120402/43a75f82/attachment.html>
Reasonably Related Threads
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes
- [LLVMdev] SAFECode testsuite query