Adve, Vikram Sadanand via llvm-dev
2015-Oct-09 16:45 UTC
[llvm-dev] llvm-dev Digest, Vol 136, Issue 22
(Note to self: learn to scan the full digest for later messages in a thread before replying to an earlier message.) Ed, Your reply to John answered some of my questions, but not all, and raised a new one:> Maybe I should have been a bit clearer; we're really interested in full > memory and type safety. We want to harden the system against memory > corruption vulnerabilities. Process isolation isn't an issue, as we are > in an embedded context where we don't have processes.Do you also care about use-after-free and free-after-free errors? Handling free-after-free is not too hard, but general use-after-free can be expensive, unless you’re willing to spend a fair amount of extra memory. If dynamic allocation isn’t an issue for your (embedded) applications, that makes this whole problem go away, of course. This brings me to the new question: exactly what unsafe features do you need to be concerned about? Our first paper in the SAFECode project focused on identifying a subset of C for which type and memory safety could be enforced without *any* run-time checks or GC. We found that array bounds checks were the biggest problem; ignoring these, we can ensure the safety of pointer and dynamic memory usage in all the embedded benchmarks we tried without any run-time checks. The paper appeared at LCTES: http://llvm.org/pubs/2003-05-05-LCTES03-CodeSafety.html Like the later work, this did not eliminate use-after-free but used APA to ensure that any use-after-free errors are “harmless” in that they did not violate all the other type- and memory-safety guarantees. Let me know if you have any questions about this. --Vikram // Vikram S. Adve // Professor, Department of Computer Science // University of Illinois at Urbana-Champaign // vadve at illinois.edu // http://llvm.org <http://llvm.org/> // // "A ‘No’ uttered from the deepest conviction is better than // a ‘Yes’ merely uttered to please, or worse, to avoid trouble." // --Mahatma Gandhi // On 10/9/15, 1:05 AM, "llvm-dev on behalf of via llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of llvm-dev at lists.llvm.org> wrote:>Date: Thu, 08 Oct 2015 23:50:33 +0100 >From: Ed Robbins via llvm-dev <llvm-dev at lists.llvm.org> >To: John Criswell <jtcriswel at gmail.com>, llvm-dev at lists.llvm.org >Subject: Re: [llvm-dev] Pool allocator + safecode >Message-ID: > <1444344633.2719095.405296865.10B20F64 at webmail.messagingengine.com> >Content-Type: text/plain > >Thanks again for the really nice response John. > >On Thu, Oct 8, 2015, at 10:14 PM, John Criswell wrote: >>The original llvm-gcc (in which we modified gcc to produce LLVM bitcode) >>no longer exists. When GCC was extended to use plugins, someone created >>the DragonEgg plugin. I am not sure if that still exists. >>Is there a reason you need a GCC-based front-end? Clang is intended to >>be a drop-in replacement for GCC and supports many of the GCC extensions. > >The only reason was to make it easier to drop into the toolchain. I >don't anticipate that being a huge amount of work to switch to clang >anyway, but I saw that SafeCode was originally using llvm-gcc so I >thought that might make life easy. > >> Integrating into Clang simply makes it easier to run >>the SAFECode transformation passes after the LLVM passes have been run. > >So basically it means you don't have to run safecode manually? > >>DSA is a points-to and call graph analysis. As such, it is used for >>many things in addition to APA and SAFECode. For example, the SMACK >>verifier uses DSA. > >Got it. > >>The APA code has been bit rotting for awhile. While you can compile and >>run it with LLVM 3.2, it won't work as well as it did for the paper, and >>it probably needs some work to make it robust. Please remember that APA >>has always been a research prototype. > >I really want to bring all this into trunk. I've spent this afternoon >chugging through minor API incompatibilities trying to get APA to build. >Not there yet! > >>> We really want the all singing all dancing safecode framework with APA >>> as detailed in the 2005 TECS SafeCode paper (Memory Safety Without >>> Garbage Collection for Embedded Applications). We are trying to build a >>> C based embedded system that is type safe at the lowest possible run >>> time cost. So I am also going to modify the uninitialized pointer MMU >>> based stuff to work with the ARM Cortex M3 MPU. I don't think there are >>> any shortcuts here (I'd be happy to be proved wrong though) - we need >>> APA. >>I see. >>If you just need to isolate processes on your system so that they don't >>overwrite each other, it looks like the ARM Cortex M3 MPU can do that. >>Is there a reason why that won't work? > >Maybe I should have been a bit clearer; we're really interested in full >memory and type safety. We want to harden the system against memory >corruption vulnerabilities. Process isolation isn't an issue, as we are >in an embedded context where we don't have processes. I was really >talking about setting uninitialised pointers to a value that is >configured in the MPU to be inaccessible, and using a handler to abort >in the case where an attempt is made to read/write to any of these >pointers. > >>However, if you want to isolate processes using inferred type safety, >>you will need to address three significant challenges: >>1) Due to changes in LLVM, DSA has a difficult time inferring types. In >>a nutshell, some LLVM optimizations turned typed-getelementptr (GEP) >>instructions into GEPs that use byte-level indexing. DSA was written >>with the assumption that GEPs carried high-level type information. >>I think this is fixable by refactoring DSA's type inference code to act >>more like Value Set Analysis: it would use a map between a 4-tuple and a >>type. The 4-tuple (a, b, c, d) describes a formula ax + b in which a >>and b denote offset and stride and c and d place a limit on the lower >>and upper bounds of x. c and d can be +- infinity, allowing the 4-tuple >>to denote a type that occurs in an unbounded array. > >This sounds like a good idea. I'm willing to give this a go, time >permitting. > >>As an FYI, there are multiple research groups interested in accurate >>points-to analysis results (mine included). I'll be holding a BoF at >>the LLVM Developer's meeting this year to discuss who needs what and if >>there's a way to develop it. > >That doesn't surprise me. If someone would like to do this it would be >very handy from my point of view too! > >>2) Making APA robust. In my personal opinion, the original APA may be >>more complicated than necessary for your application. APA currently >>creates context sensitive pools, which means that it needs to pass pool >>handles around. This requires transforming function signatures which >>makes inter-operating with external code a real pain. It also makes the >>transform complicated. >>A simpler design would be to create one pool for each type that DSA >>infers. The points-to analysis would not be sound (unless it unified >>all points-to sets that have the same type), but it would allow all >>pools (or nearly all pools) to be global variables, greatly simplifying >>the transformation. It would also continue to prevent the application >>from accessing data outside its allocated memory. > >We aren't going to need to inter-operate with external code all that >much (if at all), so perhaps this will not be an issue. DSA and the >context sensitive pools of APA (despite the ugly transform) were >originally what attracted us to SafeCode; we would like to remain sound. > >>3) The version of SAFECode in that paper used the Omega constraint >>solver to prove that array accesses are safe. That code has long >>bit-rotted away, and its implementation was not the most efficient (it >>exec()'ed the Omega solver for every query). A better approach today >>would be to integrate the constraint solver into the compiler proper. >>Additionally, you now have other tools available, such as CVC4, Z3, and >>SMACK/Boogie, for building and solving the constraints. > >We had already noticed that the limit to linear array access was way >behind state of the art (we have a lot of experience with constraint >solvers). Using an SMT solver would make sense, as you can use the >theory that suits the particular problem - linear isn't good enough? >You've got IDL, octagons etc. > >To start off with we'd like to get something running, even if it has >limitations. Unless you think it's a really bad idea I will for the >moment try to get the old APA working to some extent again. It isn't >really an issue if I have to run APA and safecode as separate parts of >the build process for the moment, or if parts of it are clunky, from our >point of view it would just be good just to get some toy examples >working relatively quickly. > >>As an FYI, later versions of SAFECode use run-time checks for >>type-unsafe pools and array accesses so that it can support the full >>generality of C (see Dhurjati's PLDI 2007 paper and my SVA >>publications). You could probably do something simple in which >>type-safe data goes into pools and type-unsafe data goes into a large >>area for which loads and stores are subjected to simple SFI-like >>instrumentation (e.g., Google Native Client). There may be solutions in >>the middle of the spectrum between fast-but-difficult-to-implement and >>slow-but-easy-to-implement. > >We want speed speed speed! If we only support a (preferably large) >subset of C it doesn't matter - this is better than having additional >run time checks. Is there an option to warn whenever a run-time check is >inserted? I'll check out some of the other papers. I read the TECS paper >and gave a few of the others a skim, but looks like I missed some stuff. > >Cheers, >Ed
Ed Robbins via llvm-dev
2015-Oct-12 15:51 UTC
[llvm-dev] llvm-dev Digest, Vol 136, Issue 22
On Fri, Oct 9, 2015, at 05:45 PM, Adve, Vikram Sadanand wrote:> > Maybe I should have been a bit clearer; we're really interested in full > > memory and type safety. We want to harden the system against memory > > corruption vulnerabilities. Process isolation isn't an issue, as we are > > in an embedded context where we don't have processes. > > > Do you also care about use-after-free and free-after-free errors? > Handling free-after-free is not too hard, but general use-after-free can > be expensive, unless you’re willing to spend a fair amount of extra > memory. If dynamic allocation isn’t an issue for your (embedded) > applications, that makes this whole problem go away, of course.Yes, we do care about these issues. We are happy with the trick used in the TECS paper - that use after frees are type safe in that they will either point to the original object or an object of the same type.> > This brings me to the new question: exactly what unsafe features do you > need to be concerned about? Our first paper in the SAFECode project > focused on identifying a subset of C for which type and memory safety > could be enforced without *any* run-time checks or GC. We found that > array bounds checks were the biggest problem; ignoring these, we can > ensure the safety of pointer and dynamic memory usage in all the embedded > benchmarks we tried without any run-time checks. The paper appeared at > LCTES: > http://llvm.org/pubs/2003-05-05-LCTES03-CodeSafety.html > Like the later work, this did not eliminate use-after-free but used APA > to ensure that any use-after-free errors are “harmless” in that they did > not violate all the other type- and memory-safety guarantees. > > Let me know if you have any questions about this. >We're not sure what features to be concerned about until we actually try to run it on some of the intended code base. So probably, everything. But we are happy to spend some effort in refactoring programs if it means we can get type/memory safety without run time checks. Most of the C semantics that need to be dropped appear to be things that are quite bad practice anyway (except perhaps complex array indexing). I've been reading through this and a few other publications in order to try and get a better handle on the code. Having not done anything with llvm before and not knowing DSA/APA/SafeCode intimately I am a bit out of my depth so the background reading is great. Also, your code is really well commented :) The main question I have at the moment is what is really broken with APA in its current state? Or is it just that it's suspected to be broken? I've managed to build and run it using opt, and it appears to work just as well as it did for llvm-3.2. I was expecting to have to modify something to work around the untyped GEPs that John mentioned, but in fact I haven't had any such issues. I run using: path/to/opt -S -load path/to/LLVMDataStructure.so -load path/to/poolalloc.so -paheur-AllHeapNodes -poolalloc myprog.bc > myprog.pa.ll and get the same (modulo bytecode syntax changes) as with llvm 3.2. I have also tried some of the other heuristics and have not found any differences. But I don't think I tried with any programs that need more than one pool yet (so haven't really exercised it), and I haven't tried running sc over the output yet either. It would be useful if I could get a pointer to some test programs that really exercise PA. This is next on my todo list (going to check those referenced in your publications first). Thanks for your help Vikram. Best, Ed
Ed Robbins via llvm-dev
2016-Jan-25 13:13 UTC
[llvm-dev] llvm-dev Digest, Vol 136, Issue 22
Hi John, Vikram and all, I've been working on this a while now and thought I'd send a mail to report some status. I've also got various patches I'd like to submit for review, but I'm still getting my commits in order so I will follow up with those later. The context is that I'm working on getting the contiki embedded (Internet of Things) OS working with some notion of type and memory safety, on the Texas Instruments CC2538. This is a 32MHz Cortex M3 with 32K RAM, 512K flash, a zigbee/802.15.4 wireless interface, and an MPU. I've revived poolalloc, which was not actually too much work. Currently I'm using it to create pools for heap allocations only (not to create pool handles for safecode). There were just a couple of incompatibilities between it and LLVM 3.7. I also had to make a couple of changes to support ARM, and also make changes in the runtime (FL2 allocator) where I replaced the mutex implementation and disabled pointer compression. The pointer compression routines want an MMU, and the cc2538 only has an MPU. Safecode has been more work. There were some bugs and a bunch more changes due to incompatibilities with LLVM 3.7 (in fact the code built fine, but was crashing or not working when you tried to run the passes). I've also changed it to build shared objects so that I can run everything through opt, and that all works fine. I'm using the relevant passes to insert checks, and the optimisations that remove them, and it looks at the moment as though ~60% of checks are removed by static passes, which is not bad. I'm not using any of the baggy bounds checking, dangling pointer checks, or softbound, as they either need an MMU or wont fit into memory. I tried using the control flow integrity checking but it causes the object files to really blow up in size and they wont fit into flash anymore afterwards (they use too much RAM as well). I basically rewrote most of the runtime, creating an optional new runtime called "EmbeddedRuntime". This was required because although safecode exports all its runtime functionality to C, it uses a lot of C++ STL bits such as set,map etc, which would require linking against libstdc++, and we can't get this into memory. So the EmbeddedRuntime uses a very simple binary tree (based on tsearch) to store allocated ranges, and other parts of the runtime have been rewritten to remove STL stuff. Building the poolalloc and safecode runtimes is a bit messy. I've had to create my own makefiles and build outside of the main build system because I need to target arm (with different flags) and my main build targets x86. If anyone has ideas about a cleaner solution I'd love to hear from them. A note about memory safety; since poolalloc is not yet creating pool handles, many of the checks are not completely memory safe. Those objects will be registered into and checked against the global tree instead, so often although we can guarantee they remain in an allocated range, we can't be sure they remain in the correct range. However, I hope to rectify this at some point so that we can get complete memory safety. I'd also like to be sure we can trust libc, but am just trusting the wrappers for now. Best, Ed -- Ed Robbins ed at dismantleengineering.co.uk On Mon, Oct 12, 2015, at 03:51 PM, Ed Robbins wrote:> On Fri, Oct 9, 2015, at 05:45 PM, Adve, Vikram Sadanand wrote: > > > Maybe I should have been a bit clearer; we're really interested in full > > > memory and type safety. We want to harden the system against memory > > > corruption vulnerabilities. Process isolation isn't an issue, as we are > > > in an embedded context where we don't have processes. > > > > > > Do you also care about use-after-free and free-after-free errors? > > Handling free-after-free is not too hard, but general use-after-free can > > be expensive, unless you’re willing to spend a fair amount of extra > > memory. If dynamic allocation isn’t an issue for your (embedded) > > applications, that makes this whole problem go away, of course. > > Yes, we do care about these issues. We are happy with the trick used in > the TECS paper - that use after frees are type safe in that they will > either point to the original object or an object of the same type. > > > > > This brings me to the new question: exactly what unsafe features do you > > need to be concerned about? Our first paper in the SAFECode project > > focused on identifying a subset of C for which type and memory safety > > could be enforced without *any* run-time checks or GC. We found that > > array bounds checks were the biggest problem; ignoring these, we can > > ensure the safety of pointer and dynamic memory usage in all the embedded > > benchmarks we tried without any run-time checks. The paper appeared at > > LCTES: > > http://llvm.org/pubs/2003-05-05-LCTES03-CodeSafety.html > > Like the later work, this did not eliminate use-after-free but used APA > > to ensure that any use-after-free errors are “harmless” in that they did > > not violate all the other type- and memory-safety guarantees. > > > > Let me know if you have any questions about this. > > > > We're not sure what features to be concerned about until we actually try > to run it on some of the intended code base. So probably, everything. > But we are happy to spend some effort in refactoring programs if it > means we can get type/memory safety without run time checks. Most of the > C semantics that need to be dropped appear to be things that are quite > bad practice anyway (except perhaps complex array indexing). > > I've been reading through this and a few other publications in order to > try and get a better handle on the code. Having not done anything with > llvm before and not knowing DSA/APA/SafeCode intimately I am a bit out > of my depth so the background reading is great. Also, your code is > really well commented :) > > The main question I have at the moment is what is really broken with APA > in its current state? Or is it just that it's suspected to be broken? > I've managed to build and run it using opt, and it appears to work just > as well as it did for llvm-3.2. I was expecting to have to modify > something to work around the untyped GEPs that John mentioned, but in > fact I haven't had any such issues. I run using: > > path/to/opt -S -load path/to/LLVMDataStructure.so -load > path/to/poolalloc.so -paheur-AllHeapNodes -poolalloc myprog.bc > > myprog.pa.ll > > and get the same (modulo bytecode syntax changes) as with llvm 3.2. I > have also tried some of the other heuristics and have not found any > differences. But I don't think I tried with any programs that need more > than one pool yet (so haven't really exercised it), and I haven't tried > running sc over the output yet either. > > It would be useful if I could get a pointer to some test programs that > really exercise PA. This is next on my todo list (going to check those > referenced in your publications first). > > Thanks for your help Vikram. Best, > Ed