Hello, I'd like write a program that performs static analysis of code at the LLVM assembly/bitcode level, and to do so I plan on extensively referencing the language reference. As I hope to eventually use this tool as part of a security analysis of untrusted code, I need to be rather strict in my interpretation of the document. As such, I have some questions about how the implementers interpret the document (each question assumes we're considering a single fixed release version): 1. Is http://www.llvm.org/releases/<version>/docs/LangRef.html the most authoritative reference for a given version aside from the source code itself? 2. Are target-specific behaviors documented for each supported target? 3. Does undefined behavior semantically invalidate the entire program or is its unpredictable effect limited in scope somehow? 4. Are any behaviors undefined by virtue of not being specified in the reference, or are all scenarios that lead to undefined behavior explicitly identified as such? 5. Are there any language features with non-performance related semantic import (e.g annotations, instructions, intrinsic functions, types, etc.) that are not specified by the reference but are nevertheless implemented in the build system? 6. Are all deviations from the reference, no matter how minor, considered bugs (either in the implementation or the spec)? If not, what deviations are considered acceptable? If so, is it expected that all such discovered and possibly corrected deviations will have associated bug reports, or might some be corrected in the development repository without documentation of the issue outside of a commit message? In other words, if I'm working with, say, llvm 2.9 and want to find all deviations known to upstream, can I just browse bug reports or will I have to go through commit logs as well? These are the questions I have for now, but I may have more as I go along. Is this the appropriate place to ask this kind of thing? Thanks, Shea Levy
On Wed, Oct 19, 2011 at 8:20 PM, Shea Levy <shea at shealevy.com> wrote:> Hello, > > I'd like write a program that performs static analysis of code at the > LLVM assembly/bitcode level, and to do so I plan on extensively > referencing the language reference. As I hope to eventually use this > tool as part of a security analysis of untrusted code, I need to be > rather strict in my interpretation of the document. As such, I have some > questions about how the implementers interpret the document (each > question assumes we're considering a single fixed release version): > > 1. Is http://www.llvm.org/releases/<version>/docs/LangRef.html the most > authoritative reference for a given version aside from the source code > itself?Yes.> 2. Are target-specific behaviors documented for each supported target?When anything has target-specific behavior, that fact should be documented. Beyond that, if you have a question about what some construct is supposed to do, please ask.> 3. Does undefined behavior semantically invalidate the entire program or > is its unpredictable effect limited in scope somehow?There is no limit to the scope of undefined behavior.> 4. Are any behaviors undefined by virtue of not being specified in the > reference, or are all scenarios that lead to undefined behavior > explicitly identified as such?We really want to explicitly identify them all in the reference; if you have a question about some specific case, please ask.> 5. Are there any language features with non-performance related semantic > import (e.g annotations, instructions, intrinsic functions, types, etc.) > that are not specified by the reference but are nevertheless implemented > in the build system?You should be able to analyze the semantics of IR accurately based purely on information encoded into the IR. Every instruction, type, attribute etc. should be documented in LangRef. Platform-specific intrinsics are not documented, but can generally be treated like a call to an external function.> 6. Are all deviations from the reference, no matter how minor, > considered bugs (either in the implementation or the spec)? If not, what > deviations are considered acceptable?If the reference doesn't describe the implementation accurately, we consider it a bug. Granted, some bugs are relatively low-priority.> If so, is it expected that all > such discovered and possibly corrected deviations will have associated > bug reports, or might some be corrected in the development repository > without documentation of the issue outside of a commit message? In other > words, if I'm working with, say, llvm 2.9 and want to find all > deviations known to upstream, can I just browse bug reports or will I > have to go through commit logs as well?LLVM Bugzilla doesn't contain an entry for every bug; to find every fix, you'll have to go through commit logs. Not sure what you're trying to do here, though.> These are the questions I have for now, but I may have more as I go > along. Is this the appropriate place to ask this kind of thing?Yes. -Eli
On 10/19/11 11:58 PM, Eli Friedman wrote:> On Wed, Oct 19, 2011 at 8:20 PM, Shea Levy<shea at shealevy.com> wrote: >> 2. Are target-specific behaviors documented for each supported target? > When anything has target-specific behavior, that fact should be > documented. Beyond that, if you have a question about what some > construct is supposed to do, please ask.What I meant was: for a given target-specific behavior, is there anywhere I can look to see what the behavior specifically is for, say, i686-pc-linux, like you are supposed to be able to for implementation-defined behaviors in C?>> 5. Are there any language features with non-performance related semantic >> import (e.g annotations, instructions, intrinsic functions, types, etc.) >> that are not specified by the reference but are nevertheless implemented >> in the build system? > You should be able to analyze the semantics of IR accurately based > purely on information encoded into the IR. Every instruction, type, > attribute etc. should be documented in LangRef. Platform-specific > intrinsics are not documented, but can generally be treated like a > call to an external function.Platform-specific intrinsics are not documented anywhere, or just not in the language reference?>> If so, is it expected that all >> such discovered and possibly corrected deviations will have associated >> bug reports, or might some be corrected in the development repository >> without documentation of the issue outside of a commit message? In other >> words, if I'm working with, say, llvm 2.9 and want to find all >> deviations known to upstream, can I just browse bug reports or will I >> have to go through commit logs as well? > LLVM Bugzilla doesn't contain an entry for every bug; to find every > fix, you'll have to go through commit logs. Not sure what you're > trying to do here, though.Some more detail on my project: I'm mostly doing this so I can get introduced to the field of static analysis, learn what it's big problems are and what's just impossible with it, etc. To that end, however, I've decided to try to implement a set of checks that might actually be useful, to me at least. In particular, I want to see how many of the run-time checks made in hardware when a CPU is in user-mode and memory is segmented can be proven to be unnecessary at compile-time. The (probably impossible) end-goals to this project would be a) that every program which passes its checks would be as safe to run in kernel mode with full memory access as it would be in user mode and b) that a not-insignificant subset of well-written programs passes its checks. If I ever reach the point that I'm actually using this thing to run untrusted code in kernel mode, I'll want to know about as many deviations from the spec as possible to know if they might affect the reasoning my program uses. Thanks for the help, Shea Levy
On Oct 19, 2011, at 8:58 PM, Eli Friedman wrote:> On Wed, Oct 19, 2011 at 8:20 PM, Shea Levy <shea at shealevy.com> wrote: > >> 4. Are any behaviors undefined by virtue of not being specified in the >> reference, or are all scenarios that lead to undefined behavior >> explicitly identified as such? > > We really want to explicitly identify them all in the reference; if > you have a question about some specific case, please ask.However, there is a ton of stuff that's not explicitly identified today. For example, consider a call to a function address bitcasted to a type incompatible with the type of the function. Most of us around here intuitively know this gets undefined behavior because we know how to think like a C compiler. But LangRef doesn't discuss this. It doesn't even have a concept of "compatible" types with which to discuss it. What should the rules be? If we look through LLVM's source code, we find that the inliner has code for smoothing over caller/callee mismatches. However, we can't translate this logic into LangRef because it does things that are impossible to do for non-inlined calls in most backends. If we dig through every backend, we could come up with a minimal set of functionality that could be broadly supported. However, this set would be too minimal for clang, for example, which regularly bitcasts objc_msgSend in ways that it knows will work, but only for non-obvious reasons. You could spend weeks researching all the nuances of just this problem. In practice, LLVM just doesn't worry about it. Problems like this tend to be edge cases that don't cause trouble for most people most of the time. However, you can find them all over the place if you go looking. Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111020/3d49e61a/attachment.html>