In this email, I argue that LLVM IR is a poor system for building a Platform, by which I mean any system where LLVM IR would be a format in which programs are stored or transmitted for subsequent use on multiple underlying architectures. LLVM IR initially seems like it would work well here. I myself was once attracted to this idea. I was even motivated to put a bunch of my own personal time into making some of LLVM's optimization passes more robust in the absence of TargetData a while ago, even with no specific project in mind. There are several things still missing, but one could easily imagine that this is just a matter of people writing some more code. However, there are several ways in which LLVM IR differs from actual platforms, both high-level VMs like Java or .NET and actual low-level ISAs like x86 or ARM. First, the boundaries of what capabilities LLVM provides are nebulous. LLVM IR contains: * Explicitly Target-specific features. These aren't secret; x86_fp80's reason for being is pretty clear. * Target-specific ABI code. In order to interoperate with native C ABIs, LLVM requires front-ends to emit target-specific IR. Pretty much everyone around here has run into this. * Implicitly Target-specific features. The most obvious examples of these are all the different Linkage kinds. These are all basically just gateways to features in real linkers, and real linkers vary quite a lot. LLVM has its own IR-level Linker, but it doesn't do all the stuff that native linkers do. * Target-specific limitations in seemingly portable features. How big can the alignment be on an alloca? Or a GlobalVariable? What's the widest supported integer type? LLVM's various backends all have different answers to questions like these. Even ignoring the fact that the quality of the backends in the LLVM source tree varies widely, the question of "What can LLVM IR do?" has numerous backend-specific facets. This can be problematic for producers as well as consumers. Second, and more fundamentally, LLVM IR is a fundamentally vague language. It has: * Undefined Behavior. LLVM is, at its heart, a C compiler, and Undefined Behavior is one of its cornerstones. High-level VMs typically raise predictable exceptions when they encounter program errors. Physical machines typically document their behavior very extensively. LLVM is fundamentally different from both: it presents a bunch of rules to follow and then offers no description of what happens if you break them. LLVM's optimizers are built on the assumption that the rules are never broken, so when rules do get broken, the code just goes off the rails and runs into whatever happens to be in the way. Sometimes it crashes loudly. Sometimes it silently corrupts data and keeps running. There are some tools that can help locate violations of the rules. Valgrind is a very useful tool. But they can't find everything. There are even some kinds of undefined behavior that I've never heard anyone even propose a method of detection for. * Intentional vagueness. There is a strong preference for defining LLVM IR semantics intuitively rather than formally. This is quite practical; formalizing a language is a lot of work, it reduces future flexibility, and it tends to draw attention to troublesome edge cases which could otherwise be largely ignored. I've done work to try to formalize parts of LLVM IR, and the results have been largely fruitless. I got bogged down in edge cases that no one is interested in fixing. * Floating-point arithmetic is not always consistent. Some backends don't fully implement IEEE-754 arithmetic rules even without -ffast-math and friends, to get better performance. If you're familiar with "write once, debug everywhere" in Java, consider the situation in LLVM IR, which is fundamentally opposed to even trying to provide that level of consistency. And if you allow the optimizer to do subtarget-specific optimizations, you increase the chances that some bit of undefined behavior or vagueness will be exposed. Third, LLVM is a low level system that doesn't represent high-level abstractions natively. It forces them to be chopped up into lots of small low-level instructions. * It makes LLVM's Interpreter really slow. The amount of work performed by each instruction is relatively small, so the interpreter has to execute a relatively large number of instructions to do simple tasks, such as virtual method calls. Languages built for interpretation do more with fewer instructions, and have lower per-instruction overhead. * Similarly, it makes really-fast JITing hard. LLVM is fast compared to some other static C compilers, but it's not fast compared to real JIT compilers. Compiling one LLVM IR level instruction at a time can be relatively simple, ignoring the weird stuff, but this approach generates comically bad code. Fixing this requires recognizing patterns in groups of instructions, and then emitting code for the patterns. This works, but it's more involved. * Lowering high-level language features into low-level code locks in implementation details. This is less severe in native code, because a compiled blob is limited to a single hardware platform as well. But a platform which advertizes architecture independence which still has all the ABI lock-in of HLL implementation details presents a much more frightening backwards compatibility specter. * Apple has some LLVM IR transformations for Objective-C, however the transformations have to reverse-engineer the high-level semantics out of the lowered code, which is awkward. Further, they're reasoning about high-level semantics in a way that isn't guaranteed to be safe by LLVM IR rules alone. It works for the kinds of code clang generates for Objective C, but it wouldn't necessarily be correct if run on code produced by other front-ends. LLVM IR isn't capable of representing the necessary semantics for this unless we start embedding Objective C into it. In conclusion, consider the task of writing an independent implementation of an LLVM IR Platform. The set of capabilities it provides depends on who you talk to. Semantic details are left to chance. There are features which require a bunch of complicated infrastructure to implement which are rarely used. And if you want light-weight execution, you'll probably need to translate it into something else better suited for it first. This all doesn't sound very appealing. LLVM isn't actually a virtual machine. It's widely acknoledged that the name "LLVM" is a historical artifact which doesn't reliably connote what LLVM actually grew to be. LLVM IR is a compiler IR. Dan
Thank you for writing this. First, I'd like to say that I am in 100% agreement with your points. I've been tempted many times to write something similar, although what you've written has been articulated much better than what I would have said. When I try to explain to people what LLVM is I say "It's essentially the back-end of a compiler" - a job it does extremely well. I don't say "It's a virtual machine", because that is a job it doesn't do very well at all. I'd like to add a couple of additional items to your list - first, LLVM IR isn't stable, and it isn't backwards compatible. Bitcode is not useful as an archival format, because a bitcode file cannot be loaded if it's even a few months out of sync with the code that loads it. Loading a bitcode file that is years old is hopeless. Also, bitcode is large compared to Java or CLR bitcodes. This isn't such a big deal, but for people who want to ship code over the network it could be an issue. I've been thinking that it would be a worthwhile project to develop a high-level IR that avoids many of the issues that you raise. Similar in concept to Java byte code, but without Java's limitations - for example it would support pass-by-value types. (CLR has this, but it also has limitations). Of course, this IR would of necessity be less flexible than LLVM IR, but you could always dip into IR where needed, such as C programs dip into assembly on occasion. This hypothetical IR language would include a type system that was rich enough to express all of the DWARF semantics - so that instead of having two parallel representations of every type (one for LLVM's code generators and one for DWARF), you could instead generate both the LLVM types and the DWARF DI's from a common representation. This would have a huge savings in both complexity and the size of bitcode files. On Tue, Oct 4, 2011 at 11:53 AM, Dan Gohman <gohman at apple.com> wrote:> In this email, I argue that LLVM IR is a poor system for building a > Platform, by which I mean any system where LLVM IR would be a > format in which programs are stored or transmitted for subsequent > use on multiple underlying architectures. > > LLVM IR initially seems like it would work well here. I myself was > once attracted to this idea. I was even motivated to put a bunch of > my own personal time into making some of LLVM's optimization passes > more robust in the absence of TargetData a while ago, even with no > specific project in mind. There are several things still missing, > but one could easily imagine that this is just a matter of people > writing some more code. > > However, there are several ways in which LLVM IR differs from actual > platforms, both high-level VMs like Java or .NET and actual low-level > ISAs like x86 or ARM. > > First, the boundaries of what capabilities LLVM provides are nebulous. > LLVM IR contains: > > * Explicitly Target-specific features. These aren't secret; > x86_fp80's reason for being is pretty clear. > > * Target-specific ABI code. In order to interoperate with native > C ABIs, LLVM requires front-ends to emit target-specific IR. > Pretty much everyone around here has run into this. > > * Implicitly Target-specific features. The most obvious examples of > these are all the different Linkage kinds. These are all basically > just gateways to features in real linkers, and real linkers vary > quite a lot. LLVM has its own IR-level Linker, but it doesn't > do all the stuff that native linkers do. > > * Target-specific limitations in seemingly portable features. > How big can the alignment be on an alloca? Or a GlobalVariable? > What's the widest supported integer type? LLVM's various backends > all have different answers to questions like these. > > Even ignoring the fact that the quality of the backends in the > LLVM source tree varies widely, the question of "What can LLVM IR do?" > has numerous backend-specific facets. This can be problematic for > producers as well as consumers. > > Second, and more fundamentally, LLVM IR is a fundamentally > vague language. It has: > > * Undefined Behavior. LLVM is, at its heart, a C compiler, and > Undefined Behavior is one of its cornerstones. > > High-level VMs typically raise predictable exceptions when they > encounter program errors. Physical machines typically document > their behavior very extensively. LLVM is fundamentally different > from both: it presents a bunch of rules to follow and then offers > no description of what happens if you break them. > > LLVM's optimizers are built on the assumption that the rules > are never broken, so when rules do get broken, the code just > goes off the rails and runs into whatever happens to be in > the way. Sometimes it crashes loudly. Sometimes it silently > corrupts data and keeps running. > > There are some tools that can help locate violations of the > rules. Valgrind is a very useful tool. But they can't find > everything. There are even some kinds of undefined behavior that > I've never heard anyone even propose a method of detection for. > > * Intentional vagueness. There is a strong preference for defining > LLVM IR semantics intuitively rather than formally. This is quite > practical; formalizing a language is a lot of work, it reduces > future flexibility, and it tends to draw attention to troublesome > edge cases which could otherwise be largely ignored. > > I've done work to try to formalize parts of LLVM IR, and the > results have been largely fruitless. I got bogged down in > edge cases that no one is interested in fixing. > > * Floating-point arithmetic is not always consistent. Some backends > don't fully implement IEEE-754 arithmetic rules even without > -ffast-math and friends, to get better performance. > > If you're familiar with "write once, debug everywhere" in Java, > consider the situation in LLVM IR, which is fundamentally opposed > to even trying to provide that level of consistency. And if you allow > the optimizer to do subtarget-specific optimizations, you increase > the chances that some bit of undefined behavior or vagueness will be > exposed. > > Third, LLVM is a low level system that doesn't represent high-level > abstractions natively. It forces them to be chopped up into lots of > small low-level instructions. > > * It makes LLVM's Interpreter really slow. The amount of work > performed by each instruction is relatively small, so the interpreter > has to execute a relatively large number of instructions to do simple > tasks, such as virtual method calls. Languages built for interpretation > do more with fewer instructions, and have lower per-instruction > overhead. > > * Similarly, it makes really-fast JITing hard. LLVM is fast compared > to some other static C compilers, but it's not fast compared to > real JIT compilers. Compiling one LLVM IR level instruction at a > time can be relatively simple, ignoring the weird stuff, but this > approach generates comically bad code. Fixing this requires > recognizing patterns in groups of instructions, and then emitting > code for the patterns. This works, but it's more involved. > > * Lowering high-level language features into low-level code locks > in implementation details. This is less severe in native code, > because a compiled blob is limited to a single hardware platform > as well. But a platform which advertizes architecture independence > which still has all the ABI lock-in of HLL implementation details > presents a much more frightening backwards compatibility specter. > > * Apple has some LLVM IR transformations for Objective-C, however > the transformations have to reverse-engineer the high-level semantics > out of the lowered code, which is awkward. Further, they're > reasoning about high-level semantics in a way that isn't guaranteed > to be safe by LLVM IR rules alone. It works for the kinds of code > clang generates for Objective C, but it wouldn't necessarily be > correct if run on code produced by other front-ends. LLVM IR > isn't capable of representing the necessary semantics for this > unless we start embedding Objective C into it. > > > In conclusion, consider the task of writing an independent implementation > of an LLVM IR Platform. The set of capabilities it provides depends on who > you talk to. Semantic details are left to chance. There are features > which require a bunch of complicated infrastructure to implement which > are rarely used. And if you want light-weight execution, you'll > probably need to translate it into something else better suited for it > first. This all doesn't sound very appealing. > > LLVM isn't actually a virtual machine. It's widely acknoledged that the > name "LLVM" is a historical artifact which doesn't reliably connote what > LLVM actually grew to be. LLVM IR is a compiler IR. > > Dan > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111004/506ee796/attachment.html>
Interestingly I wrote a bytecode language exactly like this for my master's thesis, based atop of LLVM. I abandoned the project after graduating, but it had it's promising moments. ________________________________________ From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] On Behalf Of Talin [viridia at gmail.com] Sent: 04 October 2011 21:23 To: Dan Gohman Cc: llvmdev at cs.uiuc.edu Mailing List Subject: Re: [LLVMdev] LLVM IR is a compiler IR Thank you for writing this. First, I'd like to say that I am in 100% agreement with your points. I've been tempted many times to write something similar, although what you've written has been articulated much better than what I would have said. When I try to explain to people what LLVM is I say "It's essentially the back-end of a compiler" - a job it does extremely well. I don't say "It's a virtual machine", because that is a job it doesn't do very well at all. I'd like to add a couple of additional items to your list - first, LLVM IR isn't stable, and it isn't backwards compatible. Bitcode is not useful as an archival format, because a bitcode file cannot be loaded if it's even a few months out of sync with the code that loads it. Loading a bitcode file that is years old is hopeless. Also, bitcode is large compared to Java or CLR bitcodes. This isn't such a big deal, but for people who want to ship code over the network it could be an issue. I've been thinking that it would be a worthwhile project to develop a high-level IR that avoids many of the issues that you raise. Similar in concept to Java byte code, but without Java's limitations - for example it would support pass-by-value types. (CLR has this, but it also has limitations). Of course, this IR would of necessity be less flexible than LLVM IR, but you could always dip into IR where needed, such as C programs dip into assembly on occasion. This hypothetical IR language would include a type system that was rich enough to express all of the DWARF semantics - so that instead of having two parallel representations of every type (one for LLVM's code generators and one for DWARF), you could instead generate both the LLVM types and the DWARF DI's from a common representation. This would have a huge savings in both complexity and the size of bitcode files. On Tue, Oct 4, 2011 at 11:53 AM, Dan Gohman <gohman at apple.com<mailto:gohman at apple.com>> wrote: In this email, I argue that LLVM IR is a poor system for building a Platform, by which I mean any system where LLVM IR would be a format in which programs are stored or transmitted for subsequent use on multiple underlying architectures. LLVM IR initially seems like it would work well here. I myself was once attracted to this idea. I was even motivated to put a bunch of my own personal time into making some of LLVM's optimization passes more robust in the absence of TargetData a while ago, even with no specific project in mind. There are several things still missing, but one could easily imagine that this is just a matter of people writing some more code. However, there are several ways in which LLVM IR differs from actual platforms, both high-level VMs like Java or .NET and actual low-level ISAs like x86 or ARM. First, the boundaries of what capabilities LLVM provides are nebulous. LLVM IR contains: * Explicitly Target-specific features. These aren't secret; x86_fp80's reason for being is pretty clear. * Target-specific ABI code. In order to interoperate with native C ABIs, LLVM requires front-ends to emit target-specific IR. Pretty much everyone around here has run into this. * Implicitly Target-specific features. The most obvious examples of these are all the different Linkage kinds. These are all basically just gateways to features in real linkers, and real linkers vary quite a lot. LLVM has its own IR-level Linker, but it doesn't do all the stuff that native linkers do. * Target-specific limitations in seemingly portable features. How big can the alignment be on an alloca? Or a GlobalVariable? What's the widest supported integer type? LLVM's various backends all have different answers to questions like these. Even ignoring the fact that the quality of the backends in the LLVM source tree varies widely, the question of "What can LLVM IR do?" has numerous backend-specific facets. This can be problematic for producers as well as consumers. Second, and more fundamentally, LLVM IR is a fundamentally vague language. It has: * Undefined Behavior. LLVM is, at its heart, a C compiler, and Undefined Behavior is one of its cornerstones. High-level VMs typically raise predictable exceptions when they encounter program errors. Physical machines typically document their behavior very extensively. LLVM is fundamentally different from both: it presents a bunch of rules to follow and then offers no description of what happens if you break them. LLVM's optimizers are built on the assumption that the rules are never broken, so when rules do get broken, the code just goes off the rails and runs into whatever happens to be in the way. Sometimes it crashes loudly. Sometimes it silently corrupts data and keeps running. There are some tools that can help locate violations of the rules. Valgrind is a very useful tool. But they can't find everything. There are even some kinds of undefined behavior that I've never heard anyone even propose a method of detection for. * Intentional vagueness. There is a strong preference for defining LLVM IR semantics intuitively rather than formally. This is quite practical; formalizing a language is a lot of work, it reduces future flexibility, and it tends to draw attention to troublesome edge cases which could otherwise be largely ignored. I've done work to try to formalize parts of LLVM IR, and the results have been largely fruitless. I got bogged down in edge cases that no one is interested in fixing. * Floating-point arithmetic is not always consistent. Some backends don't fully implement IEEE-754 arithmetic rules even without -ffast-math and friends, to get better performance. If you're familiar with "write once, debug everywhere" in Java, consider the situation in LLVM IR, which is fundamentally opposed to even trying to provide that level of consistency. And if you allow the optimizer to do subtarget-specific optimizations, you increase the chances that some bit of undefined behavior or vagueness will be exposed. Third, LLVM is a low level system that doesn't represent high-level abstractions natively. It forces them to be chopped up into lots of small low-level instructions. * It makes LLVM's Interpreter really slow. The amount of work performed by each instruction is relatively small, so the interpreter has to execute a relatively large number of instructions to do simple tasks, such as virtual method calls. Languages built for interpretation do more with fewer instructions, and have lower per-instruction overhead. * Similarly, it makes really-fast JITing hard. LLVM is fast compared to some other static C compilers, but it's not fast compared to real JIT compilers. Compiling one LLVM IR level instruction at a time can be relatively simple, ignoring the weird stuff, but this approach generates comically bad code. Fixing this requires recognizing patterns in groups of instructions, and then emitting code for the patterns. This works, but it's more involved. * Lowering high-level language features into low-level code locks in implementation details. This is less severe in native code, because a compiled blob is limited to a single hardware platform as well. But a platform which advertizes architecture independence which still has all the ABI lock-in of HLL implementation details presents a much more frightening backwards compatibility specter. * Apple has some LLVM IR transformations for Objective-C, however the transformations have to reverse-engineer the high-level semantics out of the lowered code, which is awkward. Further, they're reasoning about high-level semantics in a way that isn't guaranteed to be safe by LLVM IR rules alone. It works for the kinds of code clang generates for Objective C, but it wouldn't necessarily be correct if run on code produced by other front-ends. LLVM IR isn't capable of representing the necessary semantics for this unless we start embedding Objective C into it. In conclusion, consider the task of writing an independent implementation of an LLVM IR Platform. The set of capabilities it provides depends on who you talk to. Semantic details are left to chance. There are features which require a bunch of complicated infrastructure to implement which are rarely used. And if you want light-weight execution, you'll probably need to translate it into something else better suited for it first. This all doesn't sound very appealing. LLVM isn't actually a virtual machine. It's widely acknoledged that the name "LLVM" is a historical artifact which doesn't reliably connote what LLVM actually grew to be. LLVM IR is a compiler IR. Dan _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -- -- Talin -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Talin, I too agree 100% with Dan's words, and this could be a good pointer for Jin-Gu Kang to continue on his pursuit for a better target-independent bitcode. Also, add your backwards compatibility issue to debug metadata in IR, in which fields appear or disappear without notice. But I think you hit a sweet spot here... On 4 October 2011 21:23, Talin <viridia at gmail.com> wrote:> This hypothetical IR language would include a type system that was rich > enough to express all of the DWARF semantics - so that instead of having two > parallel representations of every type (one for LLVM's code generators and > one for DWARF), you could instead generate both the LLVM types and the DWARF > DI's from a common representation. This would have a huge savings in both > complexity and the size of bitcode files.This is a really interesting idea. If you could describe your type system in terms of Dwarf, you would have both: a rich type system AND free Dwarf. However, writing a back-end that would understand such a rich type system AND language ABIs is out of the question. We were discussing JIT and trying to come to a solution where JIT wouldn't be as heavy as it has to be now, to no avail. Unless there is a language that is of a higher level (like Java bytecode) or JIT will always suffer. If you join Dan's well said points, plus yours, Jin-Gu's and the necessity of a decent JIT, it's almost reason enough to split the IR into higher and lower versions (as proposed last year to deal with complex type systems and ABIs). Even some optimisations (maybe even Polly) could benefit from this higher level representation, and all current optimisations can still pass on the current, low-level, IR. My tuppence. cheers, --renato
On Oct 4, 2011, at 11:53 AM, Dan Gohman wrote:> In this email, I argue that LLVM IR is a poor system for building a > Platform, by which I mean any system where LLVM IR would be a > format in which programs are stored or transmitted for subsequent > use on multiple underlying architectures.Hi Dan, I agree with almost all of the points you make, but not your conclusion. Many of the issues that you point out as problems are actually "features" that a VM like Java doesn't provide. For example, Java doesn't have uninitialized variables on the stack, and LLVM does. LLVM is capable of expressing the implicit zero initialization of variables that is implicit in Java, it just leaves the choice to the frontend. Many of the other issues that you raise are true, but irrelevant when compared to other VMs. For example, LLVM allows a frontend to produce code that is ABI compatible with native C ABIs. It does this by requiring the frontend to know a lot about the native C ABI. Java doesn't permit this at all, and so LLVM having "this feature" seems like a feature over-and-above what high-level VMs provide. Similiarly, the "conditionally" supported features like large and obscurely sized integers simply don't exist in these VMs. The one key feature that LLVM doesn't have that Java does, and which cannot be added to LLVM "through a small matter of implementation" is verifiable safety. Java bytecode verification is not something that LLVM IR permits, which you can't really do in LLVM (without resorting to techniques like SFI). With all that said, I do think that we have a real issue here. The real issue is that we have people struggling to do things that a "hard" and see LLVM as the problem. For example: 1. The native client folks trying to use LLVM IR as a portable representation that abstracts arbitrary C calling conventions. This doesn't work because the frontend has to know the C calling conventions of the target. 2. The OpenCL folks trying to turn LLVM into a portable abstraction language by introducing endianness abstractions. This is hard because C is inherently a non-portable language, and this is only scratching the surface of the issues. To really fix this, OpenCL would have to be subset substantially, like the EFI C dialect.> LLVM isn't actually a virtual machine. It's widely acknoledged that the > name "LLVM" is a historical artifact which doesn't reliably connote what > LLVM actually grew to be. LLVM IR is a compiler IR.It sounds like you're picking a very specific definition of what a VM is. LLVM certainly isn't a high level virtual machine like Java, but that's exactly the feature that makes it a practical target for C-family languages. It isn't LLVM's fault that people want LLVM to magically solve all of C's portability problems. -Chris
On 5 October 2011 00:19, Chris Lattner <clattner at apple.com> wrote:> 1. The native client folks trying to use LLVM IR as a portable representation that abstracts arbitrary C calling conventions. This doesn't work because the frontend has to know the C calling conventions of the target.(...)> 2. The OpenCL folks trying to turn LLVM into a portable abstraction language by introducing endianness abstractions. This is hard because C is inherently a non-portable language, and this is only scratching the surface of the issues. To really fix this, OpenCL would have to be subset substantially, like the EFI C dialect.(...)> It sounds like you're picking a very specific definition of what a VM is. LLVM certainly isn't a high level virtual machine like Java, but that's exactly the feature that makes it a practical target for C-family languages. It isn't LLVM's fault that people want LLVM to magically solve all of C's portability problems.Chris, This is a very simplistic point of view, and TBH, I'm a bit shocked. Having a "nicer codebase" and "friendlier community" are two strong points for LLVM against GCC, but they're too weak to migrate people from GCC to LLVM. JIT, "the native client folks", "the openCL folks" are showing how powerful LLVM could be, if it was a bit more accommodating. Without those troublesome folks, LLVM is just another compiler, like GCC, and being completely blunt, it's no better. The infrastructure to add new passes is better, but the number and quality of passes is not. It's way easier to create new back-ends, but the existing number and quality, again, no better. The "good code" is suffering a diverse community, large codebase and company's interests, which is not a good forecast for code quality. It's not just the IR that has a lot of kludge, back-ends, front-ends, dwarf emitter, exception handling, etc., although some nicer than GCC, it's not complete nor accurate. If you want to bet on a "fun community" to drive LLVM, I don't think you'll go too far. And if you want to discard the OpenCL, JIT and NativeClient-style community, well, there won't be much of a community to be any fun... If you want to win on the code quality battle, while working for a big company, good luck. Part of the GCC community's grunts are towards companies trying to push selfish code in, and well, their reasons are not all without merit. I didn't see people looking for a magic wand on these discussions so far... -- cheers, --renato http://systemcall.org/
On Tue, Oct 4, 2011 at 4:19 PM, Chris Lattner <clattner at apple.com> wrote:> On Oct 4, 2011, at 11:53 AM, Dan Gohman wrote: > > In this email, I argue that LLVM IR is a poor system for building a > > Platform, by which I mean any system where LLVM IR would be a > > format in which programs are stored or transmitted for subsequent > > use on multiple underlying architectures. > > Hi Dan, > > I agree with almost all of the points you make, but not your conclusion. > Many of the issues that you point out as problems are actually "features" > that a VM like Java doesn't provide. For example, Java doesn't have > uninitialized variables on the stack, and LLVM does. LLVM is capable of > expressing the implicit zero initialization of variables that is implicit in > Java, it just leaves the choice to the frontend. > > Many of the other issues that you raise are true, but irrelevant when > compared to other VMs. For example, LLVM allows a frontend to produce code > that is ABI compatible with native C ABIs. It does this by requiring the > frontend to know a lot about the native C ABI. Java doesn't permit this at > all, and so LLVM having "this feature" seems like a feature over-and-above > what high-level VMs provide. Similiarly, the "conditionally" supported > features like large and obscurely sized integers simply don't exist in these > VMs. > > The one key feature that LLVM doesn't have that Java does, and which cannot > be added to LLVM "through a small matter of implementation" is verifiable > safety. Java bytecode verification is not something that LLVM IR permits, > which you can't really do in LLVM (without resorting to techniques like > SFI). > > With all that said, I do think that we have a real issue here. The real > issue is that we have people struggling to do things that a "hard" and see > LLVM as the problem. For example: > > 1. The native client folks trying to use LLVM IR as a portable > representation that abstracts arbitrary C calling conventions. This doesn't > work because the frontend has to know the C calling conventions of the > target. > > 2. The OpenCL folks trying to turn LLVM into a portable abstraction > language by introducing endianness abstractions. This is hard because C is > inherently a non-portable language, and this is only scratching the surface > of the issues. To really fix this, OpenCL would have to be subset > substantially, like the EFI C dialect. > > > LLVM isn't actually a virtual machine. It's widely acknoledged that the > > name "LLVM" is a historical artifact which doesn't reliably connote what > > LLVM actually grew to be. LLVM IR is a compiler IR. > > It sounds like you're picking a very specific definition of what a VM is. > LLVM certainly isn't a high level virtual machine like Java, but that's > exactly the feature that makes it a practical target for C-family languages. > It isn't LLVM's fault that people want LLVM to magically solve all of C's > portability problems. > > I understand that the official goals of the LLVM project are carefullylimited. A large number of LLVM users are perfectly happy to live within the envelope of what LLVM provides. At the same time, there are also a fair number of users who are aiming for things that appear to be just outside that envelope. These "near miss" users are looking at Java, at CLR, and constantly asking themselves "did I make the right decision betting on LLVM rather than these other platforms?" Unfortunately, there are frustratingly few choices available in this space, and LLVM happens to be "nearest" conceptually to what these users want to accomplish. But bridging the gap between where they want to go and where LLVM is headed is often quite a challenge, one that is measured in multiple man-years of effort. -Chris> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111004/52976000/attachment.html>
Dan Gohman <gohman at apple.com> writes: Great post, Dan. Some comments follow. [snip]> * Target-specific ABI code. In order to interoperate with native > C ABIs, LLVM requires front-ends to emit target-specific IR. > Pretty much everyone around here has run into this.There are places where compatibility with the native C ABI is taken too far. For instance, time ago I noted that what the user sets through Module::setDataLayout is simply ignored. LLVM uses the data layout required by the native C ABI, which is hardcoded into LLVM's source code. So I asked: pass the value setted by Module::setDataLayout to the layers that are interested on it, as any user would expect. The response I got was, in essence, "As you are not working on C/C++, I couldn't care less about your language's requirements." So I have a two-line patch on my LLVM local copy, which has the effect of making the IR code generated by my compiler portable across Linux/x86 and Windows/x86 (although that was not the reason I wanted the change.) So it is true that LLVM IR has portability limitations, but not all of them are intrinsic to the LLVM IR nature. [snip]
Hi Talin,> I'd like to add a couple of additional items to your list - first, LLVM IR isn't > stable, and it isn't backwards compatible. Bitcode is not useful as an archival > format, because a bitcode file cannot be loaded if it's even a few months out of > sync with the code that loads it. Loading a bitcode file that is years old is > hopeless.that sounds like a bug, assuming the bitcode was produced by released versions of LLVM (bitcode produced with some intermediate development version of LLVM may or may not be loadable in the final release). Maybe you should open some bug reports? Ciao, Duncan.
Hi Óscar,> There are places where compatibility with the native C ABI is taken too > far. For instance, time ago I noted that what the user sets through > Module::setDataLayout is simply ignored.it's not ignored, it's used by the IR level optimizers. That way these optimizers can know stuff about the target without having to be linked to a target backend. LLVM uses the data layout> required by the native C ABI, which is hardcoded into LLVM's source > code. So I asked: pass the value setted by Module::setDataLayout to the > layers that are interested on it, as any user would expect.There are two classes of information in datalayout: things which correspond to stuff hard-wired into the target processor (for example that x86 is little endian), and stuff which is not hard-wired in (for example the alignment of x86 long double, which is 4 or 8 bytes on x86-32 depending on whether you are on linux, darwin or windows). Hoping to have code generators override the hard-wired stuff if they see something different in the data layout is just too much to ask for - eg the x86 code generators are never going to produce big endian code just because you set big-endianness in the datalayout. Even the second class of "soft" parameters is not completely flexible: for example most processors enforce a minimum alignment for types, and trying to reduce it by giving types a lesser alignment in the datalayout just isn't going to work. So given that the ways in which codegen could adapt to various datalayout settings are quite limited and constrained by the target, does it really make sense to try to parametrize the codegenerators by the datalayout at all? In any case, it might be good if the code generators produced a warning if they see that the datalayout string doesn't correspond to what codegen thinks it should be (I though someone added that already?). Ciao, Duncan. The response> I got was, in essence, "As you are not working on C/C++, I couldn't care > less about your language's requirements." So I have a two-line patch on > my LLVM local copy, which has the effect of making the IR code generated > by my compiler portable across Linux/x86 and Windows/x86 (although that > was not the reason I wanted the change.) > > So it is true that LLVM IR has portability limitations, but not all of > them are intrinsic to the LLVM IR nature. > > [snip] > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Dan, I read five distinct requests in your well-written remarks, which may appeal to different people: 1. How can we make LLVM more portable? As Chris later pointed out, it's hard to achieve this goal on the input side while preserving C semantics, since even C source code doesn't really have that property. On the platform front, recent discussions about "non-standard" architectures highlighted that most of the LLVM effort is really around x86 and ARM, and platforms that deviate from these reference points tend to be second thoughts. 2. How can we make LLVM more stable over time? As a regular user of LLVM, I initially found the frequent changes in LLVM painful. On the other hand, that effort is not a high price if it keeps the code base fluid. It wouldn't hurt to take an approach like OpenGL where new stuff is tested through a shared "extensions" mechanism, and deprecation of old interfaces spans years. "It no longer works" is a message we see a little too often on LLVM-dev. 3. How can we clarify the specification of LLVM? In the good old Unix tradition, the source code is the documentation, and the "documentation" explains the bugs and gives simplistic examples. But standard-level specification is really hard and tends to spend inordinate amount of time on corner cases ordinary folks don't care about. To wit: C++ and C++ ABI standardization efforts. LLVM has the luxury to be able to just assert in the corner cases, and deal with it on demand. 4. How can we address minority needs in LLVM? Being a minority here, I can only second that. I'd say that LLVM has to keep their priorities right. As someone else pointed out, one reason to pick up LLVM is because it gives me interoperability with C. I'm not willing to give that up, and that means I have to learn a little bit of the C non-portable way of doing things. That being said, minorities are also the guys keeping you on your toes. 5. How can we avoid selfish kludges and self-imposed limitations in the LLVM source code base? Probably the more immediately actionable point. IMO, things tend to go in the right direction, at least in my experience. But it's always easy to lapse. Overall, I see these not so really as technical or architectural issues. Rather, I'd say that LLVM is very "market driven", i.e. the largest communities (C and x86) tend to grab all the attention. Still, it has reached a level of maturity where even smaller teams like ours can benefit from the crumbs. That being said, can we build a portable LLVM IR on top of the existing stuff without giving up C compatibility? I'm not sure. I would settle for a few sub-goals that may be more easily achieved, e.g. define a subset of the IR that is exactly as portable as C, or ensuring that object layout settings default to the target, but can effectively be overridden in a meaningful way (think: C++ ABI inheritance rules, HP28/HP48 internal object layout, ...) My two bytes Christophe On 4 oct. 2011, at 20:53, Dan Gohman wrote:> In this email, I argue that LLVM IR is a poor system for building a > Platform, by which I mean any system where LLVM IR would be a > format in which programs are stored or transmitted for subsequent > use on multiple underlying architectures. > > LLVM IR initially seems like it would work well here. I myself was > once attracted to this idea. I was even motivated to put a bunch of > my own personal time into making some of LLVM's optimization passes > more robust in the absence of TargetData a while ago, even with no > specific project in mind. There are several things still missing, > but one could easily imagine that this is just a matter of people > writing some more code. > > However, there are several ways in which LLVM IR differs from actual > platforms, both high-level VMs like Java or .NET and actual low-level > ISAs like x86 or ARM. > > First, the boundaries of what capabilities LLVM provides are nebulous. > LLVM IR contains: > > * Explicitly Target-specific features. These aren't secret; > x86_fp80's reason for being is pretty clear. > > * Target-specific ABI code. In order to interoperate with native > C ABIs, LLVM requires front-ends to emit target-specific IR. > Pretty much everyone around here has run into this. > > * Implicitly Target-specific features. The most obvious examples of > these are all the different Linkage kinds. These are all basically > just gateways to features in real linkers, and real linkers vary > quite a lot. LLVM has its own IR-level Linker, but it doesn't > do all the stuff that native linkers do. > > * Target-specific limitations in seemingly portable features. > How big can the alignment be on an alloca? Or a GlobalVariable? > What's the widest supported integer type? LLVM's various backends > all have different answers to questions like these. > > Even ignoring the fact that the quality of the backends in the > LLVM source tree varies widely, the question of "What can LLVM IR do?" > has numerous backend-specific facets. This can be problematic for > producers as well as consumers. > > Second, and more fundamentally, LLVM IR is a fundamentally > vague language. It has: > > * Undefined Behavior. LLVM is, at its heart, a C compiler, and > Undefined Behavior is one of its cornerstones. > > High-level VMs typically raise predictable exceptions when they > encounter program errors. Physical machines typically document > their behavior very extensively. LLVM is fundamentally different > from both: it presents a bunch of rules to follow and then offers > no description of what happens if you break them. > > LLVM's optimizers are built on the assumption that the rules > are never broken, so when rules do get broken, the code just > goes off the rails and runs into whatever happens to be in > the way. Sometimes it crashes loudly. Sometimes it silently > corrupts data and keeps running. > > There are some tools that can help locate violations of the > rules. Valgrind is a very useful tool. But they can't find > everything. There are even some kinds of undefined behavior that > I've never heard anyone even propose a method of detection for. > > * Intentional vagueness. There is a strong preference for defining > LLVM IR semantics intuitively rather than formally. This is quite > practical; formalizing a language is a lot of work, it reduces > future flexibility, and it tends to draw attention to troublesome > edge cases which could otherwise be largely ignored. > > I've done work to try to formalize parts of LLVM IR, and the > results have been largely fruitless. I got bogged down in > edge cases that no one is interested in fixing. > > * Floating-point arithmetic is not always consistent. Some backends > don't fully implement IEEE-754 arithmetic rules even without > -ffast-math and friends, to get better performance. > > If you're familiar with "write once, debug everywhere" in Java, > consider the situation in LLVM IR, which is fundamentally opposed > to even trying to provide that level of consistency. And if you allow > the optimizer to do subtarget-specific optimizations, you increase > the chances that some bit of undefined behavior or vagueness will be > exposed. > > Third, LLVM is a low level system that doesn't represent high-level > abstractions natively. It forces them to be chopped up into lots of > small low-level instructions. > > * It makes LLVM's Interpreter really slow. The amount of work > performed by each instruction is relatively small, so the interpreter > has to execute a relatively large number of instructions to do simple > tasks, such as virtual method calls. Languages built for interpretation > do more with fewer instructions, and have lower per-instruction > overhead. > > * Similarly, it makes really-fast JITing hard. LLVM is fast compared > to some other static C compilers, but it's not fast compared to > real JIT compilers. Compiling one LLVM IR level instruction at a > time can be relatively simple, ignoring the weird stuff, but this > approach generates comically bad code. Fixing this requires > recognizing patterns in groups of instructions, and then emitting > code for the patterns. This works, but it's more involved. > > * Lowering high-level language features into low-level code locks > in implementation details. This is less severe in native code, > because a compiled blob is limited to a single hardware platform > as well. But a platform which advertizes architecture independence > which still has all the ABI lock-in of HLL implementation details > presents a much more frightening backwards compatibility specter. > > * Apple has some LLVM IR transformations for Objective-C, however > the transformations have to reverse-engineer the high-level semantics > out of the lowered code, which is awkward. Further, they're > reasoning about high-level semantics in a way that isn't guaranteed > to be safe by LLVM IR rules alone. It works for the kinds of code > clang generates for Objective C, but it wouldn't necessarily be > correct if run on code produced by other front-ends. LLVM IR > isn't capable of representing the necessary semantics for this > unless we start embedding Objective C into it. > > > In conclusion, consider the task of writing an independent implementation > of an LLVM IR Platform. The set of capabilities it provides depends on who > you talk to. Semantic details are left to chance. There are features > which require a bunch of complicated infrastructure to implement which > are rarely used. And if you want light-weight execution, you'll > probably need to translate it into something else better suited for it > first. This all doesn't sound very appealing. > > LLVM isn't actually a virtual machine. It's widely acknoledged that the > name "LLVM" is a historical artifact which doesn't reliably connote what > LLVM actually grew to be. LLVM IR is a compiler IR. > > Dan > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Dan Gohman <gohman at apple.com> writes:> In this email, I argue that LLVM IR is a poor system for building a > Platform, by which I mean any system where LLVM IR would be a > format in which programs are stored or transmitted for subsequent > use on multiple underlying architectures.I agree with all of this. But...so what? :) It's a compiler IR. That's not a bad thing. Do you want to propose some changes to make LLVM IR more target independent? I'm sure it would be welcome, but these are not easy problems to solve, as you know. :) -Dave