Chris Lattner wrote:> On Sun, 16 Nov 2003, Reid Spencer wrote: > > >>On Sun, 2003-11-16 at 11:17, Chris Lattner wrote: >> >> >>>No, it's all or nothing. Once linked, they cannot be seperated (easily). >>>However, especially when using the JIT, there is little overhead for >>>running a gigantic program that only has 1% of the functions in it ever >>>executed... >> >>Perhaps in the general case, but what if its running on an embedded >>system and the "gigantic program" >>causes an out-of-memory condition? > > > The JIT doesn't even load unreferenced functions from the disk, so this > shouldn't be the case... (thanks to Misha for implementing this :) > > Also, the globaldce pass deletes functions which can never be called by > the program, so large hunks of libraries get summarily removed from the > program after static linking. > > >>>There are multiple different ways to approach these questions depending on >>>what we want to do and what the priorities are. There are several good >>>solutions, but for now, everything needs to be statically linked. I >>>expect this to change over the next month or so. > > >>When you have time, I'd like to hear what you're planning in this area >>as it will directly effect how I build my compiler and VM. > > > What do you need, and what would you like? At this point there are > several solutions that make sense, but they have to be balanced against > practical issues. For example, say we do IPO across the package, and then > one of the members get updated. How do we know to invalidate the results? > > As I think that I have mentioned before, one long-term way of implementing > this is to attach analysis results to bytecode files as well as the code. > Thus, you could compile libc, say, with LLVM to a "shared object" bytecode > file. While doing this, the optimizer could notice that "strlen" has no > side-effects, for example, and attach that information to the bytecode > file. >While on the subject of annotating bytecode with analysis info, could I entice someone to also think about carrying other types of source-level annotations through into bytecode ? This is particularly useful for situations where one wants to use LLVM infrastructure for its whole-program optimization capabilities, however wouldn't want to give up on the ability to debug the final product binary. At the moment, my understanding is that source code annotations like file names, line numbers etc isn't carried through. When one gets around to linking the whole program, you end up with a single .s file of native machine code (which by now is a giant collection of bits picked up from a multitude of source files) with no ability to do symbolic debugging on the resulting binary...> When linking a program that uses libc, the linker wouldn't pull in any > function bodies from "shared objects", but would read the analysis results > and attach them to the function prototypes in the program. This would > allow the LICM optimizer, to hoist strlen calls out of loops when it makes > sense, for example. > > Of course there are situations when it is better to actually link the > function bodies into the program too. In the strlen example, it might be > the case that the program will go faster if strlen is inlined into a > particular call site. > > I'm inclined to start simple and work our way up to these cases, but if > you have certain usage patterns in mind, I would love to hear them, and we > can hash out what will really get implemented... > > -Chris >
> While on the subject of annotating bytecode with analysis info, could I > entice someone to also think about carrying other types of source-level > annotations through into bytecode ? This is particularly useful for > situations where one wants to use LLVM infrastructure for its > whole-program optimization capabilities, however wouldn't want to give > up on the ability to debug the final product binary. At the moment, my > understanding is that source code annotations like file names, line > numbers etc isn't carried through. When one gets around to linking theYes, this is very true. This is on my medium-term todo list. LLVM will definitely support this, its just that we want to do it right and we are focusing on other issues at the moment (like performance). At the moment, the best way to debug LLVM compiled code is to use the C backend, compile with -g, and suffer through the experience. :( Luckily, when writing LLVM optimizations and such, bugpoint makes things much much nicer. :) -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/
On Sun, 2003-11-16 at 13:01, Vipin Gokhale wrote:> While on the subject of annotating bytecode with analysis info, could I > entice someone to also think about carrying other types of source-level > annotations through into bytecode ? This is particularly useful for > situations where one wants to use LLVM infrastructure for its > whole-program optimization capabilities, however wouldn't want to give > up on the ability to debug the final product binary. At the moment, my > understanding is that source code annotations like file names, line > numbers etc isn't carried through. When one gets around to linking the > whole program, you end up with a single .s file of native machine code > (which by now is a giant collection of bits picked up from a multitude > of source files) with no ability to do symbolic debugging on the > resulting binary...I whole heartedly second that motion. My purposes are a little different, however. The language for which I'm compiling (XPL) is fairly high level. For example, data structures such as hash tables and red black trees are simply referenced as "maps" which map one type to another. What exact data structure is used underneath is up to the compiler and runtime optimizer, even allowing transformation of the underlying type at runtime. For example, a map that initially contains 3 elements would probably just be a vector of pairs because its pretty straight forward to linearly scan a small table and it is space efficient. But, as the map grows in size, it might transform itself into a sorted vector so binary search can be used and then into a hash table to reduce the overhead of searching further and then again later on into a full red-black tree. Of course, all of this depends on whether insertions and deletions are more frequent than look ups, etc. The point here is that XPL needs to keep track of what a given variable represents at the source level. If the compiler sees a map that is initially small it might represent it in LLVM assembly as a vector of pairs. Later on, it gets optimized into being a hash table. In order to do that and keep track of things, I need to know that the vector of pairs is >intended< to be a map, not simply a vector of pairs. Another reason to do this is to speed up compilation time. XPL works similarly to Java in that you define a module and "import" other modules into it. I do not want to recompile a module each time it is imported. I'd rather just save the static portion of the syntax tree (i.e. the declarations) somewhere and load it en masse when its referenced in another compilation. Currently, I have a partially implemented solution for this based on my persistent memory module (like an object database for C++ that allows you to save graphs of objects onto disk via virtual memory management tricks). When a module is referenced in an import statement, its disk segment is located and mapped into memory in one shot .. no parsing, no linking together, just instantly available. For large software projects with 1000s of modules, this is a HUGE compilation time win. Since finding LLVM, I'm wondering if it wouldn't be better to store all the AST information in the bytecode file so that I don't have compilation information in one place and the code for it in another. To do this, I'd need support from LLVM to put "compile time information" into a bytecode or assembly file. This information would never be used at runtime and never "optimized out". It just sits in the bytecode file taking up space until some compiler (or other tool) asks for it. I've given some thought to this and here's how I think it should go: 1. Compile time information is placed in separate section of the bytecode file (presumably at the end to reduce runtime I/O) 2. Nothing in the compile time information is used at runtime. It is neither the subject of optimization nor execution. 3. Compile time information sections are completely optional. A given language compiler need not utilize them and they have no bearing on correct execution of the program. 4. Compile time information is loaded only explicitly (presumably by a compiler based on LLVM) but also possibly by an optimization pass that would like to understand the higher-order semantics better (this would require the pass to be language specific, presumably). 5. Compile time information is defined as a set of global variables just the same as for the runtime definitions. The full use of LLVM Types (especially derived types like structures and pointers) can be used to define the global variables. 6. There are never any naming conflicts between compile time information variables in different modules. Each compile time global variable is, effectively, scoped in its module. This allows compiler writers to use the same name for various pieces of data in every module emitted without clashing. 7. The exact same facility for dealing with module scoped types and variables are used to deal with the compile time information. When asked for it, the VMCore would produce a SymbolTable that references all the global types and variables in the compile time information. 8. LLVM assembler and bytecode reader will assure the syntactic integrity of the compile time information as it would for any other bytecode. It checks types, pointer references, etc. and emits warnings (errors?) if the compiler information is not syntactically valid. 9. LLVM makes no assertions about the semantics or content of the compile time information. It can be anything the compiler writer wishes to express to retain compilation information. Correctness of the information content (beyond syntactics) is left to the compiler writer. Exceptions to this rule may be warranted where there is general applicability to multiple source languages. Debug (file & line number) info would seem to be a natural exception. 10. Compile time information sections are marked with a name that relates to the high-level compiler that produced them. This avoids confusion when one language attempts to read the compile time information of another language. This is somewhat like an open ended, generalized ELF section for keeping track of compiler and/or debug information. Because its based on existing capabilities of LLVM, I don't think it would be particularly difficult to implement either. Reid. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/f02e63aa/attachment.sig>
> The point here is that XPL needs to keep track of what a given variable > represents at the source level. If the compiler sees a map that is > initially small it might represent it in LLVM assembly as a vector of > pairs. Later on, it gets optimized into being a hash table. In order to > do that and keep track of things, I need to know that the vector of > pairs is >intended< to be a map, not simply a vector of pairs.Absolutely. No matter what source language you're interested in, you want to know about _source_ variables/types/etc, not about LLVM varaibles, types, etc.> Another reason to do this is to speed up compilation time. XPL works > similarly to Java in that you define a module and "import" other modules > into it. I do not want to recompile a module each time it is imported.Makes sense . On the LLVM side of the fence, we are planning on making the JIT cache native translations, so you only need to pay the translation cost the first time a function is executed. This is also plays into the 'offline compilation' idea as well.> Since finding LLVM, I'm wondering if it wouldn't be better to store all > the AST information in the bytecode file so that I don't have > compilation information in one place and the code for it in another. > To do this, I'd need support from LLVM to put "compile time information" > into a bytecode or assembly file. This information would never be used > at runtime and never "optimized out". It just sits in the bytecode file > taking up space until some compiler (or other tool) asks for it.Makes sense. The LLVM bytecode file is packetized to specifically support these kinds of applications. The bytecode reader can skip over sections it doesn't understand. The unimplemented part is figuring out a format to put this into the .ll file (probably just a hex dump or something), and having the compiler preserve it through optimization.> 5. Compile time information is defined as a set of global variables > just the same as for the runtime definitions. The full use of > LLVM Types (especially derived types like structures and > pointers) can be used to define the global variables.If you just want to do this _today_ you already can. We have an "appending" linkage type which can make this very simple. Basically global arrays with appending linkage automatically merge together when bytecode files are linked (just like 'section' are merged in a traditional linker). If you want to implement your extra information using globals, that is no problem, they will just always be loaded and processed.> 6. There are never any naming conflicts between compile time > information variables in different modules. Each compile time > global variable is, effectively, scoped in its module. This > allows compiler writers to use the same name for various pieces > of data in every module emitted without clashing.If you use the appending linkage mechanism, you _want_ them to have the same name. :)> 7. The exact same facility for dealing with module scoped types and > variables are used to deal with the compile time information. > When asked for it, the VMCore would produce a SymbolTable that > references all the global types and variables in the compile > time information.If you use globals directly, you can just use the standard stuff.> 8. LLVM assembler and bytecode reader will assure the syntactic > integrity of the compile time information as it would for any > other bytecode. It checks types, pointer references, etc. and > emits warnings (errors?) if the compiler information is not > syntactically valid.How does it do this if it doesn't understand it? I thought it would just pass it through unmodified?> 9. LLVM makes no assertions about the semantics or content of the > compile time information. It can be anything the compiler writer > wishes to express to retain compilation information. Correctness > of the information content (beyond syntactics) is left to the > compiler writer. Exceptions to this rule may be warranted whereThis seems to contradict #8.> there is general applicability to multiple source languages. > Debug (file & line number) info would seem to be a natural > exception.Note that debug information doesn't work with this model. In particular, when the LLVM optimizer transmogrifies the code, it has to update the debug information to remain accurate. This requires understanding (at some level) the debug format.> 10. Compile time information sections are marked with a name that > relates to the high-level compiler that produced them. This > avoids confusion when one language attempts to read the compile > time information of another language. > > This is somewhat like an open ended, generalized ELF section for keeping > track of compiler and/or debug information. Because its based on > existing capabilities of LLVM, I don't think it would be particularly > difficult to implement either.There are two ways to implement this, as described above: 1. Use global arrays of bytes or something. If you want to, your arrays can even have pointers to globals variables and functions in them. 2. Use an untyped blob of data, attached to the .bc file. #2 is better from the efficiency standpoint (it doesn't need to be loaded if not used), but #1 is already fully implemented (it is used to implement global ctor/dtors)... -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/