> The point here is that XPL needs to keep track of what a given variable > represents at the source level. If the compiler sees a map that is > initially small it might represent it in LLVM assembly as a vector of > pairs. Later on, it gets optimized into being a hash table. In order to > do that and keep track of things, I need to know that the vector of > pairs is >intended< to be a map, not simply a vector of pairs.Absolutely. No matter what source language you're interested in, you want to know about _source_ variables/types/etc, not about LLVM varaibles, types, etc.> Another reason to do this is to speed up compilation time. XPL works > similarly to Java in that you define a module and "import" other modules > into it. I do not want to recompile a module each time it is imported.Makes sense . On the LLVM side of the fence, we are planning on making the JIT cache native translations, so you only need to pay the translation cost the first time a function is executed. This is also plays into the 'offline compilation' idea as well.> Since finding LLVM, I'm wondering if it wouldn't be better to store all > the AST information in the bytecode file so that I don't have > compilation information in one place and the code for it in another. > To do this, I'd need support from LLVM to put "compile time information" > into a bytecode or assembly file. This information would never be used > at runtime and never "optimized out". It just sits in the bytecode file > taking up space until some compiler (or other tool) asks for it.Makes sense. The LLVM bytecode file is packetized to specifically support these kinds of applications. The bytecode reader can skip over sections it doesn't understand. The unimplemented part is figuring out a format to put this into the .ll file (probably just a hex dump or something), and having the compiler preserve it through optimization.> 5. Compile time information is defined as a set of global variables > just the same as for the runtime definitions. The full use of > LLVM Types (especially derived types like structures and > pointers) can be used to define the global variables.If you just want to do this _today_ you already can. We have an "appending" linkage type which can make this very simple. Basically global arrays with appending linkage automatically merge together when bytecode files are linked (just like 'section' are merged in a traditional linker). If you want to implement your extra information using globals, that is no problem, they will just always be loaded and processed.> 6. There are never any naming conflicts between compile time > information variables in different modules. Each compile time > global variable is, effectively, scoped in its module. This > allows compiler writers to use the same name for various pieces > of data in every module emitted without clashing.If you use the appending linkage mechanism, you _want_ them to have the same name. :)> 7. The exact same facility for dealing with module scoped types and > variables are used to deal with the compile time information. > When asked for it, the VMCore would produce a SymbolTable that > references all the global types and variables in the compile > time information.If you use globals directly, you can just use the standard stuff.> 8. LLVM assembler and bytecode reader will assure the syntactic > integrity of the compile time information as it would for any > other bytecode. It checks types, pointer references, etc. and > emits warnings (errors?) if the compiler information is not > syntactically valid.How does it do this if it doesn't understand it? I thought it would just pass it through unmodified?> 9. LLVM makes no assertions about the semantics or content of the > compile time information. It can be anything the compiler writer > wishes to express to retain compilation information. Correctness > of the information content (beyond syntactics) is left to the > compiler writer. Exceptions to this rule may be warranted whereThis seems to contradict #8.> there is general applicability to multiple source languages. > Debug (file & line number) info would seem to be a natural > exception.Note that debug information doesn't work with this model. In particular, when the LLVM optimizer transmogrifies the code, it has to update the debug information to remain accurate. This requires understanding (at some level) the debug format.> 10. Compile time information sections are marked with a name that > relates to the high-level compiler that produced them. This > avoids confusion when one language attempts to read the compile > time information of another language. > > This is somewhat like an open ended, generalized ELF section for keeping > track of compiler and/or debug information. Because its based on > existing capabilities of LLVM, I don't think it would be particularly > difficult to implement either.There are two ways to implement this, as described above: 1. Use global arrays of bytes or something. If you want to, your arrays can even have pointers to globals variables and functions in them. 2. Use an untyped blob of data, attached to the .bc file. #2 is better from the efficiency standpoint (it doesn't need to be loaded if not used), but #1 is already fully implemented (it is used to implement global ctor/dtors)... -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/
On Sun, 2003-11-16 at 17:13, Chris Lattner wrote:> > The point here is that XPL needs to keep track of what a given variable > > represents at the source level. If the compiler sees a map that is > > initially small it might represent it in LLVM assembly as a vector of > > pairs. Later on, it gets optimized into being a hash table. In order to > > do that and keep track of things, I need to know that the vector of > > pairs is >intended< to be a map, not simply a vector of pairs. > > Absolutely. No matter what source language you're interested in, you want > to know about _source_ variables/types/etc, not about LLVM varaibles, > types, etc.Right.> > > Another reason to do this is to speed up compilation time. XPL works > > similarly to Java in that you define a module and "import" other modules > > into it. I do not want to recompile a module each time it is imported. > > Makes sense . On the LLVM side of the fence, we are planning on making the > JIT cache native translations, so you only need to pay the translation > cost the first time a function is executed. This is also plays into the > 'offline compilation' idea as well.I had assumed as much but I think I'm talking about something different. When I said "I do not want to recompile a module each time it is imported", I meant recompile in order to get the _source_ language descriptions only. I wouldn't recompile to get the byte codes to be executed because (presumably) those are already available as you noted. For example, if module A imports module B, I want to be able to just instantaneously load from B the definitions of types, constants, global variables and functions, as specified in the _source_ language without going back to the _source_ and recompiling it to regenerate the information. If we were in the C/C++ world, this would be more akin to header file pre-compilation. I want to load the _source_ AST for a given compiler very quickly, without revisiting the source code itself.> > > Since finding LLVM, I'm wondering if it wouldn't be better to store all > > the AST information in the bytecode file so that I don't have > > compilation information in one place and the code for it in another. > > To do this, I'd need support from LLVM to put "compile time information" > > into a bytecode or assembly file. This information would never be used > > at runtime and never "optimized out". It just sits in the bytecode file > > taking up space until some compiler (or other tool) asks for it. > > Makes sense. The LLVM bytecode file is packetized to specifically > support these kinds of applications. The bytecode reader can skip over > sections it doesn't understand. The unimplemented part is figuring out a > format to put this into the .ll file (probably just a hex dump or > something), and having the compiler preserve it through optimization.Sort of. What I'm thinking of is a section that it normally skips over (or, even better, never reaches because its at the end). However, the contents of that section would be interpretable by LLVM if someone asked for it. That is, the contents of the section contain constant type and variable definitions that are _not_ part of the executable program but are the _source_ description for the program. Those source descriptions are specified using regular LLVM Type and variable definitions but they don't factor into the program at all. When a bytecode file is loaded, anything defined in such a section is just skipped over. When a compiler or debugger asks for that section explicitly (the only way it gets accessed), LLVM would interpret the bytecodes and give back an instance of SymbolTable that only references Value and Type objects. These are the types and values that the compiler writer emitted to describe the _source_ and their semantics are up to the source compiler writer.> > > 5. Compile time information is defined as a set of global variables > > just the same as for the runtime definitions. The full use of > > LLVM Types (especially derived types like structures and > > pointers) can be used to define the global variables. > > If you just want to do this _today_ you already can. We have an > "appending" linkage type which can make this very simple. Basically > global arrays with appending linkage automatically merge together when > bytecode files are linked (just like 'section' are merged in a traditional > linker). If you want to implement your extra information using globals, > that is no problem, they will just always be loaded and processed.No. These _source_ descriptions are not to be loaded and processed ever except by explicit instruction from a compiler or debugger. For normal program execution they are always ignored. Furthermore, they must NOT be merged unless you just mean concatenated into one big "source description" segment. I don't see much utility in that myself. If by merged you mean that commonly named global symbols are reduced to a single copy (like linkonce), then this defeats the point. What if a compiler wanted to emit a variable named "ModuleOptions" in each translation unit that describes the _source_ compiler options used to compile the module. If those all get merged away, you lose the ability to distinguish different "ModuleOptions" for different modules. This is the reason for point #6.> > > 6. There are never any naming conflicts between compile time > > information variables in different modules. Each compile time > > global variable is, effectively, scoped in its module. This > > allows compiler writers to use the same name for various pieces > > of data in every module emitted without clashing. > > If you use the appending linkage mechanism, you _want_ them to have the > same name. :)No, you don't for the reason described above. Is there a way to retain the unique identity of each of the variables when using appending linkage?> > > 7. The exact same facility for dealing with module scoped types and > > variables are used to deal with the compile time information. > > When asked for it, the VMCore would produce a SymbolTable that > > references all the global types and variables in the compile > > time information. > > If you use globals directly, you can just use the standard stuff.Perhaps, I'm unsure of the details but you'd need to somehow mark these globals as "not part of the program, never execute, ignore on load, fetch only if requested".> > > 8. LLVM assembler and bytecode reader will assure the syntactic > > integrity of the compile time information as it would for any > > other bytecode. It checks types, pointer references, etc. and > > emits warnings (errors?) if the compiler information is not > > syntactically valid. > > How does it do this if it doesn't understand it? I thought it would just > pass it through unmodified?Read my statement carefully. I said "syntactic integrity" not semantics. LLVM would ensure that, within the compile time information (i.e. source description) there are (a) no references to undefined types, (b) no pointers to undefined symbols, (c) etc. These are all syntactic constructs that can be checked by LLVM without ever really understanding what the information in the compile time information actually _means_. That interpretation is left to the compiler writer. This just gives the compiler writer some assurance that the content of the compile time information at least makes some structural sense. Furthermore, this information, even though it may represent a very complex data structure, is treated as a big constant. There can be no variable parts (despite me referencing this as "global variables" previously). There might, however be relocatable parts such as a reference to an actual function or global variable.> > > 9. LLVM makes no assertions about the semantics or content of the > > compile time information. It can be anything the compiler writer > > wishes to express to retain compilation information. Correctness > > of the information content (beyond syntactics) is left to the > > compiler writer. Exceptions to this rule may be warranted where > > This seems to contradict #8.Not really. You don't want LLVM to specify to _source_ language compiler writers what is and isn't valid semantically. In fact, you'd have a really hard time doing so. You'd end up with (conceptually) something like the GCC "tree" mess, trying to be all things to everyone. Why bother? Leave that to the compiler writer. You only want LLVM to check syntax/structure/referential integrity, etc.> > > there is general applicability to multiple source languages. > > Debug (file & line number) info would seem to be a natural > > exception. > > Note that debug information doesn't work with this model. In particular, > when the LLVM optimizer transmogrifies the code, it has to update the > debug information to remain accurate. This requires understanding (at > some level) the debug format.You're right. Debug information needs to be more closely aligned with the actual code in order for it to survive transformation. In fact, this raises some suspicions about the viability of my approach in general. If the source description information contains references to a function that gets eliminated because its never called, what happens? Same thing for types and variables at both global and function scope.>>> I'm off to do some serious thinking about this proposal :( <<<> > > 10. Compile time information sections are marked with a name that > > relates to the high-level compiler that produced them. This > > avoids confusion when one language attempts to read the compile > > time information of another language. > > > > This is somewhat like an open ended, generalized ELF section for keeping > > track of compiler and/or debug information. Because its based on > > existing capabilities of LLVM, I don't think it would be particularly > > difficult to implement either. > > There are two ways to implement this, as described above: > 1. Use global arrays of bytes or something. If you want to, your arrays > can even have pointers to globals variables and functions in them. > 2. Use an untyped blob of data, attached to the .bc file. > > #2 is better from the efficiency standpoint (it doesn't need to be loaded > if not used), but #1 is already fully implemented (it is used to implement > global ctor/dtors)...I don't think #1 works because of the naming clash issue and because it implies that these global arrays become part of the program. I explicitly want to forbid that because (at least in the case of XPL), I can imagine situations where the source description information is more voluminous than the actual program by an order of magnitude (its that way with debug "symbol" information today). What I want to do is emit the same named global variable (your "arrays of bytes or something") in each module to capture information about that module. For example, I want to emit a global array of structures that describes the types defined in the module. I want to call that global array "Types". If I do that in every module, what happens? I get a link time "duplicate symbol definition" error? If I use appending linkage, I only get one of them? This is a disaster for this type of information. And, the name must remain constant across modules so that I can say, "load the compile time information for module X" and then "get variable "Types" from that compile time information. I can then peruse the type information for that module. If I have to mangle the name in each module, that's a little unfriendly and error prone. Furthermore, I do NOT want this information to be part of the program. It isn't, it describes the program. As such, your point #2 must be accommodated. The blob of data is normally skipped when the program is executed. But, when it is requested, that blob of data isn't just returned to the compiler as a blob. Because it represents a constant graph of types and values, LLVM first checks its integrity, then instantiates the necessary C++ objects to represent it and places them into a symbol table which is returned to the compiler. This means the compiler can quickly look up source descriptions in that module. If that approach is too cumbersome for LLVM, then I would vote for just the "blob" thing and leave it to each compiler writer to interpret the blob correctly. Make sense?> -ChrisReid. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/0a5717c1/attachment.sig>
Chris, I've done a little more thinking about this (putting source definitions into bytecode/assembler) files. Despite my previous assertions to the contrary, using the LLVM type and constant global declarations in the source definitions section is a little limiting because (a) it limits the choices for source compiler writers, (b) it might imply a larger amount of information than would otherwise be possible, and (c) it implies a contract between LLVM and source compiler writers that LLVM shouldn't have to support. So, here's what I suggest: 1. LLVM supports a named section that contains a BLOB. LLVM doesn't care about the contents of the BLOB but will assist in its maintenance (see below). To LLVM its a name and a chunk of data. 2. The name of the section is to allow information from different compilers to be distinguished. 3. LLVM provides a C++ class that represents the named blob. I'll call this class "ExtraInfo". 4. Source compiler writers can subclass ExtraInfo to their heart's content. 5. The ExtraInfo class supports pure virtual methods that are invoked by LLVM to notify its subclass(es) when an optimization causes a function, type or global variable to be deleted. No other notifications should be necessary. 6. During compilation, any ExtraInfo subclasses created by the source compiler are attached to the Module object and the maintenance provided in 5 is invoked automatically as optimizations occur. Does this sound reasonable? Reid. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031117/7b51d682/attachment.sig>
On Sun, 16 Nov 2003, Reid Spencer wrote:> header file pre-compilation. I want to load the _source_ AST for a > given compiler very quickly, without revisiting the source code itself.Gotcha.> Sort of. What I'm thinking of is a section that it normally skips over > (or, even better, never reaches because its at the end). However, the > contents of that section would be interpretable by LLVM if someone asked > for it. That is, the contents of the section contain constant type and > variable definitions that are _not_ part of the executable program but > are the _source_ description for the program. Those source descriptions > are specified using regular LLVM Type and variable definitions but they > don't factor into the program at all. When a bytecode file is loaded, > anything defined in such a section is just skipped over. When a compilerOk, this is all cool.> or debugger asks for that section explicitly (the only way it gets > accessed), LLVM would interpret the bytecodes and give back an instance > of SymbolTable that only references Value and Type objects. These are > the types and values that the compiler writer emitted to describe the > _source_ and their semantics are up to the source compiler writer.This isn't. I don't understand exactly what you're talking about here. What "Value" and "type" objects can there be if LLVM doesn't understand it? It seems to make more sense to me for the debugger or whatever to ask for a named section, and get handed an _untyped block_ of binary data...> > If you just want to do this _today_ you already can. We have an > > "appending" linkage type which can make this very simple. Basically > > global arrays with appending linkage automatically merge together when > > bytecode files are linked (just like 'section' are merged in a traditional > > linker). If you want to implement your extra information using globals, > > that is no problem, they will just always be loaded and processed. > > No. These _source_ descriptions are not to be loaded and processed ever > except by explicit instruction from a compiler or debugger. For normalOkay...> program execution they are always ignored. Furthermore, they must NOT be > merged unless you just mean concatenated into one big "source > description" segment. I don't see much utility in that myself. If byThat's what I meant. Assuming LLVM doesn't understand the contents of it, all it can do is concatenate.> merged you mean that commonly named global symbols are reduced to a > single copy (like linkonce), then this defeats the point. What if aI did mean appended.> compiler wanted to emit a variable named "ModuleOptions" in each > translation unit that describes the _source_ compiler options used to > compile the module. If those all get merged away, you lose the ability > to distinguish different "ModuleOptions" for different modules. This is > the reason for point #6.I understand.> > > 6. There are never any naming conflicts between compile time > > > information variables in different modules. Each compile time > > > global variable is, effectively, scoped in its module. This > > > allows compiler writers to use the same name for various pieces > > > of data in every module emitted without clashing. > > > > If you use the appending linkage mechanism, you _want_ them to have the > > same name. :) > No, you don't for the reason described above. Is there a way to retain > the unique identity of each of the variables when using appending > linkage?In the example above, the idea is that you would specify a binary blob of data put into an LLVM global constant array of bytes. The LLVM linker would concatenate these arrays of bytes without having any idea how to interpret the bytes. It would be up to your compiler to be able to interpret the meaning of the bytes and to be able to determine the 'identity of the variables' given the raw data.> > > 7. The exact same facility for dealing with module scoped types and > > > variables are used to deal with the compile time information. > > > When asked for it, the VMCore would produce a SymbolTable that > > > references all the global types and variables in the compile > > > time information. > > > > If you use globals directly, you can just use the standard stuff. > > Perhaps, I'm unsure of the details but you'd need to somehow mark these > globals as "not part of the program, never execute, ignore on load, > fetch only if requested".It would be straight-forward to make the JIT materialize globals only when they are referenced.> > > 8. LLVM assembler and bytecode reader will assure the syntactic > > > integrity of the compile time information as it would for any > > > other bytecode. It checks types, pointer references, etc. and > > > emits warnings (errors?) if the compiler information is not > > > syntactically valid. > > > > How does it do this if it doesn't understand it? I thought it would just > > pass it through unmodified? > > Read my statement carefully. I said "syntactic integrity" not semantics. > LLVM would ensure that, within the compile time information (i.e. source > description) there are (a) no references to undefined types, (b) no > pointers to undefined symbols, (c) etc. These are all syntactic > constructs that can be checked by LLVM without ever really understanding > what the information in the compile time information actually _means_. > That interpretation is left to the compiler writer. This just gives theSo you mean it checks the LLVM types and LLVM variables? I'm so confused, I thought you were talking about source level stuff! :)> compiler writer some assurance that the content of the compile time > information at least makes some structural sense. Furthermore, this > information, even though it may represent a very complex data structure, > is treated as a big constant. There can be no variable parts (despite me > referencing this as "global variables" previously). There might, however > be relocatable parts such as a reference to an actual function or global > variable.Ok, that is making more sense. Yes, LLVM already supports this.> > > 9. LLVM makes no assertions about the semantics or content of the > > > compile time information. It can be anything the compiler writer > > > wishes to express to retain compilation information. Correctness > > > of the information content (beyond syntactics) is left to the > > > compiler writer. Exceptions to this rule may be warranted where > > > > This seems to contradict #8.> Not really. You don't want LLVM to specify to _source_ language compiler > writers what is and isn't valid semantically. In fact, you'd have a > really hard time doing so. You'd end up with (conceptually) something > like the GCC "tree" mess, trying to be all things to everyone. Why > bother? Leave that to the compiler writer. You only want LLVM to check > syntax/structure/referential integrity, etc.Ok, I didn't understand what you meant by LLVM checking the structure but not understanding the semantics. You don't mean the structure _of the data itself_, just that the LLVM view of it is ok.> > > there is general applicability to multiple source languages. > > > Debug (file & line number) info would seem to be a natural > > > exception. > > > > Note that debug information doesn't work with this model. In particular, > > when the LLVM optimizer transmogrifies the code, it has to update the > > debug information to remain accurate. This requires understanding (at > > some level) the debug format. > > You're right. Debug information needs to be more closely aligned with > the actual code in order for it to survive transformation. In fact, this > raises some suspicions about the viability of my approach in general. If > the source description information contains references to a function > that gets eliminated because its never called, what happens? Same thing > for types and variables at both global and function scope.If a global has a pointer to a function, that function will never be eliminated. Likewise, things interprocedural constant propagation (leading to deletion of arguments) will never happen.> > There are two ways to implement this, as described above: > > 1. Use global arrays of bytes or something. If you want to, your arrays > > can even have pointers to globals variables and functions in them. > > 2. Use an untyped blob of data, attached to the .bc file. > > > > #2 is better from the efficiency standpoint (it doesn't need to be loaded > > if not used), but #1 is already fully implemented (it is used to implement > > global ctor/dtors)... > > I don't think #1 works because of the naming clash issue and because it > implies that these global arrays become part of the program. I > explicitly want to forbid that because (at least in the case of XPL), I > can imagine situations where the source description information is more > voluminous than the actual program by an order of magnitude (its that > way with debug "symbol" information today).I understand exactly what you're saying. Debug information in general has this problem. It's a very reasonable, and general, performance optimization for the JIT to never materialize globals it doesn't need, so this in and of itself isn't hard. The hard part is that if you have "external" pointers into the LLVM code, that those pointers will be invalidated very quickly by general transformations. Presumably you don't want to handcuff the optimizer too much.> What I want to do is emit the same named global variable (your "arrays > of bytes or something") in each module to capture information about that > module. For example, I want to emit a global array of structures that > describes the types defined in the module. I want to call that global > array "Types". If I do that in every module, what happens? I get a link > time "duplicate symbol definition" error?Yes.> If I use appending linkage, I only get one of them?No. The elements of the array will be concatenated together, as described in: http://llvm.cs.uiuc.edu/docs/LangRef.html#modulestructure> This is a disaster for this type of information. And, the name must > remain constant across modules so that I can say, "load the compile time > information for module X" and then "get variable "Types" from that > compile time information. I can then peruse the type information for > that module. If I have to mangle the name in each module, that's a > little unfriendly and error prone. Furthermore, I do NOT want this > information to be part of the program. It isn't, it describes the > program.I understand. This is exactly what appending linkage is for.> If that approach is too cumbersome for LLVM, then I would vote for just > the "blob" thing and leave it to each compiler writer to interpret the > blob correctly.This can certainly be done, but the problem is that random blobs on the side will not be updated, and will be invalidated. It seems to me that you're trying to address a problem semantically equivalent to debug information, which I _want to directly address_, but there are other more important things that need to be done first, as prerequisites. It is critically important to me to make the LLVM transformations _implicitly_ update debug information as they do their thing, without being aware of it. Just like the symbol table is implicitly always kept up-to-date. Of course, doing this is not easy. ;) -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/