thr3ads.net - llvm dev - [LLVMdev] Packages [Nov 2003]

If this information is useful, please help other people find it:
Share via:

Vipin Gokhale

2003-Nov-16 14:56 UTC

[LLVMdev] Packages

Chris Lattner wrote:> On Sun, 16 Nov 2003, Reid Spencer wrote:
> 
> 
>>On Sun, 2003-11-16 at 11:17, Chris Lattner wrote:
>>
>>
>>>No, it's all or nothing.  Once linked, they cannot be seperated
(easily).
>>>However, especially when using the JIT, there is little overhead for
>>>running a gigantic program that only has 1% of the functions in it
ever
>>>executed...
>>
>>Perhaps in the general case, but what if its running on an embedded
>>system and the "gigantic program"
>>causes an out-of-memory condition?
> 
> 
> The JIT doesn't even load unreferenced functions from the disk, so this
> shouldn't be the case... (thanks to Misha for implementing this :)
> 
> Also, the globaldce pass deletes functions which can never be called by
> the program, so large hunks of libraries get summarily removed from the
> program after static linking.
> 
> 
>>>There are multiple different ways to approach these questions
depending on
>>>what we want to do and what the priorities are.  There are several
good
>>>solutions, but for now, everything needs to be statically linked.  I
>>>expect this to change over the next month or so.
> 
> 
>>When you have time, I'd like to hear what you're planning in
this area
>>as it will directly effect how I build my compiler and VM.
> 
> 
> What do you need, and what would you like?  At this point there are
> several solutions that make sense, but they have to be balanced against
> practical issues.  For example, say we do IPO across the package, and then
> one of the members get updated.  How do we know to invalidate the results?
> 
> As I think that I have mentioned before, one long-term way of implementing
> this is to attach analysis results to bytecode files as well as the code.
> Thus, you could compile libc, say, with LLVM to a "shared object"
bytecode
> file.  While doing this, the optimizer could notice that "strlen"
has no
> side-effects, for example, and attach that information to the bytecode
> file.
> 
While on the subject of annotating bytecode with analysis info, could I 
entice someone to also think about carrying other types of source-level 
annotations through into bytecode ? This is particularly useful for 
situations where one wants to use LLVM infrastructure for its 
whole-program optimization capabilities, however wouldn't want to give 
up on the ability to debug the final product binary. At the moment, my 
understanding is that source code annotations like file names, line 
numbers etc isn't carried through. When one gets around to linking the 
whole program, you end up with a single .s file of native machine code 
(which by now is a giant collection of bits picked up from a multitude 
of source files) with no ability to do symbolic debugging on the 
resulting binary...

> When linking a program that uses libc, the linker wouldn't pull in any
> function bodies from "shared objects", but would read the
analysis results
> and attach them to the function prototypes in the program.  This would
> allow the LICM optimizer, to hoist strlen calls out of loops when it makes
> sense, for example.
> 
> Of course there are situations when it is better to actually link the
> function bodies into the program too.  In the strlen example, it might be
> the case that the program will go faster if strlen is inlined into a
> particular call site.
> 
> I'm inclined to start simple and work our way up to these cases, but if
> you have certain usage patterns in mind, I would love to hear them, and we
> can hash out what will really get implemented...
> 
> -Chris
>

Chris Lattner

2003-Nov-16 14:58 UTC

head link

[LLVMdev] Packages

> While on the subject of annotating bytecode with analysis info, could I
> entice someone to also think about carrying other types of source-level
> annotations through into bytecode ? This is particularly useful for
> situations where one wants to use LLVM infrastructure for its
> whole-program optimization capabilities, however wouldn't want to give
> up on the ability to debug the final product binary. At the moment, my
> understanding is that source code annotations like file names, line
> numbers etc isn't carried through. When one gets around to linking the
Yes, this is very true.  This is on my medium-term todo list.  LLVM will
definitely support this, its just that we want to do it right and we are
focusing on other issues at the moment (like performance).

At the moment, the best way to debug LLVM compiled code is to use the C
backend, compile with -g, and suffer through the experience.  :(

Luckily, when writing LLVM optimizations and such, bugpoint makes things
much much nicer.  :)

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

Reid Spencer

2003-Nov-16 18:42 UTC

head link

[LLVMdev] Packages

On Sun, 2003-11-16 at 13:01, Vipin Gokhale wrote:> While on the subject of annotating bytecode with analysis info, could I 
> entice someone to also think about carrying other types of source-level 
> annotations through into bytecode ? This is particularly useful for 
> situations where one wants to use LLVM infrastructure for its 
> whole-program optimization capabilities, however wouldn't want to give 
> up on the ability to debug the final product binary. At the moment, my 
> understanding is that source code annotations like file names, line 
> numbers etc isn't carried through. When one gets around to linking the 
> whole program, you end up with a single .s file of native machine code 
> (which by now is a giant collection of bits picked up from a multitude 
> of source files) with no ability to do symbolic debugging on the 
> resulting binary...
I whole heartedly second that motion.

My purposes are a little different, however. The language for which I'm
compiling (XPL) is fairly high level. For example, data structures such
as hash tables and red black trees are simply referenced as "maps"
which
map one type to another. What exact data structure is used underneath is
up to the compiler and runtime optimizer, even allowing transformation
of the underlying type at runtime. For example, a map that initially
contains 3 elements would probably just be a vector of pairs because its
pretty straight forward to linearly scan a small table and it is space
efficient. But, as the map grows in size, it might transform itself into
a sorted vector so binary search can be used and then into a hash table
to reduce the overhead of searching further and then again later on into
a full red-black tree. Of course, all of this depends on whether
insertions and deletions are more frequent than look ups, etc. 

The point here is that XPL needs to keep track of what a given variable
represents at the source level. If the compiler sees a map that is
initially small it might represent it in LLVM assembly as a vector of
pairs. Later on, it gets optimized into being a hash table. In order to
do that and keep track of things, I need to know that the vector of
pairs is >intended< to be a map, not simply a vector of pairs.

Another reason to do this is to speed up compilation time. XPL works
similarly to Java in that you define a module and "import" other
modules
into it.  I do not want to recompile a module each time it is imported.
I'd rather just save the static portion of the syntax tree (i.e. the
declarations) somewhere and load it en masse when its referenced in
another compilation.  Currently, I have a partially implemented solution
for this based on my persistent memory module (like an object database
for C++ that allows you to save graphs of objects onto disk via virtual
memory management tricks). When a module is referenced in an import
statement, its disk segment is located and mapped into memory in one
shot .. no parsing, no linking together, just instantly available. For
large software projects with 1000s of modules, this is a HUGE
compilation time win.

Since finding LLVM, I'm wondering if it wouldn't be better to store all
the AST information in the bytecode file so that I don't have
compilation information in one place and the code for it in another.  To
do this, I'd need support from LLVM to put "compile time
information"
into a bytecode or assembly file. This information would never be used
at runtime and never "optimized out". It just sits in the bytecode
file
taking up space until some compiler (or other tool) asks for it.

I've given some thought to this and here's how I think it should go:

     1. Compile time information is placed in separate section of the
        bytecode file (presumably at the end to reduce runtime I/O)
     2. Nothing in the compile time information is used at runtime. It
        is neither the subject of optimization nor execution.
     3. Compile time information sections are completely optional. A
        given language compiler need not utilize them and they have no
        bearing on correct execution of the program.
     4. Compile time information is loaded only explicitly (presumably
        by a compiler based on LLVM) but also possibly by an
        optimization pass that would like to understand the higher-order
        semantics better (this would require the pass to be language
        specific, presumably).
     5. Compile time information is defined as a set of global variables
        just the same as for the runtime definitions. The full use of
        LLVM Types (especially derived types like structures and
        pointers) can be used to define the global variables. 
     6. There are never any naming conflicts between compile time
        information variables in different modules. Each compile time
        global variable is, effectively, scoped in its module. This
        allows compiler writers to use the same name for various pieces
        of data in every module emitted without clashing.
     7. The exact same facility for dealing with module scoped types and
        variables are used to deal with the compile time information.
        When asked for it, the VMCore would produce a SymbolTable that
        references all the global types and variables in the compile
        time information.
     8. LLVM assembler and bytecode reader will assure the syntactic
        integrity of the compile time information as it would for any
        other bytecode. It checks types, pointer references, etc. and
        emits warnings (errors?) if the compiler information is not
        syntactically valid.
     9. LLVM makes no assertions about the semantics or content of the
        compile time information. It can be anything the compiler writer
        wishes to express to retain compilation information. Correctness
        of the information content (beyond syntactics) is left to the
        compiler writer.  Exceptions to this rule may be warranted where
        there is general applicability to multiple source languages.
        Debug (file & line number) info would seem to be a natural
        exception.
    10. Compile time information sections are marked with a name that
        relates to the high-level compiler that produced them. This
        avoids confusion when one language attempts to read the compile
        time information of another language.

This is somewhat like an open ended, generalized ELF section for keeping
track of compiler and/or debug information.  Because its based on
existing capabilities of LLVM, I don't think it would be particularly
difficult to implement either.

Reid.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/f02e63aa/attachment.sig>

Chris Lattner

2003-Nov-16 18:59 UTC

head link

[LLVMdev] Packages

> The point here is that XPL needs to keep track of what a given variable
> represents at the source level. If the compiler sees a map that is
> initially small it might represent it in LLVM assembly as a vector of
> pairs. Later on, it gets optimized into being a hash table. In order to
> do that and keep track of things, I need to know that the vector of
> pairs is >intended< to be a map, not simply a vector of pairs.
Absolutely.  No matter what source language you're interested in, you want
to know about _source_ variables/types/etc, not about LLVM varaibles,
types, etc.
> Another reason to do this is to speed up compilation time. XPL works
> similarly to Java in that you define a module and "import" other
modules
> into it.  I do not want to recompile a module each time it is imported.
Makes sense . On the LLVM side of the fence, we are planning on making the
JIT cache native translations, so you only need to pay the translation
cost the first time a function is executed.  This is also plays into the
'offline compilation' idea as well.
> Since finding LLVM, I'm wondering if it wouldn't be better to store
all
> the AST information in the bytecode file so that I don't have
> compilation information in one place and the code for it in another.
> To do this, I'd need support from LLVM to put "compile time
information"
> into a bytecode or assembly file. This information would never be used
> at runtime and never "optimized out". It just sits in the
bytecode file
> taking up space until some compiler (or other tool) asks for it.
Makes sense.   The LLVM bytecode file is packetized to specifically
support these kinds of applications.  The bytecode reader can skip over
sections it doesn't understand.  The unimplemented part is figuring out a
format to put this into the .ll file (probably just a hex dump or
something), and having the compiler preserve it through optimization.
>      5. Compile time information is defined as a set of global variables
>         just the same as for the runtime definitions. The full use of
>         LLVM Types (especially derived types like structures and
>         pointers) can be used to define the global variables.
If you just want to do this _today_ you already can.  We have an
"appending" linkage type which can make this very simple.  Basically
global arrays with appending linkage automatically merge together when
bytecode files are linked (just like 'section' are merged in a
traditional
linker).  If you want to implement your extra information using globals,
that is no problem, they will just always be loaded and processed.
>      6. There are never any naming conflicts between compile time
>         information variables in different modules. Each compile time
>         global variable is, effectively, scoped in its module. This
>         allows compiler writers to use the same name for various pieces
>         of data in every module emitted without clashing.
If you use the appending linkage mechanism, you _want_ them to have the
same name. :)
>      7. The exact same facility for dealing with module scoped types and
>         variables are used to deal with the compile time information.
>         When asked for it, the VMCore would produce a SymbolTable that
>         references all the global types and variables in the compile
>         time information.
If you use globals directly, you can just use the standard stuff.
>      8. LLVM assembler and bytecode reader will assure the syntactic
>         integrity of the compile time information as it would for any
>         other bytecode. It checks types, pointer references, etc. and
>         emits warnings (errors?) if the compiler information is not
>         syntactically valid.
How does it do this if it doesn't understand it?  I thought it would just
pass it through unmodified?
>      9. LLVM makes no assertions about the semantics or content of the
>         compile time information. It can be anything the compiler writer
>         wishes to express to retain compilation information. Correctness
>         of the information content (beyond syntactics) is left to the
>         compiler writer.  Exceptions to this rule may be warranted where
This seems to contradict #8.
>         there is general applicability to multiple source languages.
>         Debug (file & line number) info would seem to be a natural
>         exception.
Note that debug information doesn't work with this model.  In particular,
when the LLVM optimizer transmogrifies the code, it has to update the
debug information to remain accurate.  This requires understanding (at
some level) the debug format.
>     10. Compile time information sections are marked with a name that
>         relates to the high-level compiler that produced them. This
>         avoids confusion when one language attempts to read the compile
>         time information of another language.
>
> This is somewhat like an open ended, generalized ELF section for keeping
> track of compiler and/or debug information.  Because its based on
> existing capabilities of LLVM, I don't think it would be particularly
> difficult to implement either.
There are two ways to implement this, as described above:
  1. Use global arrays of bytes or something.  If you want to, your arrays
     can even have pointers to globals variables and functions in them.
  2. Use an untyped blob of data, attached to the .bc file.

#2 is better from the efficiency standpoint (it doesn't need to be loaded
if not used), but #1 is already fully implemented (it is used to implement
global ctor/dtors)...

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2003 - [LLVMdev] Packages

[LLVMdev] Packages

[LLVMdev] Packages

[LLVMdev] Packages

[LLVMdev] Packages

Apparently Analagous Threads