On Nov 26, 2007, at 00:47, Jon Harrop wrote:> Here is a complete 104-line native code compiler for a tiny subset > of OCaml that is expressive enough to compile an external Fibonacci > program: > > [...] > > I was kind of hoping that function pointers would just magically > work, so this: > > do (if 1 <= 2 then fib else fib) 40 > > would run, but instead it produces this error: > > $ llc -f run.bc -o run.s > llc: bitcode didn't read correctly. > Reason: Invalid instruction with no BBTry 'Llvm_analysis.assert_valid_module m;' before you write bitcode to figure out where things went awry. ('dump_module m;' may also help.) GIGO applies (but garbage-in/segfault-out is more likely). Unfortunately, even if the bindings were more strongly typed, it would still be structurally possible to build invalid LLVM code, so you've just got to take care not to violate the invariants, then use the verifier as a cross-check. Using an asserts build of LLVM is also helpful, but you should have this if you built from SVN.> I assume I need a real type system that can handle both ints and > function pointers?Something like that. If you're fond of Ocaml's tagged object model (and it does have some nice properties), you'll want to slam every value to some generic value type and use lots of bitcasts. The value type is either the native integer (i32 or i64 depending on target) or a pointer type, probably the recursive pointer type: let value_ty let temp_ty = opaque_type () in let h = handle_to_type temp_ty in refine_type temp_ty (pointer_type temp_ty); type_of_handle h Since LLVM does not have an intptr type, you'd need to know your target and parameterize your codegen accordingly to use the integer types. The bitcasts have no runtime cost, and LLVM optimization passes can simplify them. Your compiler's semantic analysis phase can be responsible for proving type-safety, so it isn't strictly necessary to propagate type annotations on expressions into the codegen phase so long as the AST nodes themselves are not polymorphic. However, the potential exists to significantly reduce memory pressure (avoiding boxing) by using a typed object model instead of a tagged one. In this case, propagating types throughout would be necessary. This decision has very significant implications for the compilation of polymorphic functions.> After that, I'll add compound types and a memory leak. ;-)Indeed. — Gordon
On Monday 26 November 2007 16:21, Gordon Henriksen wrote:> Try 'Llvm_analysis.assert_valid_module m;' before you write bitcode to > figure out where things went awry. ('dump_module m;' may also help.) > GIGO applies (but garbage-in/segfault-out is more likely).Ok, thanks for the tip.> Unfortunately, even if the bindings were more strongly typed, it would > still be structurally possible to build invalid LLVM code, so you've > just got to take care not to violate the invariants, then use the > verifier as a cross-check.I suspect the OCaml bindings to LLVM could express more of these constraints in the static type system. Has anyone tried to leverage this? I'd certainly like to do this for a higher-level representation. I have lots of ideas but I suspect that is OT for this list.> Using an asserts build of LLVM is also helpful, but you should have > this if you built from SVN.I did, yes.> > I assume I need a real type system that can handle both ints and > > function pointers? > > Something like that. > > If you're fond of Ocaml's tagged object model (and it does have some > nice properties), you'll want to slam every value to some generic > value type and use lots of bitcasts. The value type is either the > native integer (i32 or i64 depending on target) or a pointer type, > probably the recursive pointer type: > > let value_ty > let temp_ty = opaque_type () in > let h = handle_to_type temp_ty in > refine_type temp_ty (pointer_type temp_ty); > type_of_handle h > > Since LLVM does not have an intptr type, you'd need to know your > target and parameterize your codegen accordingly to use the integer > types. > > The bitcasts have no runtime cost, and LLVM optimization passes can > simplify them. Your compiler's semantic analysis phase can be > responsible for proving type-safety, so it isn't strictly necessary to > propagate type annotations on expressions into the codegen phase so > long as the AST nodes themselves are not polymorphic. > > However, the potential exists to significantly reduce memory pressure > (avoiding boxing) by using a typed object model instead of a tagged > one. In this case, propagating types throughout would be necessary. > This decision has very significant implications for the compilation of > polymorphic functions.So this is the part where I start asking really basic questions. :-) As I understand it, the simplest approach is to box all values, which means that every value is stored as a struct containing the type (an enum) and the value itself (inline if it is no bigger than a pointer or as a pointer to a larger data structure otherwise). Something like this: enum runtype { Int; Float; Array; } struct array { int length; box *a; } union value { int n; // 64-bit float x; // 64-bit array* u; // >64-bit } struct box { runtype t; value v; } So you'd create an int with: box make_int(int n) { box b; b.t = Int; b.v.n = n; return b; } Is that right or are values stored as a box*? I'd rather box everything rather than tag ints to start with. I'll relegate that to a potential optimization. What of this can LLVM's optimizer optimize away for me? So I have to work out how to generate IR that handles those data structures. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e
On Nov 26, 2007, at 14:18, Jon Harrop wrote:> On Monday 26 November 2007 16:21, Gordon Henriksen wrote: >> > >> Unfortunately, even if the bindings were more strongly typed, it >> would still be structurally possible to build invalid LLVM code, so >> you've just got to take care not to violate the invariants, then >> use the verifier as a cross-check. > > I suspect the OCaml bindings to LLVM could express more of these > constraints in the static type system. Has anyone tried to leverage > this?Any specific ideas?>>> I assume I need a real type system that can handle both ints and >>> function pointers? >> > > As I understand it, the simplest approach is to box all values, > which means that every value is stored as a struct containing the > type (an enum) and the value itself (inline if it is no bigger than > a pointer or as a pointer to a larger data structure otherwise). > > I'd rather box everything rather than tag ints to start with. I'll > relegate that to a potential optimization.You could go about it that way, sure. Memory pressure will be very high.> What of this can LLVM's optimizer optimize away for me?Not much. LLVM won't really change your memory layout. The features which did perform dramatic memory reorganization were excised from the source tree due to patent infringement issues. (They remain in the repository on a branch for UIUC research.) These were also targeted toward manual memory management. It should be possible for LLVM to perform intra- or interprocedural escape analysis and lower heap allocations to stack allocations, but I don't believe this is implemented. Garbage collection would also complicate it, since LLVM would need to recognize GC allocator functions as such. This would be an excellent project if you are interested in contributing to LLVM itself.> So I have to work out how to generate IR that handles those data > structures.Should be straightforward. Read the getelementptr FAQ. Unions are handled with pointer bitcasts. http://llvm.org/docs/GetElementPtr.html — Gordon