thr3ads.net - llvm dev - [LLVMdev] Changing pointer representation? [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Jules

2006-Dec-01 10:41 UTC

[LLVMdev] Changing pointer representation?

Having finally found some time to work on this project, I'm currently 
looking at mechanisms of augmenting LLVM to catch out-of-bounds pointer 
references.

For a variety of reasons, I don't think the approach taken by the 
Safecode project is appropriate for mine -- particularly, I have no 
requirement to interface to external code (all code in the system will 
either be compiled using LLVM or written specifically to interface with 
LLVM-compiled code), which invalidates a key assumption of that 
project.  Therefore, having looked at the available options, I've 
decided a so-called "fat pointer" representation is ideal for my
project.

I can see two possible approaches for this:

* Modify the LLVM machine-code backend to use a 64-bit pointer 
representation (32-bit base address, which points to an object 
descriptor, and a 32-bit offset from the base of the object for the data 
item pointed to) on a 32-bit architecture (or 128 bits on a 64-bit 
architecture), and then change the definition of the dereference 
instruction to check the range with the descriptor, or
* Create an optimizer pass that performs a code translation, modifying 
all places where pointers are stored to include base pointers and 
offsets (i.e., replace 'zzz *' with '{{ int, [0 x zzz] }*,
int}', and
all places pointers are referenced and dereferenced to track and check 
the base and limits from the descriptors.  It then becomes illegal to 
performing indexing on a pointer that does not point to the base of an 
object.

I'm currently leaning towards the latter, primarily because it seems 
more general; in the end, I'm going to want at least x86 and x86-64 
support, and the former approach will mean I'll need to do the work 
twice for two different platforms.

I'm also trying to work out what to do to pointers to elements of 
complex structures, and what kind of dereferencing is allowed on those.  
My current feeling is:
* If an object has a descriptor associated, the lowest allowable offset 
will be 4 (because offset 0 contains the length of the object).  This 
means I can reserve offset 0 as an indicator for 'this object doesn't 
have a descriptor' and cause any dereferencing of the result of pointer 
arithmetic to fail on objects with offset 0.  I'd probably swap the 
pointer for a special 'invalid pointer' value on detecting such
arithmetic.
* All arrays should have a descriptor, wherever they're allocated, as 
part of a complex type, directly on the stack or on the heap.
* This means I'll need to change the behaviour of:
   * getelementptr, to set 'invalid pointer' values whenever an offset 0
pointer is used with a nonzero index, or if the result of a manipulation 
would be to access offset 0 of a pointer that isn't at offset 0, and to 
skip the descriptor on arrays embedded inside a complex type
   * load and store instructions, to throw an exception on invalid 
pointers and check bounds on pointers with descriptors, and to load and 
store both base and offset whenever storing a pointer's data
   * Any instruction that generates a pointer as its result, to produce 
the base and offset rather than a simple pointer.
      In most cases the offset will be zero.  There's probably an 
optimisation in this case that means the offset doesn't need to be 
produced in many cases; perhaps by delaying its production until it is 
stored in a pointer variable.

It occurs to me that some of the people here have surely worked on this 
kind of thing before, and perhaps can relate some experiences of things 
that have either worked or not worked.  Am I doing anything stupid here?

Thanks!

Jules

Vikram S. Adve

2006-Dec-01 15:47 UTC

head link

[LLVMdev] Changing pointer representation?

If you don't need to interface with externally compiled code at all,  
then using fat pointers is the right way to go.  It should be more  
efficient than even the SAFECode strategy.

--Vikram


On Dec 1, 2006, at 4:41 AM, Jules wrote:
> Having finally found some time to work on this project, I'm currently
> looking at mechanisms of augmenting LLVM to catch out-of-bounds  
> pointer
> references.
>
> For a variety of reasons, I don't think the approach taken by the
> Safecode project is appropriate for mine -- particularly, I have no
> requirement to interface to external code (all code in the system will
> either be compiled using LLVM or written specifically to interface  
> with
> LLVM-compiled code), which invalidates a key assumption of that
> project.  Therefore, having looked at the available options, I've
> decided a so-called "fat pointer" representation is ideal for my
> project.
>
> I can see two possible approaches for this:
>
> * Modify the LLVM machine-code backend to use a 64-bit pointer
> representation (32-bit base address, which points to an object
> descriptor, and a 32-bit offset from the base of the object for the  
> data
> item pointed to) on a 32-bit architecture (or 128 bits on a 64-bit
> architecture), and then change the definition of the dereference
> instruction to check the range with the descriptor, or
> * Create an optimizer pass that performs a code translation, modifying
> all places where pointers are stored to include base pointers and
> offsets (i.e., replace 'zzz *' with '{{ int, [0 x zzz] }*,
int}', and
> all places pointers are referenced and dereferenced to track and check
> the base and limits from the descriptors.  It then becomes illegal to
> performing indexing on a pointer that does not point to the base of an
> object.
>
> I'm currently leaning towards the latter, primarily because it seems
> more general; in the end, I'm going to want at least x86 and x86-64
> support, and the former approach will mean I'll need to do the work
> twice for two different platforms.
>
> I'm also trying to work out what to do to pointers to elements of
> complex structures, and what kind of dereferencing is allowed on  
> those.
> My current feeling is:
> * If an object has a descriptor associated, the lowest allowable  
> offset
> will be 4 (because offset 0 contains the length of the object).  This
> means I can reserve offset 0 as an indicator for 'this object
doesn't
> have a descriptor' and cause any dereferencing of the result of  
> pointer
> arithmetic to fail on objects with offset 0.  I'd probably swap the
> pointer for a special 'invalid pointer' value on detecting such  
> arithmetic.
> * All arrays should have a descriptor, wherever they're allocated, as
> part of a complex type, directly on the stack or on the heap.
> * This means I'll need to change the behaviour of:
>    * getelementptr, to set 'invalid pointer' values whenever an  
> offset 0
> pointer is used with a nonzero index, or if the result of a  
> manipulation
> would be to access offset 0 of a pointer that isn't at offset 0,  
> and to
> skip the descriptor on arrays embedded inside a complex type
>    * load and store instructions, to throw an exception on invalid
> pointers and check bounds on pointers with descriptors, and to load  
> and
> store both base and offset whenever storing a pointer's data
>    * Any instruction that generates a pointer as its result, to  
> produce
> the base and offset rather than a simple pointer.
>       In most cases the offset will be zero.  There's probably an
> optimisation in this case that means the offset doesn't need to be
> produced in many cases; perhaps by delaying its production until it is
> stored in a pointer variable.
>
> It occurs to me that some of the people here have surely worked on  
> this
> kind of thing before, and perhaps can relate some experiences of  
> things
> that have either worked or not worked.  Am I doing anything stupid  
> here?
>
> Thanks!
>
> Jules
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chris Lattner

2006-Dec-01 20:06 UTC

head link

[LLVMdev] Changing pointer representation?

On Fri, 1 Dec 2006, Jules wrote:
> Having finally found some time to work on this project, I'm currently
> looking at mechanisms of augmenting LLVM to catch out-of-bounds pointer
> references.
ok> I'm currently leaning towards the latter, primarily because it seems
> more general; in the end, I'm going to want at least x86 and x86-64
> support, and the former approach will mean I'll need to do the work
> twice for two different platforms.
This could work, but the transformation is tricky, particularly in the 
face of recursive types.  I suggest:

#3: change the CFE so that it lowers C pointers to your pair.  This should 
be relatively straight-forward, and solves the recursive type issue.
> It occurs to me that some of the people here have surely worked on this
> kind of thing before, and perhaps can relate some experiences of things
> that have either worked or not worked.  Am I doing anything stupid here?
I think the easiest thing to do is to change how the CFE expands the 
operations you care about.  This keeps the LLVM-level semantics the same 
as they are now (so all optzns will work, etc) and the bounds checks are 
exposed to the llvm optimizers, so they can be eliminated.

An alternative approach would be to do this entirely in the code 
generator, hiding all the action from the optimizers.  This would also 
work, but would be target-specific and would not let you do aggressive
optimziations of the bounds check code.

-Chris

-- 
http://nondot.org/sabre/
http://llvm.org/

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Dec 2006 - [LLVMdev] Changing pointer representation?

[LLVMdev] Changing pointer representation?

[LLVMdev] Changing pointer representation?

[LLVMdev] Changing pointer representation?

Reasonably Related Threads