thr3ads.net - llvm dev - [LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper [May 2015]

If this information is useful, please help other people find it:
Share via:

Wan Zhiyuan

2015-May-05 21:33 UTC

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

Dear John,
I intend to implement the improvements on DSA.
After running DSA on SPEC, I found DSA gives low precision for mcf and
bzip2.
I have checked the most imprecise c files in mcf an found that the code
seems to be a mixture of "PHI" and "GEP" instructions.

Could you please give me some hints about what the big picture of the
improvement should be and how to start?

Thank you!

Regards,
Zhiyuan

On Mon, Apr 6, 2015 at 5:22 PM, John Criswell <jtcriswel at gmail.com>
wrote:
>  Dear Zhiyuan,
>
> In order to reproduce the results from the paper, you'll need to
replicate
> a system from that era.  You'll need to use the same version of LLVM
and
> DSA that the paper used.  I think that was LLVM 1.9 (the release_19 branch
> of LLVM and poolalloc), but I'm not sure.  You should check to see if
the
> paper specifies the version.
>
> As you'll be using a very old version of LLVM, it may be worth setting
up
> a VM with a corresponding old version of Linux.  I suspect newer compilers
> will not compile a version of LLVM that is that old, so using an old
> version of Linux with an old version of GCC may be needed.  I think Fedora
> Core 2 is the OS we were using at the time.
>
> To answer the question of why you can't use a modern version of LLVM
and
> poolalloc, it's because LLVM has changed significantly.  DSA relies
upon
> the type annotations provided in the LLVM IR to "bootstrap" its
type
> inference (bootstrap is not quite the right word, but it's the closest
one
> of which I could think).  As LLVM matured, transformations would ditch the
> type information (e.g., transforming typed GEPs into untyped GEPs into a
> byte array), making DSA's ability to do type-inference (and thereby
> improving field sensitivity) more difficult.  Throw into the mix the fact
> that DSA is maintained by an academic research group, and the result is
> that modern DSA doesn't have the accuracy that the original DSA did.
>
> The good news is that I think DSA can be fixed by making its
> type-inferencing code smarter.  The bad news is that it'd be a fair
amount
> of work to do.  So far, no one has had sufficient desire/motivation to
> design and implement the improvements.
>
> Regards,
>
> John Criswell
>
>
>
> On 4/6/15 4:56 PM, Wan Zhiyuan wrote:
>
> Dear all,
> I am trying to reproduce the "Percent May Alias" result described
in PLDI
> 07's paper "Making Context-Sensitive Points-to Analysis with Heap
Cloning
> Practical For The Real World" (
> http://llvm.org/pubs/2007-06-10-PLDI-DSA.html).
>
>  However, my "Percent May Alias" for all the benchmarks is much
greater,
> especially "bzip2".
>
>  The DSA code I use is from "https://github.com/smackers/smack".
I have
> diff the code between smack and poolalloc_32. They are almost the same
> except the "#include" statements.
>
>  I was wondering whether I need to do some configuration to make DSA work
> properly.
>
>  Thank you!
>
>  Zhiyuan
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> --
> John Criswell
> Assistant Professor
> Department of Computer Science, University of
Rochesterhttp://www.cs.rochester.edu/u/criswell
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150505/813b03f8/attachment.html>

John Criswell

2015-May-06 14:35 UTC

head link

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

On 5/5/15 5:33 PM, Wan Zhiyuan wrote:> Dear John,
> I intend to implement the improvements on DSA.
There be nasty dragons in DSA.  Don't say I didn't warn you.
:)

On a side note, I recently discovered that someone (I think Will Dietz) 
updated the code to work with LLVM mainline, so you should be able to 
update to a newer version of LLVM and use DSA.  I've created a snapshot 
of the LLVM source tree and a modified version of the poolalloc tree in 
https://github.com/jtcriswell/llvm-dsa that you may find useful.  I 
think Will is also maintaining a github tree that he updates regularly.
> After running DSA on SPEC, I found DSA gives low precision for mcf and 
> bzip2.
> I have checked the most imprecise c files in mcf an found that the 
> code seems to be a mixture of "PHI" and "GEP"
instructions.
>
> Could you please give me some hints about what the big picture of the 
> improvement should be and how to start?
There may be many reasons why DSA's precision is low.  You'll need to 
figure out what the reason is for each program.  In some cases, it may 
be type inference.  In other cases, it may be that DSA is cognizant of 
the influence of external code (which causes it to correctly give 
pessimistic results).  For programs using function pointers, there's a 
bug in the algorithm in which DSNodes that should be marked complete are 
not.  It could also be that you're using the wrong DSA pass (e.g., using 
EQTD when TD will do).

Looking at type inference specifically, the problem, in a nutshell, is 
that DSA's type inference should not rely upon LLVM's type annotations.
It should just create a map from offsets to types. Some work on this has 
already been done (e.g., DSA can figure out that casting a pointer to a 
structure into a pointer to the type of the structure's first field is 
still offset 0).  However, there are still places in which DSA is 
relying upon LLVM's type annotations.

One thing that I would like to look at is changing how DSA analyzes 
arrays of structures.  Right now, DSA tries to infer a structure type 
for a memory object and then tries to infer whether the memory object is 
a singleton structure or an array of structures (you can see this in 
DSA's interface; you can see a map between offsets and types and an
'A'
flag indicating that the object is an array).  I think this makes DSA 
needlessly complicated.

I think it would be better if DSA did what Value Set Analysis (VSA) 
does.  VSA was designed to analyze untyped binary code.  For each 
abstract memory object, it creates a map of 4-tuples to types.  Each 
4-tuple represents a formula ax+b as (a,b,c,d) in which b if offset, a 
is stride, and x is constrained between values c and d (c and d can be 
constants or +-infinity).  For example, if you have an array of struct 
{int a ; char * b} on a 32-bit machine, DSA currently tries to figure 
out that there's an array of structure elements in which there's an int 
at offset 0 and a char * at offset 4 within the structure.  VSA would 
say that there's a memory object with an int at every multiple of 4x+0 
offset and a char * at every 4x+4 offset.

The VSA approach should be agnostic to all sorts of weird casting, 
embedded arrays, and embedded unions, though this is an educated guess 
at present.

Regards,

John Criswell
>
> Thank you!
>
> Regards,
> Zhiyuan
>
> On Mon, Apr 6, 2015 at 5:22 PM, John Criswell <jtcriswel at gmail.com 
> <mailto:jtcriswel at gmail.com>> wrote:
>
>     Dear Zhiyuan,
>
>     In order to reproduce the results from the paper, you'll need to
>     replicate a system from that era.  You'll need to use the same
>     version of LLVM and DSA that the paper used.  I think that was
>     LLVM 1.9 (the release_19 branch of LLVM and poolalloc), but I'm
>     not sure.  You should check to see if the paper specifies the version.
>
>     As you'll be using a very old version of LLVM, it may be worth
>     setting up a VM with a corresponding old version of Linux.  I
>     suspect newer compilers will not compile a version of LLVM that is
>     that old, so using an old version of Linux with an old version of
>     GCC may be needed.  I think Fedora Core 2 is the OS we were using
>     at the time.
>
>     To answer the question of why you can't use a modern version of
>     LLVM and poolalloc, it's because LLVM has changed significantly. 
>     DSA relies upon the type annotations provided in the LLVM IR to
>     "bootstrap" its type inference (bootstrap is not quite the
right
>     word, but it's the closest one of which I could think).  As LLVM
>     matured, transformations would ditch the type information (e.g.,
>     transforming typed GEPs into untyped GEPs into a byte array),
>     making DSA's ability to do type-inference (and thereby improving
>     field sensitivity) more difficult.  Throw into the mix the fact
>     that DSA is maintained by an academic research group, and the
>     result is that modern DSA doesn't have the accuracy that the
>     original DSA did.
>
>     The good news is that I think DSA can be fixed by making its
>     type-inferencing code smarter.  The bad news is that it'd be a
>     fair amount of work to do.  So far, no one has had sufficient
>     desire/motivation to design and implement the improvements.
>
>     Regards,
>
>     John Criswell
>
>
>
>     On 4/6/15 4:56 PM, Wan Zhiyuan wrote:
>>     Dear all,
>>     I am trying to reproduce the "Percent May Alias" result
described
>>     in PLDI 07's paper "Making Context-Sensitive Points-to
Analysis
>>     with Heap Cloning Practical For The Real World"
>>     (http://llvm.org/pubs/2007-06-10-PLDI-DSA.html).
>>
>>     However, my "Percent May Alias" for all the benchmarks is
much
>>     greater, especially "bzip2".
>>
>>     The DSA code I use is from
"https://github.com/smackers/smack". I
>>     have diff the code between smack and poolalloc_32. They are
>>     almost the same except the "#include" statements.
>>
>>     I was wondering whether I need to do some configuration to make
>>     DSA work properly.
>>
>>     Thank you!
>>
>>     Zhiyuan
>>
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     LLVMdev at cs.uiuc.edu  <mailto:LLVMdev at cs.uiuc.edu>      
http://llvm.cs.uiuc.edu
>>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>     -- 
>     John Criswell
>     Assistant Professor
>     Department of Computer Science, University of Rochester
>     http://www.cs.rochester.edu/u/criswell
>
>

-- 
John Criswell
Assistant Professor
Department of Computer Science, University of Rochester
http://www.cs.rochester.edu/u/criswell

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150506/982426e3/attachment.html>

Wan Zhiyuan

2015-May-07 03:07 UTC

head link

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

Dear John,
Thank you so much for your reply!
Please kindly correct me if my understanding is wrong.
>From your reply, I learn that there could be mainly four reasons why
DSA'sprecision is low:
1. type inference
2. influence of external code
3. bug in handling function pointers
4. using the wrong DSA pass

As you mentioned in your previous mail, LLVM changes a lot.
So I was wondering if DSA's algorithm (presented in PLDI '07) can
achieve a
comparable precision on top of LLVM 3.5+ after implementing the
improvements.

Also, I have diff the DSA code in poolalloc release_19 and release_32.  It
seems to me that the changes in DSA mainly comes from the aspects as below:
1. release_32 introduces the DataLayout to do some type inference;
2. release_32 replaces some data structures with LLVM built-in data
structures;
3. release_32 handles different types of LLVM IR instructions, and in a
different way.
4. the inheritance of Analysis Passes has changed a bit.

Therefore, I was wondering if the introduction of DataLayout and changes in
LLVM IR cause the imprecision as well.

Thank you!

Regards,

Zhiyuan Wan


On Wed, May 6, 2015 at 10:35 AM, John Criswell <jtcriswel at gmail.com>
wrote:
>  On 5/5/15 5:33 PM, Wan Zhiyuan wrote:
>
> Dear John,
> I intend to implement the improvements on DSA.
>
>
> There be nasty dragons in DSA.  Don't say I didn't warn you.
> :)
>
> On a side note, I recently discovered that someone (I think Will Dietz)
> updated the code to work with LLVM mainline, so you should be able to
> update to a newer version of LLVM and use DSA.  I've created a snapshot
of
> the LLVM source tree and a modified version of the poolalloc tree in
> https://github.com/jtcriswell/llvm-dsa that you may find useful.  I think
> Will is also maintaining a github tree that he updates regularly.
>
>  After running DSA on SPEC, I found DSA gives low precision for mcf and
> bzip2.
> I have checked the most imprecise c files in mcf an found that the code
> seems to be a mixture of "PHI" and "GEP" instructions.
>
>  Could you please give me some hints about what the big picture of the
> improvement should be and how to start?
>
>
> There may be many reasons why DSA's precision is low.  You'll need
to
> figure out what the reason is for each program.  In some cases, it may be
> type inference.  In other cases, it may be that DSA is cognizant of the
> influence of external code (which causes it to correctly give pessimistic
> results).  For programs using function pointers, there's a bug in the
> algorithm in which DSNodes that should be marked complete are not.  It
> could also be that you're using the wrong DSA pass (e.g., using EQTD
when
> TD will do).
>
> Looking at type inference specifically, the problem, in a nutshell, is
> that DSA's type inference should not rely upon LLVM's type
annotations.  It
> should just create a map from offsets to types.  Some work on this has
> already been done (e.g., DSA can figure out that casting a pointer to a
> structure into a pointer to the type of the structure's first field is
> still offset 0).  However, there are still places in which DSA is relying
> upon LLVM's type annotations.
>
> One thing that I would like to look at is changing how DSA analyzes arrays
> of structures.  Right now, DSA tries to infer a structure type for a memory
> object and then tries to infer whether the memory object is a singleton
> structure or an array of structures (you can see this in DSA's
interface;
> you can see a map between offsets and types and an 'A' flag
indicating that
> the object is an array).  I think this makes DSA needlessly complicated.
>
> I think it would be better if DSA did what Value Set Analysis (VSA) does.
> VSA was designed to analyze untyped binary code.  For each abstract memory
> object, it creates a map of 4-tuples to types.  Each 4-tuple represents a
> formula ax+b as (a,b,c,d) in which b if offset, a is stride, and x is
> constrained between values c and d (c and d can be constants or
> +-infinity).  For example, if you have an array of struct {int a ; char *
> b} on a 32-bit machine, DSA currently tries to figure out that there's
an
> array of structure elements in which there's an int at offset 0 and a
char
> * at offset 4 within the structure.  VSA would say that there's a
memory
> object with an int at every multiple of 4x+0 offset and a char * at every
> 4x+4 offset.
>
> The VSA approach should be agnostic to all sorts of weird casting,
> embedded arrays, and embedded unions, though this is an educated guess at
> present.
>
> Regards,
>
> John Criswell
>
>
>
>  Thank you!
>
>  Regards,
> Zhiyuan
>
> On Mon, Apr 6, 2015 at 5:22 PM, John Criswell <jtcriswel at
gmail.com> wrote:
>
>>  Dear Zhiyuan,
>>
>> In order to reproduce the results from the paper, you'll need to
>> replicate a system from that era.  You'll need to use the same
version of
>> LLVM and DSA that the paper used.  I think that was LLVM 1.9 (the
>> release_19 branch of LLVM and poolalloc), but I'm not sure.  You
should
>> check to see if the paper specifies the version.
>>
>> As you'll be using a very old version of LLVM, it may be worth
setting up
>> a VM with a corresponding old version of Linux.  I suspect newer
compilers
>> will not compile a version of LLVM that is that old, so using an old
>> version of Linux with an old version of GCC may be needed.  I think
Fedora
>> Core 2 is the OS we were using at the time.
>>
>> To answer the question of why you can't use a modern version of
LLVM and
>> poolalloc, it's because LLVM has changed significantly.  DSA relies
upon
>> the type annotations provided in the LLVM IR to "bootstrap"
its type
>> inference (bootstrap is not quite the right word, but it's the
closest one
>> of which I could think).  As LLVM matured, transformations would ditch
the
>> type information (e.g., transforming typed GEPs into untyped GEPs into
a
>> byte array), making DSA's ability to do type-inference (and thereby
>> improving field sensitivity) more difficult.  Throw into the mix the
fact
>> that DSA is maintained by an academic research group, and the result is
>> that modern DSA doesn't have the accuracy that the original DSA
did.
>>
>> The good news is that I think DSA can be fixed by making its
>> type-inferencing code smarter.  The bad news is that it'd be a fair
amount
>> of work to do.  So far, no one has had sufficient desire/motivation to
>> design and implement the improvements.
>>
>> Regards,
>>
>> John Criswell
>>
>>
>>
>> On 4/6/15 4:56 PM, Wan Zhiyuan wrote:
>>
>>  Dear all,
>> I am trying to reproduce the "Percent May Alias" result
described in PLDI
>> 07's paper "Making Context-Sensitive Points-to Analysis with
Heap Cloning
>> Practical For The Real World" (
>> http://llvm.org/pubs/2007-06-10-PLDI-DSA.html).
>>
>>  However, my "Percent May Alias" for all the benchmarks is
much greater,
>> especially "bzip2".
>>
>>  The DSA code I use is from
"https://github.com/smackers/smack". I have
>> diff the code between smack and poolalloc_32. They are almost the same
>> except the "#include" statements.
>>
>>  I was wondering whether I need to do some configuration to make DSA
>> work properly.
>>
>>  Thank you!
>>
>>  Zhiyuan
>>
>>
>>
>>
>>
>>  _______________________________________________
>> LLVM Developers mailing listLLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>> --
>> John Criswell
>> Assistant Professor
>> Department of Computer Science, University of
Rochesterhttp://www.cs.rochester.edu/u/criswell
>>
>>
>
>
> --
> John Criswell
> Assistant Professor
> Department of Computer Science, University of
Rochesterhttp://www.cs.rochester.edu/u/criswell
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150506/585fe91c/attachment.html>

llvm dev - May 2015 - [LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper

[LLVMdev] llvm DSA - reproduce the result in PLDI 07 paper