thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Ralf Jung via llvm-dev

2021-Jun-22 19:40 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi John,
>     Note that "provenance" as we use it in this discussion is an
*explicit
>     operational artifact* -- it exists as a concrete piece of state in the
>     Abstract Machine. That is very different from something that might just
be
>     used internally in some kind of analysis.
> 
>     There is no problem with "resetting" that provenance on a
"inttoptr", and
>     basically forgetting about where the int comes from. Note that this is
a
>     statement about an operation in the Abstract Machine, not merely a
statement
>     about some analysis: this is not "forgetting" as in
"safely
>     overapproximating the real set of possible behaviors", it is
"forgetting" by
>     *forcing* the provenance to be some kind of wildcard/default
provenance. All
>     analyses then have to correctly account for that.
> 
> But it’s not a truly wildcard provenance. At the very least, it’s
> restricted to the set of provenances that have been exposed, and
> my understanding is that Juneyoung’s non-determinism rule attempts
> to readmit a use-dependency analysis and turn |ptrtoint| back into
> a simple scalar operation.
So, at least in the paper Juneyoung and me and a few other people wrote (the one
referenced a couple times in this thread already: 
https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf), it is a wildcard
provenance.
"exposed" is not a thing in the operational semantics, it is a theorem
we can prove.
> 
>         For example, you have compellingly argued that it’s problematic to
>         do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
>         I instead do this optimization on integers where |a = ptrtoint A|.
>         The result is now simply |b|. If I |inttoptr| that result and
access
>         the memory, there will be no record that that access may validly
>         be to |A|. It does not help that the access may be represented
>         as |inttoptr (ptrtoint B)| for some |B| rather than just directly
>         to |B|, because there is no use-dependence on |A|. All there is
>         is an apparently unrelated and unused |ptrtoint A|.
> 
>     So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint
B" being
>     replaced by "ptrtoint B"? I don't see any problem with
that. Do you have a
>     concrete example?
> 
> I think you can take any example where pointer-type restrictions
> would be necessary to protect against miscompilation and turn it
> into an example here by just inserting |inttoptr| and |ptrtoint|
> appropriately. Quick example:
> 
> |int A = 0x1; int B = 0x2; long a = (long) (A+1); long b = (long) B; long
result
> = (a == b ? a : b); if (a == b) *(((int*) result) - 1) = 0x4; else *((int*)
> result) = 0x8; printf(“%d %d\n”, A, B); |
(Sorry about the bad quote, not sure what Thunderbird does here.)
I assume you mean "&A" and "&B" in lines 3 and 4?

> I submit that this program has unspecified but not undefined behavior,
> with printing “1 8” and “4 2” being the only two valid outcomes.
Okay, I can follow that.
> But I think an optimizer which changes the fifth line to
> |long result = b;| without leaving any trace behind
so then we are at

int A = 0x1;
int B = 0x2;
long a = (long) (&A+1);
long b = (long) &B;
long result = b;
if (a == b)
   *(((int*) result) - 1) = 0x4;
else
   *((int*) result) = 0x8;
printf(“%d %d\n”, A, B);

> could easily
> compile this to print “1 2” because there would be nothing to
> prevent the initialization of |A| from being forwarded to the
> final load.
I don't think that's right, since "a" is still used in the
"if", so a bit of
information about the address is leaked and you can't assume the address was
not
guessed. The furthest you can go is

int A = 0x1;
int B = 0x2;
long a = (long) (&A+1);
long b = (long) &B;
if (a == b)
   *(((int*) (long) &B) - 1) = 0x4;
else
   *((int*) (long) &B) = 0x8;
printf(“%d %d\n”, A, B);

No need for an explicit 'exposed', I think.

That said, things get more tricky once you want to account for C
'restrict' /
LLVM 'noalias', and then it might be necessary to have explicit
'exposed'-like
operations. I haven't seen a properly worked-out model for this, but the 
candidate models I saw would need something like this. So maybe it'd not be
a
bad idea to have such an operation anyway... I just don't think it should
have
any effect for the kind of programs we have been discussing here so far.

Kind regards,
Ralf
> 
> You can prevent this by noting that the provenance of |A| has been
> xposed and allowing the |inttoptr| of |result| to alias |A|, but
> that seems inconsistent with treating |ptrtoint| as a simple scalar
> operation and allowing a use-analysis of a |ptrtoint| to restrict
> which |inttoptr| casts are allowed to recreate provenance for the
> |ptrtoint| operand.
> 
> If you want to keep treating |ptrtoint| as a scalar operation and
> doing use-analyses on it, I think the most palatable option is to
> recognize whenever you’re cutting a use-dependency and
> conservatively record in IR that the original value has now been
> exposed. So if you start with this:
> 
> |%eq = icmp eq i32 %a, %b %result = select i1 %eq, i32 %a, i32 %b |
> 
> You have to transform it like this:
> 
> |%result = %b call void @llvm.expose.i32(i32 %a) |
> 
> You should be able to remove these exposure events in a lot of
> situations, but conservatively they’ll have to be treated as
> escapes.
> 
> Most optimizations never cut use-dependencies on opaque values
> like this and so won’t be affected.
> 
> John.
> 
-- 
Website: https://people.mpi-sws.org/~jung/

John McCall via llvm-dev

2021-Jun-22 20:52 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 22 Jun 2021, at 15:40, Ralf Jung wrote:>>     Note that "provenance" as we use it in this discussion is
an
>> *explicit
>>     operational artifact* -- it exists as a concrete piece of state 
>> in the
>>     Abstract Machine. That is very different from something that 
>> might just be
>>     used internally in some kind of analysis.
>>
>>     There is no problem with "resetting" that provenance on a
>> "inttoptr", and
>>     basically forgetting about where the int comes from. Note that 
>> this is a
>>     statement about an operation in the Abstract Machine, not merely 
>> a statement
>>     about some analysis: this is not "forgetting" as in
"safely
>>     overapproximating the real set of possible behaviors", it is 
>> "forgetting" by
>>     *forcing* the provenance to be some kind of wildcard/default 
>> provenance. All
>>     analyses then have to correctly account for that.
>>
>> But it’s not a truly wildcard provenance. At the very least, it’s
>> restricted to the set of provenances that have been exposed, and
>> my understanding is that Juneyoung’s non-determinism rule attempts
>> to readmit a use-dependency analysis and turn |ptrtoint| back into
>> a simple scalar operation.
>
> So, at least in the paper Juneyoung and me and a few other people 
> wrote (the one referenced a couple times in this thread already: 
> https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf), it is a 
> wildcard provenance.
> "exposed" is not a thing in the operational semantics, it is a
theorem
> we can prove.
Okay, fine, I won’t say “exposure”. :)  You intend to still be 
able to
prove that the reconstructed provenance cannot be that of certain
memory locations based on a comprehensive use analysis of those
locations.
>>         For example, you have compellingly argued that it’s 
>> problematic to
>>         do the reduction |a == b ? a : b| to |b| for pointer types. 
>> Suppose
>>         I instead do this optimization on integers where |a = 
>> ptrtoint A|.
>>         The result is now simply |b|. If I |inttoptr| that result and 
>> access
>>         the memory, there will be no record that that access may 
>> validly
>>         be to |A|. It does not help that the access may be 
>> represented
>>         as |inttoptr (ptrtoint B)| for some |B| rather than just 
>> directly
>>         to |B|, because there is no use-dependence on |A|. All there 
>> is
>>         is an apparently unrelated and unused |ptrtoint A|.
>>
>>     So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : 
>> ptrtoint B" being
>>     replaced by "ptrtoint B"? I don't see any problem
with that. Do
>> you have a
>>     concrete example?
>>
>> I think you can take any example where pointer-type restrictions
>> would be necessary to protect against miscompilation and turn it
>> into an example here by just inserting |inttoptr| and |ptrtoint|
>> appropriately. Quick example:
>>
>> |int A = 0x1; int B = 0x2; long a = (long) (A+1); long b = (long) B; 
>> long result = (a == b ? a : b); if (a == b) *(((int*) result) - 1) = 
>> 0x4; else *((int*) result) = 0x8; printf(“%d %d\n”, A, B); |
>
> (Sorry about the bad quote, not sure what Thunderbird does here.)
> I assume you mean "&A" and "&B" in lines 3 and
4?
Yes, sorry.
>> I submit that this program has unspecified but not undefined 
>> behavior,
>> with printing “1 8” and “4 2” being the only two valid 
>> outcomes.
>
> Okay, I can follow that.
>
>> But I think an optimizer which changes the fifth line to
>> |long result = b;| without leaving any trace behind
>
> so then we are at
>
> int A = 0x1;
> int B = 0x2;
> long a = (long) (&A+1);
> long b = (long) &B;
> long result = b;
> if (a == b)
>   *(((int*) result) - 1) = 0x4;
> else
>   *((int*) result) = 0x8;
> printf(“%d %d\n”, A, B);
>
>
>> could easily
>> compile this to print “1 2” because there would be nothing to
>> prevent the initialization of |A| from being forwarded to the
>> final load.
>
> I don't think that's right, since "a" is still used in
the "if", so a
> bit of information about the address is leaked and you can't assume 
> the address was not guessed.
Okay, so you have a baseline expectation that pointer comparisons will 
be treated as escapes.  The reason I added a second comparison in my 
example is just because otherwise in the `a != b` case the access is out 
of bounds.  So I guess you’re suggesting that it’s impossible to 
construct an example that avoids that problem that wouldn’t ultimately 
do something uneliminatable that would count as an escape?
> That said, things get more tricky once you want to account for C 
> 'restrict' / LLVM 'noalias', and then it might be necessary
to have
> explicit 'exposed'-like operations. I haven't seen a properly 
> worked-out model for this, but the candidate models I saw would need 
> something like this. So maybe it'd not be a bad idea to have such an 
> operation anyway... I just don't think it should have any effect for 
> the kind of programs we have been discussing here so far.
Honestly, part of why I like this approach is that I think the basic 
idea — recognizing when we’re doing a dependency-breaking value 
replacement and using a different replacement sequence — also moves us 
towards a solution for `consume` dependencies.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/e057e169/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM