thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding range metadata to array subscripts. [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2021-Mar-27 20:37 UTC

[llvm-dev] [RFC] Adding range metadata to array subscripts.

On 3/27/21 1:30 PM, Florian Hahn wrote:>
>> On Mar 24, 2021, at 19:32, Johannes Doerfert <johannesdoerfert at
gmail.com> wrote:
>>
>>
>> On 3/24/21 12:47 PM, Florian Hahn wrote:
>>>> On Mar 24, 2021, at 15:16, Johannes Doerfert via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>
>>>> On 3/24/21 9:06 AM, Clement Courbet wrote:
>>>>> On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> I really like encoding more (range) information in the
IR,
>>>>>> more thoughts inlined.
>>>>>>
>>>>>> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> tl;dr: I would like to teach clang to output range
metadata so that LLVM
>>>>>>> can do better alias analysis. I have a proposal as
D99248
>>>>>>> <https://reviews.llvm.org/D99248> (clang
part) and D99247
>>>>>>> <https://reviews.llvm.org/D99247> (llvm
part). But there are other
>>>>>> possible
>>>>>>> options that I'm detailing below.
>>>>>>>
>>>>>>> Consider the following code, adapted from brotli
>>>>>>> <https://en.wikipedia.org/wiki/Brotli>:
>>>>>>>
>>>>>>> ```
>>>>>>>
>>>>>>> struct Histogram {
>>>>>>>
>>>>>>>     int values[256];
>>>>>>>
>>>>>>>     int total;
>>>>>>>
>>>>>>> };
>>>>>>>
>>>>>>> Histogram DoIt(const int* image, int size) {
>>>>>>>
>>>>>>>     Histogram histogram;
>>>>>>>
>>>>>>>     for (int i = 0; i < size; ++i) {
>>>>>>>
>>>>>>>       ++histogram.values[image[i]];  // (A)
>>>>>>>
>>>>>>>       ++histogram.total;             // (B)
>>>>>>>
>>>>>>>     }
>>>>>>>
>>>>>>>     return histogram;
>>>>>>>
>>>>>>> }
>>>>>>> ```
>>>>>>>
>>>>>>> In this code, the compiler does not know anything
about the values of
>>>>>>> images[i], so it assumes that 256 is a possible
value for it. In that
>>>>>> case,
>>>>>>> (A) would change the value of histogram.total, so
(B) has to load, add
>>>>>> one
>>>>>>> and store [godbolt
<https://godbolt.org/z/KxE343>].
>>>>>>>
>>>>>>> Fortunately, C/C++ has a rule that it is invalid
(actually, UB) to use
>>>>>>> values to form a pointer to total and dereference
it. What valid C/C++
>>>>>> code
>>>>>>> is allowed to do with values is:
>>>>>>>    - Form any pointer in [values, values + 256].
>>>>>>>    - Form and dereference any pointer in [values,
values + 256)
>>>>>>>
>>>>>>> Note that the LLVM memory model is much laxer than
that of C/C++. It has
>>>>>> no
>>>>>>> notion of types. In particular, given an LLVM
aggregate definition:
>>>>>>>
>>>>>>> ```
>>>>>>> %struct.S = type { [42 x i32], i32, i32 }
>>>>>>> ```
>>>>>>>
>>>>>>> It is perfectly valid to use an address derived
from a GEP(0,0,%i)  [gep
>>>>>>> reference] representing indexing into the [42 x
i32] array to load the
>>>>>> i32
>>>>>>> member at index 2. It is also valid for %i to be 43
(though not 44 if an
>>>>>>> inbound GEP is used).
>>>>>>> So clang has to give LLVM more information about
the C/C++ rules.
>>>>>>>
>>>>>>> *IR representation:*
>>>>>>> LLVM has several ways of representing ranges of
values:
>>>>>>>    - *!range* metadata can be attached to integer
call and load
>>>>>> instructions
>>>>>>> to indicate the allowed range of values of the
result. LLVM's
>>>>>> ValueTracking
>>>>>>> provides a function for querying the range for any
llvm::Variable.
>>>>>>>    - The *llvm.assume* intrinsic takes a boolean
condition that can also
>>>>>> be
>>>>>>> used by ValueTracking to infer range of values.
>>>>>>>    - The *inrange* attribute of GEP can be used to
indicate C-like
>>>>>> semantics
>>>>>>> for the structure field marked with the inrange
attribute. It can only be
>>>>>>> used for GEP constantexprs (ie.e. GEPs defined
inline), but not for
>>>>>>> standalone GEPs defining instructions.  relevant
discussion
>>>>>>>
<https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>>>>>>>
>>>>>>> Alternatives:
>>>>>>> *(1) *Annotate each array subscript index value
with a range, e.g.:
>>>>>>> ```
>>>>>>> %i = i64 …
>>>>>>> %ri =  call i64 @llvm.annotation.i64(%index),
!range !0
>>>>>>> %gep1 = getelementptr inbounds %struct.S,
%struct.S* %s, i64 0, i32 0,
>>>>>> i32
>>>>>>> %ri
>>>>>>> ...
>>>>>>> !0 = !{i64 0, i64 42}
>>>>>>> ```
>>>>>>> *(2) *(variant of 1) relax the constraint that
!range metadata can only
>>>>>> be
>>>>>>> set on call and load instructions, and set the
!range metadata on the
>>>>>> index
>>>>>>> expression. We still need annotations for function
parameters though:
>>>>>>> ```
>>>>>>> %i = i64 … , !range !0
>>>>>>> %gep1 = getelementptr inbounds %struct.S,
%struct.S* %s, i64 0, i32 0,
>>>>>> i32
>>>>>>> %i
>>>>>>> ...
>>>>>>> !0 = !{i64 0, i64 42}
>>>>>>> ```
>>>>>>> This is slightly more compact.
>>>>>>>
>>>>>>> *(3)* Same as (1), with llvm.assume. This feels
inferior to annotations.
>>>>>>> *(4)* Extend inrange to non-constantexprs GEPs. It
is unclear how this
>>>>>> will
>>>>>>> interfere with optimizations.
>>>>>> I would very much like not to introduce another way to
encode
>>>>>> assumptions other than `llvm.assume`. If you want to
avoid the extra
>>>>>> instructions, use `llvm.assume(i1 true)
["range"(%val, %lb, %ub)]`,
>>>>>> which is in line with our move towards operand bundle
use.
>>>>>>
>>>>> Thanks, I did not know about that. I've just tried it
but it appears that
>>>>> tags have to be attribute names, and `!range` is not a
valid attribute,
>>>>> it's a metadata node. Is there a way to encode this ?
>>>> We need to get rid of that assertion. There are other
non-attributes
>>>> to be used in assume operand bundles in the (near) future, so
the this
>>>> work has to be done anyway.
>>> +1 on trying to use assume, rather than adding another way.
>>>
>>> But are value ranges special for assumes, so that we need to handle
them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?
>> It would make users explicit and we will have non-attribute bundles
anyway.
>> I find it also "conceptually nicer", would you prefer
explicit instructions?
> One disadvantage of using a bundle (or !range metadata) is that we treat
ranges for certain values in a special way and differently to how we treat range
information expressed by the user e.g. via conditions (or builtin assume).
I don't think this is necessarily accurate. We can, and already do 
(https://godbolt.org/z/MaMEb1Koo),
generate bundles from conditions. If we can interpret a condition, why 
could we not rewrite it into a bundle?
I'm not sure why this is any different form other normalization we do. 
(And bundles have benefits over implicit
instruction encodings, for example use tracking and #instructions.)

>   This means we have to handle multiple variants across the codebase, which
can lead to situations where only one or the other is handled, which in turn can
lead to surprising results (of the form: why does a transformation apply if
information provided in a certain way, but does not apply of the equivalent info
is provided in a different way).
> Using instruction potentially also allows us to specify more complex
ranges, in relation to other values.
I don't see how bundles would restrict us in any way. I mean, if we want 
to express property XYZ for %v and %q,
`llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and it is
arguably as generic as you want it to be.

> But I realize that there are some practical consideration that make the
instruction approach less appealing and I am all in favor of the more pragmatic
& practical solution to start with.
I think bundles, and more generic assumptions, are what we need in the 
future. I still believe we should use them
to encode information in assertions [0], among other things, without 
running into the risk of having side-effects
that influence the compilation.

~ Johannes

[0] https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

> Cheers,
> Florian

James Courtier-Dutton via llvm-dev

2021-Mar-28 08:49 UTC

head link

[llvm-dev] [RFC] Adding range metadata to array subscripts.

Hi,

char* test_fill(int size) {
  char *test1 = malloc(size)
  for (n = 0; n <= size; n++) {
    test1[n] = 'A';
  }
}

Would it be worth making the "range" information a little richer and
be able to use algebraic expressions as well as numeric ranges.
Note: the above example code has an off by one overflow, and it would
be helpful if one could catch that at compile time.
In this case, it could catch that n must be less than size, and not
less than or equal to size.
Thus putting the range value on the test1 pointer as being from
address of test1 to test1 + (size - 1)

This can only be achieved if algebraic expressions are used for
ranges, and not just constant values.
Actual use cases can get much more complicated with for example,
non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

Another useful thing to catch at compile time, would be a warning that
a pointer is being dereferenced, and we were not able to apply a range
expression to it. I.e. warn about unbounded dereferences.

I think it would be useful to at least consider how we would capture
this more complex range information/metadata in LLVM IR.

Kind Regards

James

> >>>> On 3/24/21 9:06 AM, Clement Courbet wrote:
> >>>>> On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
> >>>>> johannesdoerfert at gmail.com> wrote:
> >>>>>
> >>>>>> I really like encoding more (range) information in
the IR,
> >>>>>> more thoughts inlined.
> >>>>>>
> >>>>>> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev
wrote:
> >>>>>>> struct Histogram {
> >>>>>>>
> >>>>>>>     int values[256];
> >>>>>>>
> >>>>>>>     int total;
> >>>>>>>
> >>>>>>> };
> >>>>>>>
> >>>>>>> Histogram DoIt(const int* image, int size) {
> >>>>>>>
> >>>>>>>     Histogram histogram;
> >>>>>>>
> >>>>>>>     for (int i = 0; i < size; ++i) {
> >>>>>>>
> >>>>>>>       ++histogram.values[image[i]];  // (A)
> >>>>>>>
> >>>>>>>       ++histogram.total;             // (B)
> >>>>>>>
> >>>>>>>     }
> >>>>>>>
> >>>>>>>     return histogram;
> >>>>>>>
> >>>>>>> }

Florian Hahn via llvm-dev

2021-Mar-30 17:25 UTC

head link

[llvm-dev] [RFC] Adding range metadata to array subscripts.

> On Mar 27, 2021, at 20:37, Johannes Doerfert <johannesdoerfert at
gmail.com> wrote:
> On 3/27/21 1:30 PM, Florian Hahn wrote:
>> 
>>> On Mar 24, 2021, at 19:32, Johannes Doerfert <johannesdoerfert
at gmail.com> wrote:
>>> On 3/24/21 12:47 PM, Florian Hahn wrote:
>>>>> On Mar 24, 2021, at 15:16, Johannes Doerfert via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>>> We need to get rid of that assertion. There are other
non-attributes
>>>>> to be used in assume operand bundles in the (near) future,
so the this
>>>>> work has to be done anyway.
>>>> +1 on trying to use assume, rather than adding another way.
>>>> 
>>>> But are value ranges special for assumes, so that we need to
handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume
users?
>>> It would make users explicit and we will have non-attribute bundles
anyway.
>>> I find it also "conceptually nicer", would you prefer
explicit instructions?
>> One disadvantage of using a bundle (or !range metadata) is that we
treat ranges for certain values in a special way and differently to how we treat
range information expressed by the user e.g. via conditions (or builtin assume).
> 
> I don't think this is necessarily accurate. We can, and already do
(https://godbolt.org/z/MaMEb1Koo <https://godbolt.org/z/MaMEb1Koo>),
> generate bundles from conditions. If we can interpret a condition, why
could we not rewrite it into a bundle?
> I'm not sure why this is any different form other normalization we do.
(And bundles have benefits over implicit
> instruction encodings, for example use tracking and #instructions.)
> 
I’m not arguing that it is not possible to do everything with assume bundles. I
am saying that we end up with at least 2 ways to encode the same information, so
we need to handle 2 parallel encodings (i.e. we always have to handle conditions
from the program control flow, which is represented via instructions)

I think the !nonnull example you shared is illustrates the extra work passes
will have to do. For example, a couple of passes know how to generically handle
information from assumes, exactly because the conditions used for the assumes
are not special and they already have to handle the same conditions for
branches. If we instead convert the condition to a special bundle, all those
passes will need updating to properly interpret !nonnull (and future bundles).
Examples include SCCP, NewGVN, parts of SCEV.

FTR I think assume bundles are great to express interesting properties!

I am just trying to highlight some potential drawbacks when it comes to ranges
or other properties we can express directly in LLVM IR already. I am sure it
would be possible to add some extra abstraction to make it easier to update the
relevant passes, it’s just a cost to consider.

> 
>>  This means we have to handle multiple variants across the codebase,
which can lead to situations where only one or the other is handled, which in
turn can lead to surprising results (of the form: why does a transformation
apply if information provided in a certain way, but does not apply of the
equivalent info is provided in a different way).
>> Using instruction potentially also allows us to specify more complex
ranges, in relation to other values.
> 
> I don't see how bundles would restrict us in any way. I mean, if we
want to express property XYZ for %v and %q,
> `llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and
it is arguably as generic as you want it to be.
> 
I agree that it is possible to encode more interesting properties with assume
bundles, but wouldn’t we end up duplicating all existing compare predicates for
example? And for something like %x + %y < %x we would either fall back to
instructions again or come up with a way to encode that in bundles as well. If
we still use IR instructions for more complex expressions, we’d still need a way
to exclude the ‘assume-only’ uses.
> 
>> But I realize that there are some practical consideration that make the
instruction approach less appealing and I am all in favor of the more pragmatic
& practical solution to start with.
> 
> I think bundles, and more generic assumptions, are what we need in the
future. I still believe we should use them
> to encode information in assertions [0], among other things, without
running into the risk of having side-effects
> that influence the compilation.
Again, I am not saying we shouldn’t, just that there are some potential
drawbacks in some cases.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210330/bddb0b5e/attachment.html>

llvm dev - Mar 2021 - [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.