thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding range metadata to array subscripts. [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Clement Courbet via llvm-dev

2021-Mar-24 14:06 UTC

[llvm-dev] [RFC] Adding range metadata to array subscripts.

On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
johannesdoerfert at gmail.com> wrote:
> I really like encoding more (range) information in the IR,
> more thoughts inlined.
>
> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
> > Hi everyone,
> >
> > tl;dr: I would like to teach clang to output range metadata so that
LLVM
> > can do better alias analysis. I have a proposal as D99248
> > <https://reviews.llvm.org/D99248> (clang part) and D99247
> > <https://reviews.llvm.org/D99247> (llvm part). But there are
other
> possible
> > options that I'm detailing below.
> >
> > Consider the following code, adapted from brotli
> > <https://en.wikipedia.org/wiki/Brotli>:
> >
> > ```
> >
> > struct Histogram {
> >
> >    int values[256];
> >
> >    int total;
> >
> > };
> >
> > Histogram DoIt(const int* image, int size) {
> >
> >    Histogram histogram;
> >
> >    for (int i = 0; i < size; ++i) {
> >
> >      ++histogram.values[image[i]];  // (A)
> >
> >      ++histogram.total;             // (B)
> >
> >    }
> >
> >    return histogram;
> >
> > }
> > ```
> >
> > In this code, the compiler does not know anything about the values of
> > images[i], so it assumes that 256 is a possible value for it. In that
> case,
> > (A) would change the value of histogram.total, so (B) has to load, add
> one
> > and store [godbolt <https://godbolt.org/z/KxE343>].
> >
> > Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
> > values to form a pointer to total and dereference it. What valid C/C++
> code
> > is allowed to do with values is:
> >   - Form any pointer in [values, values + 256].
> >   - Form and dereference any pointer in [values, values + 256)
> >
> > Note that the LLVM memory model is much laxer than that of C/C++. It
has
> no
> > notion of types. In particular, given an LLVM aggregate definition:
> >
> > ```
> > %struct.S = type { [42 x i32], i32, i32 }
> > ```
> >
> > It is perfectly valid to use an address derived from a GEP(0,0,%i) 
[gep
> > reference] representing indexing into the [42 x i32] array to load the
> i32
> > member at index 2. It is also valid for %i to be 43 (though not 44 if
an
> > inbound GEP is used).
> > So clang has to give LLVM more information about the C/C++ rules.
> >
> > *IR representation:*
> > LLVM has several ways of representing ranges of values:
> >   - *!range* metadata can be attached to integer call and load
> instructions
> > to indicate the allowed range of values of the result. LLVM's
> ValueTracking
> > provides a function for querying the range for any llvm::Variable.
> >   - The *llvm.assume* intrinsic takes a boolean condition that can
also
> be
> > used by ValueTracking to infer range of values.
> >   - The *inrange* attribute of GEP can be used to indicate C-like
> semantics
> > for the structure field marked with the inrange attribute. It can only
be
> > used for GEP constantexprs (ie.e. GEPs defined inline), but not for
> > standalone GEPs defining instructions.  relevant discussion
> > <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
> >
> > Alternatives:
> > *(1) *Annotate each array subscript index value with a range, e.g.:
> > ```
> > %i = i64 …
> > %ri =  call i64 @llvm.annotation.i64(%index), !range !0
> > %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,
> i32
> > %ri
> > ...
> > !0 = !{i64 0, i64 42}
> > ```
> > *(2) *(variant of 1) relax the constraint that !range metadata can
only
> be
> > set on call and load instructions, and set the !range metadata on the
> index
> > expression. We still need annotations for function parameters though:
> > ```
> > %i = i64 … , !range !0
> > %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,
> i32
> > %i
> > ...
> > !0 = !{i64 0, i64 42}
> > ```
> > This is slightly more compact.
> >
> > *(3)* Same as (1), with llvm.assume. This feels inferior to
annotations.
> > *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this
> will
> > interfere with optimizations.
>
> I would very much like not to introduce another way to encode
> assumptions other than `llvm.assume`. If you want to avoid the extra
> instructions, use `llvm.assume(i1 true) ["range"(%val, %lb,
%ub)]`,
> which is in line with our move towards operand bundle use.
>
Thanks, I did not know about that. I've just tried it but it appears that
tags have to be attribute names, and `!range` is not a valid attribute,
it's a metadata node. Is there a way to encode this ?

> SCEV should be thought about this (as well), unsure what the problem
> is you describe below. If BasicAA needs to know, sure.
>
scev-aa already knows how to use range information. If we add a !range
metadata in clang right now and use SCEV, there is nothing to do on the
LLVM side. My point was that scev-aa is not enabled in the default
pipeline, so we might as well teach BasicAA about this cheap case.

Actually I think it makes sense to teach BasicAA about range information
anyway (D99247) given that it could already be useful in cases like:

```
define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(
%struct.Histogram) align 4 %0, i32 %1, *i8 zeroext %2*) local_unnamed_addr
#0 {
...
*%6 = zext i8 %2 to i64*
%7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0
, i32 0, *i64 %6*
```

Where ValueTracking could easily deduce that %6 is in [0,255].


>
> ~ Johannes
>
>
> >
> > *On the clang side*:
> > The clang part is quite trivial as the infrastructure is already in
place
> > to emit dynamic ubsan guards: D99248
<https://reviews.llvm.org/D99248>
> >
> > *On the LLVM Side:*
> > Alternatives:
> > *(A)* - (annotation or assume options) Simply enable scev-aa which
knows
> > how to handle value ranges in general. IIUC it's not enabled in
clang
> > because it has issues with invalidation when code changes, and is
> therefore
> > not cacheable. This makes it too slow to be practical.
> > *(B) *- (annotation or assume options) Teach BasicAA to honor !range
> > metadata (D99247 <https://reviews.llvm.org/D99247>)
> > *(C)* - (inrange option) Teach BasicAA to honor inrange attributes of
> GEP.
> >
> > I was leaning towards (1) and (B) because:
> >   - BasicAA already has basic support for value range analysis
> > (zero/nonzero), this is a small and natural extension.
> >   - The BasicAA improvement already benefits some existing code (as
> > evidenced by the test changes in D99247
<https://reviews.llvm.org/D99247
> >)
> >   - Using range metadata rather than the `inrange` attribute means
that
> > BasicAA will automatically benefit from improvements in value tracking
in
> > the future.
> >
> > Opinions ?
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210324/024fedfe/attachment.html>

Johannes Doerfert via llvm-dev

2021-Mar-24 15:16 UTC

head link

[llvm-dev] [RFC] Adding range metadata to array subscripts.

On 3/24/21 9:06 AM, Clement Courbet wrote:> On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> I really like encoding more (range) information in the IR,
>> more thoughts inlined.
>>
>> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
>>> Hi everyone,
>>>
>>> tl;dr: I would like to teach clang to output range metadata so that
LLVM
>>> can do better alias analysis. I have a proposal as D99248
>>> <https://reviews.llvm.org/D99248> (clang part) and D99247
>>> <https://reviews.llvm.org/D99247> (llvm part). But there are
other
>> possible
>>> options that I'm detailing below.
>>>
>>> Consider the following code, adapted from brotli
>>> <https://en.wikipedia.org/wiki/Brotli>:
>>>
>>> ```
>>>
>>> struct Histogram {
>>>
>>>     int values[256];
>>>
>>>     int total;
>>>
>>> };
>>>
>>> Histogram DoIt(const int* image, int size) {
>>>
>>>     Histogram histogram;
>>>
>>>     for (int i = 0; i < size; ++i) {
>>>
>>>       ++histogram.values[image[i]];  // (A)
>>>
>>>       ++histogram.total;             // (B)
>>>
>>>     }
>>>
>>>     return histogram;
>>>
>>> }
>>> ```
>>>
>>> In this code, the compiler does not know anything about the values
of
>>> images[i], so it assumes that 256 is a possible value for it. In
that
>> case,
>>> (A) would change the value of histogram.total, so (B) has to load,
add
>> one
>>> and store [godbolt <https://godbolt.org/z/KxE343>].
>>>
>>> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to
use
>>> values to form a pointer to total and dereference it. What valid
C/C++
>> code
>>> is allowed to do with values is:
>>>    - Form any pointer in [values, values + 256].
>>>    - Form and dereference any pointer in [values, values + 256)
>>>
>>> Note that the LLVM memory model is much laxer than that of C/C++.
It has
>> no
>>> notion of types. In particular, given an LLVM aggregate definition:
>>>
>>> ```
>>> %struct.S = type { [42 x i32], i32, i32 }
>>> ```
>>>
>>> It is perfectly valid to use an address derived from a GEP(0,0,%i) 
[gep
>>> reference] representing indexing into the [42 x i32] array to load
the
>> i32
>>> member at index 2. It is also valid for %i to be 43 (though not 44
if an
>>> inbound GEP is used).
>>> So clang has to give LLVM more information about the C/C++ rules.
>>>
>>> *IR representation:*
>>> LLVM has several ways of representing ranges of values:
>>>    - *!range* metadata can be attached to integer call and load
>> instructions
>>> to indicate the allowed range of values of the result. LLVM's
>> ValueTracking
>>> provides a function for querying the range for any llvm::Variable.
>>>    - The *llvm.assume* intrinsic takes a boolean condition that can
also
>> be
>>> used by ValueTracking to infer range of values.
>>>    - The *inrange* attribute of GEP can be used to indicate C-like
>> semantics
>>> for the structure field marked with the inrange attribute. It can
only be
>>> used for GEP constantexprs (ie.e. GEPs defined inline), but not for
>>> standalone GEPs defining instructions.  relevant discussion
>>> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>>>
>>> Alternatives:
>>> *(1) *Annotate each array subscript index value with a range, e.g.:
>>> ```
>>> %i = i64 …
>>> %ri =  call i64 @llvm.annotation.i64(%index), !range !0
>>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32
0,
>> i32
>>> %ri
>>> ...
>>> !0 = !{i64 0, i64 42}
>>> ```
>>> *(2) *(variant of 1) relax the constraint that !range metadata can
only
>> be
>>> set on call and load instructions, and set the !range metadata on
the
>> index
>>> expression. We still need annotations for function parameters
though:
>>> ```
>>> %i = i64 … , !range !0
>>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32
0,
>> i32
>>> %i
>>> ...
>>> !0 = !{i64 0, i64 42}
>>> ```
>>> This is slightly more compact.
>>>
>>> *(3)* Same as (1), with llvm.assume. This feels inferior to
annotations.
>>> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how
this
>> will
>>> interfere with optimizations.
>> I would very much like not to introduce another way to encode
>> assumptions other than `llvm.assume`. If you want to avoid the extra
>> instructions, use `llvm.assume(i1 true) ["range"(%val, %lb,
%ub)]`,
>> which is in line with our move towards operand bundle use.
>>
> Thanks, I did not know about that. I've just tried it but it appears
that
> tags have to be attribute names, and `!range` is not a valid attribute,
> it's a metadata node. Is there a way to encode this ?
We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.

>
>> SCEV should be thought about this (as well), unsure what the problem
>> is you describe below. If BasicAA needs to know, sure.
>>
> scev-aa already knows how to use range information. If we add a !range
> metadata in clang right now and use SCEV, there is nothing to do on the
> LLVM side. My point was that scev-aa is not enabled in the default
> pipeline, so we might as well teach BasicAA about this cheap case.
>
> Actually I think it makes sense to teach BasicAA about range information
> anyway (D99247) given that it could already be useful in cases like:
>
> ```
> define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(
> %struct.Histogram) align 4 %0, i32 %1, *i8 zeroext %2*) local_unnamed_addr
> #0 {
> ...
> *%6 = zext i8 %2 to i64*
> %7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0
> , i32 0, *i64 %6*
> ```
>
> Where ValueTracking could easily deduce that %6 is in [0,255].
Sounds reasonable.

~ Johannes

>
>
>
>> ~ Johannes
>>
>>
>>> *On the clang side*:
>>> The clang part is quite trivial as the infrastructure is already in
place
>>> to emit dynamic ubsan guards: D99248
<https://reviews.llvm.org/D99248>
>>>
>>> *On the LLVM Side:*
>>> Alternatives:
>>> *(A)* - (annotation or assume options) Simply enable scev-aa which
knows
>>> how to handle value ranges in general. IIUC it's not enabled in
clang
>>> because it has issues with invalidation when code changes, and is
>> therefore
>>> not cacheable. This makes it too slow to be practical.
>>> *(B) *- (annotation or assume options) Teach BasicAA to honor
!range
>>> metadata (D99247 <https://reviews.llvm.org/D99247>)
>>> *(C)* - (inrange option) Teach BasicAA to honor inrange attributes
of
>> GEP.
>>> I was leaning towards (1) and (B) because:
>>>    - BasicAA already has basic support for value range analysis
>>> (zero/nonzero), this is a small and natural extension.
>>>    - The BasicAA improvement already benefits some existing code
(as
>>> evidenced by the test changes in D99247
<https://reviews.llvm.org/D99247
>>> )
>>>    - Using range metadata rather than the `inrange` attribute means
that
>>> BasicAA will automatically benefit from improvements in value
tracking in
>>> the future.
>>>
>>> Opinions ?
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Mar 2021 - [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.