thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding range metadata to array subscripts. [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Nuno Lopes via llvm-dev

2021-Mar-24 11:00 UTC

[llvm-dev] [RFC] Adding range metadata to array subscripts.

In theory, inrange should be perfect for this job. If inrange was extended to
non-const geps, I think it would work for all cases you care about, right? It’s
also the most compact representation, so I would prefer that solution. Better
than adding a new call instruction for every single pointer arithmetic
instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range
metadata I think.

 

Just my 2c.

 

Nuno

 

 

From: Clement Courbet
Sent: 24 March 2021 09:15
To: llvm-dev <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Adding range metadata to array subscripts.

 

Hi everyone,

 

tl;dr: I would like to teach clang to output range metadata so that LLVM can do
better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248>  (clang part) and D99247
<https://reviews.llvm.org/D99247>  (llvm part). But there are other
possible options that I'm detailing below.

 

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli> :

 

```

struct Histogram {

  int values[256];

  int total;

};

 

Histogram DoIt(const int* image, int size) {

  Histogram histogram;

  for (int i = 0; i < size; ++i) {

    ++histogram.values[image[i]];  // (A)

    ++histogram.total;             // (B)

  }

  return histogram;

}

```

 

In this code, the compiler does not know anything about the values of images[i],
so it assumes that 256 is a possible value for it. In that case, (A) would
change the value of histogram.total, so (B) has to load, add one and store
[godbolt <https://godbolt.org/z/KxE343> ].

 

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use values to
form a pointer to total and dereference it. What valid C/C++ code is allowed to
do with values is:

 - Form any pointer in [values, values + 256].
 - Form and dereference any pointer in [values, values + 256)

 

Note that the LLVM memory model is much laxer than that of C/C++. It has no
notion of types. In particular, given an LLVM aggregate definition:

```
%struct.S = type { [42 x i32], i32, i32 }
```

It is perfectly valid to use an address derived from a GEP(0,0,%i)  [gep
reference] representing indexing into the [42 x i32] array to load the i32
member at index 2. It is also valid for %i to be 43 (though not 44 if an inbound
GEP is used).

So clang has to give LLVM more information about the C/C++ rules.

 

IR representation:
LLVM has several ways of representing ranges of values:
 - !range metadata can be attached to integer call and load instructions to
indicate the allowed range of values of the result. LLVM's ValueTracking
provides a function for querying the range for any llvm::Variable.
 - The llvm.assume intrinsic takes a boolean condition that can also be used by
ValueTracking to infer range of values.
 - The inrange attribute of GEP can be used to indicate C-like semantics for the
structure field marked with the inrange attribute. It can only be used for GEP
constantexprs (ie.e. GEPs defined inline), but not for standalone GEPs defining
instructions.  relevant discussion
<https://reviews.llvm.org/D22793?id=65626#inline-194653> .

 

Alternatives: 
(1) Annotate each array subscript index value with a range, e.g.:

```
%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %ri
...
!0 = !{i64 0, i64 42}

```
(2) (variant of 1) relax the constraint that !range metadata can only be set on
call and load instructions, and set the !range metadata on the index expression.
We still need annotations for function parameters though:

```
%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %i
...
!0 = !{i64 0, i64 42}

```
This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations.
(4) Extend inrange to non-constantexprs GEPs. It is unclear how this will
interfere with optimizations.

 

 

On the clang side:

The clang part is quite trivial as the infrastructure is already in place to
emit dynamic ubsan guards: D99248 <https://reviews.llvm.org/D99248>

 

On the LLVM Side:

Alternatives:
(A) - (annotation or assume options) Simply enable scev-aa which knows how to
handle value ranges in general. IIUC it's not enabled in clang because it
has issues with invalidation when code changes, and is therefore not cacheable.
This makes it too slow to be practical.
(B) - (annotation or assume options) Teach BasicAA to honor !range metadata
(D99247 <https://reviews.llvm.org/D99247> )
(C) - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

 

I was leaning towards (1) and (B) because:

 - BasicAA already has basic support for value range analysis (zero/nonzero),
this is a small and natural extension.

 - The BasicAA improvement already benefits some existing code (as evidenced by
the test changes in D99247 <https://reviews.llvm.org/D99247> )

 - Using range metadata rather than the `inrange` attribute means that BasicAA
will automatically benefit from improvements in value tracking in the future.

 

Opinions ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210324/703b6a0d/attachment.html>

Clement Courbet via llvm-dev

2021-Mar-24 11:19 UTC

head link

[llvm-dev] [RFC] Adding range metadata to array subscripts.

On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunoplopes at sapo.pt> wrote:
> In theory, inrange should be perfect for this job. If inrange was extended
> to non-const geps, I think it would work for all cases you care about,
> right?
>
Agreed, but inrange was not extended to non-const geps. I'm not sure
exactly why. From what I understand of the previous discussions, there were
non-trivial issues with supporting `inrange` for non-const geps, though
it's not exactly clear to me what these were. Peter, what was the blocker
for that ?

> It’s also the most compact representation, so I would prefer that
> solution. Better than adding a new call instruction for every single
> pointer arithmetic instruction.
>
> In terms of BasicAA, supporting inrange shouldn’t be any different from
> !range metadata I think.
>
Yes, for the purpose of BasicAA `inrange` is a special case of range, where
the range happens to be the size of the underlying `inrange` field.

>
>
> Just my 2c.
>
>
>
> Nuno
>
>
>
>
>
> *From:* Clement Courbet
> *Sent:* 24 March 2021 09:15
> *To:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [RFC] Adding range metadata to array subscripts.
>
>
>
> Hi everyone,
>
>
>
> tl;dr: I would like to teach clang to output range metadata so that LLVM
> can do better alias analysis. I have a proposal as D99248
> <https://reviews.llvm.org/D99248> (clang part) and D99247
> <https://reviews.llvm.org/D99247> (llvm part). But there are other
> possible options that I'm detailing below.
>
>
>
> Consider the following code, adapted from brotli
> <https://en.wikipedia.org/wiki/Brotli>:
>
>
>
> ```
>
> struct Histogram {
>
>   int values[256];
>
>   int total;
>
> };
>
>
>
> Histogram DoIt(const int* image, int size) {
>
>   Histogram histogram;
>
>   for (int i = 0; i < size; ++i) {
>
>     ++histogram.values[image[i]];  // (A)
>
>     ++histogram.total;             // (B)
>
>   }
>
>   return histogram;
>
> }
>
> ```
>
>
>
> In this code, the compiler does not know anything about the values of
> images[i], so it assumes that 256 is a possible value for it. In that case,
> (A) would change the value of histogram.total, so (B) has to load, add one
> and store [godbolt <https://godbolt.org/z/KxE343>].
>
>
>
> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
> values to form a pointer to total and dereference it. What valid C/C++ code
> is allowed to do with values is:
>
>  - Form any pointer in [values, values + 256].
>  - Form and dereference any pointer in [values, values + 256)
>
>
>
> Note that the LLVM memory model is much laxer than that of C/C++. It has
> no notion of types. In particular, given an LLVM aggregate definition:
>
> ```
> %struct.S = type { [42 x i32], i32, i32 }
> ```
>
> It is perfectly valid to use an address derived from a GEP(0,0,%i)  [gep
> reference] representing indexing into the [42 x i32] array to load the i32
> member at index 2. It is also valid for %i to be 43 (though not 44 if an
> inbound GEP is used).
>
> So clang has to give LLVM more information about the C/C++ rules.
>
>
>
> *IR representation:*
> LLVM has several ways of representing ranges of values:
>  - *!range* metadata can be attached to integer call and load
> instructions to indicate the allowed range of values of the result.
LLVM's
> ValueTracking provides a function for querying the range for any
> llvm::Variable.
>  - The *llvm.assume* intrinsic takes a boolean condition that can also be
> used by ValueTracking to infer range of values.
>  - The *inrange* attribute of GEP can be used to indicate C-like
> semantics for the structure field marked with the inrange attribute. It can
> only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for
> standalone GEPs defining instructions.  relevant discussion
> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>
>
>
> Alternatives:
> *(1) *Annotate each array subscript index value with a range, e.g.:
>
> ```
> %i = i64 …
> %ri =  call i64 @llvm.annotation.i64(%index), !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %ri
> ...
> !0 = !{i64 0, i64 42}
>
> ```
> *(2) *(variant of 1) relax the constraint that !range metadata can only
> be set on call and load instructions, and set the !range metadata on the
> index expression. We still need annotations for function parameters though:
>
> ```
> %i = i64 … , !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %i
> ...
> !0 = !{i64 0, i64 42}
>
> ```
> This is slightly more compact.
>
> *(3)* Same as (1), with llvm.assume. This feels inferior to annotations.
> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this
> will interfere with optimizations.
>
>
>
>
>
> *On the clang side*:
>
> The clang part is quite trivial as the infrastructure is already in place
> to emit dynamic ubsan guards: D99248
<https://reviews.llvm.org/D99248>
>
>
>
> *On the LLVM Side:*
>
> Alternatives:
> *(A)* - (annotation or assume options) Simply enable scev-aa which knows
> how to handle value ranges in general. IIUC it's not enabled in clang
> because it has issues with invalidation when code changes, and is therefore
> not cacheable. This makes it too slow to be practical.
> *(B) *- (annotation or assume options) Teach BasicAA to honor !range
> metadata (D99247 <https://reviews.llvm.org/D99247>)
> *(C)* - (inrange option) Teach BasicAA to honor inrange attributes of GEP.
>
>
>
> I was leaning towards (1) and (B) because:
>
>  - BasicAA already has basic support for value range analysis
> (zero/nonzero), this is a small and natural extension.
>
>  - The BasicAA improvement already benefits some existing code (as
> evidenced by the test changes in D99247
<https://reviews.llvm.org/D99247>)
>
>  - Using range metadata rather than the `inrange` attribute means that
> BasicAA will automatically benefit from improvements in value tracking in
> the future.
>
>
>
> Opinions ?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210324/4bf15a62/attachment.html>

Clement Courbet via llvm-dev

2021-Mar-25 14:42 UTC

head link

[llvm-dev] [RFC] Adding range metadata to array subscripts.

On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunoplopes at sapo.pt> wrote:
> In theory, inrange should be perfect for this job. If inrange was extended
> to non-const geps, I think it would work for all cases you care about,
> right? It’s also the most compact representation, so I would prefer that
> solution. Better than adding a new call instruction for every single
> pointer arithmetic instruction.
>
> In terms of BasicAA, supporting inrange shouldn’t be any different from
> !range metadata I think.
>
>
>
So I tried implementing this on top of `inrange`, and I found out that you
can only have one `inrange` index, which doesn't allow us to represent
access into multidimensional arrays : https://reviews.llvm.org/D99341

> Just my 2c.
>
>
>
> Nuno
>
>
>
>
>
> *From:* Clement Courbet
> *Sent:* 24 March 2021 09:15
> *To:* llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [RFC] Adding range metadata to array subscripts.
>
>
>
> Hi everyone,
>
>
>
> tl;dr: I would like to teach clang to output range metadata so that LLVM
> can do better alias analysis. I have a proposal as D99248
> <https://reviews.llvm.org/D99248> (clang part) and D99247
> <https://reviews.llvm.org/D99247> (llvm part). But there are other
> possible options that I'm detailing below.
>
>
>
> Consider the following code, adapted from brotli
> <https://en.wikipedia.org/wiki/Brotli>:
>
>
>
> ```
>
> struct Histogram {
>
>   int values[256];
>
>   int total;
>
> };
>
>
>
> Histogram DoIt(const int* image, int size) {
>
>   Histogram histogram;
>
>   for (int i = 0; i < size; ++i) {
>
>     ++histogram.values[image[i]];  // (A)
>
>     ++histogram.total;             // (B)
>
>   }
>
>   return histogram;
>
> }
>
> ```
>
>
>
> In this code, the compiler does not know anything about the values of
> images[i], so it assumes that 256 is a possible value for it. In that case,
> (A) would change the value of histogram.total, so (B) has to load, add one
> and store [godbolt <https://godbolt.org/z/KxE343>].
>
>
>
> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
> values to form a pointer to total and dereference it. What valid C/C++ code
> is allowed to do with values is:
>
>  - Form any pointer in [values, values + 256].
>  - Form and dereference any pointer in [values, values + 256)
>
>
>
> Note that the LLVM memory model is much laxer than that of C/C++. It has
> no notion of types. In particular, given an LLVM aggregate definition:
>
> ```
> %struct.S = type { [42 x i32], i32, i32 }
> ```
>
> It is perfectly valid to use an address derived from a GEP(0,0,%i)  [gep
> reference] representing indexing into the [42 x i32] array to load the i32
> member at index 2. It is also valid for %i to be 43 (though not 44 if an
> inbound GEP is used).
>
> So clang has to give LLVM more information about the C/C++ rules.
>
>
>
> *IR representation:*
> LLVM has several ways of representing ranges of values:
>  - *!range* metadata can be attached to integer call and load
> instructions to indicate the allowed range of values of the result.
LLVM's
> ValueTracking provides a function for querying the range for any
> llvm::Variable.
>  - The *llvm.assume* intrinsic takes a boolean condition that can also be
> used by ValueTracking to infer range of values.
>  - The *inrange* attribute of GEP can be used to indicate C-like
> semantics for the structure field marked with the inrange attribute. It can
> only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for
> standalone GEPs defining instructions.  relevant discussion
> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>
>
>
> Alternatives:
> *(1) *Annotate each array subscript index value with a range, e.g.:
>
> ```
> %i = i64 …
> %ri =  call i64 @llvm.annotation.i64(%index), !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %ri
> ...
> !0 = !{i64 0, i64 42}
>
> ```
> *(2) *(variant of 1) relax the constraint that !range metadata can only
> be set on call and load instructions, and set the !range metadata on the
> index expression. We still need annotations for function parameters though:
>
> ```
> %i = i64 … , !range !0
> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
> %i
> ...
> !0 = !{i64 0, i64 42}
>
> ```
> This is slightly more compact.
>
> *(3)* Same as (1), with llvm.assume. This feels inferior to annotations.
> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this
> will interfere with optimizations.
>
>
>
>
>
> *On the clang side*:
>
> The clang part is quite trivial as the infrastructure is already in place
> to emit dynamic ubsan guards: D99248
<https://reviews.llvm.org/D99248>
>
>
>
> *On the LLVM Side:*
>
> Alternatives:
> *(A)* - (annotation or assume options) Simply enable scev-aa which knows
> how to handle value ranges in general. IIUC it's not enabled in clang
> because it has issues with invalidation when code changes, and is therefore
> not cacheable. This makes it too slow to be practical.
> *(B) *- (annotation or assume options) Teach BasicAA to honor !range
> metadata (D99247 <https://reviews.llvm.org/D99247>)
> *(C)* - (inrange option) Teach BasicAA to honor inrange attributes of GEP.
>
>
>
> I was leaning towards (1) and (B) because:
>
>  - BasicAA already has basic support for value range analysis
> (zero/nonzero), this is a small and natural extension.
>
>  - The BasicAA improvement already benefits some existing code (as
> evidenced by the test changes in D99247
<https://reviews.llvm.org/D99247>)
>
>  - Using range metadata rather than the `inrange` attribute means that
> BasicAA will automatically benefit from improvements in value tracking in
> the future.
>
>
>
> Opinions ?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210325/75c657c3/attachment.html>

llvm dev - Mar 2021 - [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.

[llvm-dev] [RFC] Adding range metadata to array subscripts.