thr3ads.net - llvm dev - [llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer [May 2016]

If this information is useful, please help other people find it:
Share via:

Hahnfeld, Jonas via llvm-dev

2016-May-03 10:40 UTC

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Hello all,

I've been wondering why Clang doesn't generate non-temporal stores when
compiling the STREAM benchmark [1] and therefore doesn't yield optimal
results.

It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
operations and also merges the loads and stores into vector operations.
However it doesn't add the '!nontemporal' metadata which would be
needed for
maximal bandwidth on X86.
I briefly looked into this and for non-temporal memory instructions to work,
the memory address would have to be aligned to the vector length which
currently isn't the case neither.

To summarize the following things would be needed to give non-temporal
hints:
1) Ensure correct alignment of merged vector memory instructions
This could be implemented by executing the first (scalar) loop iterations
until the addresses for loads and stores are aligned, similar to what already
happens for the remainder of the loop. The larger alignment would also allow
aligned vector instructions instead of the currently unaligned ones.

2) Give non-temporal hints when different array elements are only used once
per loop iteration
We probably need to analyze the different load and stores per loop iteration
for this...

Any thoughts or any ongoing work that I'm missing?

Thanks,
Jonas


[1] https://www.cs.virginia.edu/stream/

--
Jonas Hahnfeld, MATSE-Auszubildender

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Hahnfeld at itc.rwth-aachen.de
www.itc.rwth-aachen.de


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5868 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160503/e93dc440/attachment.bin>

Adam Nemet via llvm-dev

2016-May-03 17:21 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello all,
> 
> I've been wondering why Clang doesn't generate non-temporal stores
when
> compiling the STREAM benchmark [1] and therefore doesn't yield optimal
> results.
> 
> It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
> operations and also merges the loads and stores into vector operations.
> However it doesn't add the '!nontemporal' metadata which would
be needed for
> maximal bandwidth on X86.
> I briefly looked into this and for non-temporal memory instructions to
work,
> the memory address would have to be aligned to the vector length which
> currently isn't the case neither.
> 
> To summarize the following things would be needed to give non-temporal
> hints:
> 1) Ensure correct alignment of merged vector memory instructions
> This could be implemented by executing the first (scalar) loop iterations
> until the addresses for loads and stores are aligned, similar to what
already
> happens for the remainder of the loop. The larger alignment would also
allow
> aligned vector instructions instead of the currently unaligned ones.
> 
> 2) Give non-temporal hints when different array elements are only used once
> per loop iteration
> We probably need to analyze the different load and stores per loop
iteration
> for this…
You probably also want to ensure that you stay in the loop long enough, i.e.
have some sort of a dynamic-trip count check or PGO data indicating this.

You essentially want to ensure that reads after the loop were not hitting in the
cache even with regular stores.  (If you are writing a large area in the loop, a
large percentage of those lines are already evicted by the time you exit the
loop.)

Adam
> 
> Any thoughts or any ongoing work that I'm missing?
> 
> Thanks,
> Jonas
> 
> 
> [1] https://www.cs.virginia.edu/stream/
> 
> --
> Jonas Hahnfeld, MATSE-Auszubildender
> 
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Hahnfeld at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Hal Finkel via llvm-dev

2016-May-03 17:25 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

----- Original Message -----> From: "Adam Nemet via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "Jonas Hahnfeld" <Hahnfeld at itc.rwth-aachen.de>
> Cc: "llvm-dev (llvm-dev at lists.llvm.org)" <llvm-dev at
lists.llvm.org>
> Sent: Tuesday, May 3, 2016 12:21:07 PM
> Subject: Re: [llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer
> 
> 
> > On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev
> > <llvm-dev at lists.llvm.org> wrote:
> > 
> > Hello all,
> > 
> > I've been wondering why Clang doesn't generate non-temporal
stores
> > when
> > compiling the STREAM benchmark [1] and therefore doesn't yield
> > optimal
> > results.
> > 
> > It turned out that the Loop Vectorizer correctly vectorizes the
> > arithmetic
> > operations and also merges the loads and stores into vector
> > operations.
> > However it doesn't add the '!nontemporal' metadata which
would be
> > needed for
> > maximal bandwidth on X86.
> > I briefly looked into this and for non-temporal memory instructions
> > to work,
> > the memory address would have to be aligned to the vector length
> > which
> > currently isn't the case neither.
> > 
> > To summarize the following things would be needed to give
> > non-temporal
> > hints:
> > 1) Ensure correct alignment of merged vector memory instructions
> > This could be implemented by executing the first (scalar) loop
> > iterations
> > until the addresses for loads and stores are aligned, similar to
> > what already
> > happens for the remainder of the loop. The larger alignment would
> > also allow
> > aligned vector instructions instead of the currently unaligned
> > ones.
> > 
> > 2) Give non-temporal hints when different array elements are only
> > used once
> > per loop iteration
> > We probably need to analyze the different load and stores per loop
> > iteration
> > for this…
> 
> You probably also want to ensure that you stay in the loop long
> enough, i.e. have some sort of a dynamic-trip count check or PGO
> data indicating this.
This sounds right. Also, I'll point out that LLVM essentially does not have
a memory-hierarchy model based on which such decisions could be made. Work in
this area would be welcome.

 -Hal
 > You essentially want to ensure that reads after the loop were not
> hitting in the cache even with regular stores.  (If you are writing
> a large area in the loop, a large percentage of those lines are
> already evicted by the time you exit the loop.)
> 
> Adam
> 
> > 
> > Any thoughts or any ongoing work that I'm missing?
> > 
> > Thanks,
> > Jonas
> > 
> > 
> > [1] https://www.cs.virginia.edu/stream/
> > 
> > --
> > Jonas Hahnfeld, MATSE-Auszubildender
> > 
> > IT Center
> > Group: High Performance Computing
> > Division: Computational Science and Engineering
> > RWTH Aachen University
> > Seffenter Weg 23
> > D 52074  Aachen (Germany)
> > Hahnfeld at itc.rwth-aachen.de
> > www.itc.rwth-aachen.de
> > 
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

via llvm-dev

2016-May-03 17:26 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Non-temporal hints are extremely dangerous to use in practice IMO; they hurt far
more than they help in real programs. Compilers really should not be using them
without the programmer’s knowledge, even if it helps some microbenchmark,
because in real programs, the only one who really knows the memory access
patterns is the programmer. Unless you can be certain the output of a function
or loop is literally bigger than the entire cache and won’t be used again for a
long time, forcibly evicting it from cache tends to be a very costly mistake.

—escha
> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello all,
> 
> I've been wondering why Clang doesn't generate non-temporal stores
when
> compiling the STREAM benchmark [1] and therefore doesn't yield optimal
> results.
> 
> It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
> operations and also merges the loads and stores into vector operations.
> However it doesn't add the '!nontemporal' metadata which would
be needed for
> maximal bandwidth on X86.
> I briefly looked into this and for non-temporal memory instructions to
work,
> the memory address would have to be aligned to the vector length which
> currently isn't the case neither.
> 
> To summarize the following things would be needed to give non-temporal
> hints:
> 1) Ensure correct alignment of merged vector memory instructions
> This could be implemented by executing the first (scalar) loop iterations
> until the addresses for loads and stores are aligned, similar to what
already
> happens for the remainder of the loop. The larger alignment would also
allow
> aligned vector instructions instead of the currently unaligned ones.
> 
> 2) Give non-temporal hints when different array elements are only used once
> per loop iteration
> We probably need to analyze the different load and stores per loop
iteration
> for this...
> 
> Any thoughts or any ongoing work that I'm missing?
> 
> Thanks,
> Jonas
> 
> 
> [1] https://www.cs.virginia.edu/stream/
> 
> --
> Jonas Hahnfeld, MATSE-Auszubildender
> 
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Hahnfeld at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Adam Nemet via llvm-dev

2016-May-03 17:29 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

> On May 3, 2016, at 10:21 AM, Adam Nemet <anemet at apple.com> wrote:
> 
> 
>> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>> Hello all,
>> 
>> I've been wondering why Clang doesn't generate non-temporal
stores when
>> compiling the STREAM benchmark [1] and therefore doesn't yield
optimal
>> results.
>> 
>> It turned out that the Loop Vectorizer correctly vectorizes the
arithmetic
>> operations and also merges the loads and stores into vector operations.
>> However it doesn't add the '!nontemporal' metadata which
would be needed for
>> maximal bandwidth on X86.
Also MichaelZ introduced builtins last year to manually force the generation of
non-temporal loads and stores: __builtin_nontemporal_load/store.  I believe
these are documented.
>> I briefly looked into this and for non-temporal memory instructions to
work,
>> the memory address would have to be aligned to the vector length which
>> currently isn't the case neither.
>> 
>> To summarize the following things would be needed to give non-temporal
>> hints:
>> 1) Ensure correct alignment of merged vector memory instructions
>> This could be implemented by executing the first (scalar) loop
iterations
>> until the addresses for loads and stores are aligned, similar to what
already
>> happens for the remainder of the loop. The larger alignment would also
allow
>> aligned vector instructions instead of the currently unaligned ones.
>> 
>> 2) Give non-temporal hints when different array elements are only used
once
>> per loop iteration
>> We probably need to analyze the different load and stores per loop
iteration
>> for this…
> 
> You probably also want to ensure that you stay in the loop long enough,
i.e. have some sort of a dynamic-trip count check or PGO data indicating this.
> 
> You essentially want to ensure that reads after the loop were not hitting
in the cache even with regular stores.  (If you are writing a large area in the
loop, a large percentage of those lines are already evicted by the time you exit
the loop.)
> 
> Adam
> 
>> 
>> Any thoughts or any ongoing work that I'm missing?
>> 
>> Thanks,
>> Jonas
>> 
>> 
>> [1] https://www.cs.virginia.edu/stream/
>> 
>> --
>> Jonas Hahnfeld, MATSE-Auszubildender
>> 
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074  Aachen (Germany)
>> Hahnfeld at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

via llvm-dev

2016-May-03 19:45 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Steve Canon is on vacation, so I’m going to word for word quote his take on the
compiler autogenerating nontemporal hints:

"nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope
nope nope nope nope nope nope nope nope nope nope n” — Steve Canon

—escha
> On May 3, 2016, at 10:26 AM, via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Non-temporal hints are extremely dangerous to use in practice IMO; they
hurt far more than they help in real programs. Compilers really should not be
using them without the programmer’s knowledge, even if it helps some
microbenchmark, because in real programs, the only one who really knows the
memory access patterns is the programmer. Unless you can be certain the output
of a function or loop is literally bigger than the entire cache and won’t be
used again for a long time, forcibly evicting it from cache tends to be a very
costly mistake.
> 
> —escha
> 
>> On May 3, 2016, at 3:40 AM, Hahnfeld, Jonas via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>> Hello all,
>> 
>> I've been wondering why Clang doesn't generate non-temporal
stores when
>> compiling the STREAM benchmark [1] and therefore doesn't yield
optimal
>> results.
>> 
>> It turned out that the Loop Vectorizer correctly vectorizes the
arithmetic
>> operations and also merges the loads and stores into vector operations.
>> However it doesn't add the '!nontemporal' metadata which
would be needed for
>> maximal bandwidth on X86.
>> I briefly looked into this and for non-temporal memory instructions to
work,
>> the memory address would have to be aligned to the vector length which
>> currently isn't the case neither.
>> 
>> To summarize the following things would be needed to give non-temporal
>> hints:
>> 1) Ensure correct alignment of merged vector memory instructions
>> This could be implemented by executing the first (scalar) loop
iterations
>> until the addresses for loads and stores are aligned, similar to what
already
>> happens for the remainder of the loop. The larger alignment would also
allow
>> aligned vector instructions instead of the currently unaligned ones.
>> 
>> 2) Give non-temporal hints when different array elements are only used
once
>> per loop iteration
>> We probably need to analyze the different load and stores per loop
iteration
>> for this...
>> 
>> Any thoughts or any ongoing work that I'm missing?
>> 
>> Thanks,
>> Jonas
>> 
>> 
>> [1] https://www.cs.virginia.edu/stream/
>> 
>> --
>> Jonas Hahnfeld, MATSE-Auszubildender
>> 
>> IT Center
>> Group: High Performance Computing
>> Division: Computational Science and Engineering
>> RWTH Aachen University
>> Seffenter Weg 23
>> D 52074  Aachen (Germany)
>> Hahnfeld at itc.rwth-aachen.de
>> www.itc.rwth-aachen.de
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

JF Bastien via llvm-dev

2016-May-03 20:20 UTC

head link

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Agreed with the other replies on this thread, I'd also suggest looking at
my RFC:

https://groups.google.com/d/topic/llvm-dev/ZJ8SVCJPpcc/discussion

Which I still have to implement.

On Tue, May 3, 2016 at 3:40 AM, Hahnfeld, Jonas via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hello all,
>
> I've been wondering why Clang doesn't generate non-temporal stores
when
> compiling the STREAM benchmark [1] and therefore doesn't yield optimal
> results.
>
> It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
> operations and also merges the loads and stores into vector operations.
> However it doesn't add the '!nontemporal' metadata which would
be needed
> for
> maximal bandwidth on X86.
> I briefly looked into this and for non-temporal memory instructions to
> work,
> the memory address would have to be aligned to the vector length which
> currently isn't the case neither.
>
> To summarize the following things would be needed to give non-temporal
> hints:
> 1) Ensure correct alignment of merged vector memory instructions
> This could be implemented by executing the first (scalar) loop iterations
> until the addresses for loads and stores are aligned, similar to what
> already
> happens for the remainder of the loop. The larger alignment would also
> allow
> aligned vector instructions instead of the currently unaligned ones.
>
> 2) Give non-temporal hints when different array elements are only used once
> per loop iteration
> We probably need to analyze the different load and stores per loop
> iteration
> for this...
>
> Any thoughts or any ongoing work that I'm missing?
>
> Thanks,
> Jonas
>
>
> [1] https://www.cs.virginia.edu/stream/
>
> --
> Jonas Hahnfeld, MATSE-Auszubildender
>
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Hahnfeld at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160503/cc8e9ce7/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - May 2016 - [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

[llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

Possibly Parallel Threads