thr3ads.net - llvm dev - [llvm-dev] Non-Temporal hints from Loop Vectorizer [Jan 2018]

If this information is useful, please help other people find it:
Share via:

hameeza ahmed via llvm-dev

2018-Jan-20 18:29 UTC

[llvm-dev] Non-Temporal hints from Loop Vectorizer

i have already seen usage of __builtin_nontemporal_store but i want to
automate identification of non temporal loads/stores. i think i need to go
for a pass. is it possiblee to detect non temporal loops without polly?

On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim <llvm-dev at
redking.me.uk>
wrote:
> On 20/01/2018 18:16, hameeza ahmed wrote:
>
> Actually i am working on vector accelerator which will perform those
> instructions which are non temporal.
>
> for instance if i have this loop
>
> for(i=0;i<2048;i++)
> a[i]=b[i]+c[i];
>
> currently it emits following IR;
>
>
>   %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64
> %index
>   %1 = bitcast i32* %0 to <16 x i32>*
>   %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa
!1
>   %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64
> %index
>   %9 = bitcast i32* %8 to <16 x i32>*
>   %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16,
!tbaa !1
>   %16 = add nsw <16 x i32> %wide.load14, %wide.load
>   %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64
> %index
>   %21 = bitcast i32* %20 to <16 x i32>*
>   store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1
>
>
> However, i want it to emit following IR
>
>   %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64
> %index
>   %1 = bitcast i32* %0 to <16 x i32>*
>   %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa
!1,
> !nontemporal !1
>   %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64
> %index
>   %9 = bitcast i32* %8 to <16 x i32>*
>   %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16,
!tbaa
> !1, !nontemporal !1
>   %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1
>   %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64
> %index
>   %21 = bitcast i32* %20 to <16 x i32>*
>   store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1,
!nontemporal
> !1
>
> so that i can offload load, add, store to accelerator hardware. is it
> possible here? do i need a separate pass to detect whether the loop has non
> temporal data or polly will help here? what do you say?
>
> From C/C++ you just need to use the
__builtin_nontemporal_store/__builtin_nontemporal_load
> builtins to tag the stores/loads with the nontemporal flag.
>
> for(i=0;i<2048;i++) {
>   __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) +
> __builtin_nontemporal_load(c + i), a + i );
> }
>
> There may be an attribute you can tag pointers with instead but I don't
> know off hand.
>
> On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim <llvm-dev at
redking.me.uk>
> wrote:
>
>> On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote:
>>
>>> Hello,
>>>
>>> My work deals with non-temporal loads and stores i found
non-temporal
>>> meta data in llvm documentation but its not shown in IR.
>>>
>>> How to get non-temporal meta data?
>>>
>> llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt
vector
>> loads in IR - is that what you're after?
>>
>> Simon.
>>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180120/923638cd/attachment.html>

Hal Finkel via llvm-dev

2018-Jan-21 20:59 UTC

head link

[llvm-dev] Non-Temporal hints from Loop Vectorizer

On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev
wrote:> i have already seen usage of __builtin_nontemporal_store but i want to
> automate identification of non temporal loads/stores. i think i need
> to go for a pass. is it possiblee to detect non temporal loops without
> polly?
Yes, but we don't have anything that does that right now. The cost
modeling is non-trivial, however. In the loop below, which of those
accesses would you expect to be nontemporal? All of those accesses span
only 8 KB, and that's certainly smaller than many L1 caches. Turning
those into nontemporal accesses could certainly lead to a performance
regression for that loop, subsequent code, or both. If we do this more
generally, I suspect that we'd need to split the loop so that small trip
counts don't use them at all, and for larger trip counts, we don't
disturb data-reuse opportunities that would otherwise exist.

 -Hal
>
> On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim
> <llvm-dev at redking.me.uk <mailto:llvm-dev at redking.me.uk>>
wrote:
>
>     On 20/01/2018 18:16, hameeza ahmed wrote:
>>     Actually i am working on vector accelerator which will perform
>>     those instructions which are non temporal.
>>
>>     for instance if i have this loop
>>
>>     for(i=0;i<2048;i++)
>>     a[i]=b[i]+c[i];
>>
>>     currently it emits following IR;
>>
>>
>>       %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64
>>     0, i64 %index
>>       %1 = bitcast i32* %0 to <16 x i32>*
>>       %wide.load = load <16 x i32>, <16 x i32>* %1, align
16, !tbaa !1
>>       %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64
>>     0, i64 %index
>>       %9 = bitcast i32* %8 to <16 x i32>*
>>       %wide.load14 = load <16 x i32>, <16 x i32>* %9, align
16, !tbaa !1
>>       %16 = add nsw <16 x i32> %wide.load14, %wide.load
>>       %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a,
>>     i64 0, i64 %index
>>       %21 = bitcast i32* %20 to <16 x i32>*
>>       store <16 x i32> %16, <16 x i32>* %21, align 16,
!tbaa !1
>>
>>
>>     However, i want it to emit following IR 
>>
>>       %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64
>>     0, i64 %index
>>       %1 = bitcast i32* %0 to <16 x i32>*
>>       %wide.load = load <16 x i32>, <16 x i32>* %1, align
16, !tbaa
>>     !1, !nontemporal !1
>>       %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64
>>     0, i64 %index
>>       %9 = bitcast i32* %8 to <16 x i32>*
>>       %wide.load14 = load <16 x i32>, <16 x i32>* %9, align
16, !tbaa
>>     !1, !nontemporal !1
>>       %16 = add nsw <16 x i32> %wide.load14,
%wide.load, !nontemporal !1
>>       %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a,
>>     i64 0, i64 %index
>>       %21 = bitcast i32* %20 to <16 x i32>*
>>       store <16 x i32> %16, <16 x i32>* %21, align 16,
!tbaa
>>     !1, !nontemporal !1
>>
>>     so that i can offload load, add, store to accelerator hardware.
>>     is it possible here? do i need a separate pass to detect whether
>>     the loop has non temporal data or polly will help here? what do
>>     you say?
>     From C/C++ you just need to use the
>     __builtin_nontemporal_store/__builtin_nontemporal_load builtins to
>     tag the stores/loads with the nontemporal flag.
>
>     for(i=0;i<2048;i++) {
>       __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) +
>     __builtin_nontemporal_load(c + i), a + i );
>     }
>
>     There may be an attribute you can tag pointers with instead but I
>     don't know off hand.
>
>>     On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim
>>     <llvm-dev at redking.me.uk <mailto:llvm-dev at
redking.me.uk>> wrote:
>>
>>         On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote:
>>
>>             Hello,
>>
>>             My work deals with non-temporal loads and stores i found
>>             non-temporal meta data in llvm documentation but its not
>>             shown in IR.
>>
>>             How to get non-temporal meta data?
>>
>>         llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to
>>         create nt vector loads in IR - is that what you're after?
>>
>>         Simon.
>>
>>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180121/1fa0789c/attachment.html>

hameeza ahmed via llvm-dev

2018-Jan-22 21:26 UTC

head link

[llvm-dev] Non-Temporal hints from Loop Vectorizer

Thank You.

If i execute the same vector sum code with greater number of iterations
like 100000000000 will the non temporal loads and stores effective?

On Mon, Jan 22, 2018 at 1:59 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev wrote:
>
> i have already seen usage of __builtin_nontemporal_store but i want to
> automate identification of non temporal loads/stores. i think i need to go
> for a pass. is it possiblee to detect non temporal loops without polly?
>
>
> Yes, but we don't have anything that does that right now. The cost
> modeling is non-trivial, however. In the loop below, which of those
> accesses would you expect to be nontemporal? All of those accesses span
> only 8 KB, and that's certainly smaller than many L1 caches. Turning
those
> into nontemporal accesses could certainly lead to a performance regression
> for that loop, subsequent code, or both. If we do this more generally, I
> suspect that we'd need to split the loop so that small trip counts
don't
> use them at all, and for larger trip counts, we don't disturb
data-reuse
> opportunities that would otherwise exist.
>
>  -Hal
>
>
> On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim <llvm-dev at
redking.me.uk>
> wrote:
>
>> On 20/01/2018 18:16, hameeza ahmed wrote:
>>
>> Actually i am working on vector accelerator which will perform those
>> instructions which are non temporal.
>>
>> for instance if i have this loop
>>
>> for(i=0;i<2048;i++)
>> a[i]=b[i]+c[i];
>>
>> currently it emits following IR;
>>
>>
>>   %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0,
i64
>> %index
>>   %1 = bitcast i32* %0 to <16 x i32>*
>>   %wide.load = load <16 x i32>, <16 x i32>* %1, align 16,
!tbaa !1
>>   %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0,
i64
>> %index
>>   %9 = bitcast i32* %8 to <16 x i32>*
>>   %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16,
!tbaa !1
>>   %16 = add nsw <16 x i32> %wide.load14, %wide.load
>>   %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0,
i64
>> %index
>>   %21 = bitcast i32* %20 to <16 x i32>*
>>   store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1
>>
>>
>> However, i want it to emit following IR
>>
>>   %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0,
i64
>> %index
>>   %1 = bitcast i32* %0 to <16 x i32>*
>>   %wide.load = load <16 x i32>, <16 x i32>* %1, align 16,
!tbaa !1,
>> !nontemporal !1
>>   %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0,
i64
>> %index
>>   %9 = bitcast i32* %8 to <16 x i32>*
>>   %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16,
!tbaa
>> !1, !nontemporal !1
>>   %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal
!1
>>   %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0,
i64
>> %index
>>   %21 = bitcast i32* %20 to <16 x i32>*
>>   store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa
!1, !nontemporal
>> !1
>>
>> so that i can offload load, add, store to accelerator hardware. is it
>> possible here? do i need a separate pass to detect whether the loop has
non
>> temporal data or polly will help here? what do you say?
>>
>> From C/C++ you just need to use the
__builtin_nontemporal_store/__builtin_nontemporal_load
>> builtins to tag the stores/loads with the nontemporal flag.
>>
>> for(i=0;i<2048;i++) {
>>   __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) +
>> __builtin_nontemporal_load(c + i), a + i );
>> }
>>
>> There may be an attribute you can tag pointers with instead but I
don't
>> know off hand.
>>
>> On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim <llvm-dev at
redking.me.uk>
>> wrote:
>>
>>> On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote:
>>>
>>>> Hello,
>>>>
>>>> My work deals with non-temporal loads and stores i found
non-temporal
>>>> meta data in llvm documentation but its not shown in IR.
>>>>
>>>> How to get non-temporal meta data?
>>>>
>>> llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt
>>> vector loads in IR - is that what you're after?
>>>
>>> Simon.
>>>
>>
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180123/94aeef37/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Jan 2018 - Non-Temporal hints from Loop Vectorizer

[llvm-dev] Non-Temporal hints from Loop Vectorizer

[llvm-dev] Non-Temporal hints from Loop Vectorizer

[llvm-dev] Non-Temporal hints from Loop Vectorizer

Seemingly Similar Threads