hameeza ahmed via llvm-dev
2018-Jan-20 18:29 UTC
[llvm-dev] Non-Temporal hints from Loop Vectorizer
i have already seen usage of __builtin_nontemporal_store but i want to automate identification of non temporal loads/stores. i think i need to go for a pass. is it possiblee to detect non temporal loops without polly? On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim <llvm-dev at redking.me.uk> wrote:> On 20/01/2018 18:16, hameeza ahmed wrote: > > Actually i am working on vector accelerator which will perform those > instructions which are non temporal. > > for instance if i have this loop > > for(i=0;i<2048;i++) > a[i]=b[i]+c[i]; > > currently it emits following IR; > > > %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 > %index > %1 = bitcast i32* %0 to <16 x i32>* > %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1 > %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 > %index > %9 = bitcast i32* %8 to <16 x i32>* > %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1 > %16 = add nsw <16 x i32> %wide.load14, %wide.load > %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 > %index > %21 = bitcast i32* %20 to <16 x i32>* > store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1 > > > However, i want it to emit following IR > > %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 > %index > %1 = bitcast i32* %0 to <16 x i32>* > %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, > !nontemporal !1 > %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 > %index > %9 = bitcast i32* %8 to <16 x i32>* > %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa > !1, !nontemporal !1 > %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1 > %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 > %index > %21 = bitcast i32* %20 to <16 x i32>* > store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal > !1 > > so that i can offload load, add, store to accelerator hardware. is it > possible here? do i need a separate pass to detect whether the loop has non > temporal data or polly will help here? what do you say? > > From C/C++ you just need to use the __builtin_nontemporal_store/__builtin_nontemporal_load > builtins to tag the stores/loads with the nontemporal flag. > > for(i=0;i<2048;i++) { > __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + > __builtin_nontemporal_load(c + i), a + i ); > } > > There may be an attribute you can tag pointers with instead but I don't > know off hand. > > On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim <llvm-dev at redking.me.uk> > wrote: > >> On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote: >> >>> Hello, >>> >>> My work deals with non-temporal loads and stores i found non-temporal >>> meta data in llvm documentation but its not shown in IR. >>> >>> How to get non-temporal meta data? >>> >> llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt vector >> loads in IR - is that what you're after? >> >> Simon. >> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180120/923638cd/attachment.html>
Hal Finkel via llvm-dev
2018-Jan-21 20:59 UTC
[llvm-dev] Non-Temporal hints from Loop Vectorizer
On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev wrote:> i have already seen usage of __builtin_nontemporal_store but i want to > automate identification of non temporal loads/stores. i think i need > to go for a pass. is it possiblee to detect non temporal loops without > polly?Yes, but we don't have anything that does that right now. The cost modeling is non-trivial, however. In the loop below, which of those accesses would you expect to be nontemporal? All of those accesses span only 8 KB, and that's certainly smaller than many L1 caches. Turning those into nontemporal accesses could certainly lead to a performance regression for that loop, subsequent code, or both. If we do this more generally, I suspect that we'd need to split the loop so that small trip counts don't use them at all, and for larger trip counts, we don't disturb data-reuse opportunities that would otherwise exist. -Hal> > On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim > <llvm-dev at redking.me.uk <mailto:llvm-dev at redking.me.uk>> wrote: > > On 20/01/2018 18:16, hameeza ahmed wrote: >> Actually i am working on vector accelerator which will perform >> those instructions which are non temporal. >> >> for instance if i have this loop >> >> for(i=0;i<2048;i++) >> a[i]=b[i]+c[i]; >> >> currently it emits following IR; >> >> >> %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 >> 0, i64 %index >> %1 = bitcast i32* %0 to <16 x i32>* >> %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1 >> %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 >> 0, i64 %index >> %9 = bitcast i32* %8 to <16 x i32>* >> %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1 >> %16 = add nsw <16 x i32> %wide.load14, %wide.load >> %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, >> i64 0, i64 %index >> %21 = bitcast i32* %20 to <16 x i32>* >> store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1 >> >> >> However, i want it to emit following IR >> >> %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 >> 0, i64 %index >> %1 = bitcast i32* %0 to <16 x i32>* >> %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa >> !1, !nontemporal !1 >> %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 >> 0, i64 %index >> %9 = bitcast i32* %8 to <16 x i32>* >> %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa >> !1, !nontemporal !1 >> %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1 >> %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, >> i64 0, i64 %index >> %21 = bitcast i32* %20 to <16 x i32>* >> store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa >> !1, !nontemporal !1 >> >> so that i can offload load, add, store to accelerator hardware. >> is it possible here? do i need a separate pass to detect whether >> the loop has non temporal data or polly will help here? what do >> you say? > From C/C++ you just need to use the > __builtin_nontemporal_store/__builtin_nontemporal_load builtins to > tag the stores/loads with the nontemporal flag. > > for(i=0;i<2048;i++) { > __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + > __builtin_nontemporal_load(c + i), a + i ); > } > > There may be an attribute you can tag pointers with instead but I > don't know off hand. > >> On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim >> <llvm-dev at redking.me.uk <mailto:llvm-dev at redking.me.uk>> wrote: >> >> On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote: >> >> Hello, >> >> My work deals with non-temporal loads and stores i found >> non-temporal meta data in llvm documentation but its not >> shown in IR. >> >> How to get non-temporal meta data? >> >> llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to >> create nt vector loads in IR - is that what you're after? >> >> Simon. >> >> > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180121/1fa0789c/attachment.html>
hameeza ahmed via llvm-dev
2018-Jan-22 21:26 UTC
[llvm-dev] Non-Temporal hints from Loop Vectorizer
Thank You. If i execute the same vector sum code with greater number of iterations like 100000000000 will the non temporal loads and stores effective? On Mon, Jan 22, 2018 at 1:59 AM, Hal Finkel <hfinkel at anl.gov> wrote:> > On 01/20/2018 12:29 PM, hameeza ahmed via llvm-dev wrote: > > i have already seen usage of __builtin_nontemporal_store but i want to > automate identification of non temporal loads/stores. i think i need to go > for a pass. is it possiblee to detect non temporal loops without polly? > > > Yes, but we don't have anything that does that right now. The cost > modeling is non-trivial, however. In the loop below, which of those > accesses would you expect to be nontemporal? All of those accesses span > only 8 KB, and that's certainly smaller than many L1 caches. Turning those > into nontemporal accesses could certainly lead to a performance regression > for that loop, subsequent code, or both. If we do this more generally, I > suspect that we'd need to split the loop so that small trip counts don't > use them at all, and for larger trip counts, we don't disturb data-reuse > opportunities that would otherwise exist. > > -Hal > > > On Sat, Jan 20, 2018 at 11:26 PM, Simon Pilgrim <llvm-dev at redking.me.uk> > wrote: > >> On 20/01/2018 18:16, hameeza ahmed wrote: >> >> Actually i am working on vector accelerator which will perform those >> instructions which are non temporal. >> >> for instance if i have this loop >> >> for(i=0;i<2048;i++) >> a[i]=b[i]+c[i]; >> >> currently it emits following IR; >> >> >> %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 >> %index >> %1 = bitcast i32* %0 to <16 x i32>* >> %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1 >> %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 >> %index >> %9 = bitcast i32* %8 to <16 x i32>* >> %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa !1 >> %16 = add nsw <16 x i32> %wide.load14, %wide.load >> %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 >> %index >> %21 = bitcast i32* %20 to <16 x i32>* >> store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1 >> >> >> However, i want it to emit following IR >> >> %0 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 >> %index >> %1 = bitcast i32* %0 to <16 x i32>* >> %wide.load = load <16 x i32>, <16 x i32>* %1, align 16, !tbaa !1, >> !nontemporal !1 >> %8 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 >> %index >> %9 = bitcast i32* %8 to <16 x i32>* >> %wide.load14 = load <16 x i32>, <16 x i32>* %9, align 16, !tbaa >> !1, !nontemporal !1 >> %16 = add nsw <16 x i32> %wide.load14, %wide.load, !nontemporal !1 >> %20 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 >> %index >> %21 = bitcast i32* %20 to <16 x i32>* >> store <16 x i32> %16, <16 x i32>* %21, align 16, !tbaa !1, !nontemporal >> !1 >> >> so that i can offload load, add, store to accelerator hardware. is it >> possible here? do i need a separate pass to detect whether the loop has non >> temporal data or polly will help here? what do you say? >> >> From C/C++ you just need to use the __builtin_nontemporal_store/__builtin_nontemporal_load >> builtins to tag the stores/loads with the nontemporal flag. >> >> for(i=0;i<2048;i++) { >> __builtin_nontemporal_store( __builtin_nontemporal_load(b+i) + >> __builtin_nontemporal_load(c + i), a + i ); >> } >> >> There may be an attribute you can tag pointers with instead but I don't >> know off hand. >> >> On Sat, Jan 20, 2018 at 11:02 PM, Simon Pilgrim <llvm-dev at redking.me.uk> >> wrote: >> >>> On 20/01/2018 17:44, hameeza ahmed via llvm-dev wrote: >>> >>>> Hello, >>>> >>>> My work deals with non-temporal loads and stores i found non-temporal >>>> meta data in llvm documentation but its not shown in IR. >>>> >>>> How to get non-temporal meta data? >>>> >>> llvm\test\CodeGen\X86\nontemporal-loads.ll shows how to create nt >>> vector loads in IR - is that what you're after? >>> >>> Simon. >>> >> >> >> > > > _______________________________________________ > LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180123/94aeef37/attachment.html>