Sanjay Patel via llvm-dev
2015-Nov-04 15:46 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
Hi Jay - I see the slow, small accesses using an older clang [Apple LLVM version 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that comes into play if you don't specify a particular CPU: http://llvm.org/viewvc/llvm-project?view=revision&revision=245950 $ ./clang -O1 -mavx copy.c -S -o - ... movslq %edi, %rax movq _spr_dynamic at GOTPCREL(%rip), %rcx movq (%rcx), %rcx shlq $5, %rax movslq %esi, %rdx movq _spr_static at GOTPCREL(%rip), %rsi movq (%rsi), %rsi shlq $5, %rdx vmovups (%rsi,%rdx), %ymm0 <--- 32-byte load vmovups %ymm0, (%rcx,%rax) <--- 32-byte store popq %rbp vzeroupper retq On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:> Thanks, Hal. > > That code is very readable. Basically, the following has to be true > - not a memset or memzero [check] > - no implicit floats [check] > - size greater than 16 [check, it's 32] > - ! isUnalignedMem16Slow [check?] > - int256, fp256, or sse2, or sse1 is around [check] > > That last condition is: > - src & dst alignment is 0 or greater than 16 > > I think this is true, because I'm reading from a giant array of these > things, so the memory should be aligned to the object size. Assuming > that's wrong, I added an explicit alignment attribute. > > I think part of the problem is that the memcpy that gets generated > isn't for the structure, but for the structures bitcast into character > arrays: > > %17 = bitcast %struct.sprite* %9 to i8* > %18 = bitcast %struct.sprite* %16 to i8* > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32 > 4, i1 false) > > So even though the original struct pointers were aligned at 32, the > byte arrays that are created lose that alignment information. > > If this is correct, would you recommend this as just an error that > will be fixed with a little test case? > > BTW, Here's a tiny C program that demonstrates the "problem": > > typedef struct { > float dx; float dy; > float mx; float my; > float theta; float a; > short spr; short pal; > char layer; > char r; char g; char b; > } sprite; > > sprite *spr_static; // or array of [1024] // or add > __attribute__ ((align_value(32))) > sprite *spr_dynamic; // or array of [1024] // or add __attribute__ > ((align_value(32))) > > void copy(int i, int j) { > spr_dynamic[i] = spr_static[j]; > } > > Thanks! > > Jay > > On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote: > > > > > > ----- Original Message ----- > >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org> > >> To: "Jay McCarthy" <jay.mccarthy at gmail.com> > >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > >> Sent: Tuesday, November 3, 2015 12:30:51 PM > >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on > X86-64 AVX > >> > >> If the memcpy version isn't getting optimized into larger memory > >> operations, that definitely sounds like a bug worth filing. > >> > >> Lowering of memcpy is affected by the size of the copy, alignments of > >> the source and dest, and CPU target. You may be able to narrow down > >> the problem by changing those parameters. > >> > > > > The relevant target-specific logic is in > X86TargetLowering::getOptimalMemOpType, looking at that might help in > understanding what's going on. > > > > -Hal > > > >> > >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy < > >> jay.mccarthy at gmail.com > wrote: > >> > >> > >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking > >> the C program, using clang to emit the llvm and seeing the memcpy. > >> The > >> memcpy version that clang generates gets compiled into assembly that > >> uses the large sequence of movs and does not use the vector hardware > >> at all. When I started debugging, I took that clang produced .ll and > >> started to write it different ways trying to get different results. > >> > >> Jay > >> > >> > >> > >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel < > >> spatel at rotateright.com > wrote: > >> > Hi Jay - > >> > > >> > I'm surprised by the codegen for your examples too, but LLVM has an > >> > expectation that a front-end and IR optimizer will use llvm.memcpy > >> > liberally: > >> > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 > >> > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 > >> > > >> > "Any ld-ld-st-st sequence over this should have been converted to > >> > llvm.memcpy by the frontend." > >> > "The optimizer should really avoid this case by converting large > >> > object/array copies to llvm.memcpy" > >> > > >> > > >> > So for example with clang: > >> > > >> > $ cat copy.c > >> > struct bagobytes { > >> > int i0; > >> > int i1; > >> > }; > >> > > >> > void foo(struct bagobytes* a, struct bagobytes* b) { > >> > *b = *a; > >> > } > >> > > >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - > >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { > >> > ... > >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, > >> > i1 > >> > false), !tbaa.struct !6 > >> > ret void > >> > } > >> > > >> > It may still be worth filing a bug (or seeing if one is already > >> > open) for > >> > one of your simple examples. > >> > > >> > > >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev > >> > < llvm-dev at lists.llvm.org > wrote: > >> >> > >> >> I am a first time poster, so I apologize if this is an obvious > >> >> question or out of scope for LLVM. I am an LLVM user. I don't > >> >> really > >> >> know anything about hacking on LLVM, but I do know a bit about > >> >> compilation generally. > >> >> > >> >> I am on x86-64 and I am interested in structure reads, writes, and > >> >> constants being optimized to use vector registers when the > >> >> alignment > >> >> and sizes are right. I have created a gist of a small example: > >> >> > >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed > >> >> > >> >> The assembly is produced with > >> >> > >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx > >> >> > >> >> The key idea is that we have a structure like this: > >> >> > >> >> %athing = type { float, float, float, float, float, float, i16, > >> >> i16, > >> >> i8, i8, i8, i8 } > >> >> > >> >> That works out to be 32 bytes, so it can fit in YMM registers. > >> >> > >> >> If I have two pointers to arrays of these things: > >> >> > >> >> @one = external global %athing > >> >> @two = external global %athing > >> >> > >> >> and then I do a copy from one to the other > >> >> > >> >> %a = load %athing* @two > >> >> store %athing %a, %athing* @one > >> >> > >> >> Then the code that is generated uses the XMM registers for the > >> >> floats, > >> >> but does 12 loads and then 12 stores. > >> >> > >> >> In contrast, if I manually cast to a properly sized float vector I > >> >> get > >> >> the desired single load and single store: > >> >> > >> >> %two_vector = bitcast %athing* @two to <8 x float>* > >> >> %b = load <8 x float>* %two_vector > >> >> %one_vector = bitcast %athing* @one to <8 x float>* > >> >> store <8 x float> %b, <8 x float>* %one_vector > >> >> > >> >> The rest of the file demonstrates that the code for modifying > >> >> these > >> >> vectors is pretty good, but has examples of bad ways to initialize > >> >> the > >> >> structure and a good way to initialize it. If I try to store a > >> >> constant struct, I get 13 stores. If I try to assemble a vector by > >> >> casting <2 x i16> to float then <4 x i8> to float and installing > >> >> them > >> >> into a single <8 x float>, I do get the desired single store, but > >> >> I > >> >> get very complicated constants that are loaded from memory. In > >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> > >> >> as > >> >> I go, then I get the desired initialization with no loads and just > >> >> modifications of the single YMM register. (Even this last one, > >> >> however, doesn't have the best assembly because the words and > >> >> bytes > >> >> are not inserted into the vector simultaneously, but instead > >> >> individually.) > >> >> > >> >> I am kind of surprised that the obvious code didn't get optimized > >> >> the > >> >> way I expected and even the tedious version of the initialization > >> >> isn't optimal either. I would like to know if a transformation of > >> >> one > >> >> to the other is feasible in LLVM (I know anything is possible, but > >> >> what is feasible in this situation?) or if I should implement a > >> >> transformation like this in my front-end and settle for the > >> >> initialization that comes out. > >> >> > >> >> Thank you for your time, > >> >> > >> >> Jay > >> >> > >> >> -- > >> >> Jay McCarthy > >> >> Associate Professor > >> >> PLT @ CS @ UMass Lowell > >> >> http://jeapostrophe.github.io > >> >> > >> >> "Wherefore, be not weary in well-doing, > >> >> for ye are laying the foundation of a great work. > >> >> And out of small things proceedeth that which is great." > >> >> - D&C 64:33 > >> >> _______________________________________________ > >> >> LLVM Developers mailing list > >> >> llvm-dev at lists.llvm.org > >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> > > >> > > >> > >> > >> > >> -- > >> Jay McCarthy > >> Associate Professor > >> PLT @ CS @ UMass Lowell > >> http://jeapostrophe.github.io > >> > >> "Wherefore, be not weary in well-doing, > >> for ye are laying the foundation of a great work. > >> And out of small things proceedeth that which is great." > >> - D&C 64:33 > >> > >> > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > > > > -- > Jay McCarthy > Associate Professor > PLT @ CS @ UMass Lowell > http://jeapostrophe.github.io > > "Wherefore, be not weary in well-doing, > for ye are laying the foundation of a great work. > And out of small things proceedeth that which is great." > - D&C 64:33 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151104/d17ce514/attachment.html>
Jay McCarthy via llvm-dev
2015-Nov-04 15:53 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
Oh that's great. I'll just update and go from there. Thanks so much and sorry for the noise. Jay On Wed, Nov 4, 2015 at 10:46 AM, Sanjay Patel <spatel at rotateright.com> wrote:> Hi Jay - > > I see the slow, small accesses using an older clang [Apple LLVM version > 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that > comes into play if you don't specify a particular CPU: > http://llvm.org/viewvc/llvm-project?view=revision&revision=245950 > > $ ./clang -O1 -mavx copy.c -S -o - > ... > movslq %edi, %rax > movq _spr_dynamic at GOTPCREL(%rip), %rcx > movq (%rcx), %rcx > shlq $5, %rax > movslq %esi, %rdx > movq _spr_static at GOTPCREL(%rip), %rsi > movq (%rsi), %rsi > shlq $5, %rdx > vmovups (%rsi,%rdx), %ymm0 <--- 32-byte load > vmovups %ymm0, (%rcx,%rax) <--- 32-byte store > popq %rbp > vzeroupper > retq > > > > On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote: >> >> Thanks, Hal. >> >> That code is very readable. Basically, the following has to be true >> - not a memset or memzero [check] >> - no implicit floats [check] >> - size greater than 16 [check, it's 32] >> - ! isUnalignedMem16Slow [check?] >> - int256, fp256, or sse2, or sse1 is around [check] >> >> That last condition is: >> - src & dst alignment is 0 or greater than 16 >> >> I think this is true, because I'm reading from a giant array of these >> things, so the memory should be aligned to the object size. Assuming >> that's wrong, I added an explicit alignment attribute. >> >> I think part of the problem is that the memcpy that gets generated >> isn't for the structure, but for the structures bitcast into character >> arrays: >> >> %17 = bitcast %struct.sprite* %9 to i8* >> %18 = bitcast %struct.sprite* %16 to i8* >> call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32 >> 4, i1 false) >> >> So even though the original struct pointers were aligned at 32, the >> byte arrays that are created lose that alignment information. >> >> If this is correct, would you recommend this as just an error that >> will be fixed with a little test case? >> >> BTW, Here's a tiny C program that demonstrates the "problem": >> >> typedef struct { >> float dx; float dy; >> float mx; float my; >> float theta; float a; >> short spr; short pal; >> char layer; >> char r; char g; char b; >> } sprite; >> >> sprite *spr_static; // or array of [1024] // or add >> __attribute__ ((align_value(32))) >> sprite *spr_dynamic; // or array of [1024] // or add __attribute__ >> ((align_value(32))) >> >> void copy(int i, int j) { >> spr_dynamic[i] = spr_static[j]; >> } >> >> Thanks! >> >> Jay >> >> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote: >> > >> > >> > ----- Original Message ----- >> >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org> >> >> To: "Jay McCarthy" <jay.mccarthy at gmail.com> >> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org> >> >> Sent: Tuesday, November 3, 2015 12:30:51 PM >> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on >> >> X86-64 AVX >> >> >> >> If the memcpy version isn't getting optimized into larger memory >> >> operations, that definitely sounds like a bug worth filing. >> >> >> >> Lowering of memcpy is affected by the size of the copy, alignments of >> >> the source and dest, and CPU target. You may be able to narrow down >> >> the problem by changing those parameters. >> >> >> > >> > The relevant target-specific logic is in >> > X86TargetLowering::getOptimalMemOpType, looking at that might help in >> > understanding what's going on. >> > >> > -Hal >> > >> >> >> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy < >> >> jay.mccarthy at gmail.com > wrote: >> >> >> >> >> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking >> >> the C program, using clang to emit the llvm and seeing the memcpy. >> >> The >> >> memcpy version that clang generates gets compiled into assembly that >> >> uses the large sequence of movs and does not use the vector hardware >> >> at all. When I started debugging, I took that clang produced .ll and >> >> started to write it different ways trying to get different results. >> >> >> >> Jay >> >> >> >> >> >> >> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel < >> >> spatel at rotateright.com > wrote: >> >> > Hi Jay - >> >> > >> >> > I'm surprised by the codegen for your examples too, but LLVM has an >> >> > expectation that a front-end and IR optimizer will use llvm.memcpy >> >> > liberally: >> >> > >> >> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 >> >> > >> >> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 >> >> > >> >> > "Any ld-ld-st-st sequence over this should have been converted to >> >> > llvm.memcpy by the frontend." >> >> > "The optimizer should really avoid this case by converting large >> >> > object/array copies to llvm.memcpy" >> >> > >> >> > >> >> > So for example with clang: >> >> > >> >> > $ cat copy.c >> >> > struct bagobytes { >> >> > int i0; >> >> > int i1; >> >> > }; >> >> > >> >> > void foo(struct bagobytes* a, struct bagobytes* b) { >> >> > *b = *a; >> >> > } >> >> > >> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - >> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { >> >> > ... >> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, >> >> > i1 >> >> > false), !tbaa.struct !6 >> >> > ret void >> >> > } >> >> > >> >> > It may still be worth filing a bug (or seeing if one is already >> >> > open) for >> >> > one of your simple examples. >> >> > >> >> > >> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev >> >> > < llvm-dev at lists.llvm.org > wrote: >> >> >> >> >> >> I am a first time poster, so I apologize if this is an obvious >> >> >> question or out of scope for LLVM. I am an LLVM user. I don't >> >> >> really >> >> >> know anything about hacking on LLVM, but I do know a bit about >> >> >> compilation generally. >> >> >> >> >> >> I am on x86-64 and I am interested in structure reads, writes, and >> >> >> constants being optimized to use vector registers when the >> >> >> alignment >> >> >> and sizes are right. I have created a gist of a small example: >> >> >> >> >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed >> >> >> >> >> >> The assembly is produced with >> >> >> >> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx >> >> >> >> >> >> The key idea is that we have a structure like this: >> >> >> >> >> >> %athing = type { float, float, float, float, float, float, i16, >> >> >> i16, >> >> >> i8, i8, i8, i8 } >> >> >> >> >> >> That works out to be 32 bytes, so it can fit in YMM registers. >> >> >> >> >> >> If I have two pointers to arrays of these things: >> >> >> >> >> >> @one = external global %athing >> >> >> @two = external global %athing >> >> >> >> >> >> and then I do a copy from one to the other >> >> >> >> >> >> %a = load %athing* @two >> >> >> store %athing %a, %athing* @one >> >> >> >> >> >> Then the code that is generated uses the XMM registers for the >> >> >> floats, >> >> >> but does 12 loads and then 12 stores. >> >> >> >> >> >> In contrast, if I manually cast to a properly sized float vector I >> >> >> get >> >> >> the desired single load and single store: >> >> >> >> >> >> %two_vector = bitcast %athing* @two to <8 x float>* >> >> >> %b = load <8 x float>* %two_vector >> >> >> %one_vector = bitcast %athing* @one to <8 x float>* >> >> >> store <8 x float> %b, <8 x float>* %one_vector >> >> >> >> >> >> The rest of the file demonstrates that the code for modifying >> >> >> these >> >> >> vectors is pretty good, but has examples of bad ways to initialize >> >> >> the >> >> >> structure and a good way to initialize it. If I try to store a >> >> >> constant struct, I get 13 stores. If I try to assemble a vector by >> >> >> casting <2 x i16> to float then <4 x i8> to float and installing >> >> >> them >> >> >> into a single <8 x float>, I do get the desired single store, but >> >> >> I >> >> >> get very complicated constants that are loaded from memory. In >> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> >> >> >> as >> >> >> I go, then I get the desired initialization with no loads and just >> >> >> modifications of the single YMM register. (Even this last one, >> >> >> however, doesn't have the best assembly because the words and >> >> >> bytes >> >> >> are not inserted into the vector simultaneously, but instead >> >> >> individually.) >> >> >> >> >> >> I am kind of surprised that the obvious code didn't get optimized >> >> >> the >> >> >> way I expected and even the tedious version of the initialization >> >> >> isn't optimal either. I would like to know if a transformation of >> >> >> one >> >> >> to the other is feasible in LLVM (I know anything is possible, but >> >> >> what is feasible in this situation?) or if I should implement a >> >> >> transformation like this in my front-end and settle for the >> >> >> initialization that comes out. >> >> >> >> >> >> Thank you for your time, >> >> >> >> >> >> Jay >> >> >> >> >> >> -- >> >> >> Jay McCarthy >> >> >> Associate Professor >> >> >> PLT @ CS @ UMass Lowell >> >> >> http://jeapostrophe.github.io >> >> >> >> >> >> "Wherefore, be not weary in well-doing, >> >> >> for ye are laying the foundation of a great work. >> >> >> And out of small things proceedeth that which is great." >> >> >> - D&C 64:33 >> >> >> _______________________________________________ >> >> >> LLVM Developers mailing list >> >> >> llvm-dev at lists.llvm.org >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Jay McCarthy >> >> Associate Professor >> >> PLT @ CS @ UMass Lowell >> >> http://jeapostrophe.github.io >> >> >> >> "Wherefore, be not weary in well-doing, >> >> for ye are laying the foundation of a great work. >> >> And out of small things proceedeth that which is great." >> >> - D&C 64:33 >> >> >> >> >> >> _______________________________________________ >> >> LLVM Developers mailing list >> >> llvm-dev at lists.llvm.org >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> > >> > -- >> > Hal Finkel >> > Assistant Computational Scientist >> > Leadership Computing Facility >> > Argonne National Laboratory >> >> >> >> -- >> Jay McCarthy >> Associate Professor >> PLT @ CS @ UMass Lowell >> http://jeapostrophe.github.io >> >> "Wherefore, be not weary in well-doing, >> for ye are laying the foundation of a great work. >> And out of small things proceedeth that which is great." >> - D&C 64:33 > >-- Jay McCarthy Associate Professor PLT @ CS @ UMass Lowell http://jeapostrophe.github.io "Wherefore, be not weary in well-doing, for ye are laying the foundation of a great work. And out of small things proceedeth that which is great." - D&C 64:33
Sanjay Patel via llvm-dev
2015-Nov-04 16:07 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
No problem. Please do file bugs if you see anything that looks suspicious. The x86 memcpy lowering still has that FIXME comment that I haven't gotten back around to, and we have at least one other potential improvement: https://llvm.org/bugs/show_bug.cgi?id=24678 On Wed, Nov 4, 2015 at 8:53 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:> Oh that's great. I'll just update and go from there. Thanks so much > and sorry for the noise. > > Jay > > On Wed, Nov 4, 2015 at 10:46 AM, Sanjay Patel <spatel at rotateright.com> > wrote: > > Hi Jay - > > > > I see the slow, small accesses using an older clang [Apple LLVM version > > 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change > that > > comes into play if you don't specify a particular CPU: > > http://llvm.org/viewvc/llvm-project?view=revision&revision=245950 > > > > $ ./clang -O1 -mavx copy.c -S -o - > > ... > > movslq %edi, %rax > > movq _spr_dynamic at GOTPCREL(%rip), %rcx > > movq (%rcx), %rcx > > shlq $5, %rax > > movslq %esi, %rdx > > movq _spr_static at GOTPCREL(%rip), %rsi > > movq (%rsi), %rsi > > shlq $5, %rdx > > vmovups (%rsi,%rdx), %ymm0 <--- 32-byte load > > vmovups %ymm0, (%rcx,%rax) <--- 32-byte store > > popq %rbp > > vzeroupper > > retq > > > > > > > > On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com> > wrote: > >> > >> Thanks, Hal. > >> > >> That code is very readable. Basically, the following has to be true > >> - not a memset or memzero [check] > >> - no implicit floats [check] > >> - size greater than 16 [check, it's 32] > >> - ! isUnalignedMem16Slow [check?] > >> - int256, fp256, or sse2, or sse1 is around [check] > >> > >> That last condition is: > >> - src & dst alignment is 0 or greater than 16 > >> > >> I think this is true, because I'm reading from a giant array of these > >> things, so the memory should be aligned to the object size. Assuming > >> that's wrong, I added an explicit alignment attribute. > >> > >> I think part of the problem is that the memcpy that gets generated > >> isn't for the structure, but for the structures bitcast into character > >> arrays: > >> > >> %17 = bitcast %struct.sprite* %9 to i8* > >> %18 = bitcast %struct.sprite* %16 to i8* > >> call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32 > >> 4, i1 false) > >> > >> So even though the original struct pointers were aligned at 32, the > >> byte arrays that are created lose that alignment information. > >> > >> If this is correct, would you recommend this as just an error that > >> will be fixed with a little test case? > >> > >> BTW, Here's a tiny C program that demonstrates the "problem": > >> > >> typedef struct { > >> float dx; float dy; > >> float mx; float my; > >> float theta; float a; > >> short spr; short pal; > >> char layer; > >> char r; char g; char b; > >> } sprite; > >> > >> sprite *spr_static; // or array of [1024] // or add > >> __attribute__ ((align_value(32))) > >> sprite *spr_dynamic; // or array of [1024] // or add __attribute__ > >> ((align_value(32))) > >> > >> void copy(int i, int j) { > >> spr_dynamic[i] = spr_static[j]; > >> } > >> > >> Thanks! > >> > >> Jay > >> > >> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote: > >> > > >> > > >> > ----- Original Message ----- > >> >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org> > >> >> To: "Jay McCarthy" <jay.mccarthy at gmail.com> > >> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > >> >> Sent: Tuesday, November 3, 2015 12:30:51 PM > >> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on > >> >> X86-64 AVX > >> >> > >> >> If the memcpy version isn't getting optimized into larger memory > >> >> operations, that definitely sounds like a bug worth filing. > >> >> > >> >> Lowering of memcpy is affected by the size of the copy, alignments of > >> >> the source and dest, and CPU target. You may be able to narrow down > >> >> the problem by changing those parameters. > >> >> > >> > > >> > The relevant target-specific logic is in > >> > X86TargetLowering::getOptimalMemOpType, looking at that might help in > >> > understanding what's going on. > >> > > >> > -Hal > >> > > >> >> > >> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy < > >> >> jay.mccarthy at gmail.com > wrote: > >> >> > >> >> > >> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking > >> >> the C program, using clang to emit the llvm and seeing the memcpy. > >> >> The > >> >> memcpy version that clang generates gets compiled into assembly that > >> >> uses the large sequence of movs and does not use the vector hardware > >> >> at all. When I started debugging, I took that clang produced .ll and > >> >> started to write it different ways trying to get different results. > >> >> > >> >> Jay > >> >> > >> >> > >> >> > >> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel < > >> >> spatel at rotateright.com > wrote: > >> >> > Hi Jay - > >> >> > > >> >> > I'm surprised by the codegen for your examples too, but LLVM has an > >> >> > expectation that a front-end and IR optimizer will use llvm.memcpy > >> >> > liberally: > >> >> > > >> >> > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 > >> >> > > >> >> > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 > >> >> > > >> >> > "Any ld-ld-st-st sequence over this should have been converted to > >> >> > llvm.memcpy by the frontend." > >> >> > "The optimizer should really avoid this case by converting large > >> >> > object/array copies to llvm.memcpy" > >> >> > > >> >> > > >> >> > So for example with clang: > >> >> > > >> >> > $ cat copy.c > >> >> > struct bagobytes { > >> >> > int i0; > >> >> > int i1; > >> >> > }; > >> >> > > >> >> > void foo(struct bagobytes* a, struct bagobytes* b) { > >> >> > *b = *a; > >> >> > } > >> >> > > >> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - > >> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { > >> >> > ... > >> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, > >> >> > i1 > >> >> > false), !tbaa.struct !6 > >> >> > ret void > >> >> > } > >> >> > > >> >> > It may still be worth filing a bug (or seeing if one is already > >> >> > open) for > >> >> > one of your simple examples. > >> >> > > >> >> > > >> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev > >> >> > < llvm-dev at lists.llvm.org > wrote: > >> >> >> > >> >> >> I am a first time poster, so I apologize if this is an obvious > >> >> >> question or out of scope for LLVM. I am an LLVM user. I don't > >> >> >> really > >> >> >> know anything about hacking on LLVM, but I do know a bit about > >> >> >> compilation generally. > >> >> >> > >> >> >> I am on x86-64 and I am interested in structure reads, writes, and > >> >> >> constants being optimized to use vector registers when the > >> >> >> alignment > >> >> >> and sizes are right. I have created a gist of a small example: > >> >> >> > >> >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed > >> >> >> > >> >> >> The assembly is produced with > >> >> >> > >> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx > >> >> >> > >> >> >> The key idea is that we have a structure like this: > >> >> >> > >> >> >> %athing = type { float, float, float, float, float, float, i16, > >> >> >> i16, > >> >> >> i8, i8, i8, i8 } > >> >> >> > >> >> >> That works out to be 32 bytes, so it can fit in YMM registers. > >> >> >> > >> >> >> If I have two pointers to arrays of these things: > >> >> >> > >> >> >> @one = external global %athing > >> >> >> @two = external global %athing > >> >> >> > >> >> >> and then I do a copy from one to the other > >> >> >> > >> >> >> %a = load %athing* @two > >> >> >> store %athing %a, %athing* @one > >> >> >> > >> >> >> Then the code that is generated uses the XMM registers for the > >> >> >> floats, > >> >> >> but does 12 loads and then 12 stores. > >> >> >> > >> >> >> In contrast, if I manually cast to a properly sized float vector I > >> >> >> get > >> >> >> the desired single load and single store: > >> >> >> > >> >> >> %two_vector = bitcast %athing* @two to <8 x float>* > >> >> >> %b = load <8 x float>* %two_vector > >> >> >> %one_vector = bitcast %athing* @one to <8 x float>* > >> >> >> store <8 x float> %b, <8 x float>* %one_vector > >> >> >> > >> >> >> The rest of the file demonstrates that the code for modifying > >> >> >> these > >> >> >> vectors is pretty good, but has examples of bad ways to initialize > >> >> >> the > >> >> >> structure and a good way to initialize it. If I try to store a > >> >> >> constant struct, I get 13 stores. If I try to assemble a vector by > >> >> >> casting <2 x i16> to float then <4 x i8> to float and installing > >> >> >> them > >> >> >> into a single <8 x float>, I do get the desired single store, but > >> >> >> I > >> >> >> get very complicated constants that are loaded from memory. In > >> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> > >> >> >> as > >> >> >> I go, then I get the desired initialization with no loads and just > >> >> >> modifications of the single YMM register. (Even this last one, > >> >> >> however, doesn't have the best assembly because the words and > >> >> >> bytes > >> >> >> are not inserted into the vector simultaneously, but instead > >> >> >> individually.) > >> >> >> > >> >> >> I am kind of surprised that the obvious code didn't get optimized > >> >> >> the > >> >> >> way I expected and even the tedious version of the initialization > >> >> >> isn't optimal either. I would like to know if a transformation of > >> >> >> one > >> >> >> to the other is feasible in LLVM (I know anything is possible, but > >> >> >> what is feasible in this situation?) or if I should implement a > >> >> >> transformation like this in my front-end and settle for the > >> >> >> initialization that comes out. > >> >> >> > >> >> >> Thank you for your time, > >> >> >> > >> >> >> Jay > >> >> >> > >> >> >> -- > >> >> >> Jay McCarthy > >> >> >> Associate Professor > >> >> >> PLT @ CS @ UMass Lowell > >> >> >> http://jeapostrophe.github.io > >> >> >> > >> >> >> "Wherefore, be not weary in well-doing, > >> >> >> for ye are laying the foundation of a great work. > >> >> >> And out of small things proceedeth that which is great." > >> >> >> - D&C 64:33 > >> >> >> _______________________________________________ > >> >> >> LLVM Developers mailing list > >> >> >> llvm-dev at lists.llvm.org > >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> >> > > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Jay McCarthy > >> >> Associate Professor > >> >> PLT @ CS @ UMass Lowell > >> >> http://jeapostrophe.github.io > >> >> > >> >> "Wherefore, be not weary in well-doing, > >> >> for ye are laying the foundation of a great work. > >> >> And out of small things proceedeth that which is great." > >> >> - D&C 64:33 > >> >> > >> >> > >> >> _______________________________________________ > >> >> LLVM Developers mailing list > >> >> llvm-dev at lists.llvm.org > >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> >> > >> > > >> > -- > >> > Hal Finkel > >> > Assistant Computational Scientist > >> > Leadership Computing Facility > >> > Argonne National Laboratory > >> > >> > >> > >> -- > >> Jay McCarthy > >> Associate Professor > >> PLT @ CS @ UMass Lowell > >> http://jeapostrophe.github.io > >> > >> "Wherefore, be not weary in well-doing, > >> for ye are laying the foundation of a great work. > >> And out of small things proceedeth that which is great." > >> - D&C 64:33 > > > > > > > > -- > Jay McCarthy > Associate Professor > PLT @ CS @ UMass Lowell > http://jeapostrophe.github.io > > "Wherefore, be not weary in well-doing, > for ye are laying the foundation of a great work. > And out of small things proceedeth that which is great." > - D&C 64:33 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151104/e2874eba/attachment.html>