Jay McCarthy via llvm-dev
2015-Nov-03 18:01 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
Thank you for your reply. FWIW, I wrote the .ll by hand after taking the C program, using clang to emit the llvm and seeing the memcpy. The memcpy version that clang generates gets compiled into assembly that uses the large sequence of movs and does not use the vector hardware at all. When I started debugging, I took that clang produced .ll and started to write it different ways trying to get different results. Jay On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <spatel at rotateright.com> wrote:> Hi Jay - > > I'm surprised by the codegen for your examples too, but LLVM has an > expectation that a front-end and IR optimizer will use llvm.memcpy > liberally: > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 > > "Any ld-ld-st-st sequence over this should have been converted to > llvm.memcpy by the frontend." > "The optimizer should really avoid this case by converting large > object/array copies to llvm.memcpy" > > > So for example with clang: > > $ cat copy.c > struct bagobytes { > int i0; > int i1; > }; > > void foo(struct bagobytes* a, struct bagobytes* b) { > *b = *a; > } > > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { > ... > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, i1 > false), !tbaa.struct !6 > ret void > } > > It may still be worth filing a bug (or seeing if one is already open) for > one of your simple examples. > > > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> I am a first time poster, so I apologize if this is an obvious >> question or out of scope for LLVM. I am an LLVM user. I don't really >> know anything about hacking on LLVM, but I do know a bit about >> compilation generally. >> >> I am on x86-64 and I am interested in structure reads, writes, and >> constants being optimized to use vector registers when the alignment >> and sizes are right. I have created a gist of a small example: >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed >> >> The assembly is produced with >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx >> >> The key idea is that we have a structure like this: >> >> %athing = type { float, float, float, float, float, float, i16, i16, >> i8, i8, i8, i8 } >> >> That works out to be 32 bytes, so it can fit in YMM registers. >> >> If I have two pointers to arrays of these things: >> >> @one = external global %athing >> @two = external global %athing >> >> and then I do a copy from one to the other >> >> %a = load %athing* @two >> store %athing %a, %athing* @one >> >> Then the code that is generated uses the XMM registers for the floats, >> but does 12 loads and then 12 stores. >> >> In contrast, if I manually cast to a properly sized float vector I get >> the desired single load and single store: >> >> %two_vector = bitcast %athing* @two to <8 x float>* >> %b = load <8 x float>* %two_vector >> %one_vector = bitcast %athing* @one to <8 x float>* >> store <8 x float> %b, <8 x float>* %one_vector >> >> The rest of the file demonstrates that the code for modifying these >> vectors is pretty good, but has examples of bad ways to initialize the >> structure and a good way to initialize it. If I try to store a >> constant struct, I get 13 stores. If I try to assemble a vector by >> casting <2 x i16> to float then <4 x i8> to float and installing them >> into a single <8 x float>, I do get the desired single store, but I >> get very complicated constants that are loaded from memory. In >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> as >> I go, then I get the desired initialization with no loads and just >> modifications of the single YMM register. (Even this last one, >> however, doesn't have the best assembly because the words and bytes >> are not inserted into the vector simultaneously, but instead >> individually.) >> >> I am kind of surprised that the obvious code didn't get optimized the >> way I expected and even the tedious version of the initialization >> isn't optimal either. I would like to know if a transformation of one >> to the other is feasible in LLVM (I know anything is possible, but >> what is feasible in this situation?) or if I should implement a >> transformation like this in my front-end and settle for the >> initialization that comes out. >> >> Thank you for your time, >> >> Jay >> >> -- >> Jay McCarthy >> Associate Professor >> PLT @ CS @ UMass Lowell >> http://jeapostrophe.github.io >> >> "Wherefore, be not weary in well-doing, >> for ye are laying the foundation of a great work. >> And out of small things proceedeth that which is great." >> - D&C 64:33 >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-- Jay McCarthy Associate Professor PLT @ CS @ UMass Lowell http://jeapostrophe.github.io "Wherefore, be not weary in well-doing, for ye are laying the foundation of a great work. And out of small things proceedeth that which is great." - D&C 64:33
Sanjay Patel via llvm-dev
2015-Nov-03 18:30 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
If the memcpy version isn't getting optimized into larger memory operations, that definitely sounds like a bug worth filing. Lowering of memcpy is affected by the size of the copy, alignments of the source and dest, and CPU target. You may be able to narrow down the problem by changing those parameters. On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:> Thank you for your reply. FWIW, I wrote the .ll by hand after taking > the C program, using clang to emit the llvm and seeing the memcpy. The > memcpy version that clang generates gets compiled into assembly that > uses the large sequence of movs and does not use the vector hardware > at all. When I started debugging, I took that clang produced .ll and > started to write it different ways trying to get different results. > > Jay > > On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <spatel at rotateright.com> > wrote: > > Hi Jay - > > > > I'm surprised by the codegen for your examples too, but LLVM has an > > expectation that a front-end and IR optimizer will use llvm.memcpy > > liberally: > > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 > > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 > > > > "Any ld-ld-st-st sequence over this should have been converted to > > llvm.memcpy by the frontend." > > "The optimizer should really avoid this case by converting large > > object/array copies to llvm.memcpy" > > > > > > So for example with clang: > > > > $ cat copy.c > > struct bagobytes { > > int i0; > > int i1; > > }; > > > > void foo(struct bagobytes* a, struct bagobytes* b) { > > *b = *a; > > } > > > > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - > > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { > > ... > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, i1 > > false), !tbaa.struct !6 > > ret void > > } > > > > It may still be worth filing a bug (or seeing if one is already open) for > > one of your simple examples. > > > > > > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev > > <llvm-dev at lists.llvm.org> wrote: > >> > >> I am a first time poster, so I apologize if this is an obvious > >> question or out of scope for LLVM. I am an LLVM user. I don't really > >> know anything about hacking on LLVM, but I do know a bit about > >> compilation generally. > >> > >> I am on x86-64 and I am interested in structure reads, writes, and > >> constants being optimized to use vector registers when the alignment > >> and sizes are right. I have created a gist of a small example: > >> > >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed > >> > >> The assembly is produced with > >> > >> llc -O3 -march=x86-64 -mcpu=corei7-avx > >> > >> The key idea is that we have a structure like this: > >> > >> %athing = type { float, float, float, float, float, float, i16, i16, > >> i8, i8, i8, i8 } > >> > >> That works out to be 32 bytes, so it can fit in YMM registers. > >> > >> If I have two pointers to arrays of these things: > >> > >> @one = external global %athing > >> @two = external global %athing > >> > >> and then I do a copy from one to the other > >> > >> %a = load %athing* @two > >> store %athing %a, %athing* @one > >> > >> Then the code that is generated uses the XMM registers for the floats, > >> but does 12 loads and then 12 stores. > >> > >> In contrast, if I manually cast to a properly sized float vector I get > >> the desired single load and single store: > >> > >> %two_vector = bitcast %athing* @two to <8 x float>* > >> %b = load <8 x float>* %two_vector > >> %one_vector = bitcast %athing* @one to <8 x float>* > >> store <8 x float> %b, <8 x float>* %one_vector > >> > >> The rest of the file demonstrates that the code for modifying these > >> vectors is pretty good, but has examples of bad ways to initialize the > >> structure and a good way to initialize it. If I try to store a > >> constant struct, I get 13 stores. If I try to assemble a vector by > >> casting <2 x i16> to float then <4 x i8> to float and installing them > >> into a single <8 x float>, I do get the desired single store, but I > >> get very complicated constants that are loaded from memory. In > >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> as > >> I go, then I get the desired initialization with no loads and just > >> modifications of the single YMM register. (Even this last one, > >> however, doesn't have the best assembly because the words and bytes > >> are not inserted into the vector simultaneously, but instead > >> individually.) > >> > >> I am kind of surprised that the obvious code didn't get optimized the > >> way I expected and even the tedious version of the initialization > >> isn't optimal either. I would like to know if a transformation of one > >> to the other is feasible in LLVM (I know anything is possible, but > >> what is feasible in this situation?) or if I should implement a > >> transformation like this in my front-end and settle for the > >> initialization that comes out. > >> > >> Thank you for your time, > >> > >> Jay > >> > >> -- > >> Jay McCarthy > >> Associate Professor > >> PLT @ CS @ UMass Lowell > >> http://jeapostrophe.github.io > >> > >> "Wherefore, be not weary in well-doing, > >> for ye are laying the foundation of a great work. > >> And out of small things proceedeth that which is great." > >> - D&C 64:33 > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > > > > -- > Jay McCarthy > Associate Professor > PLT @ CS @ UMass Lowell > http://jeapostrophe.github.io > > "Wherefore, be not weary in well-doing, > for ye are laying the foundation of a great work. > And out of small things proceedeth that which is great." > - D&C 64:33 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151103/df43aaaa/attachment.html>
Hal Finkel via llvm-dev
2015-Nov-03 18:33 UTC
[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
----- Original Message -----> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org> > To: "Jay McCarthy" <jay.mccarthy at gmail.com> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Tuesday, November 3, 2015 12:30:51 PM > Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX > > If the memcpy version isn't getting optimized into larger memory > operations, that definitely sounds like a bug worth filing. > > Lowering of memcpy is affected by the size of the copy, alignments of > the source and dest, and CPU target. You may be able to narrow down > the problem by changing those parameters. >The relevant target-specific logic is in X86TargetLowering::getOptimalMemOpType, looking at that might help in understanding what's going on. -Hal> > On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy < > jay.mccarthy at gmail.com > wrote: > > > Thank you for your reply. FWIW, I wrote the .ll by hand after taking > the C program, using clang to emit the llvm and seeing the memcpy. > The > memcpy version that clang generates gets compiled into assembly that > uses the large sequence of movs and does not use the vector hardware > at all. When I started debugging, I took that clang produced .ll and > started to write it different ways trying to get different results. > > Jay > > > > On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel < > spatel at rotateright.com > wrote: > > Hi Jay - > > > > I'm surprised by the codegen for your examples too, but LLVM has an > > expectation that a front-end and IR optimizer will use llvm.memcpy > > liberally: > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094 > > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156 > > > > "Any ld-ld-st-st sequence over this should have been converted to > > llvm.memcpy by the frontend." > > "The optimizer should really avoid this case by converting large > > object/array copies to llvm.memcpy" > > > > > > So for example with clang: > > > > $ cat copy.c > > struct bagobytes { > > int i0; > > int i1; > > }; > > > > void foo(struct bagobytes* a, struct bagobytes* b) { > > *b = *a; > > } > > > > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o - > > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 { > > ... > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, > > i1 > > false), !tbaa.struct !6 > > ret void > > } > > > > It may still be worth filing a bug (or seeing if one is already > > open) for > > one of your simple examples. > > > > > > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev > > < llvm-dev at lists.llvm.org > wrote: > >> > >> I am a first time poster, so I apologize if this is an obvious > >> question or out of scope for LLVM. I am an LLVM user. I don't > >> really > >> know anything about hacking on LLVM, but I do know a bit about > >> compilation generally. > >> > >> I am on x86-64 and I am interested in structure reads, writes, and > >> constants being optimized to use vector registers when the > >> alignment > >> and sizes are right. I have created a gist of a small example: > >> > >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed > >> > >> The assembly is produced with > >> > >> llc -O3 -march=x86-64 -mcpu=corei7-avx > >> > >> The key idea is that we have a structure like this: > >> > >> %athing = type { float, float, float, float, float, float, i16, > >> i16, > >> i8, i8, i8, i8 } > >> > >> That works out to be 32 bytes, so it can fit in YMM registers. > >> > >> If I have two pointers to arrays of these things: > >> > >> @one = external global %athing > >> @two = external global %athing > >> > >> and then I do a copy from one to the other > >> > >> %a = load %athing* @two > >> store %athing %a, %athing* @one > >> > >> Then the code that is generated uses the XMM registers for the > >> floats, > >> but does 12 loads and then 12 stores. > >> > >> In contrast, if I manually cast to a properly sized float vector I > >> get > >> the desired single load and single store: > >> > >> %two_vector = bitcast %athing* @two to <8 x float>* > >> %b = load <8 x float>* %two_vector > >> %one_vector = bitcast %athing* @one to <8 x float>* > >> store <8 x float> %b, <8 x float>* %one_vector > >> > >> The rest of the file demonstrates that the code for modifying > >> these > >> vectors is pretty good, but has examples of bad ways to initialize > >> the > >> structure and a good way to initialize it. If I try to store a > >> constant struct, I get 13 stores. If I try to assemble a vector by > >> casting <2 x i16> to float then <4 x i8> to float and installing > >> them > >> into a single <8 x float>, I do get the desired single store, but > >> I > >> get very complicated constants that are loaded from memory. In > >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> > >> as > >> I go, then I get the desired initialization with no loads and just > >> modifications of the single YMM register. (Even this last one, > >> however, doesn't have the best assembly because the words and > >> bytes > >> are not inserted into the vector simultaneously, but instead > >> individually.) > >> > >> I am kind of surprised that the obvious code didn't get optimized > >> the > >> way I expected and even the tedious version of the initialization > >> isn't optimal either. I would like to know if a transformation of > >> one > >> to the other is feasible in LLVM (I know anything is possible, but > >> what is feasible in this situation?) or if I should implement a > >> transformation like this in my front-end and settle for the > >> initialization that comes out. > >> > >> Thank you for your time, > >> > >> Jay > >> > >> -- > >> Jay McCarthy > >> Associate Professor > >> PLT @ CS @ UMass Lowell > >> http://jeapostrophe.github.io > >> > >> "Wherefore, be not weary in well-doing, > >> for ye are laying the foundation of a great work. > >> And out of small things proceedeth that which is great." > >> - D&C 64:33 > >> _______________________________________________ > >> LLVM Developers mailing list > >> llvm-dev at lists.llvm.org > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > > > > -- > Jay McCarthy > Associate Professor > PLT @ CS @ UMass Lowell > http://jeapostrophe.github.io > > "Wherefore, be not weary in well-doing, > for ye are laying the foundation of a great work. > And out of small things proceedeth that which is great." > - D&C 64:33 > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory