Peter Bel via llvm-dev
2017-Jun-28 09:43 UTC
[llvm-dev] Wide load/store optimization question
Hi, I've looked through both AMDGPU and Sparc backends, and it seems they also do not perform the thing I want to make. The only backend which is doing it is AArch64, but it doesn't have reg constraints. So, just with an example. I have the following C code: void test() { int a = 1; int b = 2; int c = 3; int d = 4; a++; b++; c++; d++; } Without any frontend optimization is compiles to the following IR. define void @test(i32* %z) #0 { %1 = alloca i32*, align 4 %a = alloca i32, align 4 %b = alloca i32, align 4 %c = alloca i32, align 4 %d = alloca i32, align 4 store i32* %z, i32** %1, align 4 store i32 1, i32* %a, align 4 store i32 2, i32* %b, align 4 store i32 3, i32* %c, align 4 store i32 4, i32* %d, align 4 %2 = load i32, i32* %a, align 4 %3 = add nsw i32 %2, 1 store i32 %3, i32* %a, align 4 %4 = load i32, i32* %b, align 4 %5 = add nsw i32 %4, 1 store i32 %5, i32* %b, align 4 ..... } Which produces the following asm code. mov r2, #1 str r2, [fp, #-2] mov r3, #2 mov r2, #3 str r3, [fp, #-3] str r2, [fp, #-4] mov r3, #4 ldr r2, [fp, #-2] str r3, [fp, #-5] ..... What I want to do is to merge neighboring stores and loads. For example mov r3, #2 mov r2, #3 str r3, [fp, #-5] str r2, [fp, #-4] Can be converted to mov r3, #2 mov r2, #3 strd r2, [fp, #-4] But the main problem is that the offset for r3 in the snippet above was -3, not -5. Currently, i'm doing the following. During the pre-RA i'm creating a REG_SEQUENCE with the target class, assigning vregs in question as its subregs, and create a load/store inst for the sequence with mem references merged. It solves the register constraint problem, but the frame allocation problem still exists. Probably I'll need to use fixed stack objects and manually pre-allocate the frame, which i really don't want to do as it can break some other passes. Petr On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <chenwj.cs97g at g2.nctu.edu.tw> wrote:> That question makes no sense. >> - Every virtual register has a register class assigned. >> - You can construct special register classes that represent register >> tuples so that when the allocator chooses an entry from that register class >> it really has choosen a tuple of machine registers (even though it looks >> like a single register with funny aliasing as far as llvm codegen is >> concerned). >> > > And we still have to lower load i64 to load v2i32, right? > > -- > Wei-Ren Chen (陳韋任) > Homepage: https://people.cs.nctu.edu.tw/~chenwj >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170628/76585ca2/attachment.html>
James Y Knight via llvm-dev
2017-Jun-28 13:19 UTC
[llvm-dev] Wide load/store optimization question
Well, that is now a slightly different question. Once the compiler can do 64-bit loads/stores for a 64-bit integer type (e.g. C long long), then an optimization pass should be merging the loads/stores before register allocation, so that appropriate registers can be chosen. On Wed, Jun 28, 2017 at 5:43 AM, Peter Bel via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > I've looked through both AMDGPU and Sparc backends, and it seems they also > do not perform the thing I want to make. The only backend which is doing it > is AArch64, but it doesn't have reg constraints. > So, just with an example. I have the following C code: > > void test() > { > int a = 1; int b = 2; int c = 3; int d = 4; > a++; b++; c++; d++; > } > > Without any frontend optimization is compiles to the following IR. > > define void @test(i32* %z) #0 { > %1 = alloca i32*, align 4 > %a = alloca i32, align 4 > %b = alloca i32, align 4 > %c = alloca i32, align 4 > %d = alloca i32, align 4 > store i32* %z, i32** %1, align 4 > store i32 1, i32* %a, align 4 > store i32 2, i32* %b, align 4 > store i32 3, i32* %c, align 4 > store i32 4, i32* %d, align 4 > %2 = load i32, i32* %a, align 4 > %3 = add nsw i32 %2, 1 > store i32 %3, i32* %a, align 4 > %4 = load i32, i32* %b, align 4 > %5 = add nsw i32 %4, 1 > store i32 %5, i32* %b, align 4 > ..... > } > > Which produces the following asm code. > > mov r2, #1 > str r2, [fp, #-2] > mov r3, #2 > mov r2, #3 > str r3, [fp, #-3] > str r2, [fp, #-4] > mov r3, #4 > ldr r2, [fp, #-2] > str r3, [fp, #-5] > ..... > > What I want to do is to merge neighboring stores and loads. For example > mov r3, #2 > mov r2, #3 > str r3, [fp, #-5] > str r2, [fp, #-4] > Can be converted to > mov r3, #2 > mov r2, #3 > strd r2, [fp, #-4] > But the main problem is that the offset for r3 in the snippet above was > -3, not -5. > > Currently, i'm doing the following. During the pre-RA i'm creating a > REG_SEQUENCE with the target class, assigning vregs in question as its > subregs, and create a load/store inst for the sequence with mem references > merged. > It solves the register constraint problem, but the frame allocation > problem still exists. Probably I'll need to use fixed stack objects and > manually pre-allocate the frame, which i really don't want to do as it can > break some other passes. > > Petr > > > On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <chenwj.cs97g at g2.nctu.edu.tw> wrote: > >> That question makes no sense. >>> - Every virtual register has a register class assigned. >>> - You can construct special register classes that represent register >>> tuples so that when the allocator chooses an entry from that register class >>> it really has choosen a tuple of machine registers (even though it looks >>> like a single register with funny aliasing as far as llvm codegen is >>> concerned). >>> >> >> And we still have to lower load i64 to load v2i32, right? >> >> -- >> Wei-Ren Chen (陳韋任) >> Homepage: https://people.cs.nctu.edu.tw/~chenwj >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170628/15fb462f/attachment.html>
Peter Bel via llvm-dev
2017-Jul-03 09:42 UTC
[llvm-dev] Wide load/store optimization question
That's what I've managed to figure out so far. As vreg should have only one def and one kill (plz correct me if I'm wrong), there shouldn't be any collision while merging them, though it might increase reg pressure. But the frame index reference problem is still there. I need both references to be sequential, an the lower one should be dword-aligned. If i'll just add dword alignment to the lower subreg, it might result in case when both of subregs will be dword-aligned with an empty word between them. There's a number of other funny cases possible. In short, I just don't know how to glue two frame indexes together into a single block. It's possible to go with fixed frame objects, but I'd prefer to leave this way as a last resort as it may cripple some of the passes coming up later. Petr On Wed, Jun 28, 2017 at 4:19 PM, James Y Knight <jyknight at google.com> wrote:> Well, that is now a slightly different question. > > Once the compiler can do 64-bit loads/stores for a 64-bit integer type > (e.g. C long long), then an optimization pass should be merging the > loads/stores before register allocation, so that appropriate registers can > be chosen. > > > On Wed, Jun 28, 2017 at 5:43 AM, Peter Bel via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi, >> >> I've looked through both AMDGPU and Sparc backends, and it seems they >> also do not perform the thing I want to make. The only backend which is >> doing it is AArch64, but it doesn't have reg constraints. >> So, just with an example. I have the following C code: >> >> void test() >> { >> int a = 1; int b = 2; int c = 3; int d = 4; >> a++; b++; c++; d++; >> } >> >> Without any frontend optimization is compiles to the following IR. >> >> define void @test(i32* %z) #0 { >> %1 = alloca i32*, align 4 >> %a = alloca i32, align 4 >> %b = alloca i32, align 4 >> %c = alloca i32, align 4 >> %d = alloca i32, align 4 >> store i32* %z, i32** %1, align 4 >> store i32 1, i32* %a, align 4 >> store i32 2, i32* %b, align 4 >> store i32 3, i32* %c, align 4 >> store i32 4, i32* %d, align 4 >> %2 = load i32, i32* %a, align 4 >> %3 = add nsw i32 %2, 1 >> store i32 %3, i32* %a, align 4 >> %4 = load i32, i32* %b, align 4 >> %5 = add nsw i32 %4, 1 >> store i32 %5, i32* %b, align 4 >> ..... >> } >> >> Which produces the following asm code. >> >> mov r2, #1 >> str r2, [fp, #-2] >> mov r3, #2 >> mov r2, #3 >> str r3, [fp, #-3] >> str r2, [fp, #-4] >> mov r3, #4 >> ldr r2, [fp, #-2] >> str r3, [fp, #-5] >> ..... >> >> What I want to do is to merge neighboring stores and loads. For example >> mov r3, #2 >> mov r2, #3 >> str r3, [fp, #-5] >> str r2, [fp, #-4] >> Can be converted to >> mov r3, #2 >> mov r2, #3 >> strd r2, [fp, #-4] >> But the main problem is that the offset for r3 in the snippet above was >> -3, not -5. >> >> Currently, i'm doing the following. During the pre-RA i'm creating a >> REG_SEQUENCE with the target class, assigning vregs in question as its >> subregs, and create a load/store inst for the sequence with mem references >> merged. >> It solves the register constraint problem, but the frame allocation >> problem still exists. Probably I'll need to use fixed stack objects and >> manually pre-allocate the frame, which i really don't want to do as it can >> break some other passes. >> >> Petr >> >> >> On Sat, Jun 17, 2017 at 10:31 AM, 陳韋任 <chenwj.cs97g at g2.nctu.edu.tw> >> wrote: >> >>> That question makes no sense. >>>> - Every virtual register has a register class assigned. >>>> - You can construct special register classes that represent register >>>> tuples so that when the allocator chooses an entry from that register class >>>> it really has choosen a tuple of machine registers (even though it looks >>>> like a single register with funny aliasing as far as llvm codegen is >>>> concerned). >>>> >>> >>> And we still have to lower load i64 to load v2i32, right? >>> >>> -- >>> Wei-Ren Chen (陳韋任) >>> Homepage: https://people.cs.nctu.edu.tw/~chenwj >>> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170703/ee1f21f5/attachment-0001.html>