Jonathan Smith via llvm-dev
2020-Oct-26 22:51 UTC
[llvm-dev] Possible bug in x86 frame lowering with SSE instructions?
Hello, everyone. I'm looking for some insight into a bug I encountered while testing some custom IR passes on Solaris (x86) and Linux. I don't know if it's a bug with the x86 backend or the way the frame is set up by Solaris -- or if I'm simply doing something I shouldn't be doing. The bug manifests even if I don't run any of my passes, so I'm certain those aren't the issue. Given the following test C code: int main(int argc, char **argv) { int x[10] = {1,2,3}; return 0; } I compile it to IR with the following arguments: clang --target=i386-sun-solaris -S -emit-llvm -Xclang -disable-O0-optnone -x c -c array-test.c -o array-test.ll This yields the following IR: target datalayout "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128" target triple = "i386-sun-solaris" ; Function Attrs: noinline nounwind define dso_local i32 @main(i32 %0, i8** %1) #0 { %3 = alloca i32, align 4 %4 = alloca i32, align 4 %5 = alloca i8**, align 4 %6 = alloca [10 x i32], align 4 store i32 0, i32* %3, align 4 store i32 %0, i32* %4, align 4 store i8** %1, i8*** %5, align 4 %7 = bitcast [10 x i32]* %6 to i8* call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false) %8 = bitcast i8* %7 to [10 x i32]* %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0 store i32 1, i32* %9, align 4 %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1 store i32 2, i32* %10, align 4 %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2 store i32 3, i32* %11, align 4 ret i32 0 } ; Function Attrs: argmemonly nounwind willreturn writeonly declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8, i32, i1 immarg) #1 attributes #0 = { noinline nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="pentium4" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } attributes #1 = { argmemonly nounwind willreturn writeonly } Normally, I would run custom passes at this point via opt. But the error I'm getting occurs with or without this step. Without changing anything else, I run this IR through llc with the following arguments: llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s This results in the following assembly: .text .intel_syntax noprefix .file "/home/user/code/array-test.ll" .globl main # -- Begin function main .p2align 4, 0x90 .type main, at function main: # @main # %bb.0: push ebp mov ebp, esp sub esp, 56 mov dword ptr [ebp - 4], 0 xorps xmm0, xmm0 movaps xmmword ptr [ebp - 56], xmm0 movaps xmmword ptr [ebp - 40], xmm0 mov dword ptr [ebp - 20], 0 mov dword ptr [ebp - 24], 0 mov dword ptr [ebp - 56], 1 mov dword ptr [ebp - 52], 2 mov dword ptr [ebp - 48], 3 xor eax, eax add esp, 56 pop ebp ret .Lfunc_end0: .size main, .Lfunc_end0-main # -- End function .ident "clang version 12.0.0 (https://github.com/llvm/llvm-project.git 62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)" .section ".note.GNU-stack","", at progbits Other than target being i386-sun-solaris, this is exact same code generated in both instances if I target i386-pc-linux-gnu. If I run this on Linux (Ubuntu 18.04 in this case), there are no problems. If I run this on Solaris, however, a segfault occurs on the first `movaps` instruction. I believe the issue is because the stack is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so the 56- and 40-byte offsets for the array stores just happen to work on Linux -- while they end up being 8 bytes off on Solaris. Running llc with --stackrealign fixes the problem: main: # @main # %bb.0: push ebp mov ebp, esp and esp, -16 sub esp, 64 mov dword ptr [esp + 12], 0 xorps xmm0, xmm0 movaps xmmword ptr [esp + 16], xmm0 movaps xmmword ptr [esp + 32], xmm0 mov dword ptr [esp + 52], 0 mov dword ptr [esp + 48], 0 mov dword ptr [esp + 16], 1 mov dword ptr [esp + 20], 2 mov dword ptr [esp + 24], 3 xor eax, eax mov esp, ebp pop ebp ret Running clang with -fomit-frame-pointer also fixes the problem, but I have no idea why. Adding --stack-alignment=16 does *not* fix the problem. If I explicitly add the -O0 flag to llc, the `X86TargetLowering::getOptimalMemOpType()` function doesn't lower the array stores to `movaps`: main: # @main # %bb.0: push ebp mov ebp, esp push esi sub esp, 68 mov eax, dword ptr [ebp + 12] mov ecx, dword ptr [ebp + 8] xor edx, edx mov dword ptr [ebp - 8], 0 lea esi, [ebp - 48] mov dword ptr [esp], esi mov dword ptr [esp + 4], 0 mov dword ptr [esp + 8], 40 mov dword ptr [ebp - 52], eax # 4-byte Spill mov dword ptr [ebp - 56], ecx # 4-byte Spill mov dword ptr [ebp - 60], edx # 4-byte Spill call memset mov dword ptr [ebp - 48], 1 mov dword ptr [ebp - 44], 2 mov dword ptr [ebp - 40], 3 mov eax, dword ptr [ebp - 60] # 4-byte Reload add esp, 68 pop esi pop ebp ret I've spent the better part of ten hours trying to debug the X86 backend code (and I am, admittedly, not the best at knowing where to look). I determined the `X86FrameLowering::emitPrologue()` function will *only* emit the proper offset adjustment if `X86RegisterInfo::needsStackRealignment()` returns `true`, and the only thing that seems to force it to return `true` is if --stackrealign is used (which sets the "stackrealign" function attribute on `main`). I don't know if this is truly a bug in the X86 backend (an assumption about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or if this is a result of me using -disable-O0-optnone in Clang without -O0 in llc. Any insight would be helpful, and thanks for reading my rather verbose message.
Wang, Pengfei via llvm-dev
2020-Oct-27 06:21 UTC
[llvm-dev] Possible bug in x86 frame lowering with SSE instructions?
Hi Jonathan, It seems the trunk code solves this problem. https://godbolt.org/z/Y1Wdbj I took a look at the x86 ABI: https://gitlab.com/x86-psABIs/i386-ABI/-/tree/hjl/x86/1.1# It says "In other words, the value (%esp + 4) is always a multiple of 16 (32 or 64) when control is transferred to the function entry point." So if the OS follows the ABI, the ESP's value should always be 0xXXXXXXXC when enters to a function, and it turns to be 0xXXXXXXX8 after "push ebp". Which happens to be aligned to 8. Thanks Pengfei -----Original Message----- From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Jonathan Smith via llvm-dev Sent: Tuesday, October 27, 2020 6:51 AM To: llvm-dev <llvm-dev at lists.llvm.org> Subject: [llvm-dev] Possible bug in x86 frame lowering with SSE instructions? Hello, everyone. I'm looking for some insight into a bug I encountered while testing some custom IR passes on Solaris (x86) and Linux. I don't know if it's a bug with the x86 backend or the way the frame is set up by Solaris -- or if I'm simply doing something I shouldn't be doing. The bug manifests even if I don't run any of my passes, so I'm certain those aren't the issue. Given the following test C code: int main(int argc, char **argv) { int x[10] = {1,2,3}; return 0; } I compile it to IR with the following arguments: clang --target=i386-sun-solaris -S -emit-llvm -Xclang -disable-O0-optnone -x c -c array-test.c -o array-test.ll This yields the following IR: target datalayout "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128" target triple = "i386-sun-solaris" ; Function Attrs: noinline nounwind define dso_local i32 @main(i32 %0, i8** %1) #0 { %3 = alloca i32, align 4 %4 = alloca i32, align 4 %5 = alloca i8**, align 4 %6 = alloca [10 x i32], align 4 store i32 0, i32* %3, align 4 store i32 %0, i32* %4, align 4 store i8** %1, i8*** %5, align 4 %7 = bitcast [10 x i32]* %6 to i8* call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false) %8 = bitcast i8* %7 to [10 x i32]* %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0 store i32 1, i32* %9, align 4 %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1 store i32 2, i32* %10, align 4 %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2 store i32 3, i32* %11, align 4 ret i32 0 } ; Function Attrs: argmemonly nounwind willreturn writeonly declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8, i32, i1 immarg) #1 attributes #0 = { noinline nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="pentium4" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" } attributes #1 = { argmemonly nounwind willreturn writeonly } Normally, I would run custom passes at this point via opt. But the error I'm getting occurs with or without this step. Without changing anything else, I run this IR through llc with the following arguments: llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s This results in the following assembly: .text .intel_syntax noprefix .file "/home/user/code/array-test.ll" .globl main # -- Begin function main .p2align 4, 0x90 .type main, at function main: # @main # %bb.0: push ebp mov ebp, esp sub esp, 56 mov dword ptr [ebp - 4], 0 xorps xmm0, xmm0 movaps xmmword ptr [ebp - 56], xmm0 movaps xmmword ptr [ebp - 40], xmm0 mov dword ptr [ebp - 20], 0 mov dword ptr [ebp - 24], 0 mov dword ptr [ebp - 56], 1 mov dword ptr [ebp - 52], 2 mov dword ptr [ebp - 48], 3 xor eax, eax add esp, 56 pop ebp ret .Lfunc_end0: .size main, .Lfunc_end0-main # -- End function .ident "clang version 12.0.0 (https://github.com/llvm/llvm-project.git 62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)" .section ".note.GNU-stack","", at progbits Other than target being i386-sun-solaris, this is exact same code generated in both instances if I target i386-pc-linux-gnu. If I run this on Linux (Ubuntu 18.04 in this case), there are no problems. If I run this on Solaris, however, a segfault occurs on the first `movaps` instruction. I believe the issue is because the stack is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so the 56- and 40-byte offsets for the array stores just happen to work on Linux -- while they end up being 8 bytes off on Solaris. Running llc with --stackrealign fixes the problem: main: # @main # %bb.0: push ebp mov ebp, esp and esp, -16 sub esp, 64 mov dword ptr [esp + 12], 0 xorps xmm0, xmm0 movaps xmmword ptr [esp + 16], xmm0 movaps xmmword ptr [esp + 32], xmm0 mov dword ptr [esp + 52], 0 mov dword ptr [esp + 48], 0 mov dword ptr [esp + 16], 1 mov dword ptr [esp + 20], 2 mov dword ptr [esp + 24], 3 xor eax, eax mov esp, ebp pop ebp ret Running clang with -fomit-frame-pointer also fixes the problem, but I have no idea why. Adding --stack-alignment=16 does *not* fix the problem. If I explicitly add the -O0 flag to llc, the `X86TargetLowering::getOptimalMemOpType()` function doesn't lower the array stores to `movaps`: main: # @main # %bb.0: push ebp mov ebp, esp push esi sub esp, 68 mov eax, dword ptr [ebp + 12] mov ecx, dword ptr [ebp + 8] xor edx, edx mov dword ptr [ebp - 8], 0 lea esi, [ebp - 48] mov dword ptr [esp], esi mov dword ptr [esp + 4], 0 mov dword ptr [esp + 8], 40 mov dword ptr [ebp - 52], eax # 4-byte Spill mov dword ptr [ebp - 56], ecx # 4-byte Spill mov dword ptr [ebp - 60], edx # 4-byte Spill call memset mov dword ptr [ebp - 48], 1 mov dword ptr [ebp - 44], 2 mov dword ptr [ebp - 40], 3 mov eax, dword ptr [ebp - 60] # 4-byte Reload add esp, 68 pop esi pop ebp ret I've spent the better part of ten hours trying to debug the X86 backend code (and I am, admittedly, not the best at knowing where to look). I determined the `X86FrameLowering::emitPrologue()` function will *only* emit the proper offset adjustment if `X86RegisterInfo::needsStackRealignment()` returns `true`, and the only thing that seems to force it to return `true` is if --stackrealign is used (which sets the "stackrealign" function attribute on `main`). I don't know if this is truly a bug in the X86 backend (an assumption about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or if this is a result of me using -disable-O0-optnone in Clang without -O0 in llc. Any insight would be helpful, and thanks for reading my rather verbose message. _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Jonathan Smith via llvm-dev
2020-Oct-27 09:52 UTC
[llvm-dev] Possible bug in x86 frame lowering with SSE instructions?
Interesting. Thank you. I'm still curious to know what commit fixed this problem, although it sounds like it's also a problem with how Solaris is implementing the ABI. I suppose it's time for me to go hunting through commits. On Tue, Oct 27, 2020 at 2:21 AM Wang, Pengfei <pengfei.wang at intel.com> wrote:> > Hi Jonathan, > > It seems the trunk code solves this problem. https://godbolt.org/z/Y1Wdbj > I took a look at the x86 ABI: https://gitlab.com/x86-psABIs/i386-ABI/-/tree/hjl/x86/1.1# > It says "In other words, the value (%esp + 4) is always a multiple of 16 (32 or 64) when control is transferred to the function entry point." > So if the OS follows the ABI, the ESP's value should always be 0xXXXXXXXC when enters to a function, and it turns to be 0xXXXXXXX8 after "push ebp". Which happens to be aligned to 8. > > Thanks > Pengfei > > -----Original Message----- > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Jonathan Smith via llvm-dev > Sent: Tuesday, October 27, 2020 6:51 AM > To: llvm-dev <llvm-dev at lists.llvm.org> > Subject: [llvm-dev] Possible bug in x86 frame lowering with SSE instructions? > > Hello, everyone. > > I'm looking for some insight into a bug I encountered while testing some custom IR passes on Solaris (x86) and Linux. I don't know if it's a bug with the x86 backend or the way the frame is set up by Solaris > -- or if I'm simply doing something I shouldn't be doing. The bug manifests even if I don't run any of my passes, so I'm certain those aren't the issue. > > Given the following test C code: > > int main(int argc, char **argv) { > int x[10] = {1,2,3}; > return 0; > } > > I compile it to IR with the following arguments: > > clang --target=i386-sun-solaris -S -emit-llvm -Xclang -disable-O0-optnone -x c -c array-test.c -o array-test.ll > > This yields the following IR: > > target datalayout > "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128" > target triple = "i386-sun-solaris" > > ; Function Attrs: noinline nounwind > define dso_local i32 @main(i32 %0, i8** %1) #0 { > %3 = alloca i32, align 4 > %4 = alloca i32, align 4 > %5 = alloca i8**, align 4 > %6 = alloca [10 x i32], align 4 > store i32 0, i32* %3, align 4 > store i32 %0, i32* %4, align 4 > store i8** %1, i8*** %5, align 4 > %7 = bitcast [10 x i32]* %6 to i8* > call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false) > %8 = bitcast i8* %7 to [10 x i32]* > %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0 > store i32 1, i32* %9, align 4 > %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1 > store i32 2, i32* %10, align 4 > %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2 > store i32 3, i32* %11, align 4 > ret i32 0 > } > > ; Function Attrs: argmemonly nounwind willreturn writeonly > declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8, i32, i1 immarg) #1 > > attributes #0 = { noinline nounwind > "correctly-rounded-divide-sqrt-fp-math"="false" > "disable-tail-calls"="false" "frame-pointer"="all" > "less-precise-fpmad"="false" "min-legal-vector-width"="0" > "no-infs-fp-math"="false" "no-jump-tables"="false" > "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" > "no-trapping-math"="true" "stack-protector-buffer-size"="8" > "target-cpu"="pentium4" > "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" > "unsafe-fp-math"="false" "use-soft-float"="false" } > attributes #1 = { argmemonly nounwind willreturn writeonly } > > Normally, I would run custom passes at this point via opt. But the error I'm getting occurs with or without this step. > > Without changing anything else, I run this IR through llc with the following arguments: > > llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s > > This results in the following assembly: > > .text > .intel_syntax noprefix > .file "/home/user/code/array-test.ll" > .globl main # -- Begin function main > .p2align 4, 0x90 > .type main, at function > main: # @main > # %bb.0: > push ebp > mov ebp, esp > sub esp, 56 > mov dword ptr [ebp - 4], 0 > xorps xmm0, xmm0 > movaps xmmword ptr [ebp - 56], xmm0 > movaps xmmword ptr [ebp - 40], xmm0 > mov dword ptr [ebp - 20], 0 > mov dword ptr [ebp - 24], 0 > mov dword ptr [ebp - 56], 1 > mov dword ptr [ebp - 52], 2 > mov dword ptr [ebp - 48], 3 > xor eax, eax > add esp, 56 > pop ebp > ret > .Lfunc_end0: > .size main, .Lfunc_end0-main > # -- End function > .ident "clang version 12.0.0 (https://github.com/llvm/llvm-project.git > 62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)" > .section ".note.GNU-stack","", at progbits > > Other than target being i386-sun-solaris, this is exact same code generated in both instances if I target i386-pc-linux-gnu. > > If I run this on Linux (Ubuntu 18.04 in this case), there are no problems. If I run this on Solaris, however, a segfault occurs on the first `movaps` instruction. I believe the issue is because the stack is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so the 56- and 40-byte offsets for the array stores just happen to work on Linux -- while they end up being 8 bytes off on Solaris. > > Running llc with --stackrealign fixes the problem: > > main: # @main > # %bb.0: > push ebp > mov ebp, esp > and esp, -16 > sub esp, 64 > mov dword ptr [esp + 12], 0 > xorps xmm0, xmm0 > movaps xmmword ptr [esp + 16], xmm0 > movaps xmmword ptr [esp + 32], xmm0 > mov dword ptr [esp + 52], 0 > mov dword ptr [esp + 48], 0 > mov dword ptr [esp + 16], 1 > mov dword ptr [esp + 20], 2 > mov dword ptr [esp + 24], 3 > xor eax, eax > mov esp, ebp > pop ebp > ret > > Running clang with -fomit-frame-pointer also fixes the problem, but I have no idea why. Adding --stack-alignment=16 does *not* fix the problem. If I explicitly add the -O0 flag to llc, the `X86TargetLowering::getOptimalMemOpType()` function doesn't lower the array stores to `movaps`: > > main: # @main > # %bb.0: > push ebp > mov ebp, esp > push esi > sub esp, 68 > mov eax, dword ptr [ebp + 12] > mov ecx, dword ptr [ebp + 8] > xor edx, edx > mov dword ptr [ebp - 8], 0 > lea esi, [ebp - 48] > mov dword ptr [esp], esi > mov dword ptr [esp + 4], 0 > mov dword ptr [esp + 8], 40 > mov dword ptr [ebp - 52], eax # 4-byte Spill > mov dword ptr [ebp - 56], ecx # 4-byte Spill > mov dword ptr [ebp - 60], edx # 4-byte Spill > call memset > mov dword ptr [ebp - 48], 1 > mov dword ptr [ebp - 44], 2 > mov dword ptr [ebp - 40], 3 > mov eax, dword ptr [ebp - 60] # 4-byte Reload > add esp, 68 > pop esi > pop ebp > ret > > I've spent the better part of ten hours trying to debug the X86 backend code (and I am, admittedly, not the best at knowing where to look). I determined the `X86FrameLowering::emitPrologue()` function will *only* emit the proper offset adjustment if `X86RegisterInfo::needsStackRealignment()` returns `true`, and the only thing that seems to force it to return `true` is if --stackrealign is used (which sets the "stackrealign" function attribute on `main`). > > I don't know if this is truly a bug in the X86 backend (an assumption about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or if this is a result of me using -disable-O0-optnone in Clang without > -O0 in llc. > > Any insight would be helpful, and thanks for reading my rather verbose message. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev