Jack Howarth
2015-Feb-14 17:11 UTC
[LLVMdev] trunk's optimizer generates slower code than 3.5
Using the SciMark 2.0 code from http://math.nist.gov/scimark2/scimark2_1c.zip compiled with the same... make CFLAGS="-O3 -march=native" I am able to reproduce the 22% performance regression in the run time of the Sparse matmult benchmark. For 10 runs of the scimark2 benechmark, I get 998.439+/-0.4828 with the release llvm clang 3.5.1 compiler and 1217.363+/-1.1004 for the current clang 3.6svn from 3.6 branch. Not good. Jack On Sat, Feb 14, 2015 at 11:19 AM, Jack Howarth <howarth.mailing.lists at gmail.com> wrote:> Do any of the build-bots routinely run the SciMark v2.0 benchmark? > If so, might not an examination of those logs reveal the commit range > at which the optimizations in that benchmark degraded? > Jack > > On Sat, Feb 14, 2015 at 11:13 AM, Jack Howarth > <howarth.mailing.lists at gmail.com> wrote: >> The regressions in the performance of generated code, introduced >> by the llvm 3.6 release, don't seem to be limited to this 8 queens >> puzzle" solver test case. See... >> >> http://www.phoronix.com/scan.php?page=article&item=llvm-clang-3.5-3.6-rc1&num=1 >> >> where a bit hit in the performance of the Sparse Matrix Multiply test >> of the SciMark v2.0 benchmark was observed as well as others. >> Do you really want to release 3.6 with this level of performance regression? >> Jack >> >> On Fri, Feb 13, 2015 at 2:47 PM, Jack Howarth >> <howarth.mailing.lists at gmail.com> wrote: >>> Also confirmed with the llvm 3.5.1 release and the llvm 3.6 release >>> branch on x86_64-apple-darwin14... >>> >>> % clang-3.5 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>> -fno-exceptions -o 8 8.c >>> % time ./8 9 >>> 352 solutions >>> 3.603u 0.002s 0:03.60 100.0% 0+0k 0+0io 2pf+0w >>> % time ./8 10 >>> 724 solutions >>> 104.217u 0.059s 1:44.30 99.9% 0+0k 0+0io 2pf+0w >>> >>> % clang-3.6 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>> -fno-exceptions -o 8 8.c >>> % time ./8 9 >>> 352 solutions >>> 4.050u 0.001s 0:04.05 100.0% 0+0k 0+0io 2pf+0w >>> % time ./8 10 >>> 724 solutions >>> 114.808u 0.041s 1:54.86 99.9% 0+0k 0+0io 2pf+0w >>> >>> On Fri, Feb 13, 2015 at 3:37 AM, 191919 <191919 at gmail.com> wrote: >>>> I submitted the problem report to clang's bugzilla but no one seems to >>>> care so I have to send it to the mailing list. >>>> >>>> clang 3.7 svn (trunk 229055 as the time I was to report this problem) >>>> generates slower code than 3.5 (Apple LLVM version 6.0 >>>> (clang-600.0.56) (based on LLVM 3.5svn)) for the following code. >>>> >>>> It is a "8 queens puzzle" solver written as an educational example. As >>>> compiled by both clang 3.5 and 3.7, it gave the correct answer, but >>>> clang 3.5 generates code which runs 20% faster than 3.6/3.7. >>>> >>>> ########################################## >>>> # clang 3.5 which comes with Xcode 6.1.1 >>>> ########################################## >>>> $ clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> $ time ./8 9 # 9 queens >>>> 352 solutions >>>> $ time ./8 10 # 10 queens >>>> ./8 9 1.63s user 0.00s system 99% cpu 1.632 total >>>> 724 solutions >>>> ./8 10 45.11s user 0.01s system 99% cpu 45.121 total >>>> >>>> ########################################## >>>> # clang 3.7 svn trunk >>>> ########################################## >>>> $ /opt/bin/clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> $ time ./8 9 # 9 queens >>>> 352 solutions >>>> ./8 9 2.07s user 0.00s system 99% cpu 2.078 total >>>> $ time ./8 10 # 10 queens >>>> 724 solutions >>>> ./8 10 56.63s user 0.02s system 99% cpu 56.650 total >>>> >>>> The source code is below, I also attached the executable files as well >>>> as the assembly code files for clang 3.5 and 3.6 by IDA. >>>> >>>> The performance is even worse when compiling as 32-bit code while >>>> gcc-4.9.2 is not affected. >>>> >>>> ########## clang-3.5 >>>> $ clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> $ time ./8 9 >>>> 352 solutions >>>> ./8 9 1.95s user 0.00s system 99% cpu 1.950 total >>>> >>>> ########## clang-3.7 >>>> $ /opt/bin/clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> $ time ./8 9 >>>> 352 solutions >>>> ./8 9 2.48s user 0.00s system 99% cpu 2.480 total >>>> >>>> ######### gcc-4.9.2 >>>> $ /opt/bin/gcc -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> $ time ./8 9 >>>> 352 solutions >>>> ./8 9 1.44s user 0.00s system 99% cpu 1.442 total >>>> >>>> >>>> ``` >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> >>>> static inline int validate(int* a, int d) >>>> { >>>> int i, j, x; >>>> for (i = 0; i < d; ++i) >>>> { >>>> for (j = i+1, x = 1; j < d; ++j, ++x) >>>> { >>>> const int d = a[i] - a[j]; >>>> if (d == 0 || d == -x || d == x) return 0; >>>> } >>>> } >>>> return 1; >>>> } >>>> >>>> static inline int solve(int d) >>>> { >>>> int r = 0; >>>> int* a = (int*) calloc(sizeof(int), d+1); >>>> int p = d - 1; >>>> >>>> for (;;) >>>> { >>>> a[p]++; >>>> >>>> if (a[p] > d-1) >>>> { >>>> int bp = p - 1; >>>> while (bp >= 0) >>>> { >>>> a[bp]++; >>>> if (a[bp] <= d-1) break; >>>> a[bp] = 0; >>>> --bp; >>>> } >>>> if (bp < 0) >>>> break; >>>> a[p] = 0; >>>> } >>>> if (validate(a, d)) >>>> { >>>> ++r; >>>> } >>>> } >>>> >>>> free(a); >>>> return r; >>>> } >>>> >>>> int main(int argc, char** argv) >>>> { >>>> if (argc != 2) return -1; >>>> int r = solve((int) strtol(argv[1], NULL, 10)); >>>> printf("%d solutions\n", r); >>>> } >>>> ``` >>>> >>>> clang 3.5's result: >>>> >>>> ``` >>>> public _main >>>> _main proc near >>>> >>>> var_48 = qword ptr -48h >>>> var_40 = qword ptr -40h >>>> var_34 = dword ptr -34h >>>> >>>> push rbp >>>> push r15 >>>> push r14 >>>> push r13 >>>> push r12 >>>> push rbx >>>> sub rsp, 18h >>>> mov ebx, 0FFFFFFFFh >>>> cmp edi, 2 >>>> jnz loc_100000F29 >>>> mov rdi, [rsi+8] ; char * >>>> xor r14d, r14d >>>> xor esi, esi ; char ** >>>> mov edx, 0Ah ; int >>>> call _strtol >>>> mov r15, rax >>>> shl rax, 20h >>>> mov rsi, offset __mh_execute_header >>>> add rsi, rax >>>> sar rsi, 20h ; size_t >>>> mov edi, 4 ; size_t >>>> call _calloc >>>> lea edx, [r15-1] >>>> movsxd r8, edx >>>> mov ecx, r15d >>>> add ecx, 0FFFFFFFEh >>>> js loc_100000DFA >>>> test r15d, r15d >>>> mov r11d, [rax+r8*4] >>>> jle loc_100000EAE >>>> mov ecx, r15d >>>> add ecx, 0FFFFFFFEh >>>> mov [rsp+48h+var_34], ecx >>>> movsxd rcx, ecx >>>> lea rcx, [rax+rcx*4] >>>> mov [rsp+48h+var_40], rcx >>>> lea rcx, [rax+4] >>>> mov [rsp+48h+var_48], rcx >>>> xor r14d, r14d >>>> jmp short loc_100000D33 >>>> ; --------------------------------------------------------------------------- >>>> align 10h >>>> >>>> loc_100000D30: ; CODE XREF: _main+129 j >>>> ; _main+131 j ... >>>> add r14d, ebx >>>> >>>> loc_100000D33: ; CODE XREF: _main+92 j >>>> cmp r11d, edx >>>> lea edi, [r11+1] >>>> mov [rax+r8*4], edi >>>> mov rcx, [rsp+48h+var_40] >>>> mov esi, [rsp+48h+var_34] >>>> mov r11d, edi >>>> jl short loc_100000D84 >>>> nop dword ptr [rax+00h] >>>> >>>> loc_100000D50: ; CODE XREF: _main+DA j >>>> mov edi, [rcx] >>>> lea ebp, [rdi+1] >>>> mov [rcx], ebp >>>> cmp edi, edx >>>> jl short loc_100000D71 >>>> mov dword ptr [rcx], 0 >>>> add rcx, 0FFFFFFFFFFFFFFFCh >>>> test esi, esi >>>> lea esi, [rsi-1] >>>> jg short loc_100000D50 >>>> jmp loc_100000F0E >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000D71: ; CODE XREF: _main+C9 j >>>> test esi, esi >>>> js loc_100000F0E >>>> mov dword ptr [rax+r8*4], 0 >>>> xor r11d, r11d >>>> >>>> loc_100000D84: ; CODE XREF: _main+BA j >>>> cmp r15d, 1 >>>> mov esi, 0 >>>> mov r9, [rsp+48h+var_48] >>>> mov r12d, 1 >>>> jle short loc_100000DF0 >>>> >>>> loc_100000D99: ; CODE XREF: _main+15E j >>>> mov r10d, [rax+rsi*4] >>>> mov ecx, 0FFFFFFFFh >>>> mov edi, 1 >>>> mov r13, r9 >>>> nop word ptr [rax+rax+00h] >>>> >>>> loc_100000DB0: ; CODE XREF: _main+14F j >>>> xor ebx, ebx >>>> mov ebp, r10d >>>> sub ebp, [r13+0] >>>> jz loc_100000D30 >>>> cmp ecx, ebp >>>> jz loc_100000D30 >>>> cmp edi, ebp >>>> jz loc_100000D30 >>>> add r13, 4 >>>> inc rdi >>>> dec ecx >>>> mov ebx, edi >>>> add ebx, esi >>>> cmp ebx, r15d >>>> jl short loc_100000DB0 >>>> inc r12 >>>> add r9, 4 >>>> inc rsi >>>> cmp r12d, r15d >>>> jl short loc_100000D99 >>>> >>>> loc_100000DF0: ; CODE XREF: _main+107 j >>>> mov ebx, 1 >>>> jmp loc_100000D30 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000DFA: ; CODE XREF: _main+5E j >>>> mov ecx, [rax+r8*4] >>>> lea r9d, [rcx+1] >>>> mov [rax+r8*4], r9d >>>> cmp ecx, r8d >>>> jge loc_100000F0E >>>> lea r12, [rax+4] >>>> xor r14d, r14d >>>> db 2Eh >>>> nop word ptr [rax+rax+00000000h] >>>> >>>> loc_100000E20: ; CODE XREF: _main+216 j >>>> test r15d, r15d >>>> setle cl >>>> cmp r15d, 2 >>>> jl short loc_100000E90 >>>> test cl, cl >>>> mov r13d, 0 >>>> mov r11, r12 >>>> mov r10d, 1 >>>> jnz short loc_100000E90 >>>> >>>> loc_100000E3F: ; CODE XREF: _main+1F0 j >>>> mov edi, [rax+r13*4] >>>> mov edx, 0FFFFFFFFh >>>> mov ecx, 1 >>>> mov rsi, r11 >>>> >>>> loc_100000E50: ; CODE XREF: _main+1E1 j >>>> xor ebx, ebx >>>> mov ebp, edi >>>> sub ebp, [rsi] >>>> jz short loc_100000E95 >>>> cmp edx, ebp >>>> jz short loc_100000E95 >>>> cmp ecx, ebp >>>> jz short loc_100000E95 >>>> add rsi, 4 >>>> inc rcx >>>> dec edx >>>> mov ebx, ecx >>>> add ebx, r13d >>>> cmp ebx, r15d >>>> jl short loc_100000E50 >>>> inc r10 >>>> add r11, 4 >>>> inc r13 >>>> cmp r10d, r15d >>>> jl short loc_100000E3F >>>> db 66h, 66h, 66h, 66h, 2Eh >>>> nop word ptr [rax+rax+00000000h] >>>> >>>> loc_100000E90: ; CODE XREF: _main+19A j >>>> ; _main+1AD j >>>> mov ebx, 1 >>>> >>>> loc_100000E95: ; CODE XREF: _main+1C6 j >>>> ; _main+1CA j ... >>>> add r14d, ebx >>>> cmp r9d, r8d >>>> lea ecx, [r9+1] >>>> mov [rax+r8*4], ecx >>>> mov r9d, ecx >>>> jl loc_100000E20 >>>> jmp short loc_100000F0E >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000EAE: ; CODE XREF: _main+6B j >>>> add r15d, 0FFFFFFFEh >>>> movsxd rcx, r15d >>>> lea rcx, [rax+rcx*4] >>>> xor r14d, r14d >>>> jmp short loc_100000EC6 >>>> ; --------------------------------------------------------------------------- >>>> align 20h >>>> >>>> loc_100000EC0: ; CODE XREF: _main+247 j >>>> ; _main+27C j >>>> inc r14d >>>> mov r11d, ebp >>>> >>>> loc_100000EC6: ; CODE XREF: _main+22C j >>>> lea ebp, [r11+1] >>>> mov [rax+r8*4], ebp >>>> cmp r11d, r8d >>>> mov rsi, rcx >>>> mov edi, r15d >>>> jl short loc_100000EC0 >>>> nop dword ptr [rax+00000000h] >>>> >>>> loc_100000EE0: ; CODE XREF: _main+26A j >>>> mov ebp, [rsi] >>>> lea ebx, [rbp+1] >>>> mov [rsi], ebx >>>> cmp ebp, edx >>>> jl short loc_100000EFE >>>> mov dword ptr [rsi], 0 >>>> add rsi, 0FFFFFFFFFFFFFFFCh >>>> test edi, edi >>>> lea edi, [rdi-1] >>>> jg short loc_100000EE0 >>>> jmp short loc_100000F0E >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000EFE: ; CODE XREF: _main+259 j >>>> test edi, edi >>>> js short loc_100000F0E >>>> mov dword ptr [rax+r8*4], 0 >>>> xor ebp, ebp >>>> jmp short loc_100000EC0 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000F0E: ; CODE XREF: _main+DC j >>>> ; _main+E3 j ... >>>> mov rdi, rax ; void * >>>> call _free >>>> lea rdi, aDSolutions ; "%d solutions\n" >>>> xor ebx, ebx >>>> xor eax, eax >>>> mov esi, r14d >>>> call _printf >>>> >>>> loc_100000F29: ; CODE XREF: _main+16 j >>>> mov eax, ebx >>>> add rsp, 18h >>>> pop rbx >>>> pop r12 >>>> pop r13 >>>> pop r14 >>>> pop r15 >>>> pop rbp >>>> retn >>>> _main endp >>>> ``` >>>> >>>> clang 3.6's result: >>>> >>>> ``` >>>> public _main >>>> _main proc near >>>> >>>> var_60 = qword ptr -60h >>>> var_58 = qword ptr -58h >>>> var_50 = qword ptr -50h >>>> var_48 = qword ptr -48h >>>> var_40 = qword ptr -40h >>>> var_38 = qword ptr -38h >>>> >>>> push rbp >>>> push r15 >>>> push r14 >>>> push r13 >>>> push r12 >>>> push rbx >>>> sub rsp, 38h >>>> mov ebx, 0FFFFFFFFh >>>> cmp edi, 2 >>>> jnz loc_100000F23 >>>> mov rbx, offset __mh_execute_header >>>> mov rdi, [rsi+8] ; char * >>>> xor r13d, r13d >>>> xor esi, esi ; char ** >>>> mov edx, 0Ah ; int >>>> call _strtol >>>> mov r14, rax >>>> shl rax, 20h >>>> mov [rsp+68h+var_38], rax >>>> lea rsi, [rax+rbx] >>>> sar rsi, 20h ; size_t >>>> mov edi, 4 ; size_t >>>> call _calloc >>>> lea r11d, [r14-1] >>>> movsxd r12, r11d >>>> mov [rsp+68h+var_40], r12 >>>> movsxd rcx, r14d >>>> mov [rsp+68h+var_50], rcx >>>> add ecx, 0FFFFFFFEh >>>> js loc_100000E1A >>>> mov ecx, r14d >>>> add ecx, 0FFFFFFFEh >>>> movsxd rcx, ecx >>>> inc rcx >>>> mov [rsp+68h+var_58], rcx >>>> mov rcx, rax >>>> add rcx, 4 >>>> mov [rsp+68h+var_60], rcx >>>> xor ebp, ebp >>>> jmp short loc_100000D17 >>>> ; --------------------------------------------------------------------------- >>>> align 10h >>>> >>>> loc_100000D10: ; CODE XREF: _main+15B j >>>> ; _main+163 j ... >>>> mov rbp, [rsp+68h+var_48] >>>> add ebp, edi >>>> >>>> loc_100000D17: ; CODE XREF: _main+93 j >>>> cmp r13d, r11d >>>> lea edx, [r13+1] >>>> mov [rax+r12*4], edx >>>> mov rcx, [rsp+68h+var_58] >>>> mov r13d, edx >>>> jl short loc_100000D6B >>>> nop dword ptr [rax+00h] >>>> >>>> loc_100000D30: ; CODE XREF: _main+DE j >>>> mov edx, [rax+rcx*4-4] >>>> lea esi, [rdx+1] >>>> mov [rax+rcx*4-4], esi >>>> cmp edx, r11d >>>> jl short loc_100000D60 >>>> mov dword ptr [rax+rcx*4-4], 0 >>>> dec rcx >>>> test rcx, rcx >>>> jg short loc_100000D30 >>>> jmp loc_100000F09 >>>> ; --------------------------------------------------------------------------- >>>> align 20h >>>> >>>> loc_100000D60: ; CODE XREF: _main+CE j >>>> mov dword ptr [rax+r12*4], 0 >>>> xor r13d, r13d >>>> >>>> loc_100000D6B: ; CODE XREF: _main+BA j >>>> mov [rsp+68h+var_48], rbp >>>> test r14d, r14d >>>> setle cl >>>> mov rdx, offset __mh_execute_header >>>> lea rdx, [rdx+1] >>>> cmp [rsp+68h+var_38], rdx >>>> jl loc_100000E10 >>>> test cl, cl >>>> mov edx, 0 >>>> mov r10, [rsp+68h+var_60] >>>> mov r9d, 1 >>>> jnz short loc_100000E10 >>>> >>>> loc_100000DA3: ; CODE XREF: _main+195 j >>>> mov esi, [rax+rdx*4] >>>> mov r15d, 0FFFFFFFFh >>>> mov r8d, 1 >>>> mov rcx, r10 >>>> db 66h, 66h, 2Eh >>>> nop dword ptr [rax+rax+00000000h] >>>> >>>> loc_100000DC0: ; CODE XREF: _main+184 j >>>> mov ebx, [rcx] >>>> mov ebp, esi >>>> sub ebp, ebx >>>> xor edi, edi >>>> cmp r8d, ebp >>>> jz loc_100000D10 >>>> cmp esi, ebx >>>> jz loc_100000D10 >>>> cmp r15d, ebp >>>> jz loc_100000D10 >>>> add rcx, 4 >>>> inc r8 >>>> dec r15d >>>> mov edi, r8d >>>> add edi, edx >>>> cmp edi, r14d >>>> jl short loc_100000DC0 >>>> inc r9 >>>> add r10, 4 >>>> inc rdx >>>> cmp r9, [rsp+68h+var_50] >>>> jl short loc_100000DA3 >>>> nop word ptr [rax+rax+00000000h] >>>> >>>> loc_100000E10: ; CODE XREF: _main+119 j >>>> ; _main+131 j >>>> mov edi, 1 >>>> jmp loc_100000D10 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000E1A: ; CODE XREF: _main+6E j >>>> test r14d, r14d >>>> jle loc_100000F00 >>>> mov dword ptr [rax+r12*4], 1 >>>> xor ebp, ebp >>>> cmp r14d, 2 >>>> jl loc_100000F09 >>>> mov rcx, rax >>>> add rcx, 4 >>>> mov [rsp+68h+var_48], rcx >>>> xor ebp, ebp >>>> mov r15d, 1 >>>> nop dword ptr [rax+rax+00h] >>>> >>>> loc_100000E50: ; CODE XREF: _main+288 j >>>> mov rbx, rbp >>>> mov rcx, offset __mh_execute_header >>>> cmp [rsp+68h+var_38], rcx >>>> mov edx, 0 >>>> mov r13, [rsp+68h+var_48] >>>> mov r8d, 1 >>>> mov r9d, 1 >>>> jle short loc_100000EE0 >>>> >>>> loc_100000E7A: ; CODE XREF: _main+25A j >>>> mov r12d, [rax+rdx*4] >>>> mov edi, 0FFFFFFFFh >>>> mov ecx, 1 >>>> mov rsi, r13 >>>> nop dword ptr [rax+rax+00h] >>>> >>>> loc_100000E90: ; CODE XREF: _main+249 j >>>> mov r10d, [rsi] >>>> mov ebp, r12d >>>> sub ebp, r10d >>>> xor r9d, r9d >>>> cmp ecx, ebp >>>> jz short loc_100000EE0 >>>> cmp r12d, r10d >>>> jz short loc_100000EE0 >>>> cmp edi, ebp >>>> jz short loc_100000EE0 >>>> add rsi, 4 >>>> inc rcx >>>> dec edi >>>> mov ebp, ecx >>>> add ebp, edx >>>> cmp ebp, r14d >>>> jl short loc_100000E90 >>>> inc r8 >>>> add r13, 4 >>>> inc rdx >>>> cmp r8, [rsp+68h+var_50] >>>> jl short loc_100000E7A >>>> mov r9d, 1 >>>> db 66h, 66h, 66h, 66h, 2Eh >>>> nop word ptr [rax+rax+00000000h] >>>> >>>> loc_100000EE0: ; CODE XREF: _main+208 j >>>> ; _main+22E j ... >>>> mov rbp, rbx >>>> add ebp, r9d >>>> cmp r15d, r11d >>>> lea ecx, [r15+1] >>>> mov rdx, [rsp+68h+var_40] >>>> mov [rax+rdx*4], ecx >>>> mov r15d, ecx >>>> jl loc_100000E50 >>>> jmp short loc_100000F09 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000F00: ; CODE XREF: _main+1AD j >>>> xor ebp, ebp >>>> test r11d, r11d >>>> cmovns ebp, r11d >>>> >>>> loc_100000F09: ; CODE XREF: _main+E0 j >>>> ; _main+1C1 j ... >>>> mov rdi, rax ; void * >>>> call _free >>>> lea rdi, aDSolutions ; "%d solutions\n" >>>> xor ebx, ebx >>>> xor eax, eax >>>> mov esi, ebp >>>> call _printf >>>> >>>> loc_100000F23: ; CODE XREF: _main+16 j >>>> mov eax, ebx >>>> add rsp, 38h >>>> pop rbx >>>> pop r12 >>>> pop r13 >>>> pop r14 >>>> pop r15 >>>> pop rbp >>>> retn >>>> _main endp >>>> ``` >>>> >>>> gcc-4.9.2's result: >>>> ``` >>>> >>>> _main proc near >>>> >>>> var_48 = qword ptr -48h >>>> var_40 = dword ptr -40h >>>> var_3C = dword ptr -3Ch >>>> >>>> cmp edi, 2 >>>> jz short loc_100000D69 >>>> or eax, 0FFFFFFFFh >>>> retn >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000D69: ; CODE XREF: _main+3 j >>>> push r15 >>>> mov edx, 0Ah ; int >>>> push r14 >>>> push r13 >>>> push r12 >>>> push rbp >>>> push rbx >>>> sub rsp, 18h >>>> mov rdi, [rsi+8] ; char * >>>> xor esi, esi ; char ** >>>> call _strtol >>>> mov edi, 4 ; size_t >>>> lea esi, [rax+1] >>>> mov r14, rax >>>> mov ebx, eax >>>> lea r15d, [r14-2] >>>> movsxd rsi, esi ; size_t >>>> call _calloc >>>> mov [rsp+48h+var_3C], 0 >>>> mov rdi, rax ; void * >>>> lea eax, [r14-1] >>>> cdqe >>>> lea r13, [rdi+rax*4] >>>> movsxd rax, r15d >>>> mov ebp, [r13+0] >>>> shl rax, 2 >>>> lea r12, [rdi+rax] >>>> lea rax, [rdi+rax-4] >>>> mov [rsp+48h+var_48], rax >>>> mov eax, r14d >>>> lea r14d, [r14+1] >>>> nop word ptr [rax+rax+00h] >>>> nop word ptr [rax+rax+00h] >>>> >>>> loc_100000DE0: ; CODE XREF: _main+12B j >>>> ; _main+155 j ... >>>> add ebp, 1 >>>> cmp ebx, ebp >>>> mov [r13+0], ebp >>>> jg short loc_100000E62 >>>> test r15d, r15d >>>> js short loc_100000E33 >>>> mov ecx, [r12] >>>> lea edx, [rcx+1] >>>> cmp ebx, edx >>>> mov [r12], edx >>>> jg short loc_100000E58 >>>> mov r8, r12 >>>> mov rcx, [rsp+48h+var_48] >>>> mov esi, r15d >>>> jmp short loc_100000E24 >>>> ; --------------------------------------------------------------------------- >>>> align 10h >>>> >>>> loc_100000E10: ; CODE XREF: _main+D1 j >>>> mov edx, [rcx] >>>> sub r8, 4 >>>> sub rcx, 4 >>>> add edx, 1 >>>> mov [rcx+4], edx >>>> cmp ebx, edx >>>> jg short loc_100000E58 >>>> >>>> loc_100000E24: ; CODE XREF: _main+A9 j >>>> sub esi, 1 >>>> mov dword ptr [r8], 0 >>>> cmp esi, 0FFFFFFFFh >>>> jnz short loc_100000E10 >>>> >>>> loc_100000E33: ; CODE XREF: _main+8E j >>>> call _free >>>> mov esi, [rsp+48h+var_3C] >>>> add rsp, 18h >>>> xor eax, eax >>>> pop rbx >>>> lea rdi, aDSolutions ; "%d solutions\n" >>>> pop rbp >>>> pop r12 >>>> pop r13 >>>> pop r14 >>>> pop r15 >>>> jmp _printf >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000E58: ; CODE XREF: _main+9D j >>>> ; _main+C2 j >>>> mov dword ptr [r13+0], 0 >>>> xor ebp, ebp >>>> >>>> loc_100000E62: ; CODE XREF: _main+89 j >>>> test ebx, ebx >>>> jle loc_100000EE6 >>>> lea r11, [rdi+8] >>>> xor r10d, r10d >>>> >>>> loc_100000E71: ; CODE XREF: _main+184 j >>>> add r10d, 1 >>>> cmp r10d, eax >>>> jz short loc_100000EE6 >>>> mov r8d, [r11-8] >>>> mov edx, r8d >>>> sub edx, [r11-4] >>>> add edx, 1 >>>> cmp edx, 2 >>>> jbe loc_100000DE0 >>>> mov r9d, r14d >>>> mov rcx, r11 >>>> mov edx, 1 >>>> mov [rsp+48h+var_40], r10d >>>> sub r9d, r10d >>>> jmp short loc_100000ED3 >>>> ; --------------------------------------------------------------------------- >>>> align 10h >>>> >>>> loc_100000EB0: ; CODE XREF: _main+179 j >>>> mov esi, r8d >>>> sub esi, [rcx] >>>> jz loc_100000DE0 >>>> mov r10d, esi >>>> add rcx, 4 >>>> add r10d, edx >>>> jz loc_100000DE0 >>>> cmp esi, edx >>>> jz loc_100000DE0 >>>> >>>> loc_100000ED3: ; CODE XREF: _main+144 j >>>> add edx, 1 >>>> cmp edx, r9d >>>> jnz short loc_100000EB0 >>>> mov r10d, [rsp+48h+var_40] >>>> add r11, 4 >>>> jmp short loc_100000E71 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_100000EE6: ; CODE XREF: _main+104 j >>>> ; _main+118 j >>>> add [rsp+48h+var_3C], 1 >>>> jmp loc_100000DE0 >>>> _main endp >>>> ``` >>>> >>>> MSVC 10.0's result: >>>> >>>> ``` >>>> >>>> _main proc near ; CODE XREF: ___tmainCRTStartup+106 p >>>> >>>> var_80 = dword ptr -80h >>>> var_7C = dword ptr -7Ch >>>> var_78 = dword ptr -78h >>>> var_74 = dword ptr -74h >>>> var_70 = dword ptr -70h >>>> var_6C = dword ptr -6Ch >>>> var_68 = dword ptr -68h >>>> var_64 = dword ptr -64h >>>> var_60 = dword ptr -60h >>>> var_5C = dword ptr -5Ch >>>> argc = dword ptr 8 >>>> argv = dword ptr 0Ch >>>> envp = dword ptr 10h >>>> >>>> push ebp >>>> mov ebp, esp >>>> and esp, 0FFFFFF80h >>>> push esi >>>> push edi >>>> push ebx >>>> sub esp, 74h >>>> push 3 >>>> call sub_4080F0 >>>> add esp, 4 >>>> stmxcsr [esp+80h+var_80] >>>> or [esp+80h+var_80], 8000h >>>> ldmxcsr [esp+80h+var_80] >>>> cmp [ebp+argc], 2 >>>> jz short loc_40103A >>>> mov eax, 0FFFFFFFFh >>>> add esp, 74h >>>> pop ebx >>>> pop edi >>>> pop esi >>>> mov esp, ebp >>>> pop ebp >>>> retn >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_40103A: ; CODE XREF: _main+29 j >>>> call ds:GetTickCount >>>> mov esi, eax >>>> mov eax, [ebp+argv] >>>> push dword ptr [eax+4] ; char * >>>> call _atoi >>>> mov edi, eax >>>> lea eax, [edi+1] >>>> push eax ; size_t >>>> push 4 ; size_t >>>> call _calloc >>>> add esp, 0Ch >>>> mov ecx, [eax+edi*4-4] >>>> lea edx, [edi-1] >>>> mov [esp+80h+var_6C], ecx >>>> xor ebx, ebx >>>> mov [esp+80h+var_7C], ebx >>>> lea ecx, [eax+edi*4] >>>> mov [esp+80h+var_74], ecx >>>> lea ecx, [edi-2] >>>> mov [esp+80h+var_70], ecx >>>> mov [esp+80h+var_60], edx >>>> mov [esp+80h+var_80], esi >>>> mov ecx, [esp+80h+var_6C] >>>> >>>> loc_401087: ; CODE XREF: _main+142 j >>>> ; _main+193 j >>>> mov edx, [esp+80h+var_60] >>>> inc ecx >>>> mov [eax+edi*4-4], ecx >>>> cmp edi, [eax+edx*4] >>>> jg short loc_4010DC >>>> mov esi, [esp+80h+var_70] >>>> test esi, esi >>>> js short loc_4010CE >>>> xor edx, edx >>>> mov [esp+80h+var_78], eax >>>> xor ebx, ebx >>>> mov eax, [esp+80h+var_74] >>>> >>>> loc_4010A9: ; CODE XREF: _main+C8 j >>>> mov ecx, [eax+ebx*4-8] >>>> inc ecx >>>> cmp ecx, edi >>>> jl loc_40117A >>>> inc edx >>>> lea esi, [ebx+edi-3] >>>> mov dword ptr [eax+ebx*4-8], 0 >>>> dec ebx >>>> cmp edx, [esp+80h+var_60] >>>> jb short loc_4010A9 >>>> mov eax, [esp+80h+var_78] >>>> >>>> loc_4010CE: ; CODE XREF: _main+9B j >>>> ; _main+186 j >>>> test esi, esi >>>> jl short loc_401147 >>>> mov dword ptr [eax+edi*4-4], 0 >>>> xor ecx, ecx >>>> >>>> loc_4010DC: ; CODE XREF: _main+93 j >>>> test edi, edi >>>> jle short loc_40113E >>>> mov [esp+80h+var_6C], ecx >>>> xor edx, edx >>>> mov [esp+80h+var_5C], edi >>>> >>>> loc_4010EA: ; CODE XREF: _main+132 j >>>> lea ecx, [edx+1] >>>> mov ebx, ecx >>>> mov esi, ebx >>>> cmp ecx, [esp+80h+var_5C] >>>> jge short loc_401130 >>>> mov edx, [eax+edx*4] >>>> mov edi, 1 >>>> mov [esp+80h+var_64], esi >>>> mov [esp+80h+var_68], ecx >>>> >>>> loc_401107: ; CODE XREF: _main+122 j >>>> mov esi, [eax+ebx*4] >>>> cmp edx, esi >>>> jz short loc_40118B >>>> sub esi, edx >>>> mov ecx, esi >>>> neg ecx >>>> cmp edi, ecx >>>> jz short loc_40118B >>>> cmp esi, edi >>>> jz short loc_40118B >>>> inc ebx >>>> inc edi >>>> cmp ebx, [esp+80h+var_5C] >>>> jl short loc_401107 >>>> mov ecx, [esp+80h+var_68] >>>> mov esi, [esp+80h+var_64] >>>> cmp ecx, [esp+80h+var_5C] >>>> >>>> loc_401130: ; CODE XREF: _main+F5 j >>>> mov edx, esi >>>> jl short loc_4010EA >>>> xchg ax, ax >>>> mov ecx, [esp+80h+var_6C] >>>> mov edi, [esp+80h+var_5C] >>>> >>>> loc_40113E: ; CODE XREF: _main+DE j >>>> inc [esp+80h+var_7C] >>>> jmp loc_401087 >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_401147: ; CODE XREF: _main+D0 j >>>> mov ebx, [esp+80h+var_7C] >>>> mov esi, [esp+80h+var_80] >>>> push eax ; void * >>>> call _free >>>> add esp, 4 >>>> call ds:GetTickCount >>>> sub eax, esi >>>> push eax >>>> push ebx >>>> push offset aDSolutionsInDM ; "%d solutions in %d msecs.\n" >>>> call _printf >>>> xor eax, eax >>>> add esp, 80h >>>> pop ebx >>>> pop edi >>>> pop esi >>>> mov esp, ebp >>>> pop ebp >>>> retn >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_40117A: ; CODE XREF: _main+B0 j >>>> mov edx, [esp+80h+var_74] >>>> mov eax, [esp+80h+var_78] >>>> mov [edx+ebx*4-8], ecx >>>> jmp loc_4010CE >>>> ; --------------------------------------------------------------------------- >>>> >>>> loc_40118B: ; CODE XREF: _main+10C j >>>> ; _main+116 j ... >>>> mov ecx, [esp+80h+var_6C] >>>> mov edi, [esp+80h+var_5C] >>>> jmp loc_401087 >>>> _main endp >>>> ``` >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Jack Howarth
2015-Feb-14 17:18 UTC
[LLVMdev] trunk's optimizer generates slower code than 3.5
The same 22% performance regression also exists in current llvm/clang trunk for the SciMark2 Sparse matmult benchmark. On Sat, Feb 14, 2015 at 12:11 PM, Jack Howarth <howarth.mailing.lists at gmail.com> wrote:> Using the SciMark 2.0 code from > http://math.nist.gov/scimark2/scimark2_1c.zip compiled with the > same... > > make CFLAGS="-O3 -march=native" > > I am able to reproduce the 22% performance regression in the run time > of the Sparse matmult benchmark. > For 10 runs of the scimark2 benechmark, I get 998.439+/-0.4828 with > the release llvm clang 3.5.1 compiler > and 1217.363+/-1.1004 for the current clang 3.6svn from 3.6 branch. Not good. > Jack > > On Sat, Feb 14, 2015 at 11:19 AM, Jack Howarth > <howarth.mailing.lists at gmail.com> wrote: >> Do any of the build-bots routinely run the SciMark v2.0 benchmark? >> If so, might not an examination of those logs reveal the commit range >> at which the optimizations in that benchmark degraded? >> Jack >> >> On Sat, Feb 14, 2015 at 11:13 AM, Jack Howarth >> <howarth.mailing.lists at gmail.com> wrote: >>> The regressions in the performance of generated code, introduced >>> by the llvm 3.6 release, don't seem to be limited to this 8 queens >>> puzzle" solver test case. See... >>> >>> http://www.phoronix.com/scan.php?page=article&item=llvm-clang-3.5-3.6-rc1&num=1 >>> >>> where a bit hit in the performance of the Sparse Matrix Multiply test >>> of the SciMark v2.0 benchmark was observed as well as others. >>> Do you really want to release 3.6 with this level of performance regression? >>> Jack >>> >>> On Fri, Feb 13, 2015 at 2:47 PM, Jack Howarth >>> <howarth.mailing.lists at gmail.com> wrote: >>>> Also confirmed with the llvm 3.5.1 release and the llvm 3.6 release >>>> branch on x86_64-apple-darwin14... >>>> >>>> % clang-3.5 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> % time ./8 9 >>>> 352 solutions >>>> 3.603u 0.002s 0:03.60 100.0% 0+0k 0+0io 2pf+0w >>>> % time ./8 10 >>>> 724 solutions >>>> 104.217u 0.059s 1:44.30 99.9% 0+0k 0+0io 2pf+0w >>>> >>>> % clang-3.6 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>> -fno-exceptions -o 8 8.c >>>> % time ./8 9 >>>> 352 solutions >>>> 4.050u 0.001s 0:04.05 100.0% 0+0k 0+0io 2pf+0w >>>> % time ./8 10 >>>> 724 solutions >>>> 114.808u 0.041s 1:54.86 99.9% 0+0k 0+0io 2pf+0w >>>> >>>> On Fri, Feb 13, 2015 at 3:37 AM, 191919 <191919 at gmail.com> wrote: >>>>> I submitted the problem report to clang's bugzilla but no one seems to >>>>> care so I have to send it to the mailing list. >>>>> >>>>> clang 3.7 svn (trunk 229055 as the time I was to report this problem) >>>>> generates slower code than 3.5 (Apple LLVM version 6.0 >>>>> (clang-600.0.56) (based on LLVM 3.5svn)) for the following code. >>>>> >>>>> It is a "8 queens puzzle" solver written as an educational example. As >>>>> compiled by both clang 3.5 and 3.7, it gave the correct answer, but >>>>> clang 3.5 generates code which runs 20% faster than 3.6/3.7. >>>>> >>>>> ########################################## >>>>> # clang 3.5 which comes with Xcode 6.1.1 >>>>> ########################################## >>>>> $ clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> $ time ./8 9 # 9 queens >>>>> 352 solutions >>>>> $ time ./8 10 # 10 queens >>>>> ./8 9 1.63s user 0.00s system 99% cpu 1.632 total >>>>> 724 solutions >>>>> ./8 10 45.11s user 0.01s system 99% cpu 45.121 total >>>>> >>>>> ########################################## >>>>> # clang 3.7 svn trunk >>>>> ########################################## >>>>> $ /opt/bin/clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> $ time ./8 9 # 9 queens >>>>> 352 solutions >>>>> ./8 9 2.07s user 0.00s system 99% cpu 2.078 total >>>>> $ time ./8 10 # 10 queens >>>>> 724 solutions >>>>> ./8 10 56.63s user 0.02s system 99% cpu 56.650 total >>>>> >>>>> The source code is below, I also attached the executable files as well >>>>> as the assembly code files for clang 3.5 and 3.6 by IDA. >>>>> >>>>> The performance is even worse when compiling as 32-bit code while >>>>> gcc-4.9.2 is not affected. >>>>> >>>>> ########## clang-3.5 >>>>> $ clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> $ time ./8 9 >>>>> 352 solutions >>>>> ./8 9 1.95s user 0.00s system 99% cpu 1.950 total >>>>> >>>>> ########## clang-3.7 >>>>> $ /opt/bin/clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> $ time ./8 9 >>>>> 352 solutions >>>>> ./8 9 2.48s user 0.00s system 99% cpu 2.480 total >>>>> >>>>> ######### gcc-4.9.2 >>>>> $ /opt/bin/gcc -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> $ time ./8 9 >>>>> 352 solutions >>>>> ./8 9 1.44s user 0.00s system 99% cpu 1.442 total >>>>> >>>>> >>>>> ``` >>>>> #include <stdio.h> >>>>> #include <stdlib.h> >>>>> >>>>> static inline int validate(int* a, int d) >>>>> { >>>>> int i, j, x; >>>>> for (i = 0; i < d; ++i) >>>>> { >>>>> for (j = i+1, x = 1; j < d; ++j, ++x) >>>>> { >>>>> const int d = a[i] - a[j]; >>>>> if (d == 0 || d == -x || d == x) return 0; >>>>> } >>>>> } >>>>> return 1; >>>>> } >>>>> >>>>> static inline int solve(int d) >>>>> { >>>>> int r = 0; >>>>> int* a = (int*) calloc(sizeof(int), d+1); >>>>> int p = d - 1; >>>>> >>>>> for (;;) >>>>> { >>>>> a[p]++; >>>>> >>>>> if (a[p] > d-1) >>>>> { >>>>> int bp = p - 1; >>>>> while (bp >= 0) >>>>> { >>>>> a[bp]++; >>>>> if (a[bp] <= d-1) break; >>>>> a[bp] = 0; >>>>> --bp; >>>>> } >>>>> if (bp < 0) >>>>> break; >>>>> a[p] = 0; >>>>> } >>>>> if (validate(a, d)) >>>>> { >>>>> ++r; >>>>> } >>>>> } >>>>> >>>>> free(a); >>>>> return r; >>>>> } >>>>> >>>>> int main(int argc, char** argv) >>>>> { >>>>> if (argc != 2) return -1; >>>>> int r = solve((int) strtol(argv[1], NULL, 10)); >>>>> printf("%d solutions\n", r); >>>>> } >>>>> ``` >>>>> >>>>> clang 3.5's result: >>>>> >>>>> ``` >>>>> public _main >>>>> _main proc near >>>>> >>>>> var_48 = qword ptr -48h >>>>> var_40 = qword ptr -40h >>>>> var_34 = dword ptr -34h >>>>> >>>>> push rbp >>>>> push r15 >>>>> push r14 >>>>> push r13 >>>>> push r12 >>>>> push rbx >>>>> sub rsp, 18h >>>>> mov ebx, 0FFFFFFFFh >>>>> cmp edi, 2 >>>>> jnz loc_100000F29 >>>>> mov rdi, [rsi+8] ; char * >>>>> xor r14d, r14d >>>>> xor esi, esi ; char ** >>>>> mov edx, 0Ah ; int >>>>> call _strtol >>>>> mov r15, rax >>>>> shl rax, 20h >>>>> mov rsi, offset __mh_execute_header >>>>> add rsi, rax >>>>> sar rsi, 20h ; size_t >>>>> mov edi, 4 ; size_t >>>>> call _calloc >>>>> lea edx, [r15-1] >>>>> movsxd r8, edx >>>>> mov ecx, r15d >>>>> add ecx, 0FFFFFFFEh >>>>> js loc_100000DFA >>>>> test r15d, r15d >>>>> mov r11d, [rax+r8*4] >>>>> jle loc_100000EAE >>>>> mov ecx, r15d >>>>> add ecx, 0FFFFFFFEh >>>>> mov [rsp+48h+var_34], ecx >>>>> movsxd rcx, ecx >>>>> lea rcx, [rax+rcx*4] >>>>> mov [rsp+48h+var_40], rcx >>>>> lea rcx, [rax+4] >>>>> mov [rsp+48h+var_48], rcx >>>>> xor r14d, r14d >>>>> jmp short loc_100000D33 >>>>> ; --------------------------------------------------------------------------- >>>>> align 10h >>>>> >>>>> loc_100000D30: ; CODE XREF: _main+129 j >>>>> ; _main+131 j ... >>>>> add r14d, ebx >>>>> >>>>> loc_100000D33: ; CODE XREF: _main+92 j >>>>> cmp r11d, edx >>>>> lea edi, [r11+1] >>>>> mov [rax+r8*4], edi >>>>> mov rcx, [rsp+48h+var_40] >>>>> mov esi, [rsp+48h+var_34] >>>>> mov r11d, edi >>>>> jl short loc_100000D84 >>>>> nop dword ptr [rax+00h] >>>>> >>>>> loc_100000D50: ; CODE XREF: _main+DA j >>>>> mov edi, [rcx] >>>>> lea ebp, [rdi+1] >>>>> mov [rcx], ebp >>>>> cmp edi, edx >>>>> jl short loc_100000D71 >>>>> mov dword ptr [rcx], 0 >>>>> add rcx, 0FFFFFFFFFFFFFFFCh >>>>> test esi, esi >>>>> lea esi, [rsi-1] >>>>> jg short loc_100000D50 >>>>> jmp loc_100000F0E >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000D71: ; CODE XREF: _main+C9 j >>>>> test esi, esi >>>>> js loc_100000F0E >>>>> mov dword ptr [rax+r8*4], 0 >>>>> xor r11d, r11d >>>>> >>>>> loc_100000D84: ; CODE XREF: _main+BA j >>>>> cmp r15d, 1 >>>>> mov esi, 0 >>>>> mov r9, [rsp+48h+var_48] >>>>> mov r12d, 1 >>>>> jle short loc_100000DF0 >>>>> >>>>> loc_100000D99: ; CODE XREF: _main+15E j >>>>> mov r10d, [rax+rsi*4] >>>>> mov ecx, 0FFFFFFFFh >>>>> mov edi, 1 >>>>> mov r13, r9 >>>>> nop word ptr [rax+rax+00h] >>>>> >>>>> loc_100000DB0: ; CODE XREF: _main+14F j >>>>> xor ebx, ebx >>>>> mov ebp, r10d >>>>> sub ebp, [r13+0] >>>>> jz loc_100000D30 >>>>> cmp ecx, ebp >>>>> jz loc_100000D30 >>>>> cmp edi, ebp >>>>> jz loc_100000D30 >>>>> add r13, 4 >>>>> inc rdi >>>>> dec ecx >>>>> mov ebx, edi >>>>> add ebx, esi >>>>> cmp ebx, r15d >>>>> jl short loc_100000DB0 >>>>> inc r12 >>>>> add r9, 4 >>>>> inc rsi >>>>> cmp r12d, r15d >>>>> jl short loc_100000D99 >>>>> >>>>> loc_100000DF0: ; CODE XREF: _main+107 j >>>>> mov ebx, 1 >>>>> jmp loc_100000D30 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000DFA: ; CODE XREF: _main+5E j >>>>> mov ecx, [rax+r8*4] >>>>> lea r9d, [rcx+1] >>>>> mov [rax+r8*4], r9d >>>>> cmp ecx, r8d >>>>> jge loc_100000F0E >>>>> lea r12, [rax+4] >>>>> xor r14d, r14d >>>>> db 2Eh >>>>> nop word ptr [rax+rax+00000000h] >>>>> >>>>> loc_100000E20: ; CODE XREF: _main+216 j >>>>> test r15d, r15d >>>>> setle cl >>>>> cmp r15d, 2 >>>>> jl short loc_100000E90 >>>>> test cl, cl >>>>> mov r13d, 0 >>>>> mov r11, r12 >>>>> mov r10d, 1 >>>>> jnz short loc_100000E90 >>>>> >>>>> loc_100000E3F: ; CODE XREF: _main+1F0 j >>>>> mov edi, [rax+r13*4] >>>>> mov edx, 0FFFFFFFFh >>>>> mov ecx, 1 >>>>> mov rsi, r11 >>>>> >>>>> loc_100000E50: ; CODE XREF: _main+1E1 j >>>>> xor ebx, ebx >>>>> mov ebp, edi >>>>> sub ebp, [rsi] >>>>> jz short loc_100000E95 >>>>> cmp edx, ebp >>>>> jz short loc_100000E95 >>>>> cmp ecx, ebp >>>>> jz short loc_100000E95 >>>>> add rsi, 4 >>>>> inc rcx >>>>> dec edx >>>>> mov ebx, ecx >>>>> add ebx, r13d >>>>> cmp ebx, r15d >>>>> jl short loc_100000E50 >>>>> inc r10 >>>>> add r11, 4 >>>>> inc r13 >>>>> cmp r10d, r15d >>>>> jl short loc_100000E3F >>>>> db 66h, 66h, 66h, 66h, 2Eh >>>>> nop word ptr [rax+rax+00000000h] >>>>> >>>>> loc_100000E90: ; CODE XREF: _main+19A j >>>>> ; _main+1AD j >>>>> mov ebx, 1 >>>>> >>>>> loc_100000E95: ; CODE XREF: _main+1C6 j >>>>> ; _main+1CA j ... >>>>> add r14d, ebx >>>>> cmp r9d, r8d >>>>> lea ecx, [r9+1] >>>>> mov [rax+r8*4], ecx >>>>> mov r9d, ecx >>>>> jl loc_100000E20 >>>>> jmp short loc_100000F0E >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000EAE: ; CODE XREF: _main+6B j >>>>> add r15d, 0FFFFFFFEh >>>>> movsxd rcx, r15d >>>>> lea rcx, [rax+rcx*4] >>>>> xor r14d, r14d >>>>> jmp short loc_100000EC6 >>>>> ; --------------------------------------------------------------------------- >>>>> align 20h >>>>> >>>>> loc_100000EC0: ; CODE XREF: _main+247 j >>>>> ; _main+27C j >>>>> inc r14d >>>>> mov r11d, ebp >>>>> >>>>> loc_100000EC6: ; CODE XREF: _main+22C j >>>>> lea ebp, [r11+1] >>>>> mov [rax+r8*4], ebp >>>>> cmp r11d, r8d >>>>> mov rsi, rcx >>>>> mov edi, r15d >>>>> jl short loc_100000EC0 >>>>> nop dword ptr [rax+00000000h] >>>>> >>>>> loc_100000EE0: ; CODE XREF: _main+26A j >>>>> mov ebp, [rsi] >>>>> lea ebx, [rbp+1] >>>>> mov [rsi], ebx >>>>> cmp ebp, edx >>>>> jl short loc_100000EFE >>>>> mov dword ptr [rsi], 0 >>>>> add rsi, 0FFFFFFFFFFFFFFFCh >>>>> test edi, edi >>>>> lea edi, [rdi-1] >>>>> jg short loc_100000EE0 >>>>> jmp short loc_100000F0E >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000EFE: ; CODE XREF: _main+259 j >>>>> test edi, edi >>>>> js short loc_100000F0E >>>>> mov dword ptr [rax+r8*4], 0 >>>>> xor ebp, ebp >>>>> jmp short loc_100000EC0 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000F0E: ; CODE XREF: _main+DC j >>>>> ; _main+E3 j ... >>>>> mov rdi, rax ; void * >>>>> call _free >>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>> xor ebx, ebx >>>>> xor eax, eax >>>>> mov esi, r14d >>>>> call _printf >>>>> >>>>> loc_100000F29: ; CODE XREF: _main+16 j >>>>> mov eax, ebx >>>>> add rsp, 18h >>>>> pop rbx >>>>> pop r12 >>>>> pop r13 >>>>> pop r14 >>>>> pop r15 >>>>> pop rbp >>>>> retn >>>>> _main endp >>>>> ``` >>>>> >>>>> clang 3.6's result: >>>>> >>>>> ``` >>>>> public _main >>>>> _main proc near >>>>> >>>>> var_60 = qword ptr -60h >>>>> var_58 = qword ptr -58h >>>>> var_50 = qword ptr -50h >>>>> var_48 = qword ptr -48h >>>>> var_40 = qword ptr -40h >>>>> var_38 = qword ptr -38h >>>>> >>>>> push rbp >>>>> push r15 >>>>> push r14 >>>>> push r13 >>>>> push r12 >>>>> push rbx >>>>> sub rsp, 38h >>>>> mov ebx, 0FFFFFFFFh >>>>> cmp edi, 2 >>>>> jnz loc_100000F23 >>>>> mov rbx, offset __mh_execute_header >>>>> mov rdi, [rsi+8] ; char * >>>>> xor r13d, r13d >>>>> xor esi, esi ; char ** >>>>> mov edx, 0Ah ; int >>>>> call _strtol >>>>> mov r14, rax >>>>> shl rax, 20h >>>>> mov [rsp+68h+var_38], rax >>>>> lea rsi, [rax+rbx] >>>>> sar rsi, 20h ; size_t >>>>> mov edi, 4 ; size_t >>>>> call _calloc >>>>> lea r11d, [r14-1] >>>>> movsxd r12, r11d >>>>> mov [rsp+68h+var_40], r12 >>>>> movsxd rcx, r14d >>>>> mov [rsp+68h+var_50], rcx >>>>> add ecx, 0FFFFFFFEh >>>>> js loc_100000E1A >>>>> mov ecx, r14d >>>>> add ecx, 0FFFFFFFEh >>>>> movsxd rcx, ecx >>>>> inc rcx >>>>> mov [rsp+68h+var_58], rcx >>>>> mov rcx, rax >>>>> add rcx, 4 >>>>> mov [rsp+68h+var_60], rcx >>>>> xor ebp, ebp >>>>> jmp short loc_100000D17 >>>>> ; --------------------------------------------------------------------------- >>>>> align 10h >>>>> >>>>> loc_100000D10: ; CODE XREF: _main+15B j >>>>> ; _main+163 j ... >>>>> mov rbp, [rsp+68h+var_48] >>>>> add ebp, edi >>>>> >>>>> loc_100000D17: ; CODE XREF: _main+93 j >>>>> cmp r13d, r11d >>>>> lea edx, [r13+1] >>>>> mov [rax+r12*4], edx >>>>> mov rcx, [rsp+68h+var_58] >>>>> mov r13d, edx >>>>> jl short loc_100000D6B >>>>> nop dword ptr [rax+00h] >>>>> >>>>> loc_100000D30: ; CODE XREF: _main+DE j >>>>> mov edx, [rax+rcx*4-4] >>>>> lea esi, [rdx+1] >>>>> mov [rax+rcx*4-4], esi >>>>> cmp edx, r11d >>>>> jl short loc_100000D60 >>>>> mov dword ptr [rax+rcx*4-4], 0 >>>>> dec rcx >>>>> test rcx, rcx >>>>> jg short loc_100000D30 >>>>> jmp loc_100000F09 >>>>> ; --------------------------------------------------------------------------- >>>>> align 20h >>>>> >>>>> loc_100000D60: ; CODE XREF: _main+CE j >>>>> mov dword ptr [rax+r12*4], 0 >>>>> xor r13d, r13d >>>>> >>>>> loc_100000D6B: ; CODE XREF: _main+BA j >>>>> mov [rsp+68h+var_48], rbp >>>>> test r14d, r14d >>>>> setle cl >>>>> mov rdx, offset __mh_execute_header >>>>> lea rdx, [rdx+1] >>>>> cmp [rsp+68h+var_38], rdx >>>>> jl loc_100000E10 >>>>> test cl, cl >>>>> mov edx, 0 >>>>> mov r10, [rsp+68h+var_60] >>>>> mov r9d, 1 >>>>> jnz short loc_100000E10 >>>>> >>>>> loc_100000DA3: ; CODE XREF: _main+195 j >>>>> mov esi, [rax+rdx*4] >>>>> mov r15d, 0FFFFFFFFh >>>>> mov r8d, 1 >>>>> mov rcx, r10 >>>>> db 66h, 66h, 2Eh >>>>> nop dword ptr [rax+rax+00000000h] >>>>> >>>>> loc_100000DC0: ; CODE XREF: _main+184 j >>>>> mov ebx, [rcx] >>>>> mov ebp, esi >>>>> sub ebp, ebx >>>>> xor edi, edi >>>>> cmp r8d, ebp >>>>> jz loc_100000D10 >>>>> cmp esi, ebx >>>>> jz loc_100000D10 >>>>> cmp r15d, ebp >>>>> jz loc_100000D10 >>>>> add rcx, 4 >>>>> inc r8 >>>>> dec r15d >>>>> mov edi, r8d >>>>> add edi, edx >>>>> cmp edi, r14d >>>>> jl short loc_100000DC0 >>>>> inc r9 >>>>> add r10, 4 >>>>> inc rdx >>>>> cmp r9, [rsp+68h+var_50] >>>>> jl short loc_100000DA3 >>>>> nop word ptr [rax+rax+00000000h] >>>>> >>>>> loc_100000E10: ; CODE XREF: _main+119 j >>>>> ; _main+131 j >>>>> mov edi, 1 >>>>> jmp loc_100000D10 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000E1A: ; CODE XREF: _main+6E j >>>>> test r14d, r14d >>>>> jle loc_100000F00 >>>>> mov dword ptr [rax+r12*4], 1 >>>>> xor ebp, ebp >>>>> cmp r14d, 2 >>>>> jl loc_100000F09 >>>>> mov rcx, rax >>>>> add rcx, 4 >>>>> mov [rsp+68h+var_48], rcx >>>>> xor ebp, ebp >>>>> mov r15d, 1 >>>>> nop dword ptr [rax+rax+00h] >>>>> >>>>> loc_100000E50: ; CODE XREF: _main+288 j >>>>> mov rbx, rbp >>>>> mov rcx, offset __mh_execute_header >>>>> cmp [rsp+68h+var_38], rcx >>>>> mov edx, 0 >>>>> mov r13, [rsp+68h+var_48] >>>>> mov r8d, 1 >>>>> mov r9d, 1 >>>>> jle short loc_100000EE0 >>>>> >>>>> loc_100000E7A: ; CODE XREF: _main+25A j >>>>> mov r12d, [rax+rdx*4] >>>>> mov edi, 0FFFFFFFFh >>>>> mov ecx, 1 >>>>> mov rsi, r13 >>>>> nop dword ptr [rax+rax+00h] >>>>> >>>>> loc_100000E90: ; CODE XREF: _main+249 j >>>>> mov r10d, [rsi] >>>>> mov ebp, r12d >>>>> sub ebp, r10d >>>>> xor r9d, r9d >>>>> cmp ecx, ebp >>>>> jz short loc_100000EE0 >>>>> cmp r12d, r10d >>>>> jz short loc_100000EE0 >>>>> cmp edi, ebp >>>>> jz short loc_100000EE0 >>>>> add rsi, 4 >>>>> inc rcx >>>>> dec edi >>>>> mov ebp, ecx >>>>> add ebp, edx >>>>> cmp ebp, r14d >>>>> jl short loc_100000E90 >>>>> inc r8 >>>>> add r13, 4 >>>>> inc rdx >>>>> cmp r8, [rsp+68h+var_50] >>>>> jl short loc_100000E7A >>>>> mov r9d, 1 >>>>> db 66h, 66h, 66h, 66h, 2Eh >>>>> nop word ptr [rax+rax+00000000h] >>>>> >>>>> loc_100000EE0: ; CODE XREF: _main+208 j >>>>> ; _main+22E j ... >>>>> mov rbp, rbx >>>>> add ebp, r9d >>>>> cmp r15d, r11d >>>>> lea ecx, [r15+1] >>>>> mov rdx, [rsp+68h+var_40] >>>>> mov [rax+rdx*4], ecx >>>>> mov r15d, ecx >>>>> jl loc_100000E50 >>>>> jmp short loc_100000F09 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000F00: ; CODE XREF: _main+1AD j >>>>> xor ebp, ebp >>>>> test r11d, r11d >>>>> cmovns ebp, r11d >>>>> >>>>> loc_100000F09: ; CODE XREF: _main+E0 j >>>>> ; _main+1C1 j ... >>>>> mov rdi, rax ; void * >>>>> call _free >>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>> xor ebx, ebx >>>>> xor eax, eax >>>>> mov esi, ebp >>>>> call _printf >>>>> >>>>> loc_100000F23: ; CODE XREF: _main+16 j >>>>> mov eax, ebx >>>>> add rsp, 38h >>>>> pop rbx >>>>> pop r12 >>>>> pop r13 >>>>> pop r14 >>>>> pop r15 >>>>> pop rbp >>>>> retn >>>>> _main endp >>>>> ``` >>>>> >>>>> gcc-4.9.2's result: >>>>> ``` >>>>> >>>>> _main proc near >>>>> >>>>> var_48 = qword ptr -48h >>>>> var_40 = dword ptr -40h >>>>> var_3C = dword ptr -3Ch >>>>> >>>>> cmp edi, 2 >>>>> jz short loc_100000D69 >>>>> or eax, 0FFFFFFFFh >>>>> retn >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000D69: ; CODE XREF: _main+3 j >>>>> push r15 >>>>> mov edx, 0Ah ; int >>>>> push r14 >>>>> push r13 >>>>> push r12 >>>>> push rbp >>>>> push rbx >>>>> sub rsp, 18h >>>>> mov rdi, [rsi+8] ; char * >>>>> xor esi, esi ; char ** >>>>> call _strtol >>>>> mov edi, 4 ; size_t >>>>> lea esi, [rax+1] >>>>> mov r14, rax >>>>> mov ebx, eax >>>>> lea r15d, [r14-2] >>>>> movsxd rsi, esi ; size_t >>>>> call _calloc >>>>> mov [rsp+48h+var_3C], 0 >>>>> mov rdi, rax ; void * >>>>> lea eax, [r14-1] >>>>> cdqe >>>>> lea r13, [rdi+rax*4] >>>>> movsxd rax, r15d >>>>> mov ebp, [r13+0] >>>>> shl rax, 2 >>>>> lea r12, [rdi+rax] >>>>> lea rax, [rdi+rax-4] >>>>> mov [rsp+48h+var_48], rax >>>>> mov eax, r14d >>>>> lea r14d, [r14+1] >>>>> nop word ptr [rax+rax+00h] >>>>> nop word ptr [rax+rax+00h] >>>>> >>>>> loc_100000DE0: ; CODE XREF: _main+12B j >>>>> ; _main+155 j ... >>>>> add ebp, 1 >>>>> cmp ebx, ebp >>>>> mov [r13+0], ebp >>>>> jg short loc_100000E62 >>>>> test r15d, r15d >>>>> js short loc_100000E33 >>>>> mov ecx, [r12] >>>>> lea edx, [rcx+1] >>>>> cmp ebx, edx >>>>> mov [r12], edx >>>>> jg short loc_100000E58 >>>>> mov r8, r12 >>>>> mov rcx, [rsp+48h+var_48] >>>>> mov esi, r15d >>>>> jmp short loc_100000E24 >>>>> ; --------------------------------------------------------------------------- >>>>> align 10h >>>>> >>>>> loc_100000E10: ; CODE XREF: _main+D1 j >>>>> mov edx, [rcx] >>>>> sub r8, 4 >>>>> sub rcx, 4 >>>>> add edx, 1 >>>>> mov [rcx+4], edx >>>>> cmp ebx, edx >>>>> jg short loc_100000E58 >>>>> >>>>> loc_100000E24: ; CODE XREF: _main+A9 j >>>>> sub esi, 1 >>>>> mov dword ptr [r8], 0 >>>>> cmp esi, 0FFFFFFFFh >>>>> jnz short loc_100000E10 >>>>> >>>>> loc_100000E33: ; CODE XREF: _main+8E j >>>>> call _free >>>>> mov esi, [rsp+48h+var_3C] >>>>> add rsp, 18h >>>>> xor eax, eax >>>>> pop rbx >>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>> pop rbp >>>>> pop r12 >>>>> pop r13 >>>>> pop r14 >>>>> pop r15 >>>>> jmp _printf >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000E58: ; CODE XREF: _main+9D j >>>>> ; _main+C2 j >>>>> mov dword ptr [r13+0], 0 >>>>> xor ebp, ebp >>>>> >>>>> loc_100000E62: ; CODE XREF: _main+89 j >>>>> test ebx, ebx >>>>> jle loc_100000EE6 >>>>> lea r11, [rdi+8] >>>>> xor r10d, r10d >>>>> >>>>> loc_100000E71: ; CODE XREF: _main+184 j >>>>> add r10d, 1 >>>>> cmp r10d, eax >>>>> jz short loc_100000EE6 >>>>> mov r8d, [r11-8] >>>>> mov edx, r8d >>>>> sub edx, [r11-4] >>>>> add edx, 1 >>>>> cmp edx, 2 >>>>> jbe loc_100000DE0 >>>>> mov r9d, r14d >>>>> mov rcx, r11 >>>>> mov edx, 1 >>>>> mov [rsp+48h+var_40], r10d >>>>> sub r9d, r10d >>>>> jmp short loc_100000ED3 >>>>> ; --------------------------------------------------------------------------- >>>>> align 10h >>>>> >>>>> loc_100000EB0: ; CODE XREF: _main+179 j >>>>> mov esi, r8d >>>>> sub esi, [rcx] >>>>> jz loc_100000DE0 >>>>> mov r10d, esi >>>>> add rcx, 4 >>>>> add r10d, edx >>>>> jz loc_100000DE0 >>>>> cmp esi, edx >>>>> jz loc_100000DE0 >>>>> >>>>> loc_100000ED3: ; CODE XREF: _main+144 j >>>>> add edx, 1 >>>>> cmp edx, r9d >>>>> jnz short loc_100000EB0 >>>>> mov r10d, [rsp+48h+var_40] >>>>> add r11, 4 >>>>> jmp short loc_100000E71 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_100000EE6: ; CODE XREF: _main+104 j >>>>> ; _main+118 j >>>>> add [rsp+48h+var_3C], 1 >>>>> jmp loc_100000DE0 >>>>> _main endp >>>>> ``` >>>>> >>>>> MSVC 10.0's result: >>>>> >>>>> ``` >>>>> >>>>> _main proc near ; CODE XREF: ___tmainCRTStartup+106 p >>>>> >>>>> var_80 = dword ptr -80h >>>>> var_7C = dword ptr -7Ch >>>>> var_78 = dword ptr -78h >>>>> var_74 = dword ptr -74h >>>>> var_70 = dword ptr -70h >>>>> var_6C = dword ptr -6Ch >>>>> var_68 = dword ptr -68h >>>>> var_64 = dword ptr -64h >>>>> var_60 = dword ptr -60h >>>>> var_5C = dword ptr -5Ch >>>>> argc = dword ptr 8 >>>>> argv = dword ptr 0Ch >>>>> envp = dword ptr 10h >>>>> >>>>> push ebp >>>>> mov ebp, esp >>>>> and esp, 0FFFFFF80h >>>>> push esi >>>>> push edi >>>>> push ebx >>>>> sub esp, 74h >>>>> push 3 >>>>> call sub_4080F0 >>>>> add esp, 4 >>>>> stmxcsr [esp+80h+var_80] >>>>> or [esp+80h+var_80], 8000h >>>>> ldmxcsr [esp+80h+var_80] >>>>> cmp [ebp+argc], 2 >>>>> jz short loc_40103A >>>>> mov eax, 0FFFFFFFFh >>>>> add esp, 74h >>>>> pop ebx >>>>> pop edi >>>>> pop esi >>>>> mov esp, ebp >>>>> pop ebp >>>>> retn >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_40103A: ; CODE XREF: _main+29 j >>>>> call ds:GetTickCount >>>>> mov esi, eax >>>>> mov eax, [ebp+argv] >>>>> push dword ptr [eax+4] ; char * >>>>> call _atoi >>>>> mov edi, eax >>>>> lea eax, [edi+1] >>>>> push eax ; size_t >>>>> push 4 ; size_t >>>>> call _calloc >>>>> add esp, 0Ch >>>>> mov ecx, [eax+edi*4-4] >>>>> lea edx, [edi-1] >>>>> mov [esp+80h+var_6C], ecx >>>>> xor ebx, ebx >>>>> mov [esp+80h+var_7C], ebx >>>>> lea ecx, [eax+edi*4] >>>>> mov [esp+80h+var_74], ecx >>>>> lea ecx, [edi-2] >>>>> mov [esp+80h+var_70], ecx >>>>> mov [esp+80h+var_60], edx >>>>> mov [esp+80h+var_80], esi >>>>> mov ecx, [esp+80h+var_6C] >>>>> >>>>> loc_401087: ; CODE XREF: _main+142 j >>>>> ; _main+193 j >>>>> mov edx, [esp+80h+var_60] >>>>> inc ecx >>>>> mov [eax+edi*4-4], ecx >>>>> cmp edi, [eax+edx*4] >>>>> jg short loc_4010DC >>>>> mov esi, [esp+80h+var_70] >>>>> test esi, esi >>>>> js short loc_4010CE >>>>> xor edx, edx >>>>> mov [esp+80h+var_78], eax >>>>> xor ebx, ebx >>>>> mov eax, [esp+80h+var_74] >>>>> >>>>> loc_4010A9: ; CODE XREF: _main+C8 j >>>>> mov ecx, [eax+ebx*4-8] >>>>> inc ecx >>>>> cmp ecx, edi >>>>> jl loc_40117A >>>>> inc edx >>>>> lea esi, [ebx+edi-3] >>>>> mov dword ptr [eax+ebx*4-8], 0 >>>>> dec ebx >>>>> cmp edx, [esp+80h+var_60] >>>>> jb short loc_4010A9 >>>>> mov eax, [esp+80h+var_78] >>>>> >>>>> loc_4010CE: ; CODE XREF: _main+9B j >>>>> ; _main+186 j >>>>> test esi, esi >>>>> jl short loc_401147 >>>>> mov dword ptr [eax+edi*4-4], 0 >>>>> xor ecx, ecx >>>>> >>>>> loc_4010DC: ; CODE XREF: _main+93 j >>>>> test edi, edi >>>>> jle short loc_40113E >>>>> mov [esp+80h+var_6C], ecx >>>>> xor edx, edx >>>>> mov [esp+80h+var_5C], edi >>>>> >>>>> loc_4010EA: ; CODE XREF: _main+132 j >>>>> lea ecx, [edx+1] >>>>> mov ebx, ecx >>>>> mov esi, ebx >>>>> cmp ecx, [esp+80h+var_5C] >>>>> jge short loc_401130 >>>>> mov edx, [eax+edx*4] >>>>> mov edi, 1 >>>>> mov [esp+80h+var_64], esi >>>>> mov [esp+80h+var_68], ecx >>>>> >>>>> loc_401107: ; CODE XREF: _main+122 j >>>>> mov esi, [eax+ebx*4] >>>>> cmp edx, esi >>>>> jz short loc_40118B >>>>> sub esi, edx >>>>> mov ecx, esi >>>>> neg ecx >>>>> cmp edi, ecx >>>>> jz short loc_40118B >>>>> cmp esi, edi >>>>> jz short loc_40118B >>>>> inc ebx >>>>> inc edi >>>>> cmp ebx, [esp+80h+var_5C] >>>>> jl short loc_401107 >>>>> mov ecx, [esp+80h+var_68] >>>>> mov esi, [esp+80h+var_64] >>>>> cmp ecx, [esp+80h+var_5C] >>>>> >>>>> loc_401130: ; CODE XREF: _main+F5 j >>>>> mov edx, esi >>>>> jl short loc_4010EA >>>>> xchg ax, ax >>>>> mov ecx, [esp+80h+var_6C] >>>>> mov edi, [esp+80h+var_5C] >>>>> >>>>> loc_40113E: ; CODE XREF: _main+DE j >>>>> inc [esp+80h+var_7C] >>>>> jmp loc_401087 >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_401147: ; CODE XREF: _main+D0 j >>>>> mov ebx, [esp+80h+var_7C] >>>>> mov esi, [esp+80h+var_80] >>>>> push eax ; void * >>>>> call _free >>>>> add esp, 4 >>>>> call ds:GetTickCount >>>>> sub eax, esi >>>>> push eax >>>>> push ebx >>>>> push offset aDSolutionsInDM ; "%d solutions in %d msecs.\n" >>>>> call _printf >>>>> xor eax, eax >>>>> add esp, 80h >>>>> pop ebx >>>>> pop edi >>>>> pop esi >>>>> mov esp, ebp >>>>> pop ebp >>>>> retn >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_40117A: ; CODE XREF: _main+B0 j >>>>> mov edx, [esp+80h+var_74] >>>>> mov eax, [esp+80h+var_78] >>>>> mov [edx+ebx*4-8], ecx >>>>> jmp loc_4010CE >>>>> ; --------------------------------------------------------------------------- >>>>> >>>>> loc_40118B: ; CODE XREF: _main+10C j >>>>> ; _main+116 j ... >>>>> mov ecx, [esp+80h+var_6C] >>>>> mov edi, [esp+80h+var_5C] >>>>> jmp loc_401087 >>>>> _main endp >>>>> ``` >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Jack Howarth
2015-Feb-14 17:31 UTC
[LLVMdev] trunk's optimizer generates slower code than 3.5
Oops. I misspoke. The 22% performance regression is in fact eliminated in current llvm/clang trunk. Hopefully this is due to a single fix that can be back ported rather than some large change in the code. On Sat, Feb 14, 2015 at 12:18 PM, Jack Howarth <howarth.mailing.lists at gmail.com> wrote:> The same 22% performance regression also exists in current llvm/clang > trunk for the SciMark2 Sparse matmult benchmark. > > On Sat, Feb 14, 2015 at 12:11 PM, Jack Howarth > <howarth.mailing.lists at gmail.com> wrote: >> Using the SciMark 2.0 code from >> http://math.nist.gov/scimark2/scimark2_1c.zip compiled with the >> same... >> >> make CFLAGS="-O3 -march=native" >> >> I am able to reproduce the 22% performance regression in the run time >> of the Sparse matmult benchmark. >> For 10 runs of the scimark2 benechmark, I get 998.439+/-0.4828 with >> the release llvm clang 3.5.1 compiler >> and 1217.363+/-1.1004 for the current clang 3.6svn from 3.6 branch. Not good. >> Jack >> >> On Sat, Feb 14, 2015 at 11:19 AM, Jack Howarth >> <howarth.mailing.lists at gmail.com> wrote: >>> Do any of the build-bots routinely run the SciMark v2.0 benchmark? >>> If so, might not an examination of those logs reveal the commit range >>> at which the optimizations in that benchmark degraded? >>> Jack >>> >>> On Sat, Feb 14, 2015 at 11:13 AM, Jack Howarth >>> <howarth.mailing.lists at gmail.com> wrote: >>>> The regressions in the performance of generated code, introduced >>>> by the llvm 3.6 release, don't seem to be limited to this 8 queens >>>> puzzle" solver test case. See... >>>> >>>> http://www.phoronix.com/scan.php?page=article&item=llvm-clang-3.5-3.6-rc1&num=1 >>>> >>>> where a bit hit in the performance of the Sparse Matrix Multiply test >>>> of the SciMark v2.0 benchmark was observed as well as others. >>>> Do you really want to release 3.6 with this level of performance regression? >>>> Jack >>>> >>>> On Fri, Feb 13, 2015 at 2:47 PM, Jack Howarth >>>> <howarth.mailing.lists at gmail.com> wrote: >>>>> Also confirmed with the llvm 3.5.1 release and the llvm 3.6 release >>>>> branch on x86_64-apple-darwin14... >>>>> >>>>> % clang-3.5 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> % time ./8 9 >>>>> 352 solutions >>>>> 3.603u 0.002s 0:03.60 100.0% 0+0k 0+0io 2pf+0w >>>>> % time ./8 10 >>>>> 724 solutions >>>>> 104.217u 0.059s 1:44.30 99.9% 0+0k 0+0io 2pf+0w >>>>> >>>>> % clang-3.6 -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>> -fno-exceptions -o 8 8.c >>>>> % time ./8 9 >>>>> 352 solutions >>>>> 4.050u 0.001s 0:04.05 100.0% 0+0k 0+0io 2pf+0w >>>>> % time ./8 10 >>>>> 724 solutions >>>>> 114.808u 0.041s 1:54.86 99.9% 0+0k 0+0io 2pf+0w >>>>> >>>>> On Fri, Feb 13, 2015 at 3:37 AM, 191919 <191919 at gmail.com> wrote: >>>>>> I submitted the problem report to clang's bugzilla but no one seems to >>>>>> care so I have to send it to the mailing list. >>>>>> >>>>>> clang 3.7 svn (trunk 229055 as the time I was to report this problem) >>>>>> generates slower code than 3.5 (Apple LLVM version 6.0 >>>>>> (clang-600.0.56) (based on LLVM 3.5svn)) for the following code. >>>>>> >>>>>> It is a "8 queens puzzle" solver written as an educational example. As >>>>>> compiled by both clang 3.5 and 3.7, it gave the correct answer, but >>>>>> clang 3.5 generates code which runs 20% faster than 3.6/3.7. >>>>>> >>>>>> ########################################## >>>>>> # clang 3.5 which comes with Xcode 6.1.1 >>>>>> ########################################## >>>>>> $ clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>>> -fno-exceptions -o 8 8.c >>>>>> $ time ./8 9 # 9 queens >>>>>> 352 solutions >>>>>> $ time ./8 10 # 10 queens >>>>>> ./8 9 1.63s user 0.00s system 99% cpu 1.632 total >>>>>> 724 solutions >>>>>> ./8 10 45.11s user 0.01s system 99% cpu 45.121 total >>>>>> >>>>>> ########################################## >>>>>> # clang 3.7 svn trunk >>>>>> ########################################## >>>>>> $ /opt/bin/clang -O3 -mssse3 -fomit-frame-pointer -fno-stack-protector >>>>>> -fno-exceptions -o 8 8.c >>>>>> $ time ./8 9 # 9 queens >>>>>> 352 solutions >>>>>> ./8 9 2.07s user 0.00s system 99% cpu 2.078 total >>>>>> $ time ./8 10 # 10 queens >>>>>> 724 solutions >>>>>> ./8 10 56.63s user 0.02s system 99% cpu 56.650 total >>>>>> >>>>>> The source code is below, I also attached the executable files as well >>>>>> as the assembly code files for clang 3.5 and 3.6 by IDA. >>>>>> >>>>>> The performance is even worse when compiling as 32-bit code while >>>>>> gcc-4.9.2 is not affected. >>>>>> >>>>>> ########## clang-3.5 >>>>>> $ clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>>> -fno-exceptions -o 8 8.c >>>>>> $ time ./8 9 >>>>>> 352 solutions >>>>>> ./8 9 1.95s user 0.00s system 99% cpu 1.950 total >>>>>> >>>>>> ########## clang-3.7 >>>>>> $ /opt/bin/clang -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>>> -fno-exceptions -o 8 8.c >>>>>> $ time ./8 9 >>>>>> 352 solutions >>>>>> ./8 9 2.48s user 0.00s system 99% cpu 2.480 total >>>>>> >>>>>> ######### gcc-4.9.2 >>>>>> $ /opt/bin/gcc -m32 -O3 -fomit-frame-pointer -fno-stack-protector >>>>>> -fno-exceptions -o 8 8.c >>>>>> $ time ./8 9 >>>>>> 352 solutions >>>>>> ./8 9 1.44s user 0.00s system 99% cpu 1.442 total >>>>>> >>>>>> >>>>>> ``` >>>>>> #include <stdio.h> >>>>>> #include <stdlib.h> >>>>>> >>>>>> static inline int validate(int* a, int d) >>>>>> { >>>>>> int i, j, x; >>>>>> for (i = 0; i < d; ++i) >>>>>> { >>>>>> for (j = i+1, x = 1; j < d; ++j, ++x) >>>>>> { >>>>>> const int d = a[i] - a[j]; >>>>>> if (d == 0 || d == -x || d == x) return 0; >>>>>> } >>>>>> } >>>>>> return 1; >>>>>> } >>>>>> >>>>>> static inline int solve(int d) >>>>>> { >>>>>> int r = 0; >>>>>> int* a = (int*) calloc(sizeof(int), d+1); >>>>>> int p = d - 1; >>>>>> >>>>>> for (;;) >>>>>> { >>>>>> a[p]++; >>>>>> >>>>>> if (a[p] > d-1) >>>>>> { >>>>>> int bp = p - 1; >>>>>> while (bp >= 0) >>>>>> { >>>>>> a[bp]++; >>>>>> if (a[bp] <= d-1) break; >>>>>> a[bp] = 0; >>>>>> --bp; >>>>>> } >>>>>> if (bp < 0) >>>>>> break; >>>>>> a[p] = 0; >>>>>> } >>>>>> if (validate(a, d)) >>>>>> { >>>>>> ++r; >>>>>> } >>>>>> } >>>>>> >>>>>> free(a); >>>>>> return r; >>>>>> } >>>>>> >>>>>> int main(int argc, char** argv) >>>>>> { >>>>>> if (argc != 2) return -1; >>>>>> int r = solve((int) strtol(argv[1], NULL, 10)); >>>>>> printf("%d solutions\n", r); >>>>>> } >>>>>> ``` >>>>>> >>>>>> clang 3.5's result: >>>>>> >>>>>> ``` >>>>>> public _main >>>>>> _main proc near >>>>>> >>>>>> var_48 = qword ptr -48h >>>>>> var_40 = qword ptr -40h >>>>>> var_34 = dword ptr -34h >>>>>> >>>>>> push rbp >>>>>> push r15 >>>>>> push r14 >>>>>> push r13 >>>>>> push r12 >>>>>> push rbx >>>>>> sub rsp, 18h >>>>>> mov ebx, 0FFFFFFFFh >>>>>> cmp edi, 2 >>>>>> jnz loc_100000F29 >>>>>> mov rdi, [rsi+8] ; char * >>>>>> xor r14d, r14d >>>>>> xor esi, esi ; char ** >>>>>> mov edx, 0Ah ; int >>>>>> call _strtol >>>>>> mov r15, rax >>>>>> shl rax, 20h >>>>>> mov rsi, offset __mh_execute_header >>>>>> add rsi, rax >>>>>> sar rsi, 20h ; size_t >>>>>> mov edi, 4 ; size_t >>>>>> call _calloc >>>>>> lea edx, [r15-1] >>>>>> movsxd r8, edx >>>>>> mov ecx, r15d >>>>>> add ecx, 0FFFFFFFEh >>>>>> js loc_100000DFA >>>>>> test r15d, r15d >>>>>> mov r11d, [rax+r8*4] >>>>>> jle loc_100000EAE >>>>>> mov ecx, r15d >>>>>> add ecx, 0FFFFFFFEh >>>>>> mov [rsp+48h+var_34], ecx >>>>>> movsxd rcx, ecx >>>>>> lea rcx, [rax+rcx*4] >>>>>> mov [rsp+48h+var_40], rcx >>>>>> lea rcx, [rax+4] >>>>>> mov [rsp+48h+var_48], rcx >>>>>> xor r14d, r14d >>>>>> jmp short loc_100000D33 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 10h >>>>>> >>>>>> loc_100000D30: ; CODE XREF: _main+129 j >>>>>> ; _main+131 j ... >>>>>> add r14d, ebx >>>>>> >>>>>> loc_100000D33: ; CODE XREF: _main+92 j >>>>>> cmp r11d, edx >>>>>> lea edi, [r11+1] >>>>>> mov [rax+r8*4], edi >>>>>> mov rcx, [rsp+48h+var_40] >>>>>> mov esi, [rsp+48h+var_34] >>>>>> mov r11d, edi >>>>>> jl short loc_100000D84 >>>>>> nop dword ptr [rax+00h] >>>>>> >>>>>> loc_100000D50: ; CODE XREF: _main+DA j >>>>>> mov edi, [rcx] >>>>>> lea ebp, [rdi+1] >>>>>> mov [rcx], ebp >>>>>> cmp edi, edx >>>>>> jl short loc_100000D71 >>>>>> mov dword ptr [rcx], 0 >>>>>> add rcx, 0FFFFFFFFFFFFFFFCh >>>>>> test esi, esi >>>>>> lea esi, [rsi-1] >>>>>> jg short loc_100000D50 >>>>>> jmp loc_100000F0E >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000D71: ; CODE XREF: _main+C9 j >>>>>> test esi, esi >>>>>> js loc_100000F0E >>>>>> mov dword ptr [rax+r8*4], 0 >>>>>> xor r11d, r11d >>>>>> >>>>>> loc_100000D84: ; CODE XREF: _main+BA j >>>>>> cmp r15d, 1 >>>>>> mov esi, 0 >>>>>> mov r9, [rsp+48h+var_48] >>>>>> mov r12d, 1 >>>>>> jle short loc_100000DF0 >>>>>> >>>>>> loc_100000D99: ; CODE XREF: _main+15E j >>>>>> mov r10d, [rax+rsi*4] >>>>>> mov ecx, 0FFFFFFFFh >>>>>> mov edi, 1 >>>>>> mov r13, r9 >>>>>> nop word ptr [rax+rax+00h] >>>>>> >>>>>> loc_100000DB0: ; CODE XREF: _main+14F j >>>>>> xor ebx, ebx >>>>>> mov ebp, r10d >>>>>> sub ebp, [r13+0] >>>>>> jz loc_100000D30 >>>>>> cmp ecx, ebp >>>>>> jz loc_100000D30 >>>>>> cmp edi, ebp >>>>>> jz loc_100000D30 >>>>>> add r13, 4 >>>>>> inc rdi >>>>>> dec ecx >>>>>> mov ebx, edi >>>>>> add ebx, esi >>>>>> cmp ebx, r15d >>>>>> jl short loc_100000DB0 >>>>>> inc r12 >>>>>> add r9, 4 >>>>>> inc rsi >>>>>> cmp r12d, r15d >>>>>> jl short loc_100000D99 >>>>>> >>>>>> loc_100000DF0: ; CODE XREF: _main+107 j >>>>>> mov ebx, 1 >>>>>> jmp loc_100000D30 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000DFA: ; CODE XREF: _main+5E j >>>>>> mov ecx, [rax+r8*4] >>>>>> lea r9d, [rcx+1] >>>>>> mov [rax+r8*4], r9d >>>>>> cmp ecx, r8d >>>>>> jge loc_100000F0E >>>>>> lea r12, [rax+4] >>>>>> xor r14d, r14d >>>>>> db 2Eh >>>>>> nop word ptr [rax+rax+00000000h] >>>>>> >>>>>> loc_100000E20: ; CODE XREF: _main+216 j >>>>>> test r15d, r15d >>>>>> setle cl >>>>>> cmp r15d, 2 >>>>>> jl short loc_100000E90 >>>>>> test cl, cl >>>>>> mov r13d, 0 >>>>>> mov r11, r12 >>>>>> mov r10d, 1 >>>>>> jnz short loc_100000E90 >>>>>> >>>>>> loc_100000E3F: ; CODE XREF: _main+1F0 j >>>>>> mov edi, [rax+r13*4] >>>>>> mov edx, 0FFFFFFFFh >>>>>> mov ecx, 1 >>>>>> mov rsi, r11 >>>>>> >>>>>> loc_100000E50: ; CODE XREF: _main+1E1 j >>>>>> xor ebx, ebx >>>>>> mov ebp, edi >>>>>> sub ebp, [rsi] >>>>>> jz short loc_100000E95 >>>>>> cmp edx, ebp >>>>>> jz short loc_100000E95 >>>>>> cmp ecx, ebp >>>>>> jz short loc_100000E95 >>>>>> add rsi, 4 >>>>>> inc rcx >>>>>> dec edx >>>>>> mov ebx, ecx >>>>>> add ebx, r13d >>>>>> cmp ebx, r15d >>>>>> jl short loc_100000E50 >>>>>> inc r10 >>>>>> add r11, 4 >>>>>> inc r13 >>>>>> cmp r10d, r15d >>>>>> jl short loc_100000E3F >>>>>> db 66h, 66h, 66h, 66h, 2Eh >>>>>> nop word ptr [rax+rax+00000000h] >>>>>> >>>>>> loc_100000E90: ; CODE XREF: _main+19A j >>>>>> ; _main+1AD j >>>>>> mov ebx, 1 >>>>>> >>>>>> loc_100000E95: ; CODE XREF: _main+1C6 j >>>>>> ; _main+1CA j ... >>>>>> add r14d, ebx >>>>>> cmp r9d, r8d >>>>>> lea ecx, [r9+1] >>>>>> mov [rax+r8*4], ecx >>>>>> mov r9d, ecx >>>>>> jl loc_100000E20 >>>>>> jmp short loc_100000F0E >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000EAE: ; CODE XREF: _main+6B j >>>>>> add r15d, 0FFFFFFFEh >>>>>> movsxd rcx, r15d >>>>>> lea rcx, [rax+rcx*4] >>>>>> xor r14d, r14d >>>>>> jmp short loc_100000EC6 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 20h >>>>>> >>>>>> loc_100000EC0: ; CODE XREF: _main+247 j >>>>>> ; _main+27C j >>>>>> inc r14d >>>>>> mov r11d, ebp >>>>>> >>>>>> loc_100000EC6: ; CODE XREF: _main+22C j >>>>>> lea ebp, [r11+1] >>>>>> mov [rax+r8*4], ebp >>>>>> cmp r11d, r8d >>>>>> mov rsi, rcx >>>>>> mov edi, r15d >>>>>> jl short loc_100000EC0 >>>>>> nop dword ptr [rax+00000000h] >>>>>> >>>>>> loc_100000EE0: ; CODE XREF: _main+26A j >>>>>> mov ebp, [rsi] >>>>>> lea ebx, [rbp+1] >>>>>> mov [rsi], ebx >>>>>> cmp ebp, edx >>>>>> jl short loc_100000EFE >>>>>> mov dword ptr [rsi], 0 >>>>>> add rsi, 0FFFFFFFFFFFFFFFCh >>>>>> test edi, edi >>>>>> lea edi, [rdi-1] >>>>>> jg short loc_100000EE0 >>>>>> jmp short loc_100000F0E >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000EFE: ; CODE XREF: _main+259 j >>>>>> test edi, edi >>>>>> js short loc_100000F0E >>>>>> mov dword ptr [rax+r8*4], 0 >>>>>> xor ebp, ebp >>>>>> jmp short loc_100000EC0 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000F0E: ; CODE XREF: _main+DC j >>>>>> ; _main+E3 j ... >>>>>> mov rdi, rax ; void * >>>>>> call _free >>>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>>> xor ebx, ebx >>>>>> xor eax, eax >>>>>> mov esi, r14d >>>>>> call _printf >>>>>> >>>>>> loc_100000F29: ; CODE XREF: _main+16 j >>>>>> mov eax, ebx >>>>>> add rsp, 18h >>>>>> pop rbx >>>>>> pop r12 >>>>>> pop r13 >>>>>> pop r14 >>>>>> pop r15 >>>>>> pop rbp >>>>>> retn >>>>>> _main endp >>>>>> ``` >>>>>> >>>>>> clang 3.6's result: >>>>>> >>>>>> ``` >>>>>> public _main >>>>>> _main proc near >>>>>> >>>>>> var_60 = qword ptr -60h >>>>>> var_58 = qword ptr -58h >>>>>> var_50 = qword ptr -50h >>>>>> var_48 = qword ptr -48h >>>>>> var_40 = qword ptr -40h >>>>>> var_38 = qword ptr -38h >>>>>> >>>>>> push rbp >>>>>> push r15 >>>>>> push r14 >>>>>> push r13 >>>>>> push r12 >>>>>> push rbx >>>>>> sub rsp, 38h >>>>>> mov ebx, 0FFFFFFFFh >>>>>> cmp edi, 2 >>>>>> jnz loc_100000F23 >>>>>> mov rbx, offset __mh_execute_header >>>>>> mov rdi, [rsi+8] ; char * >>>>>> xor r13d, r13d >>>>>> xor esi, esi ; char ** >>>>>> mov edx, 0Ah ; int >>>>>> call _strtol >>>>>> mov r14, rax >>>>>> shl rax, 20h >>>>>> mov [rsp+68h+var_38], rax >>>>>> lea rsi, [rax+rbx] >>>>>> sar rsi, 20h ; size_t >>>>>> mov edi, 4 ; size_t >>>>>> call _calloc >>>>>> lea r11d, [r14-1] >>>>>> movsxd r12, r11d >>>>>> mov [rsp+68h+var_40], r12 >>>>>> movsxd rcx, r14d >>>>>> mov [rsp+68h+var_50], rcx >>>>>> add ecx, 0FFFFFFFEh >>>>>> js loc_100000E1A >>>>>> mov ecx, r14d >>>>>> add ecx, 0FFFFFFFEh >>>>>> movsxd rcx, ecx >>>>>> inc rcx >>>>>> mov [rsp+68h+var_58], rcx >>>>>> mov rcx, rax >>>>>> add rcx, 4 >>>>>> mov [rsp+68h+var_60], rcx >>>>>> xor ebp, ebp >>>>>> jmp short loc_100000D17 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 10h >>>>>> >>>>>> loc_100000D10: ; CODE XREF: _main+15B j >>>>>> ; _main+163 j ... >>>>>> mov rbp, [rsp+68h+var_48] >>>>>> add ebp, edi >>>>>> >>>>>> loc_100000D17: ; CODE XREF: _main+93 j >>>>>> cmp r13d, r11d >>>>>> lea edx, [r13+1] >>>>>> mov [rax+r12*4], edx >>>>>> mov rcx, [rsp+68h+var_58] >>>>>> mov r13d, edx >>>>>> jl short loc_100000D6B >>>>>> nop dword ptr [rax+00h] >>>>>> >>>>>> loc_100000D30: ; CODE XREF: _main+DE j >>>>>> mov edx, [rax+rcx*4-4] >>>>>> lea esi, [rdx+1] >>>>>> mov [rax+rcx*4-4], esi >>>>>> cmp edx, r11d >>>>>> jl short loc_100000D60 >>>>>> mov dword ptr [rax+rcx*4-4], 0 >>>>>> dec rcx >>>>>> test rcx, rcx >>>>>> jg short loc_100000D30 >>>>>> jmp loc_100000F09 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 20h >>>>>> >>>>>> loc_100000D60: ; CODE XREF: _main+CE j >>>>>> mov dword ptr [rax+r12*4], 0 >>>>>> xor r13d, r13d >>>>>> >>>>>> loc_100000D6B: ; CODE XREF: _main+BA j >>>>>> mov [rsp+68h+var_48], rbp >>>>>> test r14d, r14d >>>>>> setle cl >>>>>> mov rdx, offset __mh_execute_header >>>>>> lea rdx, [rdx+1] >>>>>> cmp [rsp+68h+var_38], rdx >>>>>> jl loc_100000E10 >>>>>> test cl, cl >>>>>> mov edx, 0 >>>>>> mov r10, [rsp+68h+var_60] >>>>>> mov r9d, 1 >>>>>> jnz short loc_100000E10 >>>>>> >>>>>> loc_100000DA3: ; CODE XREF: _main+195 j >>>>>> mov esi, [rax+rdx*4] >>>>>> mov r15d, 0FFFFFFFFh >>>>>> mov r8d, 1 >>>>>> mov rcx, r10 >>>>>> db 66h, 66h, 2Eh >>>>>> nop dword ptr [rax+rax+00000000h] >>>>>> >>>>>> loc_100000DC0: ; CODE XREF: _main+184 j >>>>>> mov ebx, [rcx] >>>>>> mov ebp, esi >>>>>> sub ebp, ebx >>>>>> xor edi, edi >>>>>> cmp r8d, ebp >>>>>> jz loc_100000D10 >>>>>> cmp esi, ebx >>>>>> jz loc_100000D10 >>>>>> cmp r15d, ebp >>>>>> jz loc_100000D10 >>>>>> add rcx, 4 >>>>>> inc r8 >>>>>> dec r15d >>>>>> mov edi, r8d >>>>>> add edi, edx >>>>>> cmp edi, r14d >>>>>> jl short loc_100000DC0 >>>>>> inc r9 >>>>>> add r10, 4 >>>>>> inc rdx >>>>>> cmp r9, [rsp+68h+var_50] >>>>>> jl short loc_100000DA3 >>>>>> nop word ptr [rax+rax+00000000h] >>>>>> >>>>>> loc_100000E10: ; CODE XREF: _main+119 j >>>>>> ; _main+131 j >>>>>> mov edi, 1 >>>>>> jmp loc_100000D10 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000E1A: ; CODE XREF: _main+6E j >>>>>> test r14d, r14d >>>>>> jle loc_100000F00 >>>>>> mov dword ptr [rax+r12*4], 1 >>>>>> xor ebp, ebp >>>>>> cmp r14d, 2 >>>>>> jl loc_100000F09 >>>>>> mov rcx, rax >>>>>> add rcx, 4 >>>>>> mov [rsp+68h+var_48], rcx >>>>>> xor ebp, ebp >>>>>> mov r15d, 1 >>>>>> nop dword ptr [rax+rax+00h] >>>>>> >>>>>> loc_100000E50: ; CODE XREF: _main+288 j >>>>>> mov rbx, rbp >>>>>> mov rcx, offset __mh_execute_header >>>>>> cmp [rsp+68h+var_38], rcx >>>>>> mov edx, 0 >>>>>> mov r13, [rsp+68h+var_48] >>>>>> mov r8d, 1 >>>>>> mov r9d, 1 >>>>>> jle short loc_100000EE0 >>>>>> >>>>>> loc_100000E7A: ; CODE XREF: _main+25A j >>>>>> mov r12d, [rax+rdx*4] >>>>>> mov edi, 0FFFFFFFFh >>>>>> mov ecx, 1 >>>>>> mov rsi, r13 >>>>>> nop dword ptr [rax+rax+00h] >>>>>> >>>>>> loc_100000E90: ; CODE XREF: _main+249 j >>>>>> mov r10d, [rsi] >>>>>> mov ebp, r12d >>>>>> sub ebp, r10d >>>>>> xor r9d, r9d >>>>>> cmp ecx, ebp >>>>>> jz short loc_100000EE0 >>>>>> cmp r12d, r10d >>>>>> jz short loc_100000EE0 >>>>>> cmp edi, ebp >>>>>> jz short loc_100000EE0 >>>>>> add rsi, 4 >>>>>> inc rcx >>>>>> dec edi >>>>>> mov ebp, ecx >>>>>> add ebp, edx >>>>>> cmp ebp, r14d >>>>>> jl short loc_100000E90 >>>>>> inc r8 >>>>>> add r13, 4 >>>>>> inc rdx >>>>>> cmp r8, [rsp+68h+var_50] >>>>>> jl short loc_100000E7A >>>>>> mov r9d, 1 >>>>>> db 66h, 66h, 66h, 66h, 2Eh >>>>>> nop word ptr [rax+rax+00000000h] >>>>>> >>>>>> loc_100000EE0: ; CODE XREF: _main+208 j >>>>>> ; _main+22E j ... >>>>>> mov rbp, rbx >>>>>> add ebp, r9d >>>>>> cmp r15d, r11d >>>>>> lea ecx, [r15+1] >>>>>> mov rdx, [rsp+68h+var_40] >>>>>> mov [rax+rdx*4], ecx >>>>>> mov r15d, ecx >>>>>> jl loc_100000E50 >>>>>> jmp short loc_100000F09 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000F00: ; CODE XREF: _main+1AD j >>>>>> xor ebp, ebp >>>>>> test r11d, r11d >>>>>> cmovns ebp, r11d >>>>>> >>>>>> loc_100000F09: ; CODE XREF: _main+E0 j >>>>>> ; _main+1C1 j ... >>>>>> mov rdi, rax ; void * >>>>>> call _free >>>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>>> xor ebx, ebx >>>>>> xor eax, eax >>>>>> mov esi, ebp >>>>>> call _printf >>>>>> >>>>>> loc_100000F23: ; CODE XREF: _main+16 j >>>>>> mov eax, ebx >>>>>> add rsp, 38h >>>>>> pop rbx >>>>>> pop r12 >>>>>> pop r13 >>>>>> pop r14 >>>>>> pop r15 >>>>>> pop rbp >>>>>> retn >>>>>> _main endp >>>>>> ``` >>>>>> >>>>>> gcc-4.9.2's result: >>>>>> ``` >>>>>> >>>>>> _main proc near >>>>>> >>>>>> var_48 = qword ptr -48h >>>>>> var_40 = dword ptr -40h >>>>>> var_3C = dword ptr -3Ch >>>>>> >>>>>> cmp edi, 2 >>>>>> jz short loc_100000D69 >>>>>> or eax, 0FFFFFFFFh >>>>>> retn >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000D69: ; CODE XREF: _main+3 j >>>>>> push r15 >>>>>> mov edx, 0Ah ; int >>>>>> push r14 >>>>>> push r13 >>>>>> push r12 >>>>>> push rbp >>>>>> push rbx >>>>>> sub rsp, 18h >>>>>> mov rdi, [rsi+8] ; char * >>>>>> xor esi, esi ; char ** >>>>>> call _strtol >>>>>> mov edi, 4 ; size_t >>>>>> lea esi, [rax+1] >>>>>> mov r14, rax >>>>>> mov ebx, eax >>>>>> lea r15d, [r14-2] >>>>>> movsxd rsi, esi ; size_t >>>>>> call _calloc >>>>>> mov [rsp+48h+var_3C], 0 >>>>>> mov rdi, rax ; void * >>>>>> lea eax, [r14-1] >>>>>> cdqe >>>>>> lea r13, [rdi+rax*4] >>>>>> movsxd rax, r15d >>>>>> mov ebp, [r13+0] >>>>>> shl rax, 2 >>>>>> lea r12, [rdi+rax] >>>>>> lea rax, [rdi+rax-4] >>>>>> mov [rsp+48h+var_48], rax >>>>>> mov eax, r14d >>>>>> lea r14d, [r14+1] >>>>>> nop word ptr [rax+rax+00h] >>>>>> nop word ptr [rax+rax+00h] >>>>>> >>>>>> loc_100000DE0: ; CODE XREF: _main+12B j >>>>>> ; _main+155 j ... >>>>>> add ebp, 1 >>>>>> cmp ebx, ebp >>>>>> mov [r13+0], ebp >>>>>> jg short loc_100000E62 >>>>>> test r15d, r15d >>>>>> js short loc_100000E33 >>>>>> mov ecx, [r12] >>>>>> lea edx, [rcx+1] >>>>>> cmp ebx, edx >>>>>> mov [r12], edx >>>>>> jg short loc_100000E58 >>>>>> mov r8, r12 >>>>>> mov rcx, [rsp+48h+var_48] >>>>>> mov esi, r15d >>>>>> jmp short loc_100000E24 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 10h >>>>>> >>>>>> loc_100000E10: ; CODE XREF: _main+D1 j >>>>>> mov edx, [rcx] >>>>>> sub r8, 4 >>>>>> sub rcx, 4 >>>>>> add edx, 1 >>>>>> mov [rcx+4], edx >>>>>> cmp ebx, edx >>>>>> jg short loc_100000E58 >>>>>> >>>>>> loc_100000E24: ; CODE XREF: _main+A9 j >>>>>> sub esi, 1 >>>>>> mov dword ptr [r8], 0 >>>>>> cmp esi, 0FFFFFFFFh >>>>>> jnz short loc_100000E10 >>>>>> >>>>>> loc_100000E33: ; CODE XREF: _main+8E j >>>>>> call _free >>>>>> mov esi, [rsp+48h+var_3C] >>>>>> add rsp, 18h >>>>>> xor eax, eax >>>>>> pop rbx >>>>>> lea rdi, aDSolutions ; "%d solutions\n" >>>>>> pop rbp >>>>>> pop r12 >>>>>> pop r13 >>>>>> pop r14 >>>>>> pop r15 >>>>>> jmp _printf >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000E58: ; CODE XREF: _main+9D j >>>>>> ; _main+C2 j >>>>>> mov dword ptr [r13+0], 0 >>>>>> xor ebp, ebp >>>>>> >>>>>> loc_100000E62: ; CODE XREF: _main+89 j >>>>>> test ebx, ebx >>>>>> jle loc_100000EE6 >>>>>> lea r11, [rdi+8] >>>>>> xor r10d, r10d >>>>>> >>>>>> loc_100000E71: ; CODE XREF: _main+184 j >>>>>> add r10d, 1 >>>>>> cmp r10d, eax >>>>>> jz short loc_100000EE6 >>>>>> mov r8d, [r11-8] >>>>>> mov edx, r8d >>>>>> sub edx, [r11-4] >>>>>> add edx, 1 >>>>>> cmp edx, 2 >>>>>> jbe loc_100000DE0 >>>>>> mov r9d, r14d >>>>>> mov rcx, r11 >>>>>> mov edx, 1 >>>>>> mov [rsp+48h+var_40], r10d >>>>>> sub r9d, r10d >>>>>> jmp short loc_100000ED3 >>>>>> ; --------------------------------------------------------------------------- >>>>>> align 10h >>>>>> >>>>>> loc_100000EB0: ; CODE XREF: _main+179 j >>>>>> mov esi, r8d >>>>>> sub esi, [rcx] >>>>>> jz loc_100000DE0 >>>>>> mov r10d, esi >>>>>> add rcx, 4 >>>>>> add r10d, edx >>>>>> jz loc_100000DE0 >>>>>> cmp esi, edx >>>>>> jz loc_100000DE0 >>>>>> >>>>>> loc_100000ED3: ; CODE XREF: _main+144 j >>>>>> add edx, 1 >>>>>> cmp edx, r9d >>>>>> jnz short loc_100000EB0 >>>>>> mov r10d, [rsp+48h+var_40] >>>>>> add r11, 4 >>>>>> jmp short loc_100000E71 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_100000EE6: ; CODE XREF: _main+104 j >>>>>> ; _main+118 j >>>>>> add [rsp+48h+var_3C], 1 >>>>>> jmp loc_100000DE0 >>>>>> _main endp >>>>>> ``` >>>>>> >>>>>> MSVC 10.0's result: >>>>>> >>>>>> ``` >>>>>> >>>>>> _main proc near ; CODE XREF: ___tmainCRTStartup+106 p >>>>>> >>>>>> var_80 = dword ptr -80h >>>>>> var_7C = dword ptr -7Ch >>>>>> var_78 = dword ptr -78h >>>>>> var_74 = dword ptr -74h >>>>>> var_70 = dword ptr -70h >>>>>> var_6C = dword ptr -6Ch >>>>>> var_68 = dword ptr -68h >>>>>> var_64 = dword ptr -64h >>>>>> var_60 = dword ptr -60h >>>>>> var_5C = dword ptr -5Ch >>>>>> argc = dword ptr 8 >>>>>> argv = dword ptr 0Ch >>>>>> envp = dword ptr 10h >>>>>> >>>>>> push ebp >>>>>> mov ebp, esp >>>>>> and esp, 0FFFFFF80h >>>>>> push esi >>>>>> push edi >>>>>> push ebx >>>>>> sub esp, 74h >>>>>> push 3 >>>>>> call sub_4080F0 >>>>>> add esp, 4 >>>>>> stmxcsr [esp+80h+var_80] >>>>>> or [esp+80h+var_80], 8000h >>>>>> ldmxcsr [esp+80h+var_80] >>>>>> cmp [ebp+argc], 2 >>>>>> jz short loc_40103A >>>>>> mov eax, 0FFFFFFFFh >>>>>> add esp, 74h >>>>>> pop ebx >>>>>> pop edi >>>>>> pop esi >>>>>> mov esp, ebp >>>>>> pop ebp >>>>>> retn >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_40103A: ; CODE XREF: _main+29 j >>>>>> call ds:GetTickCount >>>>>> mov esi, eax >>>>>> mov eax, [ebp+argv] >>>>>> push dword ptr [eax+4] ; char * >>>>>> call _atoi >>>>>> mov edi, eax >>>>>> lea eax, [edi+1] >>>>>> push eax ; size_t >>>>>> push 4 ; size_t >>>>>> call _calloc >>>>>> add esp, 0Ch >>>>>> mov ecx, [eax+edi*4-4] >>>>>> lea edx, [edi-1] >>>>>> mov [esp+80h+var_6C], ecx >>>>>> xor ebx, ebx >>>>>> mov [esp+80h+var_7C], ebx >>>>>> lea ecx, [eax+edi*4] >>>>>> mov [esp+80h+var_74], ecx >>>>>> lea ecx, [edi-2] >>>>>> mov [esp+80h+var_70], ecx >>>>>> mov [esp+80h+var_60], edx >>>>>> mov [esp+80h+var_80], esi >>>>>> mov ecx, [esp+80h+var_6C] >>>>>> >>>>>> loc_401087: ; CODE XREF: _main+142 j >>>>>> ; _main+193 j >>>>>> mov edx, [esp+80h+var_60] >>>>>> inc ecx >>>>>> mov [eax+edi*4-4], ecx >>>>>> cmp edi, [eax+edx*4] >>>>>> jg short loc_4010DC >>>>>> mov esi, [esp+80h+var_70] >>>>>> test esi, esi >>>>>> js short loc_4010CE >>>>>> xor edx, edx >>>>>> mov [esp+80h+var_78], eax >>>>>> xor ebx, ebx >>>>>> mov eax, [esp+80h+var_74] >>>>>> >>>>>> loc_4010A9: ; CODE XREF: _main+C8 j >>>>>> mov ecx, [eax+ebx*4-8] >>>>>> inc ecx >>>>>> cmp ecx, edi >>>>>> jl loc_40117A >>>>>> inc edx >>>>>> lea esi, [ebx+edi-3] >>>>>> mov dword ptr [eax+ebx*4-8], 0 >>>>>> dec ebx >>>>>> cmp edx, [esp+80h+var_60] >>>>>> jb short loc_4010A9 >>>>>> mov eax, [esp+80h+var_78] >>>>>> >>>>>> loc_4010CE: ; CODE XREF: _main+9B j >>>>>> ; _main+186 j >>>>>> test esi, esi >>>>>> jl short loc_401147 >>>>>> mov dword ptr [eax+edi*4-4], 0 >>>>>> xor ecx, ecx >>>>>> >>>>>> loc_4010DC: ; CODE XREF: _main+93 j >>>>>> test edi, edi >>>>>> jle short loc_40113E >>>>>> mov [esp+80h+var_6C], ecx >>>>>> xor edx, edx >>>>>> mov [esp+80h+var_5C], edi >>>>>> >>>>>> loc_4010EA: ; CODE XREF: _main+132 j >>>>>> lea ecx, [edx+1] >>>>>> mov ebx, ecx >>>>>> mov esi, ebx >>>>>> cmp ecx, [esp+80h+var_5C] >>>>>> jge short loc_401130 >>>>>> mov edx, [eax+edx*4] >>>>>> mov edi, 1 >>>>>> mov [esp+80h+var_64], esi >>>>>> mov [esp+80h+var_68], ecx >>>>>> >>>>>> loc_401107: ; CODE XREF: _main+122 j >>>>>> mov esi, [eax+ebx*4] >>>>>> cmp edx, esi >>>>>> jz short loc_40118B >>>>>> sub esi, edx >>>>>> mov ecx, esi >>>>>> neg ecx >>>>>> cmp edi, ecx >>>>>> jz short loc_40118B >>>>>> cmp esi, edi >>>>>> jz short loc_40118B >>>>>> inc ebx >>>>>> inc edi >>>>>> cmp ebx, [esp+80h+var_5C] >>>>>> jl short loc_401107 >>>>>> mov ecx, [esp+80h+var_68] >>>>>> mov esi, [esp+80h+var_64] >>>>>> cmp ecx, [esp+80h+var_5C] >>>>>> >>>>>> loc_401130: ; CODE XREF: _main+F5 j >>>>>> mov edx, esi >>>>>> jl short loc_4010EA >>>>>> xchg ax, ax >>>>>> mov ecx, [esp+80h+var_6C] >>>>>> mov edi, [esp+80h+var_5C] >>>>>> >>>>>> loc_40113E: ; CODE XREF: _main+DE j >>>>>> inc [esp+80h+var_7C] >>>>>> jmp loc_401087 >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_401147: ; CODE XREF: _main+D0 j >>>>>> mov ebx, [esp+80h+var_7C] >>>>>> mov esi, [esp+80h+var_80] >>>>>> push eax ; void * >>>>>> call _free >>>>>> add esp, 4 >>>>>> call ds:GetTickCount >>>>>> sub eax, esi >>>>>> push eax >>>>>> push ebx >>>>>> push offset aDSolutionsInDM ; "%d solutions in %d msecs.\n" >>>>>> call _printf >>>>>> xor eax, eax >>>>>> add esp, 80h >>>>>> pop ebx >>>>>> pop edi >>>>>> pop esi >>>>>> mov esp, ebp >>>>>> pop ebp >>>>>> retn >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_40117A: ; CODE XREF: _main+B0 j >>>>>> mov edx, [esp+80h+var_74] >>>>>> mov eax, [esp+80h+var_78] >>>>>> mov [edx+ebx*4-8], ecx >>>>>> jmp loc_4010CE >>>>>> ; --------------------------------------------------------------------------- >>>>>> >>>>>> loc_40118B: ; CODE XREF: _main+10C j >>>>>> ; _main+116 j ... >>>>>> mov ecx, [esp+80h+var_6C] >>>>>> mov edi, [esp+80h+var_5C] >>>>>> jmp loc_401087 >>>>>> _main endp >>>>>> ``` >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev