Don Quixote de la Mancha
2011-Nov-12 00:14 UTC
[LLVMdev] Thumb-2 code generation error in Apple LLVM at all optimization levels
This would be best reported to Apple's Radar bug database at http://bugreport.apple.com/ but its whole website has been down for a while. I have a 100% reproducible Thumb-2 code generation error that occurs at all of the levels of optimization available in the Xcode 4.2 for Snow Leopard build settings GUI: -O0, -O1, -O2, -O3 and -Os. However the bad machine code only occurs in Release builds, never in Debug builds! I tried the Debug builds at all levels of optimization as well. $ xcodebuild -version Xcode 4.2 Build version 4C199 $ /Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/clang --version Apple clang version 3.0 (tags/Apple/clang-211.9) (based on LLVM 3.0svn) Target: i386-apple-darwin10.8.0 Thread model: posix I'm not real clear where to find the part of the toolchain that emits the Thumb-2 assembly, so I can't tell you that tool's precise version. $ uname -a Darwin frylock.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 The Xcode's iPhone and iPad Simulators run iOS Apps that on my 32-bit MacBook Pro are built as i386 code. The iOS frameworks (shared libraries, sort of) that simulated Apps link to are actually shims that interface to Mac OS X's frameworks. The i386 code for my simulated App is generated correctly at -Os for both Release and Debug builds. That suggests that the problem is in the Thumb-2 code generation back-end, and not in the LLVM IR. I've seen lots of reports that the Thumb code that the Apple LLVM compiler generates for ARMv6 is quite buggy, so that one must disable Thumb code generation for ARMv6 targets. However my first-generation iPad has a Cortex A8 CPU, which is ARMv7, as does my iPhone 4. It's quite possible that disabling Thumb code generation for at least this one source file will correct the bad machine code, but Google has not blessed me with the insight as to how to do that. It's not done the same way for LLVM as for GCC. Have any of you this insight to spare? It's going to take me a little while to cook up a minimal test case as I was up all night <strike>trolling the Internet</strike> working on my iOS App, so I'm pretty beat. But when I have more details for you, I will post a more detailed report as well as a minimal test case that builds as a complete iOS App at what is now just a placeholder page: Apple Xcode 4.2 LLVM Compiler Bug Reports http://www.dulcineatech.com/bug-reports/xcode/4.2/llvm/ My App Warp Life is so named because it goes very, very fast, with many more optimizations coming soon. The UI has a speed control slider whose value is scaled, then pass to the usleep() iOS system call. usleep() suspends the process for the given number of microseconds. I realized just recently that calling usleep with delays that themselves are insignificant might actually slow my App down quite a bit, because there is all manner of overhead to making and returning from even the most trivial system calls. After measuring my game's frame rate at the best optimizations I could find, for various kinds of test data, I set a threshhold of 1/250th of a second. I never call usleep() if the configured delay setting is less than that. The full source of the entire method, and the Release and Debug build assembly codes are at the end of this mail. For clarity I show only the pertinent lines of code right here: useconds_t usecs = (useconds_t)( self.delay * (float)500000 ); if ( usecs >= 4000 ){ // ~ 1/250 sec usleep( usecs ); // usecs is ZERO!!!! } self.delay is an Objective-C 2.0 property that holds the current value of the speed slider. When set to maximum speed, usecs will always be zero. Even so, the branch is ALWAYS taken, despite the source code ensuring that the branch is only taken when usecs is greater than or equal to four thousand. Here is the Thumb-2 assembly for the Release build. I think the (float)500000 delay scaling factor is meant to be held in floating point register d8. I thought at first it might not be initialized at all, but upon closer examination I think it may actually be initialized from a program counter-relative 32-bit .long constant immediately following my method's code. .loc 1 388 3 ldr r0, [r5] ldr r1, [r4, r0] adds r1, #1 str r1, [r4, r0] .loc 1 390 64 mov r0, r4 ldr r1, [r6] blx _objc_msgSend vmov s0, r0 vmul.f32 d0, d0, d8 vcvt.u32.f32 d0, d0 vmov r0, s0 Ltmp272: .loc 1 392 9 cmp.w r0, #4000 Ltmp273: .loc 1 393 13 it hs blxhs _usleep cmp.w *looks* like a 16-bit comparison with an immediate constant, but in reality the constant is twelve bits. The ARM and Thumb instruction sets have quite severe restrictions on the allowed ranges of immediate values because the richness of the ARM and Thumb instruction set makes it hard to find enough bits in the instruction words to express a wider range of immediate values than is presently possible. I don't know what the "it hs" instruction does. I suspect that's where the problem lies, but "it" is a very common word, and "hs" is quite common as well, as it is a frequent mispelling for "has". Perhaps someone who knows Thumb-2 assembly better than I do could comment. The assembly for my Debug build is quite unlike that for the Release build, for every single one of the available optimization levels. There are quite a few instructions separating the load of the #4000 immediate into r0 and the call to usleep(). I have not yet ensured that there aren't build configuration differences between my Debug and Release builds, but I don't recall setting any. My guess is that the totally different machine code in Debug is there to make source code debugging work better. Here is my method's full Objective-C source: - (void) cycleContinuously { startDate = [[NSDate alloc] init]; generation = 0; while ( mRunning ){ [self cycle]; ++generation; useconds_t usecs = (useconds_t)( self.delay * (float)500000 ); if ( usecs >= 4000 ){ // ~ 1/250 sec usleep( usecs ); } } NSDate *endDate = [[NSDate alloc] init]; NSTimeInterval elapsed = [endDate timeIntervalSinceDate: startDate]; [startDate release]; [endDate release]; printf( "Speed: %f gen/sec\n", ( (float)generation ) / elapsed ); return; } The assembly for the problem area of my code is completely identical for each available optimization setting for Release builds. I haven't made such detailed comparisons for the Debug builds yet. Here is the Release assembly at -Os: .align 2 .code 16 .thumb_func "-[LifeGrid cycleContinuously]" "-[LifeGrid cycleContinuously]": Ltmp265: Lfunc_begin24: .loc 1 380 0 .loc 1 380 1 prologue_end push {r4, r5, r6, r7, lr} add r7, sp, #12 push.w {r8, r10, r11} vpush {d8} sub sp, #4 .loc 1 382 2 Ltmp266: movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_0+4)) Ltmp267: mov r4, r0 Ltmp268: movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_0+4)) movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_1+4)) movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_1+4)) LPC24_0: add r1, pc LPC24_1: add r0, pc ldr r1, [r1] ldr r0, [r0] blx _objc_msgSend movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4)) LPC24_2: add r1, pc ldr r1, [r1] blx _objc_msgSend movw r11, :lower16:(_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_3+4)) movt r11, :upper16:(_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_3+4)) LPC24_3: add r11, pc ldr.w r1, [r11] .loc 1 383 2 movw r5, :lower16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_4+4)) movt r5, :upper16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_4+4)) LPC24_4: add r5, pc .loc 1 382 2 str r0, [r4, r1] movs r1, #0 .loc 1 383 2 ldr r0, [r5] .loc 1 385 2 movw r8, :lower16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_5+4)) movt r8, :upper16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_5+4)) LPC24_5: add r8, pc .loc 1 383 2 str r1, [r4, r0] .loc 1 385 2 ldr.w r0, [r8] ldrb r0, [r4, r0] cbz r0, LBB24_3 Ltmp269: .loc 1 386 3 movw r10, :lower16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_6+4)) vldr.32 s16, LCPI24_0 movt r10, :upper16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_6+4)) .loc 1 390 64 movw r6, :lower16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_7+4)) movt r6, :upper16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_7+4)) .loc 1 386 3 LPC24_6: add r10, pc .loc 1 390 64 LPC24_7: add r6, pc LBB24_2: Ltmp270: .loc 1 386 3 ldr.w r1, [r10] Ltmp271: mov r0, r4 blx _objc_msgSend .loc 1 388 3 ldr r0, [r5] ldr r1, [r4, r0] adds r1, #1 str r1, [r4, r0] .loc 1 390 64 mov r0, r4 ldr r1, [r6] blx _objc_msgSend vmov s0, r0 vmul.f32 d0, d0, d8 vcvt.u32.f32 d0, d0 vmov r0, s0 Ltmp272: .loc 1 392 9 cmp.w r0, #4000 Ltmp273: .loc 1 393 13 it hs blxhs _usleep Ltmp274: .loc 1 385 2 ldr.w r0, [r8] ldrb r0, [r4, r0] cmp r0, #0 bne LBB24_2 LBB24_3: Ltmp275: .loc 1 382 2 movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_8+4)) movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_8+4)) LPC24_8: add r0, pc .loc 1 397 41 ldr r1, [r0] Ltmp276: .loc 1 382 2 movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_9+4)) movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_9+4)) LPC24_9: add r0, pc .loc 1 397 41 ldr r0, [r0] blx _objc_msgSend .loc 1 382 2 movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_10+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_10+4)) LPC24_10: add r1, pc .loc 1 397 41 ldr r1, [r1] blx _objc_msgSend .loc 1 399 69 movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_66-(LPC24_11+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_66-(LPC24_11+4)) .loc 1 397 41 mov r6, r0 .loc 1 399 69 ldr.w r0, [r11] LPC24_11: add r1, pc ldr r1, [r1] ldr r2, [r4, r0] mov r0, r6 blx _objc_msgSend str r0, [sp] .loc 1 401 2 movw r8, :lower16:(L_OBJC_SELECTOR_REFERENCES_68-(LPC24_12+4)) movt r8, :upper16:(L_OBJC_SELECTOR_REFERENCES_68-(LPC24_12+4)) ldr.w r0, [r11] LPC24_12: add r8, pc .loc 1 399 69 mov r10, r1 .loc 1 401 2 ldr.w r1, [r8] ldr r0, [r4, r0] blx _objc_msgSend .loc 1 402 2 ldr.w r1, [r8] mov r0, r6 blx _objc_msgSend .loc 1 404 2 ldr r0, [r5] add r0, r4 vldr.32 s0, [r0] vcvt.f32.s32 d0, d0 .loc 1 399 69 ldr r0, [sp] vmov d17, r0, r10 Ltmp277: .loc 1 404 2 movw r0, :lower16:(L_.str69-(LPC24_13+4)) movt r0, :upper16:(L_.str69-(LPC24_13+4)) vcvt.f64.f32 d16, s0 LPC24_13: add r0, pc vdiv.f64 d16, d16, d17 vmov r1, r2, d16 blx _printf Ltmp278: .loc 1 407 1 add sp, #4 vpop {d8} pop.w {r8, r10, r11} pop {r4, r5, r6, r7, pc} Ltmp279: .align 2 LCPI24_0: .long 1223959552 Ltmp280: Lfunc_end24: Ltmp281: Leh_func_end24: Here is the Debug assembly at -Os: .align 2 .code 16 .thumb_func "-[LifeGrid cycleContinuously]" "-[LifeGrid cycleContinuously]": Ltmp112: Lfunc_begin24: .loc 1 380 0 push {r4, r7, lr} add r7, sp, #4 sub sp, #44 mov r4, sp bic r4, r4, #7 mov sp, r4 movs r2, #0 movt r2, #0 str r0, [sp, #40] str r1, [sp, #36] .loc 1 382 2 prologue_end Ltmp113: ldr.n r0, LCPI24_4 LPC24_4: add r0, pc ldr r0, [r0] ldr.n r1, LCPI24_3 LPC24_3: add r1, pc ldr r1, [r1] str r2, [sp, #12] blx _objc_msgSend ldr.n r1, LCPI24_2 LPC24_2: add r1, pc ldr r1, [r1] blx _objc_msgSend ldr r1, [sp, #40] ldr.n r2, LCPI24_1 LPC24_1: add r2, pc ldr r2, [r2] add r1, r2 str r0, [r1] .loc 1 383 2 ldr r0, [sp, #40] ldr.n r1, LCPI24_0 LPC24_0: add r1, pc ldr r1, [r1] add r0, r1 ldr r1, [sp, #12] str r1, [r0] LBB24_1: .loc 1 385 2 ldr r0, [sp, #40] movw r1, :lower16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_14+4)) movt r1, :upper16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_14+4)) LPC24_14: add r1, pc ldr r1, [r1] ldrb r0, [r0, r1] movs r1, #0 cmp r0, #0 it ne movne r1, #1 tst.w r1, #1 beq LBB24_5 movw r0, #4000 movt r0, #0 .loc 1 386 3 Ltmp114: ldr r1, [sp, #40] movw r2, :lower16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_15+4)) movt r2, :upper16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_15+4)) LPC24_15: add r2, pc ldr r2, [r2] str r0, [sp, #8] mov r0, r1 mov r1, r2 blx _objc_msgSend .loc 1 388 3 ldr r0, [sp, #40] movw r1, :lower16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_16+4)) movt r1, :upper16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_16+4)) LPC24_16: add r1, pc ldr r1, [r1] mov r2, r1 ldr r2, [r0, r2] adds r2, #1 str r2, [r0, r1] .loc 1 390 64 ldr r0, [sp, #40] Ltmp115: movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_17+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_17+4)) LPC24_17: add r1, pc ldr r1, [r1] blx _objc_msgSend vmov s0, r0 vmov.f64 d1, d16 vldr.32 s1, LCPI24_14 vmov.f64 d2, d1 vmov.f32 s4, s1 vmov.f64 d3, d1 vmov.f32 s6, s0 vmul.f32 d16, d3, d2 vmov.f64 d2, d16 vmov.f32 s0, s4 vmov.f32 s2, s0 vcvt.u32.f32 d16, d1 vmov.f64 d1, d16 vmov.f32 s0, s2 vmov r0, s0 str r0, [sp, #32] .loc 1 392 9 ldr r0, [sp, #32] ldr r1, [sp, #8] cmp r0, r1 blo LBB24_4 .loc 1 393 13 Ltmp116: ldr r0, [sp, #32] bl _usleep str r0, [sp, #4] Ltmp117: LBB24_4: .loc 1 395 2 b LBB24_1 Ltmp118: LBB24_5: .loc 1 397 41 ldr.n r0, LCPI24_13 LPC24_13: add r0, pc ldr r0, [r0] ldr.n r1, LCPI24_12 LPC24_12: add r1, pc ldr r1, [r1] blx _objc_msgSend ldr.n r1, LCPI24_11 LPC24_11: add r1, pc ldr r1, [r1] blx _objc_msgSend str r0, [sp, #28] .loc 1 399 69 ldr r0, [sp, #28] ldr r1, [sp, #40] ldr.n r2, LCPI24_10 LPC24_10: add r2, pc ldr r2, [r2] add r1, r2 ldr r2, [r1] ldr.n r1, LCPI24_9 LPC24_9: add r1, pc ldr r1, [r1] blx _objc_msgSend vmov d16, r0, r1 vstr.64 d16, [sp, #16] .loc 1 401 2 ldr r0, [sp, #40] ldr.n r1, LCPI24_8 LPC24_8: add r1, pc ldr r1, [r1] add r0, r1 ldr r0, [r0] ldr.n r1, LCPI24_7 LPC24_7: add r1, pc ldr r1, [r1] blx _objc_msgSend .loc 1 402 2 ldr r0, [sp, #28] ldr.n r1, LCPI24_6 LPC24_6: add r1, pc ldr r1, [r1] blx _objc_msgSend .loc 1 404 2 ldr r0, [sp, #40] ldr.n r1, LCPI24_5 LPC24_5: add r1, pc ldr r1, [r1] add r0, r1 ldr r0, [r0] vmov s0, r0 vcvt.f32.s32 s0, s0 vcvt.f64.f32 d16, s0 vldr.64 d17, [sp, #16] vdiv.f64 d16, d16, d17 vmov r1, r2, d16 movw r0, :lower16:(L_.str69-(LPC24_18+4)) movt r0, :upper16:(L_.str69-(LPC24_18+4)) LPC24_18: add r0, pc blx _printf .loc 1 407 1 str r0, [sp] subs r4, r7, #4 mov sp, r4 pop {r4, r7, pc} .align 2 LCPI24_0: .long _OBJC_IVAR_$_LifeGrid.generation-(LPC24_0+4) .align 2 LCPI24_1: .long _OBJC_IVAR_$_LifeGrid.startDate-(LPC24_1+4) .align 2 LCPI24_2: .long L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4) .align 2 LCPI24_3: .long L_OBJC_SELECTOR_REFERENCES_7-(LPC24_3+4) .align 2 LCPI24_4: .long L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_4+4) .align 2 LCPI24_5: .long _OBJC_IVAR_$_LifeGrid.generation-(LPC24_5+4) .align 2 LCPI24_6: .long L_OBJC_SELECTOR_REFERENCES_68-(LPC24_6+4) .align 2 LCPI24_7: .long L_OBJC_SELECTOR_REFERENCES_68-(LPC24_7+4) .align 2 LCPI24_8: .long _OBJC_IVAR_$_LifeGrid.startDate-(LPC24_8+4) .align 2 LCPI24_9: .long L_OBJC_SELECTOR_REFERENCES_66-(LPC24_9+4) .align 2 LCPI24_10: .long _OBJC_IVAR_$_LifeGrid.startDate-(LPC24_10+4) .align 2 LCPI24_11: .long L_OBJC_SELECTOR_REFERENCES_-(LPC24_11+4) .align 2 LCPI24_12: .long L_OBJC_SELECTOR_REFERENCES_7-(LPC24_12+4) .align 2 LCPI24_13: .long L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_13+4) .align 2 LCPI24_14: .long 1223959552 Ltmp119: Lfunc_end24: Ltmp120: Leh_func_end24: Man I gotta catch some ZZZs, I'm totally thrashed. I'll do my best just to take a little nap, but the chances are pretty good I won't get outta bed unilt Monday! -- Don Quixote de la Mancha Dulcinea Technologies Corporation Software of Elegance and Beauty http://www.dulcineatech.com quixote at dulcineatech.com
Owen Anderson
2011-Nov-12 01:26 UTC
[LLVMdev] Thumb-2 code generation error in Apple LLVM at all optimization levels
On Nov 11, 2011, at 4:14 PM, Don Quixote de la Mancha wrote:> cmp.w *looks* like a 16-bit comparison with an immediate constant, but > in reality the constant is twelve bits. The ARM and Thumb instruction > sets have quite severe restrictions on the allowed ranges of immediate > values because the richness of the ARM and Thumb instruction set makes > it hard to find enough bits in the instruction words to express a > wider range of immediate values than is presently possible.This is not quite right. It does have a 12-bit immediate field, but it is decomposed into an 8-bit base immediate and a 4-bit right-rotate value. Your example of #4000 is encoded as a base value of 0xfa and a rotate of 0xe, which is correct.> I don't know what the "it hs" instruction does. I suspect that's > where the problem lies, but "it" is a very common word, and "hs" is > quite common as well, as it is a frequent mispelling for "has". > Perhaps someone who knows Thumb-2 assembly better than I do could > comment.The IT instruction is how you express predication in Thumb2. Unlike ARM instructions, where the predicate is part of the instruction, Thumb2 instructions use IT to set the predicates for following instructions. In this case, it applies the "hs" predicate to the subsequent call to _usleep. I'd have to double check, but I'm fairly confident that the hs condition code is equivalent to >= for integers. --Owen
Don Quixote de la Mancha
2011-Nov-12 02:12 UTC
[LLVMdev] Thumb-2 code generation error in Apple LLVM at all optimization levels
On Fri, Nov 11, 2011 at 5:26 PM, Owen Anderson <resistor at mac.com> wrote:> This is not quite right. It does have a 12-bit immediate field, but it is decomposed into an 8-bit base immediate and a 4-bit right-rotate value. Your example of #4000 is encoded as a base value of 0xfa and a rotate of 0xe, which is correct. > >> I don't know what the "it hs" instruction does. I suspect that's >> where the problem lies, but "it" is a very common word, and "hs" is >> quite common as well, as it is a frequent mispelling for "has". >> Perhaps someone who knows Thumb-2 assembly better than I do could >> comment. > > The IT instruction is how you express predication in Thumb2. Unlike ARM instructions, where the predicate is part of the instruction, Thumb2 instructions use IT to set the predicates for following instructions. In this case, it applies the "hs" predicate to the subsequent call to _usleep. I'd have to double check, but I'm fairly confident that the hs condition code is equivalent to >= for integers.All of my regression testing so far has had my speed slider set to its maximum, so the useconds_t has always been precisely zero. Maybe there's something special about zero that would not be the case for an integer ranging from 1 to 3999. I'll check that out, but not right now, I'm about to pass right out, but I don't want to because I am even more hungry than I am tired. There is a pizza joint within walking distance of my apartment. I'm going to go stuff myself silly. -- Don Quixote de la Mancha Dulcinea Technologies Corporation Software of Elegance and Beauty http://www.dulcineatech.com quixote at dulcineatech.com
Possibly Parallel Threads
- [LLVMdev] Simple NEON optimization
- [LLVMdev] Simple NEON optimization
- [LLVMdev] MI scheduler produce badly code with inline function
- [LLVMdev] A bug in LLVM-GCC 4.2 with inlining __exchange_and_add
- [LLVMdev] ARM/MC/ELF Support for pcrel movw/movt coming soon