Hi Tim, Thanks a lot for the reply. I tested libc.so which is a shared library. llvm-objdump also report some disassemble errors. Could you please tell me more about $a, $t and $d symbols? How these symbols are used to define different regions? Where I can find this symbols in ELF object file? Thanks, David I'm now try to find a decoder of ARM instructions in oder On Thu, Jun 7, 2012 at 3:57 AM, Tim Northover <t.p.northover at gmail.com>wrote:> Hi David, > > > I've try to use llvm-objdump to disassemble some ARM binary, such as > busybox > > in android. > > > > ./llvm-objdump -arch=arm -d busybox > > It's probably assuming the wrong architecture revision. I don't have > an android busybox handy, but I see similar on binaries compiled for > ARMv7. The trick is to use: > > llvm-objdump -triple=armv7 -d whatever > > (ARMv7 covers virtually anything Android will be running on these days). > > There are a couple of other things to be wary of at the moment though: > 1. PC-relative data, as you said: ARM code often includes literal data > inline with code, this could well *not* have a valid disassembly. In > relocatable object files, these regions should be marked[*], but I > believe LLVM has problems with that currently. In executable files > (like "busybox") the regions won't necessarily even be marked. > > 2. ARM object files may contain mixed ARM and Thumb code: two > different instruction sets. Obviously, disassembling ARM as Thumb or > the reverse won't give you anything sensible. Again, relocatable files > mark these regions[*] but executables don't. If you know an what you > want is thumb code, you can use the triple "thumbv7" instead for > llvm-objdump. > > So a combination of those probably explains why you're getting > problems and may improve matters, but it probably won't make things > perfect (and arguably can't in the case of the ARM/Thumb distinction > without reconstructing all possible control-flow graphs). > > Tim. > > [*] The marking is via symbols $a, $t and $d which reference the > beginning each stretch of ARM code, Thumb code and Data. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120607/a1fdd393/attachment.html>
Hi David, On Thu, Jun 7, 2012 at 10:17 AM, Fan Dawei <fandawei.s at gmail.com> wrote:> Could you please tell me more about $a, $t and $d symbols? How these symbols > are used to define different regions? Where I can find this symbols in ELF > object file?At the start of each range of ARM code, an assembler or compiler should produce a "$a" symbol with that address, and put it (naturally enough) in the ELF symbol-table. Similarly each stretch of Thumb code gets a "$t" and each data a "$d". For example if I assemble: .arm mov r0, r3 ldr r2, Lit Lit: .word 42 add r0, r0, r0 .thumb mov r5, r2 then the symbol table contains these entries: 4: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a [...] 6: 00000008 0 NOTYPE LOCAL DEFAULT 1 $d 7: 0000000c 0 NOTYPE LOCAL DEFAULT 1 $a 8: 00000010 0 NOTYPE LOCAL DEFAULT 1 $t which shows that an ARM region begins at offset 0x0, a data one at offset 0x8, we switch back to ARM at 0xc and finally Thumb takes over at 0x10. GNU objdump hides the symbols by default when printing the symbol-table (you can give it the --special-syms option to show them), but readelf shows them always. If you want the really deep details, they're fully documented in the ARM ELF ABI here (section 4.6.5): http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf Which is all nice to know, but I'm afraid it probably doesn't offer an immediate solution to the undefined instructions: + libc.so isn't a relocatable object file (well, it is dynamically, but that doesn't count). + llvm-objdump ignores them anyway at the moment, as far as I can tell. Tim.
Hi Tim, Thanks a lot for your help! I'm very grateful. libc.so is a prelinked library, I'll build a non-prelinked one and have another try. I'm now at the start of a binary translation project. I want to convert ARM binary code [*] to llvm ir, which is then translated to binary for our mips like architecture. That's why I'm looking for a decoder for ARM binary. The ARMMCDisassembler is production quality as be told by Evan. That's why I'm so interested in it. However, I realized today that might not be a good choice. Although the disassembled MCInsts has a clean and simple interface, the op-codes in them are auto generated from instruction description files. They are in large quantities and do not have one-to-one correspondence to arm instructions. I think it is not a good idea for our translator to rely on the implementation of llvm ARM back-end. So I have to find another decoder or implement it by by ourselves. Thanks, David [*] For most case, the targets are the shared libraries in Android APKs developed by NDK, like libangraybird.so. I think most of them are pre-linked, so it is bad for us. Because there is no $a, $t and $d symbols, we cannot figure out which region is arm code or thumb code statically. On Thu, Jun 7, 2012 at 8:11 PM, Tim Northover <t.p.northover at gmail.com>wrote:> Hi David, > > On Thu, Jun 7, 2012 at 10:17 AM, Fan Dawei <fandawei.s at gmail.com> wrote: > > Could you please tell me more about $a, $t and $d symbols? How these > symbols > > are used to define different regions? Where I can find this symbols in > ELF > > object file? > > At the start of each range of ARM code, an assembler or compiler > should produce a "$a" symbol with that address, and put it (naturally > enough) in the ELF symbol-table. Similarly each stretch of Thumb code > gets a "$t" and each data a "$d". > > For example if I assemble: > > .arm > mov r0, r3 > ldr r2, Lit > Lit: > .word 42 > add r0, r0, r0 > .thumb > mov r5, r2 > > then the symbol table contains these entries: > 4: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a > [...] > 6: 00000008 0 NOTYPE LOCAL DEFAULT 1 $d > 7: 0000000c 0 NOTYPE LOCAL DEFAULT 1 $a > 8: 00000010 0 NOTYPE LOCAL DEFAULT 1 $t > > which shows that an ARM region begins at offset 0x0, a data one at > offset 0x8, we switch back to ARM at 0xc and finally Thumb takes over > at 0x10. > > GNU objdump hides the symbols by default when printing the > symbol-table (you can give it the --special-syms option to show them), > but readelf shows them always. > > If you want the really deep details, they're fully documented in the > ARM ELF ABI here (section 4.6.5): > > > http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf > > Which is all nice to know, but I'm afraid it probably doesn't offer an > immediate solution to the undefined instructions: > + libc.so isn't a relocatable object file (well, it is dynamically, > but that doesn't count). > + llvm-objdump ignores them anyway at the moment, as far as I can tell. > > Tim. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120607/f3a505bc/attachment.html>