comic fans via llvm-dev
2017-Nov-21 15:32 UTC
[llvm-dev] question about xray tls data initialization
with some dirty hack , I've made xray runtime 'built' on windows , but unfortunately I haven't enough knowledge about linker and the runtime, and finally built executable didn't run. I'd like to share my changes here , hopes somebody help me to make it run on windows. in AsmPrinter, copy/paster xray for coff target InstMap = OutContext.getCOFFSection("xray_instr_map", 0, SectionKind::getReadOnlyWithRel()); FnSledIndex = OutContext.getCOFFSection("xray_fn_idx", 0,SectionKind::getReadOnlyWithRel()); in XRayArgs , allow windows platform to use xray args. with this, generated code seems have sled and xray parts. in xray runtime, bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a, s32 *cmp, s32 xchg, memory_order mo) is missed for MSVC , I take atomic_uint32_t implementation msvc 14.1 treats BufferQueue::Buffer::Buffer as constructor instead of data member, Buf.Buffer=>Buf.Data FunctionRecord pack , __attribute__((packed)) => #pragma pack(push,1), msvc also requires bitfields to be same type to pack them together( all types => uint32_t) FD int => HANDLE, most code logic still valid (-1 as invalid value), r/w API replaced with windows mprotect => VirtualProtect readTSC in xray_x86_64.inc also works for windows replace read tsc from proc with QueryPerformanceFrequency msvc can not compile such code void setupNewBuffer(int (*wall_clock_reader)(clockid_t, struct timespec *)); must use typedef first . xray use clock_gettime as default implementation , which is not friendly for windows .create a fake one based on chrono system_clock(ignore clockid_t) for tls destructor part, I've just commented them out.(but codeproject.com/Articles/8113/Thread-Local-Storage-The-C-Way gives a thread exit callback way for coff) and last thing , which I don't understand is the weak symbol for __start_xray_instr_map[] __stop_xray_instr_map[] __start_xray_fn_idx[] __stop_xray_fn_idx[] I replace them with __declspec(selectany) , but I'm not sure they have same meanings. some random generated code: .text .intel_syntax noprefix .def call; .scl 2; .type 32; .endef .globl call # -- Begin function call .p2align 4, 0x90 call: # @call .seh_proc call # BB#0: # %entry .p2align 1, 0x90 .Lxray_sled_0: .ascii "\353\t" nop word ptr [rax + rax + 512] sub rsp, 16 .seh_stackalloc 16 .seh_endprologue mov dword ptr [rsp + 12], ecx mov dword ptr [rsp + 8], 0 mov dword ptr [rsp + 4], 0 .LBB0_1: # %for.cond # =>This Inner Loop Header: Depth=1 mov eax, dword ptr [rsp + 4] cmp eax, dword ptr [rsp + 12] jge .LBB0_4 # BB#2: # %for.body # in Loop: Header=BB0_1 Depth=1 mov eax, dword ptr [rsp + 4] add eax, dword ptr [rsp + 8] mov dword ptr [rsp + 8], eax # BB#3: # %for.inc # in Loop: Header=BB0_1 Depth=1 mov eax, dword ptr [rsp + 4] add eax, 1 mov dword ptr [rsp + 4], eax jmp .LBB0_1 .LBB0_4: # %for.end mov eax, dword ptr [rsp + 8] add rsp, 16 .p2align 1, 0x90 .Lxray_sled_1: ret nop word ptr cs:[rax + rax + 512] .seh_handlerdata .text .seh_endproc # -- End function .section xray_instr_map,"y" .Lxray_sleds_start0: .quad .Lxray_sled_0 .quad call .byte 0x00 .byte 0x00 .byte 0x00 .zero 13 .quad .Lxray_sled_1 .quad call .byte 0x01 .byte 0x00 .byte 0x00 .zero 13 .Lxray_sleds_end0: .section xray_fn_idx,"y" .p2align 4, 0x90 .quad .Lxray_sleds_start0 .quad .Lxray_sleds_end0 .text and parts of obj dump: SECTION HEADER #5 /16 name (xray_instr_map) 0 physical address 0 virtual address 40 size of raw data 198 file pointer to raw data (00000198 to 000001D7) 1D8 file pointer to relocation table 0 file pointer to line numbers 4 number of relocations 0 number of line numbers 100000 flags 1 byte align RAW DATA #5 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 56 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 V............... 00000030: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ RELOCATIONS #5 Symbol Symbol Offset Type Applied To Index Name -------- ---------------- ----------------- -------- ------ 00000000 ADDR64 00000000 00000000 0 .text 00000008 ADDR64 00000000 00000000 E call 00000020 ADDR64 00000000 00000056 0 .text 00000028 ADDR64 00000000 00000000 E call SECTION HEADER #6 /4 name (xray_fn_idx) 0 physical address 0 virtual address 10 size of raw data 200 file pointer to raw data (00000200 to 0000020F) 210 file pointer to relocation table 0 file pointer to line numbers 2 number of relocations 0 number of line numbers 500000 flags 16 byte align RAW DATA #6 00000000: 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........ at ....... RELOCATIONS #6 Symbol Symbol Offset Type Applied To Index Name -------- ---------------- ----------------- -------- ------ 00000000 ADDR64 00000000 00000000 8 xray_instr_map 00000008 ADDR64 00000000 00000040 8 xray_instr_map On Tue, Nov 21, 2017 at 7:46 PM, Dean Michael Berris <dean.berris at gmail.com> wrote:> > On 17 Nov 2017, at 00:44, comic fans via llvm-dev <llvm-dev at lists.llvm.org> > wrote: > > I'm learning the xray library and try if it can be built on windows, in > xray_fdr_logging_impl.h > > line 152 , comment written as > // Using pthread_once(...) to initialize the thread-local data structures > > > but at line 175, 183, code written as > > thread_local pthread_key_t key; > > // Ensure that we only actually ever do the pthread initialization once. > thread_local bool UNUSED Unused = [] { > new (&TLSBuffer) ThreadLocalData(); > auto result = pthread_key_create(&key, +[](void *) { > auto &TLD = *reinterpret_cast<ThreadLocalData *>(&TLSBuffer); > > > I'm confused that pthread_key_t and Unused are both thread_local > variable, doesn't it mean the following lambda will run for each > thread , and create one pthread_key_t for only one tls data(instead of > only one pthread_key_t for all thread) ? also what does the '+' before > lambda expression mean ? this may be stupid questions, could somebody > kindly helped ? > > > Yeah, that comment is out-of-date (and the implementation is buggy) -- which > is a shame really. :/ > > But, the good news, is I think we've fixed this now in the top-of-trunk with > reviews.llvm.org/D39526 and reviews.llvm.org/D40164. > > Curiously though, how far did your exploration into getting XRay to build on > Windows go? > > Cheers > > -- Dean >
Dean Michael Berris via llvm-dev
2017-Nov-22 02:37 UTC
[llvm-dev] question about xray tls data initialization
> On 22 Nov 2017, at 02:32, comic fans <comicfans44 at gmail.com> wrote: > > with some dirty hack , I've made xray runtime 'built' on windows ,\o/> but unfortunately I haven't enough knowledge about linker and the > runtime, and finally built executable didn't run. I'd like to share > my changes here , hopes somebody help me to make it run on windows.Thanks for working on this! If you're alright with it, maybe you can send some patches to review, preferably through the LLVM Phabricator instance? You can have me or Reid (who knows more about COFF and the Windows stuff) as reviewers.> in AsmPrinter, copy/paster xray for coff target > > InstMap = OutContext.getCOFFSection("xray_instr_map", 0, > SectionKind::getReadOnlyWithRel()); > FnSledIndex = OutContext.getCOFFSection("xray_fn_idx", > 0,SectionKind::getReadOnlyWithRel()); > > in XRayArgs , allow windows platform to use xray args. with this, > generated code seems have sled and xray parts. >Nice, I suspect we can make this change with tests as well, which we can build on incrementally.> in xray runtime, > bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a, > s32 *cmp, > s32 xchg, > memory_order mo) > is missed for MSVC , I take atomic_uint32_t implementation >This is in compiler-rt/lib/sanitizer_common/... right?> msvc 14.1 treats BufferQueue::Buffer::Buffer as constructor instead of > data member, Buf.Buffer=>Buf.Data >Interesting. That's an easy patch to merge. :)> FunctionRecord pack , __attribute__((packed)) => #pragma > pack(push,1), msvc also requires bitfields to be same type to pack > them together( all types => uint32_t) >Are you able to test this on other platforms?> FD int => HANDLE, most code logic still valid (-1 as invalid value), > r/w API replaced with windows > > mprotect => VirtualProtect > > readTSC in xray_x86_64.inc also works for windows > > replace read tsc from proc with QueryPerformanceFrequency > > msvc can not compile such code > void setupNewBuffer(int (*wall_clock_reader)(clockid_t, > struct timespec *)); > > must use typedef first . xray use clock_gettime as default > implementation , which is not friendly for windows .create a fake one > based on chrono system_clock(ignore clockid_t) >This one is definitely something to do, even for potentially supporting XRay on Darwin where older versions of the SDK (10.11 and lower) don't define clock_gettime. Probably can be split off as a thing that can be reviewed and merged regardless.> for tls destructor part, I've just commented them out.(but > codeproject.com/Articles/8113/Thread-Local-Storage-The-C-Way > gives a thread exit callback way for coff) >Interesting, thanks! This one is something that could be abstracted away on a per-platform basis.> and last thing , which I don't understand is the weak symbol for > __start_xray_instr_map[] > __stop_xray_instr_map[] > __start_xray_fn_idx[] > __stop_xray_fn_idx[] > > I replace them with __declspec(selectany) , but I'm not sure they > have same meanings. >The __{start, stop}_xray_{instr_map,fn_idx}[] arrays are usually generated by the linker on ELF and ELF-like platforms. I'm not aware what the MSVC COFF linkers do, probably something others who know better can answer.> > some random generated code: > .text > .intel_syntax noprefix > .def call; > .scl 2; > .type 32; > .endef > .globl call # -- Begin function call > .p2align 4, 0x90 > call: # @call > .seh_proc call > # BB#0: # %entry > .p2align 1, 0x90 > .Lxray_sled_0: > .ascii "\353\t" > nop word ptr [rax + rax + 512] > sub rsp, 16 > .seh_stackalloc 16 > .seh_endprologue > mov dword ptr [rsp + 12], ecx > mov dword ptr [rsp + 8], 0 > mov dword ptr [rsp + 4], 0 > .LBB0_1: # %for.cond > # =>This Inner Loop Header: Depth=1 > mov eax, dword ptr [rsp + 4] > cmp eax, dword ptr [rsp + 12] > jge .LBB0_4 > # BB#2: # %for.body > # in Loop: Header=BB0_1 Depth=1 > mov eax, dword ptr [rsp + 4] > add eax, dword ptr [rsp + 8] > mov dword ptr [rsp + 8], eax > # BB#3: # %for.inc > # in Loop: Header=BB0_1 Depth=1 > mov eax, dword ptr [rsp + 4] > add eax, 1 > mov dword ptr [rsp + 4], eax > jmp .LBB0_1 > .LBB0_4: # %for.end > mov eax, dword ptr [rsp + 8] > add rsp, 16 > .p2align 1, 0x90 > .Lxray_sled_1: > ret > nop word ptr cs:[rax + rax + 512] > .seh_handlerdata > .text > .seh_endproc > # -- End function > .section xray_instr_map,"y" > .Lxray_sleds_start0: > .quad .Lxray_sled_0 > .quad call > .byte 0x00 > .byte 0x00 > .byte 0x00 > .zero 13 > .quad .Lxray_sled_1 > .quad call > .byte 0x01 > .byte 0x00 > .byte 0x00 > .zero 13 > .Lxray_sleds_end0: > .section xray_fn_idx,"y" > .p2align 4, 0x90 > .quad .Lxray_sleds_start0 > .quad .Lxray_sleds_end0 > .text > > and parts of obj dump: > > > SECTION HEADER #5 > /16 name (xray_instr_map) > 0 physical address > 0 virtual address > 40 size of raw data > 198 file pointer to raw data (00000198 to 000001D7) > 1D8 file pointer to relocation table > 0 file pointer to line numbers > 4 number of relocations > 0 number of line numbers > 100000 flags > 1 byte align > > RAW DATA #5 > 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 56 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 V............... > 00000030: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > RELOCATIONS #5 > Symbol Symbol > Offset Type Applied To Index Name > -------- ---------------- ----------------- -------- ------ > 00000000 ADDR64 00000000 00000000 0 .text > 00000008 ADDR64 00000000 00000000 E call > 00000020 ADDR64 00000000 00000056 0 .text > 00000028 ADDR64 00000000 00000000 E call > > SECTION HEADER #6 > /4 name (xray_fn_idx) > 0 physical address > 0 virtual address > 10 size of raw data > 200 file pointer to raw data (00000200 to 0000020F) > 210 file pointer to relocation table > 0 file pointer to line numbers > 2 number of relocations > 0 number of line numbers > 500000 flags > 16 byte align > > RAW DATA #6 > 00000000: 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........ at ....... > > RELOCATIONS #6 > Symbol Symbol > Offset Type Applied To Index Name > -------- ---------------- ----------------- -------- ------ > 00000000 ADDR64 00000000 00000000 8 xray_instr_map > 00000008 ADDR64 00000000 00000040 8 xray_instr_map >This looks like it's actually worked, at least at CodeGen time. Thanks again for sharing your experience, it'd be really great if you can have patches that we can review and land to potentially get XRay working on Windows! Cheers> On Tue, Nov 21, 2017 at 7:46 PM, Dean Michael Berris > <dean.berris at gmail.com> wrote: >> >> On 17 Nov 2017, at 00:44, comic fans via llvm-dev <llvm-dev at lists.llvm.org> >> wrote: >> >> I'm learning the xray library and try if it can be built on windows, in >> xray_fdr_logging_impl.h >> >> line 152 , comment written as >> // Using pthread_once(...) to initialize the thread-local data structures >> >> >> but at line 175, 183, code written as >> >> thread_local pthread_key_t key; >> >> // Ensure that we only actually ever do the pthread initialization once. >> thread_local bool UNUSED Unused = [] { >> new (&TLSBuffer) ThreadLocalData(); >> auto result = pthread_key_create(&key, +[](void *) { >> auto &TLD = *reinterpret_cast<ThreadLocalData *>(&TLSBuffer); >> >> >> I'm confused that pthread_key_t and Unused are both thread_local >> variable, doesn't it mean the following lambda will run for each >> thread , and create one pthread_key_t for only one tls data(instead of >> only one pthread_key_t for all thread) ? also what does the '+' before >> lambda expression mean ? this may be stupid questions, could somebody >> kindly helped ? >> >> >> Yeah, that comment is out-of-date (and the implementation is buggy) -- which >> is a shame really. :/ >> >> But, the good news, is I think we've fixed this now in the top-of-trunk with >> reviews.llvm.org/D39526 and reviews.llvm.org/D40164. >> >> Curiously though, how far did your exploration into getting XRay to build on >> Windows go? >> >> Cheers >> >> -- Dean >>-- Dean -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20171122/a257a05e/attachment.html>
comic fans via llvm-dev
2017-Nov-23 12:34 UTC
[llvm-dev] question about xray tls data initialization
On Wed, Nov 22, 2017 at 10:37 AM, Dean Michael Berris <dean.berris at gmail.com> wrote:> > On 22 Nov 2017, at 02:32, comic fans <comicfans44 at gmail.com> wrote: > > with some dirty hack , I've made xray runtime 'built' on windows , > > > \o/with more test, I've found that trampoline didn't got built for windows :/ currently cmake didn't generate build rule for asm so its silently ignored(with msvc ide, but not ninja). we must have enable_language(ASM_MASM) to use masm, and trampoline also need ports.> > If you're alright with it, maybe you can send some patches to review, > preferably through the LLVM Phabricator instance? You can have me or Reid > (who knows more about COFF and the Windows stuff) as reviewers. > > in AsmPrinter, copy/paster xray for coff target > > InstMap = OutContext.getCOFFSection("xray_instr_map", 0, > SectionKind::getReadOnlyWithRel()); > FnSledIndex = OutContext.getCOFFSection("xray_fn_idx", > 0,SectionKind::getReadOnlyWithRel()); > > in XRayArgs , allow windows platform to use xray args. with this, > generated code seems have sled and xray parts. > > > Nice, I suspect we can make this change with tests as well, which we can > build on incrementally.where can I find some examples to test this xray part in llvm ?> in xray runtime, > bool atomic_compare_exchange_strong(volatile atomic_sint32_t *a, > s32 *cmp, > s32 xchg, > memory_order mo) > is missed for MSVC , I take atomic_uint32_t implementation > > > This is in compiler-rt/lib/sanitizer_common/... right?yes, sanitizer_atomic_msvc.h didn't provide this override. according to msdn of interlockedcompareexchange, implementation for atomic_uint32_t should also works for atomic_sint32_t. this is a copy/paste but I think its short enough. any better suggestion ?> > FunctionRecord pack , __attribute__((packed)) => #pragma > pack(push,1), msvc also requires bitfields to be same type to pack > them together( all types => uint32_t) > > > Are you able to test this on other platforms?I've tested this on linux64 (with clang) and it pass check-xray , but I don't have mac to test. if changing all attribute to pragma is desirable , I can submit a patch for that .