On Aug 7, 2013, at 7:23 PM, Michele Scandale <michele.scandale at gmail.com> wrote:> On 08/08/2013 03:52 AM, Pete Cooper wrote: >>> Why a backend should be responsible (meaning have knowledge) for a >>> mapping between high level address spaces and low level address spaces? >> Thats true. I’m thinking entirely from the persecutive of the backend >> doing CL/CUDA. But actually LLVM is language agnostic. That is still >> something the metadata could solve. The front-end could generate the >> metadata i suggested earlier which will tell the backend how to do the >> mapping. Then the backend only needs to read the metadata. > > From here I understand that in the IR there are addrspace(N) where N=0,1,2,3,... according to the target independent mapping done by the frontend to represent different address spaces (for OpenCL 1.2 0 = private, 1 = global, 2 = local, 3 = constant). > > Then the frontend emits metadata that contains the map from "language address spaces" to "target address spaces" (for X86 would be 0->0 1->0 2->0 3->0). > > Finally the instruction selection will use these informations to perform the instruction selection correctly and tagging the machine instruction with both logical and physical address spaces.Sounds good.> >>> Why X86 backend should be aware of opencl address spaces or any other >>> address spaces? >> The only reason i can think of is that this allows the address space >> alias analysis to occur, and all of the optimizations you might want to >> implement on top of it. Otherwise you’ll need the front-end to put >> everything in address space 0 and you’ll have lost some opportunity to >> optimize in that way for x86. > > The mapping phase will allow to have to have the backend precondition satisfied (no address spaces other than zero). Having in the IR and also after both informations the alias analysis should be feasible. > >>> Like for other aspects I see more direct and intuitive to anticipate >>> target information in the frontend (this is already done and accepted) >>> to have a middle-end and back-end source language dependent (no >>> specific language knowledge is required, because different frontends >>> could be built on top of this). >>> >>> Maybe a way to decouple the frontend and the specific target is >>> possible in order to have in the target independent part of the >>> code-generator a support for a set of language with common concept >>> (like opencl/cuda) but it's still language dependent! >> Yes, that could work. Actually the numbers are probably not the >> important thing. Its the names that really tell you what the address >> space is for. The backend needs to know what loading from a local >> means. Its almost unimportant what specific number a front-end chooses >> for that address space. We know the front-end is really going to choose >> 2 (from what you said earlier), but the backend just needs to know how >> to load/store a local. >> >> So perhaps the front-end should really be generating metadata which >> tells the target what address space it chose for a memory space. That is >> >> !private_memory = metadata !{ i32 0 } >> !global_memory = metadata !{ i32 1 } >> !local_memory = metadata !{ i32 2 } >> !constant_memory = metadata !{ i32 3 } >> >> Unfortunately you’d have to essentially reserve those metadata names for >> your use (better names than i chose of course), but this might be >> reasonable. You could alternately use the example I first gave, but >> just add a name field to it. >> >> I guess targets would have to either assert or default to address space >> 0 when they see an address space without associated metadata. > > This part is not clear, still in the X86 backend private/global/local memories are meaningless. Indeed it is limited to a set of languages that support these abstractions.Yeah. They address spaces don’t mean anything in terms of instruction selection for x86. You mentioned earlier putting the physical and logical address spaces on the machine instr. If you wanted you could use these to perform code motion on x86 which would otherwise not be possible, but thats the only reason I can think of for why x86 would benefit from address space information in the backend.> > IMO a more general solution would be to fully demand to the frontend the mapping resolution generating the map from logical to physical address spaces. > > Considering also the fact that addrspace is used to support C address space extension that maps from C to physical numbered address spaces, maybe a default implicit identity function as mapping would be fine when no metadata are not provided.Yeah, I think a default identify mapping is a good idea. x86 for example uses address spaces 256 and 257 for the fs and gs segments. Without this default mapping, tests using those segments would fail. Thanks, Pete> > > Thanks again. > > -Michele > > >
On 8 Aug 2013, at 04:23, Pete Cooper <peter_cooper at apple.com> wrote:> > On Aug 7, 2013, at 7:23 PM, Michele Scandale <michele.scandale at gmail.com> wrote: > >> On 08/08/2013 03:52 AM, Pete Cooper wrote: >> >> From here I understand that in the IR there are addrspace(N) where N=0,1,2,3,... according to the target independent mapping done by the frontend to represent different address spaces (for OpenCL 1.2 0 = private, 1 = global, 2 = local, 3 = constant). >> >> Then the frontend emits metadata that contains the map from "language address spaces" to "target address spaces" (for X86 would be 0->0 1->0 2->0 3->0). >> >> Finally the instruction selection will use these informations to perform the instruction selection correctly and tagging the machine instruction with both logical and physical address spaces. > Sounds good.What happens when I link together two IR modules from different front ends that have different language-specific address spaces? I would be very hesitant about using address spaces until we've fixed their semantics to disallow bitcasts between different address spaces and require an explicit address space cast. To illustrate the problem, consider the following trivial example: typedef __attribute__((address_space(256))) int* gsptr; int *toglobal(gsptr foo) { return (int*)foo; } int load(int *foo) { return *foo; } int loadgs(gsptr foo) { return *foo; } int loadgs2(gsptr foo) { return *toglobal(foo); } When we compile this to LLVM IR with clang (disabling asynchronous unwind tables for clarity), at -O2 we get this: define i32* @toglobal(i32 addrspace(256)* %foo) nounwind readnone ssp { %1 = bitcast i32 addrspace(256)* %foo to i32* ret i32* %1 } define i32 @load(i32* nocapture %foo) nounwind readonly ssp { %1 = load i32* %foo, align 4, !tbaa !0 ret i32 %1 } define i32 @loadgs(i32 addrspace(256)* nocapture %foo) nounwind readonly ssp { %1 = load i32 addrspace(256)* %foo, align 4, !tbaa !0 ret i32 %1 } define i32 @loadgs2(i32 addrspace(256)* nocapture %foo) nounwind readonly ssp { %1 = bitcast i32 addrspace(256)* %foo to i32* %2 = load i32* %1, align 4, !tbaa !0 ret i32 %2 } Note that in loadgs2, the call to toglobal has been inlined and so the back end will just see a bitcast, which SelectionDAG treats as a no-op. The assembly we get from this is: _toglobal: ## @toglobal ## BB#0: pushq %rbp movq %rsp, %rbp movq %rdi, %rax popq %rbp ret load: ## @load ## BB#0: pushq %rbp movq %rsp, %rbp movl (%rdi), %eax popq %rbp ret .globl _loadgs .align 4, 0x90 loadgs: ## @loadgs ## BB#0: pushq %rbp movq %rsp, %rbp movl %gs:(%rdi), %eax popq %rbp ret .globl _loadgs2 .align 4, 0x90 loadgs2: ## @loadgs2 ## BB#0: pushq %rbp movq %rsp, %rbp movl (%rdi), %eax popq %rbp ret loadgs() has been compiled correctly. It uses the parameter as a gs-relative address and performs the load. The assembly for load() and loadgs2(), however, are identical: both are treating the parameter as a linear (not gs-relative) address. The cast has been lost. This is even simpler when you look at toglobal(), which has just become a noop. The correct code for this should be (I believe): _toglobal: ## @toglobal ## BB#0: pushq %rbp movq %rsp, %rbp lea %gs:(%rdi), %rax popq %rbp ret In the inlined version, the lea and movl should be combined into a single gs-relativel movl. Until we can generate correct code from IR containing address spaces, discussion of how to optimise this IR seems premature. David
My view is modules with different data layouts should be considered incompatible. Data layouts are inherently target/language specific and I don't view it any different than combining IR modules compiled for different architectures. Micah> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On Behalf Of David Chisnall > Sent: Thursday, August 08, 2013 2:04 AM > To: Pete Cooper > Cc: LLVM Developers Mailing List > Subject: Re: [LLVMdev] Address space extension > > On 8 Aug 2013, at 04:23, Pete Cooper <peter_cooper at apple.com> wrote: > > > > > On Aug 7, 2013, at 7:23 PM, Michele Scandale > <michele.scandale at gmail.com> wrote: > > > >> On 08/08/2013 03:52 AM, Pete Cooper wrote: > >> > >> From here I understand that in the IR there are addrspace(N) where > N=0,1,2,3,... according to the target independent mapping done by the > frontend to represent different address spaces (for OpenCL 1.2 0 = private, 1 > = global, 2 = local, 3 = constant). > >> > >> Then the frontend emits metadata that contains the map from "language > address spaces" to "target address spaces" (for X86 would be 0->0 1->0 2->0 > 3->0). > >> > >> Finally the instruction selection will use these informations to perform the > instruction selection correctly and tagging the machine instruction with both > logical and physical address spaces. > > Sounds good. > > What happens when I link together two IR modules from different front > ends that have different language-specific address spaces? > > I would be very hesitant about using address spaces until we've fixed their > semantics to disallow bitcasts between different address spaces and require > an explicit address space cast. To illustrate the problem, consider the > following trivial example: > > typedef __attribute__((address_space(256))) int* gsptr; > > int *toglobal(gsptr foo) > { > return (int*)foo; > } > > int load(int *foo) > { > return *foo; > } > > int loadgs(gsptr foo) > { > return *foo; > } > > int loadgs2(gsptr foo) > { > return *toglobal(foo); > } > > When we compile this to LLVM IR with clang (disabling asynchronous unwind > tables for clarity), at -O2 we get this: > > define i32* @toglobal(i32 addrspace(256)* %foo) nounwind readnone ssp { > %1 = bitcast i32 addrspace(256)* %foo to i32* > ret i32* %1 > } > > define i32 @load(i32* nocapture %foo) nounwind readonly ssp { > %1 = load i32* %foo, align 4, !tbaa !0 > ret i32 %1 > } > > define i32 @loadgs(i32 addrspace(256)* nocapture %foo) nounwind readonly > ssp { > %1 = load i32 addrspace(256)* %foo, align 4, !tbaa !0 > ret i32 %1 > } > > define i32 @loadgs2(i32 addrspace(256)* nocapture %foo) nounwind > readonly ssp { > %1 = bitcast i32 addrspace(256)* %foo to i32* > %2 = load i32* %1, align 4, !tbaa !0 > ret i32 %2 > } > > Note that in loadgs2, the call to toglobal has been inlined and so the back end > will just see a bitcast, which SelectionDAG treats as a no-op. The assembly > we get from this is: > > _toglobal: ## @toglobal > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movq %rdi, %rax > popq %rbp > ret > load: ## @load > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl (%rdi), %eax > popq %rbp > ret > > .globl _loadgs > .align 4, 0x90 > loadgs: ## @loadgs > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl %gs:(%rdi), %eax > popq %rbp > ret > > .globl _loadgs2 > .align 4, 0x90 > loadgs2: ## @loadgs2 > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl (%rdi), %eax > popq %rbp > ret > > loadgs() has been compiled correctly. It uses the parameter as a gs-relative > address and performs the load. The assembly for load() and loadgs2(), > however, are identical: both are treating the parameter as a linear (not gs- > relative) address. The cast has been lost. This is even simpler when you look > at toglobal(), which has just become a noop. The correct code for this should > be (I believe): > > _toglobal: ## @toglobal > ## BB#0: > pushq %rbp > movq %rsp, %rbp > lea %gs:(%rdi), %rax > popq %rbp > ret > > In the inlined version, the lea and movl should be combined into a single gs- > relativel movl. > > Until we can generate correct code from IR containing address spaces, > discussion of how to optimise this IR seems premature. > > David > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On 08/08/2013 11:04 AM, David Chisnall wrote:> What happens when I link together two IR modules from different front ends that have different language-specific address spaces?I agree with Micah: if during the linking two IR modules there are incoherences (e.g. in module1 2 -> 1 and in module2 2 -> 3) then the modules are incompatible and the link process should fail.> I would be very hesitant about using address spaces until we've fixed their semantics to disallow bitcasts between different address spaces and require an explicit address space cast. To illustrate the problem, consider the following trivial example: > > typedef __attribute__((address_space(256))) int* gsptr; > > int *toglobal(gsptr foo) > { > return (int*)foo; > } > > int load(int *foo) > { > return *foo; > } > > int loadgs(gsptr foo) > { > return *foo; > } > > int loadgs2(gsptr foo) > { > return *toglobal(foo); > } > > When we compile this to LLVM IR with clang (disabling asynchronous unwind tables for clarity), at -O2 we get this: > > define i32* @toglobal(i32 addrspace(256)* %foo) nounwind readnone ssp { > %1 = bitcast i32 addrspace(256)* %foo to i32* > ret i32* %1 > } > > define i32 @load(i32* nocapture %foo) nounwind readonly ssp { > %1 = load i32* %foo, align 4, !tbaa !0 > ret i32 %1 > } > > define i32 @loadgs(i32 addrspace(256)* nocapture %foo) nounwind readonly ssp { > %1 = load i32 addrspace(256)* %foo, align 4, !tbaa !0 > ret i32 %1 > } > > define i32 @loadgs2(i32 addrspace(256)* nocapture %foo) nounwind readonly ssp { > %1 = bitcast i32 addrspace(256)* %foo to i32* > %2 = load i32* %1, align 4, !tbaa !0 > ret i32 %2 > } > > Note that in loadgs2, the call to toglobal has been inlined and so the back end will just see a bitcast, which SelectionDAG treats as a no-op. The assembly we get from this is: > > _toglobal: ## @toglobal > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movq %rdi, %rax > popq %rbp > ret > load: ## @load > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl (%rdi), %eax > popq %rbp > ret > > .globl _loadgs > .align 4, 0x90 > loadgs: ## @loadgs > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl %gs:(%rdi), %eax > popq %rbp > ret > > .globl _loadgs2 > .align 4, 0x90 > loadgs2: ## @loadgs2 > ## BB#0: > pushq %rbp > movq %rsp, %rbp > movl (%rdi), %eax > popq %rbp > ret > > loadgs() has been compiled correctly. It uses the parameter as a gs-relative address and performs the load. The assembly for load() and loadgs2(), however, are identical: both are treating the parameter as a linear (not gs-relative) address. The cast has been lost. This is even simpler when you look at toglobal(), which has just become a noop. The correct code for this should be (I believe): > > _toglobal: ## @toglobal > ## BB#0: > pushq %rbp > movq %rsp, %rbp > lea %gs:(%rdi), %rax > popq %rbp > ret > > In the inlined version, the lea and movl should be combined into a single gs-relativel movl. > > Until we can generate correct code from IR containing address spaces, discussion of how to optimise this IR seems premature.I've done a quick test: the problem is that the BITCAST node is not generated during the SelectionDAG building. If you look in SelectionDAGBuilder::visitBitCast, you will see that the node is generated only if the operand value of the bitcast operation and the result value have different EVTs: the address space information is not handled in EVT and so pointers in different address spaces are mapped to the same EVT that imply a missing BITCAST node. Maybe rethinking the way address spaces are handled at the interface between middle-end and backend would allow to fix also these kind of problems. BTW, I think this specific problem can be used for a bug report :-). Thanks. -Michele
On Aug 8, 2013, at 3:04 AM, David Chisnall wrote:> The correct code for this should be (I believe): > > _toglobal: ## @toglobal > ## BB#0: > pushq %rbp > movq %rsp, %rbp > lea %gs:(%rdi), %rax > popq %rbp > retThis won't have the effect you're hoping for. LEA stands for "Load Effective Address"; it only operates on the offset part of a logical (far) address. It's no different from before, when RDI was MOV'd into RAX. In fact, there is no instruction you can use to turn a seg:offset logical address into a linear address. That's why most systems that use the FS and GS registers for thread-specific data have a field for the linear address of the TSD structure. Chip