Hello Micah,
I first apologize for the mail length, but I think that using an example would
be better to clarify the case and the objections.
> [Micah Villmow] In the case of OpenCL, you can't correctly use the
standard C calling convention and still be OpenCL compliant, the C calling
convention is too permissive. The second you use OpenCL, you are using an OpenCL
specific calling convention(Kernels can't call kernels that have local
memory, kernels can't have private arguments, non-kernel functions can only
be called by the device and not the host, etc...). While some backends might
allow you to specify the C calling convention when using OpenCL, it is because
you are encoding that OpenCL knowledge in some other way that tells the backend
you really aren't using the C calling convention, but instead using the
OpenCL version of the C calling convention.
Ok that's right, I was considering those constraints verified by the
frontend.
>>> 2) Expose triples that give different address space mappings. For
example,
>> and AMDIL backend and the R600 backend might target the exact same
chip,
>> but can expose different address space mapping for different
architectures or
>> devices because of their output target.
>>> 3) Add device specific options to enable different interpretation
of the
>> address spaces.
>>> 4) Each backend expose their address spaces, and it is the job of
the frontend
>> to map the source language correctly. Queries to TargetInstrInfo could
be
>> added to see if two address spaces are disjoint or not.
>>>
>>> Out of all of them, #4 is what I think would be ideal. It allows
the backend and
>> the frontend to work without tight integration and already uses
existing
>> framework to allow things like AA to work properly.
>>
>> I agree that is a matter of the frontend to specify the relationship
between
>> logical and physical address space. I do not agree that is the frontend
that
>> should apply this translation, because otherwise features the derives
from
>> language specification are lost in the IR and cannot be used for more
>> aggressive optimization. I just want to be able to keep this logical
information
>> and when required get the physical properties I need following what the
>> frontend has specified as mapping.
>>
>> This I think is more important for scenarios where the target device
has only
>> the default address space. If want to target OpenCL on this device I
would like
>> still to optimize OpenCL code as much as I can exploiting both language
and
>> target features.
> [Micah Villmow] If the backend supports, for example, the 4 OpenCL address
spaces, then the backend should define how they are to be represented, either
all as a single address space, as subsets of the default address space, as
disjoint address spaces, or any other representation the backend can think of.
If that is defined by the backend, the IR needs to be in that form BEFORE the
backend is passed the IR. If the frontend wants to represent it as something
else, and then before calling the backend, convert it to the valid
representation, I have no problem with that. It should not be the backends job
to do this translation, otherwise you have to teach the backend all of the valid
transformations for all source languages that could possibly target the backend.
It is unneeded complexity with little benefit for the backend to support this
'feature'. It is a much better solution to just have the frontend do the
right thing and honor the contract that the backend defines. Just !
for clarit
y, I'm grouping all of the target independent optimizations as under the
domain of the frontend. If you want to write a late optimization pass that does
address space fixup, there is nothing in LLVM to stop you from doing
so.>
> I really have no problem with the rest of your proposal as I strongly feel
that better address space support is important to LLVM, only with adding
complexity to the backends by attempting to dynamically determine static
information.
I apologize again but what you are saying is not clear to me: so I will use a
small example to show the current status and what I think would be a good future
status, using two target platforms X86 and R600. The part related to the
description of source language address spaces features through TBAA like
metadata is omitted because I think we agree on that part.
/// test.cl
///
__kernel void convolve(const __global int *input,
__constant int *mask,
__global int *output) {
unsigned x = get_global_id(0);
output[x] = input[x] + mask[get_local_id(0)];
}
The IR for R600 now is:
/// test.r600.ll
///
target datalayout
"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"
target triple = "r600-none-none"
; Function Attrs: nounwind
define void @convolve(i32 addrspace(1)* nocapture readonly %input, i32
addrspace(2)* nocapture readonly %mask, i32 addrspace(1)* nocapture %output) #0
{
entry:
%call = tail call i32 @get_global_id(i32 0) #2
%arrayidx = getelementptr inbounds i32 addrspace(1)* %input, i32 %call
%0 = load i32 addrspace(1)* %arrayidx, align 4, !tbaa !1
%call1 = tail call i32 @get_local_id(i32 0) #2
%arrayidx2 = getelementptr inbounds i32 addrspace(2)* %mask, i32 %call1
%1 = load i32 addrspace(2)* %arrayidx2, align 4, !tbaa !1
%add = add nsw i32 %1, %0
%arrayidx3 = getelementptr inbounds i32 addrspace(1)* %output, i32 %call
store i32 %add, i32 addrspace(1)* %arrayidx3, align 4, !tbaa !1
ret void
}
; Metadata and other declarations omitted
The IR for X86_64 now is:
/// test.x86_64.ll
///
target datalayout
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: nounwind uwtable
define void @convolve(i32* nocapture readonly %input, i32* nocapture readonly
%mask, i32* nocapture %output) #0 {
entry:
%call = tail call i32 @get_global_id(i32 0) #2
%idxprom = zext i32 %call to i64
%arrayidx = getelementptr inbounds i32* %input, i64 %idxprom
%0 = load i32* %arrayidx, align 4, !tbaa !1
%call1 = tail call i32 @get_local_id(i32 0) #2
%idxprom2 = zext i32 %call1 to i64
%arrayidx3 = getelementptr inbounds i32* %mask, i64 %idxprom2
%1 = load i32* %arrayidx3, align 4, !tbaa !1
%add = add nsw i32 %1, %0
%arrayidx5 = getelementptr inbounds i32* %output, i64 %idxprom
store i32 %add, i32* %arrayidx5, align 4, !tbaa !1
ret void
}
As you can see in the R600 IR the address space information is present because
the R600 address space map in CLANG is non trivial and it map opencl address
spaces to R600 target address spaces:
* private => 0
* global => 1
* constant => 2
* local => 3
The same thing happens for the X86 case, but here the map is trivial because it
reflects the target:
* private => 0
* global => 0
* constant => 0
* local => 0
My proposal would change the semantic of addrspace modifier: the address space
encoded would be logical. Because we need to know the target address spaces
where these logical must be mapped, the mapping must be provided. If no
information are provided the conversion is the identity. Otherwise foreach
address space number that used in the IR a target address space number must be
provided.
What I would like to obtain in the case of X86 would be something like this:
/// test.x86_64.ll
///
target datalayout
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
target addrspacemap = "1:0-2:0-3:0"
; Function Attrs: nounwind uwtable
define void @convolve(i32 addrspace(1)* nocapture readonly %input, i32
addrspace(2)* nocapture readonly %mask, i32 addrspace(1)* nocapture %output) #0
{
entry:
%call = tail call i32 @get_global_id(i32 0) #2
%idxprom = zext i32 %call to i64
%arrayidx = getelementptr inbounds i32 addrspace(1)* %input, i64 %idxprom
%0 = load i32 addrspace(1)* %arrayidx, align 4, !tbaa !1
%call1 = tail call i32 @get_local_id(i32 0) #2
%idxprom2 = zext i32 %call1 to i64
%arrayidx3 = getelementptr inbounds i32 addrspace(2)* %mask, i64 %idxprom2
%1 = load i32 addrspace(2)* %arrayidx3, align 4, !tbaa !1
%add = add nsw i32 %1, %0
%arrayidx5 = getelementptr inbounds i32 addrspace(1)* %output, i64 %idxprom
store i32 %add, i32 addrspace(1)* %arrayidx5, align 4, !tbaa !1
ret void
}
As you can see the mapping information will be emitted statically by the
frontend that still must know a-priori the mapping logical to physical. Now the
optimizer can see address space informations and so more aggressive optimization
would be possible.
Have a pass that *before* the instruction selection will transform the code in
order to apply the mapping replacing every non zero address space with zero
would not allow further optimizations in the backend that may exploit high level
informations. Instead I would prefer that the mapping would be done during the
instruction selection.
During the selection of instructions in the IR logical address space numbers are
available, but it's easy to derive the correspondent physical address space
using the *map function* defined by the frontend making the instruction selector
able to pick correctly instructions depending on the physical address spaces. It
may be useful to annotate the logical address space to the selected instruction
for further late optimizations.
The added complexity in the code generation phase is just the usage of the map
function.
As you can see in the backend there is no knowledge of logical address spaces, I
have just to pick address space from the IR and before using them apply a the
map function to compute the physical address space.
In the case of R600 target, the previous IR without any modification and without
any explicit conversion map will produce the same result, because the default
map is the identity and because the frontend choice for logical address spaces
is the one used for by the backend to identify the correspondent physical
address spaces.
It's not a wonderful thing to expose the mapping function in the IR, but
this is
necessary to allow independent tool like 'opt' and 'llc' to work
correctly (the
frontend here does not exist so the information can be passed through command
line or read by the IR, like the target triple).
I hope that the example is helpful to understand the problem I want to solve and
the proposed solution.
Using this example or another equivalent example can you explain me what is your
variant of the solution? How would you solve the current absence of information
in the case of X86 target?
Thanks again.
Regards,
-Michele