thr3ads.net - llvm dev - [LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Quentin Colombet

2014-Apr-16 17:16 UTC

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

Hi Jiangning,

On Apr 15, 2014, at 11:12 PM, Jiangning Liu <liujiangning1 at gmail.com>
wrote:
> Hi Quentin,
> 
> Thanks for your feedback!
>> ARM64 generates pseudo instructions ARM64::MOVaddr and friends in ISEL
stage, which intends to guarantee address serialization (page address + in-page
address), and exposes adrp finally by pass ExpandPseudoInsts. The assumption of
ARM64 solution is we don't know the in-page offset can be fused into
load/store or not at compile time, and this assumption would turn to be not true
any longer for the solution of using global merge as I proposed with the patch.
> I think this is orthogonal. If you happen to merge globals they will have
the same base address (i.e., the same pseudo instruction) but different offsets.
> CSE and such will work like a charm for the pseudos.
> 
> This is probably not true. Global merge pass happens in PreIsel stage. For
my test case at http://reviews.llvm.org/D3223, after applying the patch, we will
have LLVM IR as below,
> 
>   store i32 %a1, i32* getelementptr inbounds ({ i32, i32, i32 }*
@_MergedGlobals_x, i32 0, i32 0), align 4
>   store i32 %a2, i32* getelementptr inbounds ({ i32, i32, i32 }*
@_MergedGlobals_x, i32 0, i32 1), align 4
> 
> and after ISEL stage, we can see different Machine Instructions generated
for AArch64 and ARM64.
> 
> AArch64:
> 
>         %vreg4<def> = ADRPxi <ga:@_MergedGlobals_x>;
GPR64noxzr:%vreg4
>         LS32_STR %vreg3, %vreg4, <ga:@_MergedGlobals_x>[TF=11]
>         %vreg5<def> = ADDxxi_lsl0_s %vreg4,
<ga:@_MergedGlobals_x>[TF=11]; GPR64noxzr:%vreg5,%vreg4
>         LS32_STR %vreg2, %vreg5<kill>, 1
> 
> ARM64:
>  
>         %vreg2<def> = ADRP <ga:@_MergedGlobals_x>[TF=1];
GPR64common:%vreg2
>         STRWui %vreg0, %vreg2<kill>,
<ga:@_MergedGlobals_x>[TF=18]
>         %vreg3<def> = MOVaddr <ga:@_MergedGlobals_x>[TF=1],
<ga:@_MergedGlobals_x>[TF=18]; GPR64common:%vreg3
>         STRWui %vreg1, %vreg3<kill>, 1
> 
> The problem is MOVaddr generated for ARM64  implies introducing adrp in
ExpandPseudoInsts pass again, although at this moment we don't really see
redundant ADRP yet. AArch64 is using ADDxxi_lsl0_s instead, and it will be
folded into LS32_STR finally.Interesting.
Looks like we are too clever here.
I would have expected ISel to generate one base address and one displacement.

I believe that if we fix that both the LOHs and the global merge become
orthogonal. My guess is that we should be less aggressive at folding offset if
there are several uses.
> 
> Assuming you emit the right instructions at isel time, you will create
ADRP, LOADGot, or ADD with symbols. Since you do not know anything on the
symbols, CSE will match only the ones that are identical.
> 
> This is correct.
>  
> You will have a finer granularity to do CSE, but I am not sure it will help
that much.
> 
> The 'CSE' here is a term only rather than the traditional CSE.
Since global variables are merged into a monolithic data structure, the we will
be able to generate only one base address (page address) for all of those global
variables.
>  
> On the other hand, you lose the rematerialization capability, because that
feature can only handle one instruction at a time. So you will still be able to
rematerialize ADRP but not the LOADGot and ADD with symbols.
> 
> Yes, but this depends on register pressure, and it's hard to tell
rematerialization is always good.Sure, but it can help :).
>> If simply apply the global merge solution to ARM64, probably we should
avoid generating pseudo instruction MOVaddr and friends in ISEL stage, but
I'm not sure if the LOH solution would still work or not, because,
>> 1) ARM64 link-time optimization depends on LOH.
>> 2) We don't see linker plug-in in LLVM trunk and it would be hard
for me to verify any thoughts.
> 
> The LOH solution is also orthogonal. You can see that as a last chance way
to optimize those accesses.
> That said, if you CSE the ADRP and not the LOADGot, you will indeed create
far less candidates for the LOHs because you will have ADRPs with several uses,
which is not supported by LOHs.
> 
> Yes. This is just what I'm worrying about. So essentially those two
optimizations have conflict.Let us try to fix the codegen problem while keeping the pseudos.
>  
> FYI, the LOH optimization is not a link-time optimization in LLVM, this is
really a link-time optimization: on the binary.
> 
> Yes. I see. 
>> Any concrete suggestion of combining those different ADRP CSE solutions
and tests would be appreciated!
> 
> The bottom line is whatever you are doing with merge globals, it is
orthogonal with LOHs.
> That said I think it is best to keep the pseudo instructions.
> 
> Well, if we keep the pseudo instruction MOVaddr, we would have to keep adrp
and expose it to binary, so it would lose the opportunity of removing redundant
adrp at compile-time.
>  
> Of course I may be wrong and the best way to check would be to measure what
happens if you get rid of the pseudo instructions. Do not be too concerned with
the impact on the LOHs.
> 
> Since compile-time ADRP CSE is not so powerful as link-time ADRP removal, I
don't want to hurt link-time solution.Well, this is something that should be measured. Your patch does not kill the
LOHs, it may just reduce the number of potential candidates. For each candidate
that your patch removes, it means we at least spare one ADRP instruction. The
trade-off does not seem bad.

I suggest we:
1. Fix the ISel of pseudo (making the folding less aggressive).
2. Measure the performance with your patch.

I can definitely help for the measurements with the LOHs enabled in parallel
with your patch.
If you want I can help for #1 too.

Side question, did you happen to measure any performance improvement/regression
with your patch?
I’d like to know which tests would be good candidates to measure the impact of
your patch + LOHs enabled.

Thanks,
-Quentin> 
> Thanks,
> -Quentin
> 
>> 
>> Thanks,
>> -Jiangning
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140416/41f70ead/attachment.html>

Jiangning Liu

2014-Apr-18 02:56 UTC

head link

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

Hi Quentin,

Thanks for your kindly help!
> The problem is MOVaddr generated for ARM64  implies introducing adrp in
> ExpandPseudoInsts pass again, although at this moment we don't really
see
> redundant ADRP yet. AArch64 is using ADDxxi_lsl0_s instead, and it will be
> folded into LS32_STR finally.
>
> Interesting.
> Looks like we are too clever here.
> I would have expected ISel to generate one base address and one
> displacement.
>
> I believe that if we fix that both the LOHs and the global merge become
> orthogonal. My guess is that we should be less aggressive at folding offset
> if there are several uses.
>
>Sounds great! I thought I had misunderstanding.
> If simply apply the global merge solution to ARM64, probably we should
>> avoid generating pseudo instruction MOVaddr and friends in ISEL stage,
but
>> I'm not sure if the LOH solution would still work or not, because,
>> 1) ARM64 link-time optimization depends on LOH.
>> 2) We don't see linker plug-in in LLVM trunk and it would be hard
for me
>> to verify any thoughts.
>>
>> The LOH solution is also orthogonal. You can see that as a last chance
>> way to optimize those accesses.
>> That said, if you CSE the ADRP and not the LOADGot, you will indeed
>> create far less candidates for the LOHs because you will have ADRPs
with
>> several uses, which is not supported by LOHs.
>>
>
> Yes. This is just what I'm worrying about. So essentially those two
> optimizations have conflict.
>
> Let us try to fix the codegen problem while keeping the pseudos.
>
Currently, we have the following code for ARM64,

// The MOVaddr instruction should match only when the add is not folded
// into a load or store address.
def MOVaddr
    : Pseudo<(outs GPR64:$dst), (ins i64imm:$hi, i64imm:$low),
             [(set GPR64:$dst, (ARM64addlow (ARM64adrp tglobaladdr:$hi),
                                            tglobaladdr:$low))]>,
      Sched<[WriteAdrAdr]>;

Does it mean ISEL will generate pseudo MOVaddr as long as pattern
"(ARM64addlow (ARM64adrp $hi), $low)" exists? So I think we should
remove
this pseudo, and I don't understand what do you mean by "keeping the
pseudos". Are there any other purposes of introducing pseudo MOVaddr?
> Since compile-time ADRP CSE is not so powerful as link-time ADRP removal,
> I don't want to hurt link-time solution.
>
> Well, this is something that should be measured. Your patch does not kill
> the LOHs, it may just reduce the number of potential candidates. For each
> candidate that your patch removes, it means we at least spare one ADRP
> instruction. The trade-off does not seem bad.
>
> I suggest we:
> 1. Fix the ISel of pseudo (making the folding less aggressive).
> 2. Measure the performance with your patch.
>
> I can definitely help for the measurements with the LOHs enabled in
> parallel with your patch.
> If you want I can help for #1 too.
>
You are so nice and I'm glad that you can help both, because
1) I don't have 64-bit hardware yet
2) I don't have the link plug-in either
3) I will be busy at another high priority bug fix in a week

> Side question, did you happen to measure any performance
> improvement/regression with your patch?
>
I don't have hardware yet, so I myself didn't, but Ana helped me to
measure
it on A53 previously, but the data doesn't show consistency for two
separate measurements, so I was not convinced by that data yet. But the
data does show some sporadic improvements for some tests in EEMBC.

> I’d like to know which tests would be good candidates to measure the
> impact of your patch + LOHs enabled.
>
With my patch only, I expect 256.bzip2 and 252.eon have some performance
change because they have 42% and 52% adrp reduction percentage respectively.

For LOH, I think linker can cover a lot of more cases like the global
variable is not defined in the file being compiled. I need to collect more
data around LOH, and do you have any idea how to measure LOH effect
statically? Counting the number of LOH is enough?

Thanks,
-Jiangning

>
> Thanks,
> -Quentin
>
>
>> Thanks,
>> -Quentin
>>
>>
>> Thanks,
>> -Jiangning
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/9045a196/attachment.html>

Jiangning Liu

2014-Apr-18 03:00 UTC

head link

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

Hi Quentin,

BTW, the command line option of enabling CSE ADRP should be,

-fno-common -mllvm -global-merge-on-external=true -mllvm -global-merge=true

If you want to measure the base, the command line should be,

-mllvm -global-merge=false

Thanks,
-Jiangning



2014-04-17 1:16 GMT+08:00 Quentin Colombet <qcolombet at apple.com>:
> Hi Jiangning,
>
> On Apr 15, 2014, at 11:12 PM, Jiangning Liu <liujiangning1 at
gmail.com>
> wrote:
>
> Hi Quentin,
>
> Thanks for your feedback!
>
>> ARM64 generates pseudo instructions ARM64::MOVaddr and friends in ISEL
>> stage, which intends to guarantee address serialization (page address +
>> in-page address), and exposes adrp finally by pass ExpandPseudoInsts.
The
>> assumption of ARM64 solution is we don't know the in-page offset
can be
>> fused into load/store or not at compile time, and this assumption would
>> turn to be not true any longer for the solution of using global merge
as I
>> proposed with the patch.
>>
>> I think this is orthogonal. If you happen to merge globals they will
have
>> the same base address (i.e., the same pseudo instruction) but different
>> offsets.
>> CSE and such will work like a charm for the pseudos.
>>
>
> This is probably not true. Global merge pass happens in PreIsel stage. For
> my test case at http://reviews.llvm.org/D3223, after applying the patch,
> we will have LLVM IR as below,
>
>
>   store i32 %a1, i32* getelementptr inbounds ({ i32, i32, i32 }*
> @_MergedGlobals_x, i32 0, i32 0), align 4
>   store i32 %a2, i32* getelementptr inbounds ({ i32, i32, i32 }*
> @_MergedGlobals_x, i32 0, i32 1), align 4
>
> and after ISEL stage, we can see different Machine Instructions generated
> for AArch64 and ARM64.
>
> AArch64:
>
>         %vreg4<def> = ADRPxi <ga:@_MergedGlobals_x>;
GPR64noxzr:%vreg4
>         LS32_STR %vreg3, %vreg4, <ga:@_MergedGlobals_x>[TF=11]
>         %vreg5<def> = ADDxxi_lsl0_s %vreg4,
<ga:@_MergedGlobals_x>[TF=11];
> GPR64noxzr:%vreg5,%vreg4
>         LS32_STR %vreg2, %vreg5<kill>, 1
>
> ARM64:
>
>         %vreg2<def> = ADRP <ga:@_MergedGlobals_x>[TF=1];
> GPR64common:%vreg2
>         STRWui %vreg0, %vreg2<kill>,
<ga:@_MergedGlobals_x>[TF=18]
>         %vreg3<def> = MOVaddr <ga:@_MergedGlobals_x>[TF=1],
<ga:@_MergedGlobals_x>[TF=18];
> GPR64common:%vreg3
>         STRWui %vreg1, %vreg3<kill>, 1
>
> The problem is MOVaddr generated for ARM64  implies introducing adrp in
> ExpandPseudoInsts pass again, although at this moment we don't really
see
> redundant ADRP yet. AArch64 is using ADDxxi_lsl0_s instead, and it will be
> folded into LS32_STR finally.
>
> Interesting.
> Looks like we are too clever here.
> I would have expected ISel to generate one base address and one
> displacement.
>
> I believe that if we fix that both the LOHs and the global merge become
> orthogonal. My guess is that we should be less aggressive at folding offset
> if there are several uses.
>
>
> Assuming you emit the right instructions at isel time, you will create
>> ADRP, LOADGot, or ADD with symbols. Since you do not know anything on
the
>> symbols, CSE will match only the ones that are identical.
>>
>
> This is correct.
>
>
>> You will have a finer granularity to do CSE, but I am not sure it will
>> help that much.
>>
>
> The 'CSE' here is a term only rather than the traditional CSE.
Since
> global variables are merged into a monolithic data structure, the we will
> be able to generate only one base address (page address) for all of those
> global variables.
>
>
>> On the other hand, you lose the rematerialization capability, because
>> that feature can only handle one instruction at a time. So you will
still
>> be able to rematerialize ADRP but not the LOADGot and ADD with symbols.
>>
>
> Yes, but this depends on register pressure, and it's hard to tell
> rematerialization is always good.
>
> Sure, but it can help :).
>
> If simply apply the global merge solution to ARM64, probably we should
>> avoid generating pseudo instruction MOVaddr and friends in ISEL stage,
but
>> I'm not sure if the LOH solution would still work or not, because,
>> 1) ARM64 link-time optimization depends on LOH.
>> 2) We don't see linker plug-in in LLVM trunk and it would be hard
for me
>> to verify any thoughts.
>>
>> The LOH solution is also orthogonal. You can see that as a last chance
>> way to optimize those accesses.
>> That said, if you CSE the ADRP and not the LOADGot, you will indeed
>> create far less candidates for the LOHs because you will have ADRPs
with
>> several uses, which is not supported by LOHs.
>>
>
> Yes. This is just what I'm worrying about. So essentially those two
> optimizations have conflict.
>
> Let us try to fix the codegen problem while keeping the pseudos.
>
>
>
>> FYI, the LOH optimization is not a link-time optimization in LLVM, this
>> is really a link-time optimization: on the binary.
>>
>
> Yes. I see.
>
>>  Any concrete suggestion of combining those different ADRP CSE
solutions
>> and tests would be appreciated!
>>
>> The bottom line is whatever you are doing with merge globals, it is
>> orthogonal with LOHs.
>> That said I think it is best to keep the pseudo instructions.
>>
>
> Well, if we keep the pseudo instruction MOVaddr, we would have to keep
> adrp and expose it to binary, so it would lose the opportunity of removing
> redundant adrp at compile-time.
>
>
>> Of course I may be wrong and the best way to check would be to measure
>> what happens if you get rid of the pseudo instructions. Do not be too
>> concerned with the impact on the LOHs.
>>
>
> Since compile-time ADRP CSE is not so powerful as link-time ADRP removal,
> I don't want to hurt link-time solution.
>
> Well, this is something that should be measured. Your patch does not kill
> the LOHs, it may just reduce the number of potential candidates. For each
> candidate that your patch removes, it means we at least spare one ADRP
> instruction. The trade-off does not seem bad.
>
> I suggest we:
> 1. Fix the ISel of pseudo (making the folding less aggressive).
> 2. Measure the performance with your patch.
>
> I can definitely help for the measurements with the LOHs enabled in
> parallel with your patch.
> If you want I can help for #1 too.
>
> Side question, did you happen to measure any performance
> improvement/regression with your patch?
> I’d like to know which tests would be good candidates to measure the
> impact of your patch + LOHs enabled.
>
> Thanks,
> -Quentin
>
>
>> Thanks,
>> -Quentin
>>
>>
>> Thanks,
>> -Jiangning
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/75eaf6db/attachment.html>

Quentin Colombet

2014-Apr-18 23:15 UTC

head link

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

Hi Jiangning,

On Apr 17, 2014, at 7:56 PM, Jiangning Liu <liujiangning1 at gmail.com>
wrote:
> Hi Quentin,
> 
> Thanks for your kindly help!
>> The problem is MOVaddr generated for ARM64  implies introducing adrp in
ExpandPseudoInsts pass again, although at this moment we don't really see
redundant ADRP yet. AArch64 is using ADDxxi_lsl0_s instead, and it will be
folded into LS32_STR finally.
> 
> Interesting.
> Looks like we are too clever here.
> I would have expected ISel to generate one base address and one
displacement.
> 
> I believe that if we fix that both the LOHs and the global merge become
orthogonal. My guess is that we should be less aggressive at folding offset if
there are several uses.
> 
> 
> Sounds great! I thought I had misunderstanding. 
>>> If simply apply the global merge solution to ARM64, probably we
should avoid generating pseudo instruction MOVaddr and friends in ISEL stage,
but I'm not sure if the LOH solution would still work or not, because,
>>> 1) ARM64 link-time optimization depends on LOH.
>>> 2) We don't see linker plug-in in LLVM trunk and it would be
hard for me to verify any thoughts.
>> 
>> The LOH solution is also orthogonal. You can see that as a last chance
way to optimize those accesses.
>> That said, if you CSE the ADRP and not the LOADGot, you will indeed
create far less candidates for the LOHs because you will have ADRPs with several
uses, which is not supported by LOHs.
>> 
>> Yes. This is just what I'm worrying about. So essentially those two
optimizations have conflict.
> 
> Let us try to fix the codegen problem while keeping the pseudos.
> 
> Currently, we have the following code for ARM64,
> 
> // The MOVaddr instruction should match only when the add is not folded
> // into a load or store address.
> def MOVaddr
>     : Pseudo<(outs GPR64:$dst), (ins i64imm:$hi, i64imm:$low),
>              [(set GPR64:$dst, (ARM64addlow (ARM64adrp tglobaladdr:$hi),
>                                             tglobaladdr:$low))]>,
>       Sched<[WriteAdrAdr]>;
> 
> Does it mean ISEL will generate pseudo MOVaddr as long as pattern
"(ARM64addlow (ARM64adrp $hi), $low)" exists?Yes.
> So I think we should remove this pseudo, and I don't understand what do
you mean by "keeping the pseudos". Are there any other purposes of
introducing pseudo MOVaddr?This is for rematerialization.
What I mean by "keeping the pseudos" is we should try to preserve them
as much as possible if it does not hurt performances. In your example, I think
we could have generated only one pseudo and one displacement but the current
lowering assumed it is better to fold one of them into the store.
>> Since compile-time ADRP CSE is not so powerful as link-time ADRP
removal, I don't want to hurt link-time solution.
> Well, this is something that should be measured. Your patch does not kill
the LOHs, it may just reduce the number of potential candidates. For each
candidate that your patch removes, it means we at least spare one ADRP
instruction. The trade-off does not seem bad.
> 
> I suggest we:
> 1. Fix the ISel of pseudo (making the folding less aggressive).
> 2. Measure the performance with your patch.
> 
> I can definitely help for the measurements with the LOHs enabled in
parallel with your patch.
> If you want I can help for #1 too.
> 
> You are so nice and I'm glad that you can help both, because
> 1) I don't have 64-bit hardware yet
> 2) I don't have the link plug-in either
> 3) I will be busy at another high priority bug fix in a weekI’ll try to make the measurement with and without your patch next week.
>  
> Side question, did you happen to measure any performance
improvement/regression with your patch?
> 
> I don't have hardware yet, so I myself didn't, but Ana helped me to
measure it on A53 previously, but the data doesn't show consistency for two
separate measurements, so I was not convinced by that data yet. But the data
does show some sporadic improvements for some tests in EEMBC.
>  
> I’d like to know which tests would be good candidates to measure the impact
of your patch + LOHs enabled.
> 
> With my patch only, I expect 256.bzip2 and 252.eon have some performance
change because they have 42% and 52% adrp reduction percentage respectively.I don’t have access to EEMBC myself so I’ll focus on
SPEC.> 
> For LOH, I think linker can cover a lot of more cases like the global
variable is not defined in the file being compiled. I need to collect more data
around LOH, and do you have any idea how to measure LOH effect statically?
Counting the number of LOH is enough?The linker is quite good at optimizing the LOHs. I’d say for most 3-instruction
long LOHs, it turns at least one instruction into a nop.
That said, no, you cannot statically measure the effect of LOHs.

Thanks,
-Quentin> 
> Thanks,
> -Jiangning
>  
> 
> Thanks,
> -Quentin
>> 
>> Thanks,
>> -Quentin
>> 
>>> 
>>> Thanks,
>>> -Jiangning
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/73779a71/attachment.html>

llvm dev - Apr 2014 - [LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM

[LLVMdev] Proposal: AArch64/ARM64 merge from EuroLLVM