thr3ads.net - llvm dev - [LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Tim Northover

2014-Mar-14 14:07 UTC

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

>> Any thoughs?
>
> I'm now struggling to see how GCC justifies it. What if a different
> translation-unit declared those variables in a different order? I also
> can't get the same behaviour here, do you have a more complete
> command-line?
Ah, I see; the translation-unit that does the optimisation needs to
have them as a definition (i.e. "= {0}") rather than a declaration for
the optimisation to kick in, giving it precedence over other
declarations. And the hidden-visibility means they won't be
R_ARM_COPYed out of their initial location.

After a very brief thought, I'd still go for GlobalMerge now, in
conjunction with an enhanced "alias" so that you could emit something
like:

    @g1 = hidden alias [100 x i32]* bitcast(i32* getelementptr([300 x
i32]* @Merged, i32 0, i32 0) to [100 x i32]*)

We certainly don't seem to handle this alias properly now though, and
it may violate the intended uses. Rafael's doing some thinking about
"alias" at the moment, so I've CCed him.

Would that be a horrific abuse of the poor alias system?

Cheers.

Tim.

Rafael Espíndola

2014-Mar-14 16:03 UTC

head link

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

> After a very brief thought, I'd still go for GlobalMerge now, in
> conjunction with an enhanced "alias" so that you could emit
something
> like:
>
>     @g1 = hidden alias [100 x i32]* bitcast(i32* getelementptr([300 x
> i32]* @Merged, i32 0, i32 0) to [100 x i32]*)
>
> We certainly don't seem to handle this alias properly now though, and
> it may violate the intended uses. Rafael's doing some thinking about
> "alias" at the moment, so I've CCed him.
>
> Would that be a horrific abuse of the poor alias system?
I think it would :-) Folding objects like this prevents the linker
from deleting one of them if it is unused for example.

I think it is just a missing optimization in the ARM backend. If it
knows multiple objecs are in the same DSO, it can use the address of
one to find the other.

Given:

@g0 = hidden global [100 x i32] zeroinitializer, align 4
@g1 = hidden global [100 x i32] zeroinitializer, align 4
define void @foo() {
  tail call void @bar(i8* bitcast ([100 x i32]* @g0 to i8*))
  tail call void @bar(i8* bitcast ([100 x i32]* @g1 to i8*))
  ret void
}
declare void @bar(i8*)

The command "llc -mtriple=i686-pc-linux -relocation-model=pic"
produces

calll .L0$pb
.L0$pb:
popl %ebx
.Ltmp3:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp3-.L0$pb), %ebx
leal g0 at GOTOFF(%ebx), %eax
movl %eax, (%esp)
calll bar at PLT
leal g1 at GOTOFF(%ebx), %eax
movl %eax, (%esp)
calll bar at PLT

Which is ok , since the add of ebx is folded and the constant is an
immediate in x86.

On ARM, that is not the case. We produce

        ldr     r0, .LCPI0_0
        add     r4, pc, r0 // r4 is the equivalent of ebx in the x86 case.
        ldr       r0, .LCPI0_1 // r0 is the constant that is an
immediate in x86.
        add     r0, r0, r4 // that is the add that is folded in x86
...
.LCPI0_0:
        .long   _GLOBAL_OFFSET_TABLE_-(.LPC0_0+8)
.LCPI0_1:
        .long   g0(GOTOFF)

For ARM, codegen already keeps tracks of offset so it can implement
the constant islands, so it should be able to see that the two globals
are close enough that offset between them fits an immediate.

Nick, will this work on MachO or can ld64 move _g0, _g1 and _g2 too far apart?

BTW, what will gcc produce for

void init(void *);
extern int g0[100] __attribute__((visibility("hidden")));
extern int g1[100] __attribute__((visibility("hidden")));
extern int g2[100] __attribute__((visibility("hidden")));
void foo() {
  init(&g0);
  init(&g1);
  init(&g2);
}

Cheers,
Rafael

Nick Kledzik

2014-Mar-14 16:35 UTC

head link

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

On Mar 14, 2014, at 9:03 AM, Rafael Espíndola <rafael.espindola at
gmail.com> wrote:
> 
> On ARM, that is not the case. We produce
> 
>        ldr     r0, .LCPI0_0
>        add     r4, pc, r0 // r4 is the equivalent of ebx in the x86 case.
>        ldr       r0, .LCPI0_1 // r0 is the constant that is an
> immediate in x86.
>        add     r0, r0, r4 // that is the add that is folded in x86
> ...
> .LCPI0_0:
>        .long   _GLOBAL_OFFSET_TABLE_-(.LPC0_0+8)
> .LCPI0_1:
>        .long   g0(GOTOFF)
> 
> For ARM, codegen already keeps tracks of offset so it can implement
> the constant islands, so it should be able to see that the two globals
> are close enough that offset between them fits an immediate.
> 
> Nick, will this work on MachO or can ld64 move _g0, _g1 and _g2 too far
apart?When this is compiled, you only know that g0, g1, and g2 will be in the same
linkage unit.  You don’t know if they will come from the same translation unit.
You don’t know how big the overall __DATA segment will be, so yes it is
quite possible g0, g1, and g2 will be more that 64KB apart.

Also, 32-bit arm for mach-o does not use a GOT.   The compiler does create
GOT-like slots call non-lazy-pointers for accessing symbols defined outside
the translation unit.  Given that the arrays are declared hidden, that means
they will be defined in the linkage unit.  So, ideally the non-lazy-pointer
indirection
could be removed and have the code directly access the array.  The problem
is mach-o has no relocation for pointer diffs where one of the pointers is 
undefined.  Currently, the only solution is to get the optimal (not use
non-lazy-
pointer) is to build with LTO.

-Nick
> 
> BTW, what will gcc produce for
> 
> void init(void *);
> extern int g0[100] __attribute__((visibility("hidden")));
> extern int g1[100] __attribute__((visibility("hidden")));
> extern int g2[100] __attribute__((visibility("hidden")));
> void foo() {
>  init(&g0);
>  init(&g1);
>  init(&g2);
> }
> 
> Cheers,
> Rafael

Weiming Zhao

2014-Mar-14 18:34 UTC

head link

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

Hi Rafael,

Yes, merging gv prevents linker to do garbage collection. Should it be
implemented as a peephole pass? If we do it too early, the distance between GVs
are not fixed yet.

PS:
Below is the GCC output with "extern" hidden:
	ldr	r2, .L2
	stmfd	sp!, {r3, lr}
	.save {r3, lr}
.LPIC0:
	add	r0, pc, r2
	bl	_Z4initPv(PLT)
	ldr	r1, .L2+4
.LPIC1:
	add	r0, pc, r1
	bl	_Z4initPv(PLT)
	ldr	r0, .L2+8
.LPIC2:
	add	r0, pc, r0
	ldmfd	sp!, {r3, lr}
	b	_Z4initPv(PLT)
.L3:
	.align	2
.L2:
	.word	g0-(.LPIC0+8)
	.word	g1-(.LPIC1+8)
	.word	g2-(.LPIC2+8)

Thanks,
Weiming

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The
Linux Foundation


-----Original Message-----
From: Rafael Espíndola [mailto:rafael.espindola at gmail.com] 
Sent: Friday, March 14, 2014 9:04 AM
To: Tim Northover
Cc: Weiming Zhao; LLVM Developers Mailing List; Jim Grosbach; Nick Kledzik
Subject: Re: [LLVMdev] [ARM] [PIC] optimizing the loading of hidden global
variable
> After a very brief thought, I'd still go for GlobalMerge now, in 
> conjunction with an enhanced "alias" so that you could emit
something
> like:
>
>     @g1 = hidden alias [100 x i32]* bitcast(i32* getelementptr([300 x
> i32]* @Merged, i32 0, i32 0) to [100 x i32]*)
>
> We certainly don't seem to handle this alias properly now though, and 
> it may violate the intended uses. Rafael's doing some thinking about 
> "alias" at the moment, so I've CCed him.
>
> Would that be a horrific abuse of the poor alias system?
I think it would :-) Folding objects like this prevents the linker from deleting
one of them if it is unused for example.

I think it is just a missing optimization in the ARM backend. If it knows
multiple objecs are in the same DSO, it can use the address of one to find the
other.

Given:

@g0 = hidden global [100 x i32] zeroinitializer, align 4
@g1 = hidden global [100 x i32] zeroinitializer, align 4 define void @foo() {
  tail call void @bar(i8* bitcast ([100 x i32]* @g0 to i8*))
  tail call void @bar(i8* bitcast ([100 x i32]* @g1 to i8*))
  ret void
}
declare void @bar(i8*)

The command "llc -mtriple=i686-pc-linux -relocation-model=pic"
produces

calll .L0$pb
.L0$pb:
popl %ebx
.Ltmp3:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp3-.L0$pb), %ebx leal g0 at GOTOFF(%ebx), %eax
movl %eax, (%esp) calll bar at PLT leal g1 at GOTOFF(%ebx), %eax movl %eax,
(%esp) calll bar at PLT

Which is ok , since the add of ebx is folded and the constant is an immediate in
x86.

On ARM, that is not the case. We produce

        ldr     r0, .LCPI0_0
        add     r4, pc, r0 // r4 is the equivalent of ebx in the x86 case.
        ldr       r0, .LCPI0_1 // r0 is the constant that is an
immediate in x86.
        add     r0, r0, r4 // that is the add that is folded in x86
...
.LCPI0_0:
        .long   _GLOBAL_OFFSET_TABLE_-(.LPC0_0+8)
.LCPI0_1:
        .long   g0(GOTOFF)

For ARM, codegen already keeps tracks of offset so it can implement the constant
islands, so it should be able to see that the two globals are close enough that
offset between them fits an immediate.

Nick, will this work on MachO or can ld64 move _g0, _g1 and _g2 too far apart?

BTW, what will gcc produce for

void init(void *);
extern int g0[100] __attribute__((visibility("hidden")));
extern int g1[100] __attribute__((visibility("hidden")));
extern int g2[100] __attribute__((visibility("hidden")));
void foo() {
  init(&g0);
  init(&g1);
  init(&g2);
}

Cheers,
Rafael

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Mar 2014 - [LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

[LLVMdev] [ARM] [PIC] optimizing the loading of hidden global variable

Possibly Parallel Threads