thr3ads.net - llvm dev - [llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2017-Nov-08 03:39 UTC

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

On Tue, Nov 7, 2017 at 6:59 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:
> Rui Ueyama via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
> > tl;dr: TLSDESC have solved most problems in formerly inefficient TLS
> access
> > models, so I think we can drop TLS relaxation support from lld.
> >
> > lld's code to handle relocations is a mess; the code consists of a
lot of
> > cascading "if"s and needs a lot of prior knowledge to
understand what it
> is
> > doing. Honestly it is head-scratching and needs serious refactoring.
I'm
> > trying to simplify it to make it manageable again, and I'm now
focusing
> on
> > the TLS relaxations.
> >
> > Thread-local variables in ELF is complicated. The ELF TLS
specification
> [1]
> > defines 4 different access models: General Dynamic, Local Dynamic,
> Initial
> > Exec and Local Exec.
> >
> > I'm not going into the details of the spec here, but the reason
why we
> have
> > so many different models for the same feature is because they were
> > different in speed, and we have to use (formerly) slow models when we
> know
> > less about their run-time memory layout at compile-time or link-time.
So,
> > there was a trade-off between generality and performance. For example,
if
> > you want to use thread-local variables in a dlopen(2)'able DSO,
you need
> to
> > choose the slowest model. If a linker knows at link-time that a more
> > restricted access model is applicable (e.g. if it is linking a main
> > executable, it knows for sure that it is not creating a DSO that will
be
> > used via dlopen), the linker is allowed to rewrite instructions to
load
> > thread-local variables to use a faster access model.
> >
> > What makes the situation more complicated is the presence of a new
method
> > of accessing thread-local variables. After the ELF TLS spec was
defined,
> > TLSDESC [2] was proposed and implemented. With that method, General
> Dynamic
> > and Local Dynamic models (that were pretty slow in the original spec)
are
> > as fast as much faster Initial Exec model. TLSDESC doesn't have a
> trade-off
> > of dlopen'ability and access speed. According to [2], it also
reduces the
> > size of generated DSOs. So it seems like TLSDESC is strictly a better
way
> > of accessing thread-local variables than the old way, and the
> thread-local
> > variable's performance problem (that the TLS ELF spec was trying
to
> address
> > by defining four different access models and relaxations in between)
> > doesn't seem a real issue anymore.
> >
> > lld supports all TLS relaxations as defined by the ELF TLS spec. I
> accepted
> > the patches to implement all these features without thinking hard
enough
> > about it, but on second thought, that was likely a wrong decision.
Being
> a
> > new linker, we don't need to trace the history of the evolution of
the
> ELF
> > spec. Instead, we should have implemented whatever it makes sense now.
> >
> > So, I'd like to propose we drop TLS relaxations from lld,
including
> Initial
> > Exec → Local Exec. Dropping IE→LE is strictly speaking a degradation,
> but I
> > don't think that is important. We don't have optimizations for
much more
> > frequent variable access patterns such as locally-accessed variables
that
> > have GOT slots (which in theory we can skip GOT access because GOT
slot
> > values are known at link-time), so it is odd that we are only serious
> about
> > TLS variables, which are usually much less important. Even if it would
> turn
> > out that we want it after implementing more important relaxations,
I'd
> like
> > to drop it for now and reimplement it in a different way later.
> >
> > This should greatly simplifies the code because it does not only
reduce
> the
> > complexity and amount of the existing code, but also reduces the
amount
> of
> > knowledge you need to have to read the code, without sacrificing
> > performance of lld-generated files in practice.
> >
> > Thoughts?
>
> I don't think we can do it.
>
> The main thing we have to keep in mind is that not everyone is using
> TLSDESC. In fact, clang doesn't even support -mtls-dialect=gnu2.
>
Oh, okay, that is a surprise to me. There's no reason not to support that
and make it default, I wasn't even try that. We definitely should support
that.

If everyone switches to TLSDESC, then I am OK with
dropping> optimizations for the old model.
>
> But even with TLSDESC we still need linker relaxations. The TLSDESC idea
> solves some of the GD -> IE cost in the case where the .so is not
> dlopened, but that is it. Note that AARCH64 that is TLSDESC only has
> relaxations.
>
> So I am strongly against removing either non TLSDESC support of support
> for the relaxations.
>
It's still pretty arguable. By default, compilers use General Dynamic model
with -fpic, and Initial Exec without -fpic. lld doesn't do any relaxation
if -shared is given. So, if you are creating a DSO, thread-local variables
in the DSO are accessed using Global Dynamic model. No relaxations are
involved.

If you are creating an executable and if your executable is not
position-independent, you're using Initial Exec model by default which is
as fast as variables accessed through GOT. If you really want to use Local
Exec model, you can pass -ftls-model=local-exec to compilers.

If you are creating a position-independent executable and you want to use
Initial Exec or Local Exec, you can do that by passing
-ftls-model={initial-exec,local-exec} to compilers.

So I don't see a strong reason to do a complicated instruction rewriting in
the linker. I feel more like we should do whatever it is instructed to do
by command line options and input object files. You are for example free to
pass the -fPIC option to create object files and still let the linker to
create a non-PIC executable, even though these combinations doesn't make
much sense and produces slightly inefficient binary. If you don't like it,
you can fix the compiler options. Thread-local variables can be considered
in the same way, no?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171107/6fd22c90/attachment.html>

Rafael Avila de Espindola via llvm-dev

2017-Nov-08 04:16 UTC

head link

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

Rui Ueyama <ruiu at google.com> writes:
>> So I am strongly against removing either non TLSDESC support of support
>> for the relaxations.
>>
>
> It's still pretty arguable. By default, compilers use General Dynamic
model
> with -fpic, and Initial Exec without -fpic.
It is more complicated than that. You can get all 4 modes with clang

-------------------------------
__thread int bar = 42;
int *foo(void) {  return &bar; }
-------------------------------
without -fPIC: local exec.

-------------------------------
extern __thread int bar;
int *foo(void) {  return &bar; }
-------------------------------
without -fPIC: initial exec.
with -fPIC: general dynamic

-------------------------------
__attribute__((visibility("hidden"))) extern __thread int bar;
int *foo(void) {  return &bar; }
-------------------------------
with -fPIC: local dynamic.
> lld doesn't do any relaxation
> if -shared is given. So, if you are creating a DSO, thread-local variables
> in the DSO are accessed using Global Dynamic model. No relaxations are
> involved.
There is not a lot of opportunities there. If one patches one access at
a time LD is as expensive as GD. The linker also doesn't know if the .so
will be used with dlopen or not, sot it cannot relax to IE. I guess a
linker could have that command line option for the second part.

Now that I spell that out, it is easy to see the TLSDESC big
advantage. It can optimize the case the static linker cannot.
> If you are creating an executable and if your executable is not
> position-independent, you're using Initial Exec model by default which
is
> as fast as variables accessed through GOT. If you really want to use Local
> Exec model, you can pass -ftls-model=local-exec to compilers.
But then all the used variables have to be defined in the same
executable. You can't have even one from a shared library (think errno).

The nice thing about linker relaxations is that they are very user
friendly. The linker is the first point in the toolchaing where some
usefull fact is know, and it can optimize the result with no user
intervention.
> So I don't see a strong reason to do a complicated instruction
rewriting in
> the linker. I feel more like we should do whatever it is instructed to do
> by command line options and input object files. You are for example free to
> pass the -fPIC option to create object files and still let the linker to
> create a non-PIC executable, even though these combinations doesn't
make
> much sense and produces slightly inefficient binary. If you don't like
it,
> you can fix the compiler options. Thread-local variables can be considered
> in the same way, no?
They are considered in the same way, we also relax got access :-)

The proposal is making the linker worse for our users to make our lifes
easier. I really don't think we should do it.

It is likelly that we can code the existing optimization in a simpler
way. Even if we cannot, I don't think we should remove them.

Linker relaxations are extremely convenient. We use the example you
gave (-fPIC .o in an executable) all the time in llvm. That way we build
only one .o that is used in lib/ and bin/.

Linker relaxations are also fundamental to how RISCV works.

Cheers,
Rafael

Rui Ueyama via llvm-dev

2017-Nov-08 04:49 UTC

head link

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

On Tue, Nov 7, 2017 at 8:16 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:
> Rui Ueyama <ruiu at google.com> writes:
>
> >> So I am strongly against removing either non TLSDESC support of
support
> >> for the relaxations.
> >>
> >
> > It's still pretty arguable. By default, compilers use General
Dynamic
> model
> > with -fpic, and Initial Exec without -fpic.
>
> It is more complicated than that. You can get all 4 modes with clang
>
> -------------------------------
> __thread int bar = 42;
> int *foo(void) {  return &bar; }
> -------------------------------
> without -fPIC: local exec.
>
> -------------------------------
> extern __thread int bar;
> int *foo(void) {  return &bar; }
> -------------------------------
> without -fPIC: initial exec.
> with -fPIC: general dynamic
>
> -------------------------------
> __attribute__((visibility("hidden"))) extern __thread int bar;
> int *foo(void) {  return &bar; }
> -------------------------------
> with -fPIC: local dynamic.

The other case is

__attribute__((visibility("hidden"))) extern __thread int bar;
int *foo(void) {  return &bar; }

without -fPIC which choose Local Exec.

>
> > lld doesn't do any relaxation
> > if -shared is given. So, if you are creating a DSO, thread-local
> variables
> > in the DSO are accessed using Global Dynamic model. No relaxations are
> > involved.
>
> There is not a lot of opportunities there. If one patches one access at
> a time LD is as expensive as GD. The linker also doesn't know if the
.so
> will be used with dlopen or not, sot it cannot relax to IE. I guess a
> linker could have that command line option for the second part.
>
> Now that I spell that out, it is easy to see the TLSDESC big
> advantage. It can optimize the case the static linker cannot.

Because of this fact, DSOs that use thread-local variables such as libc are
already compiled with -ftls-model=initial-exec. So the authors of DSOs in
which the performance thread-local variables matters are already aware of
the issue and how to workaround it.
> If you are creating an executable and if your executable is not
> > position-independent, you're using Initial Exec model by default
which is
> > as fast as variables accessed through GOT. If you really want to use
> Local
> > Exec model, you can pass -ftls-model=local-exec to compilers.
>
> But then all the used variables have to be defined in the same
> executable. You can't have even one from a shared library (think
errno).
>
Not really -- you can still use Local Exec per variable basis using the
visibility attribute. I don't think that we can observe noticeable
difference in performance between Initial Exec and Local Exec except an
synthetic benchmark though.

The nice thing about linker relaxations is that they are very
user> friendly. The linker is the first point in the toolchaing where some
> usefull fact is know, and it can optimize the result with no user
> intervention.

I think I agree with this point. Automatic linker code relaxation is
convenient and if it makes a difference, we should implement that. But I'd
doubt if TLS relaxation is actually effective. George implemented them
because there's a spec defining how to relax them, and I accepted the
patches without thinking hard enough, but I didn't see a convincing
benchmark result (or even a non-convincing one) that shows that these
relaxations actually make real-world programs faster. Do you know of
any? It is funny that even the creator of TLSDESC found that their
optimization didn't actually makes NPTL faster as it is mentioned in the
"Conclusion" section in http://www.fsfla.org/~lxoliva/
writeups/TLS/RFC-TLSDESC-x86.txt.

So I don't think I'm proposing we simplify code by degrading user's
code.
It feels more like we are making too much effort on something that doesn't
produce any measurable difference in real life.
> So I don't see a strong reason to do a complicated instruction
rewriting
> in
> > the linker. I feel more like we should do whatever it is instructed to
do
> > by command line options and input object files. You are for example
free
> to
> > pass the -fPIC option to create object files and still let the linker
to
> > create a non-PIC executable, even though these combinations
doesn't make
> > much sense and produces slightly inefficient binary. If you don't
like
> it,
> > you can fix the compiler options. Thread-local variables can be
> considered
> > in the same way, no?
>
> They are considered in the same way, we also relax got access :-)
>
> The proposal is making the linker worse for our users to make our lifes
> easier. I really don't think we should do it.
>
> It is likelly that we can code the existing optimization in a simpler
> way. Even if we cannot, I don't think we should remove them.
>
> Linker relaxations are extremely convenient. We use the example you
> gave (-fPIC .o in an executable) all the time in llvm. That way we build
> only one .o that is used in lib/ and bin/.
>
> Linker relaxations are also fundamental to how RISCV works.
>
> Cheers,
> Rafael
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171107/c8b9a1c8/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Nov 2017 - [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

[llvm-dev] [RFC] lld: Dropping TLS relaxations in favor of TLSDESC

Possibly Parallel Threads