thr3ads.net - llvm dev - [llvm-dev] Replace call stack with an equivalent on the heap? [Dec 2017]

If this information is useful, please help other people find it:
Share via:

(IIIT) Siddharth Bhat via llvm-dev

2017-Dec-16 01:51 UTC

[llvm-dev] Replace call stack with an equivalent on the heap?

Hello,

I'm implementing a custom Haskell-to-LLVM compiler, and in my
experimentation, noticed that GHC is much slower than clang certain
examples, such as the ackermann function. However, from reading their
respective IRs (Cmm for GHC and LLVM for clang), I don't really see much of
a difference. Here is a link to the numbers. (n, m) are the parameters to
the ackermann function
<https://gist.github.com/bgamari/bd424e82d96ddb7b9e67c5e51cdcc5ec>

One major difference is that GHC uses a "stack-in-the-heap", in the
sense
that the a is a chunk of heap memory that functions effectively as call
stack. It is managed explicitly by GHC. GHC does not use the native call
stack _at all_. This is to implement continuation passing, in a sense.

I want to check that the difference in the performance is truly from this
"stack in  the heap" behavior, so I want to teach a backend to
generate
code that looks like this.

Is code generating  like this something that I could reasonably teach the
backend? To be specific, I wish to teach the backend to manipulate a
"custom stack". I assume this would affect codegen, because one can no
longer use push/pop/call/ret.


Thanks,
Siddharth
-- 
Sending this from my phone, please excuse any typos!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171216/32f2f13c/attachment.html>

Kuba Ober via llvm-dev

2017-Dec-16 03:25 UTC

head link

[llvm-dev] Replace call stack with an equivalent on the heap?

> 15 dec. 2017 kl. 20:51 skrev (IIIT) Siddharth Bhat via llvm-dev
<llvm-dev at lists.llvm.org>:
> 
> One major difference is that GHC uses a "stack-in-the-heap", in
the sense that the a is a chunk of heap memory that functions effectively as
call stack. It is managed explicitly by GHC. GHC does not use the native call
stack _at all_. This is to implement continuation passing, in a sense.
> 
> I want to check that the difference in the performance is truly from this
"stack in  the heap" behavior, so I want to teach a backend to
generate code that looks like this.
If the indirect jumps via this stack are all made from the same location in the
GHC runtime, then perhaps this might kill branch prediction. call/ret addresses
are duplicated in a branch predictor’s own hardware stack. If the runtime
doesn’t use call/ret, not even indirect calls, then this functionality is lost
and general branch prediction logic has to kick in. The predictors may then
either fail for a long time before a pattern is recognized via the jump site in
the runtime, or may even not recognize it and always mispredict. There may be
differences between perceptron and other techniques in this case, so see if it’s
equally bad on chips that use perceptrons (some AMD?) if yours doesn’t.

There should be a smoking gun somewhere in the performance monitoring registers
then, I’d hope. It’d be very obvious - persistent branch misprediction at the
level of functions costs dearly.

I have rather cursory knowledge in this area so perhaps the state of the art is
way ahead of my imagination, or perhaps I’m dead wrong.

Cheers, Kuba

Siddharth Bhat via llvm-dev

2017-Dec-16 12:18 UTC

head link

[llvm-dev] Replace call stack with an equivalent on the heap?

Hey,

Thanks for that idea, I had not considered that. I ran perf on the same
examples as before. However, I do not see a large difference in branch
mis-predicts. Link to gist with numbers here
<https://gist.github.com/bollu/7a9989a727ed4bf1c118dbcf386d4fc1>. Is there
something else that I missing? The instruction mix, perhaps? The C version
has roughly 1 more instruction per cycle. I am not sure how significant
that is, however.

Is there some other way to pin down what is going on in terms of slowdown?
(read the asm? profile?)

Thanks,
~Siddharth

On Sat, 16 Dec 2017 at 04:25 Kuba Ober via llvm-dev <llvm-dev at
lists.llvm.org>
wrote:
>
>
> > 15 dec. 2017 kl. 20:51 skrev (IIIT) Siddharth Bhat via llvm-dev <
> llvm-dev at lists.llvm.org>:
> >
> > One major difference is that GHC uses a "stack-in-the-heap",
in the
> sense that the a is a chunk of heap memory that functions effectively as
> call stack. It is managed explicitly by GHC. GHC does not use the native
> call stack _at all_. This is to implement continuation passing, in a sense.
> >
> > I want to check that the difference in the performance is truly from
> this "stack in  the heap" behavior, so I want to teach a backend
to
> generate code that looks like this.
>
> If the indirect jumps via this stack are all made from the same location
> in the GHC runtime, then perhaps this might kill branch prediction.
> call/ret addresses are duplicated in a branch predictor’s own hardware
> stack. If the runtime doesn’t use call/ret, not even indirect calls, then
> this functionality is lost and general branch prediction logic has to kick
> in. The predictors may then either fail for a long time before a pattern is
> recognized via the jump site in the runtime, or may even not recognize it
> and always mispredict. There may be differences between perceptron and
> other techniques in this case, so see if it’s equally bad on chips that use
> perceptrons (some AMD?) if yours doesn’t.
>
> There should be a smoking gun somewhere in the performance monitoring
> registers then, I’d hope. It’d be very obvious - persistent branch
> misprediction at the level of functions costs dearly.
>
> I have rather cursory knowledge in this area so perhaps the state of the
> art is way ahead of my imagination, or perhaps I’m dead wrong.
>
> Cheers, Kuba
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-- 
Sending this from my phone, please excuse any typos!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171216/13649a78/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Dec 2017 - Replace call stack with an equivalent on the heap?

[llvm-dev] Replace call stack with an equivalent on the heap?

[llvm-dev] Replace call stack with an equivalent on the heap?

[llvm-dev] Replace call stack with an equivalent on the heap?

Apparently Analagous Threads