(IIIT) Siddharth Bhat via llvm-dev
2017-Dec-16 01:51 UTC
[llvm-dev] Replace call stack with an equivalent on the heap?
Hello, I'm implementing a custom Haskell-to-LLVM compiler, and in my experimentation, noticed that GHC is much slower than clang certain examples, such as the ackermann function. However, from reading their respective IRs (Cmm for GHC and LLVM for clang), I don't really see much of a difference. Here is a link to the numbers. (n, m) are the parameters to the ackermann function <https://gist.github.com/bgamari/bd424e82d96ddb7b9e67c5e51cdcc5ec> One major difference is that GHC uses a "stack-in-the-heap", in the sense that the a is a chunk of heap memory that functions effectively as call stack. It is managed explicitly by GHC. GHC does not use the native call stack _at all_. This is to implement continuation passing, in a sense. I want to check that the difference in the performance is truly from this "stack in the heap" behavior, so I want to teach a backend to generate code that looks like this. Is code generating like this something that I could reasonably teach the backend? To be specific, I wish to teach the backend to manipulate a "custom stack". I assume this would affect codegen, because one can no longer use push/pop/call/ret. Thanks, Siddharth -- Sending this from my phone, please excuse any typos! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171216/32f2f13c/attachment.html>
Kuba Ober via llvm-dev
2017-Dec-16 03:25 UTC
[llvm-dev] Replace call stack with an equivalent on the heap?
> 15 dec. 2017 kl. 20:51 skrev (IIIT) Siddharth Bhat via llvm-dev <llvm-dev at lists.llvm.org>: > > One major difference is that GHC uses a "stack-in-the-heap", in the sense that the a is a chunk of heap memory that functions effectively as call stack. It is managed explicitly by GHC. GHC does not use the native call stack _at all_. This is to implement continuation passing, in a sense. > > I want to check that the difference in the performance is truly from this "stack in the heap" behavior, so I want to teach a backend to generate code that looks like this.If the indirect jumps via this stack are all made from the same location in the GHC runtime, then perhaps this might kill branch prediction. call/ret addresses are duplicated in a branch predictor’s own hardware stack. If the runtime doesn’t use call/ret, not even indirect calls, then this functionality is lost and general branch prediction logic has to kick in. The predictors may then either fail for a long time before a pattern is recognized via the jump site in the runtime, or may even not recognize it and always mispredict. There may be differences between perceptron and other techniques in this case, so see if it’s equally bad on chips that use perceptrons (some AMD?) if yours doesn’t. There should be a smoking gun somewhere in the performance monitoring registers then, I’d hope. It’d be very obvious - persistent branch misprediction at the level of functions costs dearly. I have rather cursory knowledge in this area so perhaps the state of the art is way ahead of my imagination, or perhaps I’m dead wrong. Cheers, Kuba
Siddharth Bhat via llvm-dev
2017-Dec-16 12:18 UTC
[llvm-dev] Replace call stack with an equivalent on the heap?
Hey, Thanks for that idea, I had not considered that. I ran perf on the same examples as before. However, I do not see a large difference in branch mis-predicts. Link to gist with numbers here <https://gist.github.com/bollu/7a9989a727ed4bf1c118dbcf386d4fc1>. Is there something else that I missing? The instruction mix, perhaps? The C version has roughly 1 more instruction per cycle. I am not sure how significant that is, however. Is there some other way to pin down what is going on in terms of slowdown? (read the asm? profile?) Thanks, ~Siddharth On Sat, 16 Dec 2017 at 04:25 Kuba Ober via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > > > 15 dec. 2017 kl. 20:51 skrev (IIIT) Siddharth Bhat via llvm-dev < > llvm-dev at lists.llvm.org>: > > > > One major difference is that GHC uses a "stack-in-the-heap", in the > sense that the a is a chunk of heap memory that functions effectively as > call stack. It is managed explicitly by GHC. GHC does not use the native > call stack _at all_. This is to implement continuation passing, in a sense. > > > > I want to check that the difference in the performance is truly from > this "stack in the heap" behavior, so I want to teach a backend to > generate code that looks like this. > > If the indirect jumps via this stack are all made from the same location > in the GHC runtime, then perhaps this might kill branch prediction. > call/ret addresses are duplicated in a branch predictor’s own hardware > stack. If the runtime doesn’t use call/ret, not even indirect calls, then > this functionality is lost and general branch prediction logic has to kick > in. The predictors may then either fail for a long time before a pattern is > recognized via the jump site in the runtime, or may even not recognize it > and always mispredict. There may be differences between perceptron and > other techniques in this case, so see if it’s equally bad on chips that use > perceptrons (some AMD?) if yours doesn’t. > > There should be a smoking gun somewhere in the performance monitoring > registers then, I’d hope. It’d be very obvious - persistent branch > misprediction at the level of functions costs dearly. > > I have rather cursory knowledge in this area so perhaps the state of the > art is way ahead of my imagination, or perhaps I’m dead wrong. > > Cheers, Kuba > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Sending this from my phone, please excuse any typos! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171216/13649a78/attachment.html>
Reasonably Related Threads
- LLVM and heap-allocated thread stacks
- [LLVMdev] [RFC] BlockFrequency is the wrong metric; we need a new one
- [LLVMdev] Please review my patch to make GHC calling convention work on ARM
- Make LLD output COFF relocatable object file (like ELF's -r does). How much work is required to implement this?
- Make LLD output COFF relocatable object file (like ELF's -r does). How much work is required to implement this?