Kirill Batuzov
2014-Feb-26 11:23 UTC
[LLVMdev] [Valgrind-developers] [GSoC 2014] Using LLVM as a code-generation backend for Valgrind
Hi, only one letter got to valgrind-developers mailing list. I'll quote the first message of the thread so that those who do not read llvmdev knew what's this discusssion about. === Begin of the first message ==> Hi,> > I've seen on the LLVM's Open Projet Page [1] an idea about using LLVM to > generate native code in Valgrind. For what I know, Valgrind uses libVEX > to translate native instructions into a bitcode, used to add the > instrumentation and then translated back to native code for execution. > > Valgrind and LLVM are two tools that I use nearly every day. I'm also > very interested in code generation and optimization, so adding the > possibility to use LLVM to generate native code in libVEX interests me > very much. Is it a good idea? Could a LLVM backend bring something > useful to Valgrind (for instance, faster execution or more targets > supported) ? > > I've sent this message to the LLVM and Valgrind mailing lists because > I've originally found the idea on the LLVM's website, but Valgrind is > the object of the idea. By the way, does anyone already know if LLVM or > Valgrind will be a mentoring organization for this year's GSoC? > > You can find in [2] the list of my past projects. During the GSoC 2011, > I had the chance to use the Clang libraries to compile C code, and the > LLVM JIT to execute it (with instrumented stdlib functions). I have also > played with the LLVM C bindings to generate code when I explored some > parts of Mesa. > > Denis Steckelmacher > > [1] : http://llvm.org/OpenProjects.html#misc_new > [2] : http://steckdenis.be/page-projects.html=== End of the first message == The idea of using LLVM backend in some dynamic binary translation (DBT) project has became popular recently. Unfortunately it does not prove to be good. I suggest you check the related work in QEMU. DBT part of both QEMU and Valgrind works in similar way. And there were a bunch of works on using LLVM as a QEMU backend. They resulted in slowdown mostly. In [1] the authors reported 35x slowdown, in [2] there were around 2x slowdown. Finally in [3] the authors reported performance gain, but there are some catches. 1. They used LLVM not only for backend. They replaced internal representation with LLVM. This is not an option for Valgrind because you'll need to rewrite all existing tools (including third party ones) to do it. 2. They use SPEC CPU benchmarks to measure their speedup. Important things about these tests is that they have little code to translate but a lot of computations to do with translated code. And even some of these tests are not doing too well (like 403.gcc). On real life applications (like firefox) where there are a lot of library code to translate and not so much computations to do results may be totally different. LLVM is not doing good as a DBT backend mostly for two reasons. First, in DBT you need to translate while you are running the application. You need to do it really fast. Compiler is not optimized for that task. LLVM JIT? May be. Second, in DBT you translate code in small portions like basic blocks, or extended basic blocks. They have very simple structure. There is no loops, there is no redundancy from translation high level language to low level. There is nothing good sophisticated optimizations can do better then very simple ones. In conclusion I second what have already been said: this project sounds like fun to do, but do not expect much practical results from it.> It would also be interesting to cache the LLVM-generated code > between runsThe tricky part here is to build matching between binary code fragments and cached translations from previous runs. In worst case all you know about the binary code is it's address (which can vary between runs) and the binary code itself. [1] : "Dynamically Translating x86 to LLVM using QEMU" http://infoscience.epfl.ch/record/149975/files/x86-llvm-translator-chipounov_2.pdf [2] : llvm-qemu project. http://code.google.com/p/llvm-qemu/ [3] : "LnQ: Building High Performance Dynamic Binary Translator with Existing Compiler Backends" http://people.cs.nctu.edu.tw/~chenwj/slide/paper/lnq.pdf -- Kirill
Julian Seward
2014-Feb-26 15:16 UTC
[LLVMdev] [Valgrind-developers] [GSoC 2014] Using LLVM as a code-generation backend for Valgrind
On 02/26/2014 12:23 PM, Kirill Batuzov wrote: I tend to agree with Kirill. It would be great to make Valgrind/Memcheck faster, and there are certainly ways to do that, but using LLVM is not one of them.> Second, in DBT you translate code in small portions like basic blocks, > or extended basic blocks. They have very simple structure. There is no > loops, there is no redundancy from translation high level language to > low level. There is nothing good sophisticated optimizations can do > better then very simple ones.Yes. One of the problems of the "Let's use LLVM and it'll all go much faster" concept is that it lacks a careful analysis of what makes Valgrind (and QEMU, probably) run slowly in the first place. As Kirill says, the short blocks of code that V generates make it impossible for LLVM to do sophisticated loop optimisations etc. Given what Valgrind's JIT has to work with -- straight line pieces of code -- it generally does a not-bad job of instruction selection and register allocation, and I wouldn't expect that substituting LLVM's implementation thereof would make much of a difference. What would make Valgrind faster is (1) improve the caching of guest registers in host registers across basic block boundaries. Currently all guest registers cached in host registers are flushed back into memory at block boundaries, and no host register holds any live value across the boundary. This is simple but very suboptimal, creating large amounts of memory traffic. (2) improve the way that the guest program counter is represented. Currently it is updated before every memory access, so that if an unwind is required, it is possible. But this again causes lots of excess memory traffic. This is closely related to (1). (3) add some level of control-flow if-then-else support to the IR, so that the fast-case paths for the memcheck helper functions (helperc_LOADV64le etc) can be generated inline. (4) Redesign Memcheck's shadow memory implementation to use a 1 level map rather than 2 levels as at present. Or something more TLB-like. I suspect that the combination of (1) and (2) causes processor write buffers to fill up and start stalling, although I don't have numbers to prove that. What _is_ very obvious from profiling Memcheck using Cachegrind is that the generated code contains much higher proportion of memory references than "normal integer code". And in particular it contains perhaps 4 times as many stores as "normal integer code". Which can't be a good thing. (3) is a big exercise -- much work -- but potentially very beneficial. (4) is also important if only because we need a multithreaded implementation of Memcheck. (1) and (2) are smaller projects and would constitute a refinement of the existing code generation framework.> In conclusion I second what have already been said: this project sounds > like fun to do, but do not expect much practical results from it.The above projects (1) .. (4) would also be fun :-) and might generate more immediate speedups for Valgrind. J
Yan
2014-Feb-26 15:21 UTC
[LLVMdev] [Valgrind-developers] [GSoC 2014] Using LLVM as a code-generation backend for Valgrind
For (3), would something like making all statements conditional (like LoadG, StoreG, and Exit are) do, or are we talking about something more complex? On Wed, Feb 26, 2014 at 7:16 AM, Julian Seward <jseward at acm.org> wrote:> > On 02/26/2014 12:23 PM, Kirill Batuzov wrote: > > I tend to agree with Kirill. It would be great to make Valgrind/Memcheck > faster, and there are certainly ways to do that, but using LLVM is not > one of them. > > > Second, in DBT you translate code in small portions like basic blocks, > > or extended basic blocks. They have very simple structure. There is no > > loops, there is no redundancy from translation high level language to > > low level. There is nothing good sophisticated optimizations can do > > better then very simple ones. > > Yes. One of the problems of the "Let's use LLVM and it'll all go much > faster" concept is that it lacks a careful analysis of what makes Valgrind > (and QEMU, probably) run slowly in the first place. > > As Kirill says, the short blocks of code that V generates make it > impossible for LLVM to do sophisticated loop optimisations etc. > Given what Valgrind's JIT has to work with -- straight line pieces > of code -- it generally does a not-bad job of instruction selection > and register allocation, and I wouldn't expect that substituting LLVM's > implementation thereof would make much of a difference. > > What would make Valgrind faster is > > (1) improve the caching of guest registers in host registers across > basic block boundaries. Currently all guest registers cached in > host registers are flushed back into memory at block boundaries, > and no host register holds any live value across the boundary. > This is simple but very suboptimal, creating large amounts of > memory traffic. > > (2) improve the way that the guest program counter is represented. > Currently it is updated before every memory access, so that if an > unwind is required, it is possible. But this again causes lots of > excess memory traffic. This is closely related to (1). > > (3) add some level of control-flow if-then-else support to the IR, so > that the fast-case paths for the memcheck helper functions > (helperc_LOADV64le etc) can be generated inline. > > (4) Redesign Memcheck's shadow memory implementation to use a 1 level > map rather than 2 levels as at present. Or something more > TLB-like. > > I suspect that the combination of (1) and (2) causes processor write > buffers to fill up and start stalling, although I don't have numbers > to prove that. What _is_ very obvious from profiling Memcheck using > Cachegrind is that the generated code contains much higher proportion > of memory references than "normal integer code". And in particular > it contains perhaps 4 times as many stores as "normal integer code". > Which can't be a good thing. > > (3) is a big exercise -- much work -- but potentially very beneficial. > (4) is also important if only because we need a multithreaded > implementation of Memcheck. (1) and (2) are smaller projects and would > constitute a refinement of the existing code generation framework. > > > In conclusion I second what have already been said: this project sounds > > like fun to do, but do not expect much practical results from it. > > The above projects (1) .. (4) would also be fun :-) and might generate more > immediate speedups for Valgrind. > > J > > > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Valgrind-developers at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/valgrind-developers >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140226/e8024387/attachment.html>