Duncan P. N. Exon Smith via llvm-dev
2021-Jan-27 05:23 UTC
[llvm-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs
TL;DR: Let's virtualize compiler outputs in Clang. These patches would get us started: - https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501> Add llvm::vfs::OutputManager - https://reviews.llvm.org/D95502 <https://reviews.llvm.org/D95502> Initial adoption of llvm::vfs::OutputManager in Clang. Questions for the reader - Should we virtualize compiler outputs in Clang? (Hint: yes.) - Does this support belong in LLVM? (I think it does, so that non-Clang tools can easily reuse it.) - Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think `llvm::` itself is too broad.) - Do you have a use case that this won't address well? - Should that be fixed in the initial patch, or could this be evolved in-tree to address that? - Any other major concerns / suggestions? - If you think the above patches should be split up for initial review / commit, how? (Other feedback welcome too!) Longer version There are a number of use cases for capturing compiler outputs, which I'm hoping this proposal is a step toward addressing. - Tooling wants to capture outputs directly, without going through the filesystem. - Sometimes, tertiary outputs can be ignored, or still need to be written to disk. - clang-scan-deps is using a form of stripped down "implicit" modules to determine which modules need to be built explicitly. It doesn't really want to be using the on-disk module cache—in-memory would be better. - clang's ModuleManager manually maintains an in-memory modules cache for implicit modules. This involves copying the PCM outputs into memory. It'd be better for these modules to be file-backed, instead of copies of the stream. The patch has a bunch of details written / tested (https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501>). Here are the high-level structures in the design: - OutputManager—a shared manager for creating outputs without knowing about the storage. - OutputConfig—configuration set on the OutputManager that can be (partially) overridden for specific outputs. - Output—opaque object with a raw_pwrite_stream, output path, and `erase`/`close` functionality. Internally, it has a linked list of output destinations. - OutputBackend—abstract interface installed in an OutputManager to create the "guts" of an output. While an OutputManager only has one installed, these can be layered / forked / mirrored. - OutputDestination—abstract interface paired with an OutputBackend, whose lifetime is managed by an Output. - ContentBuffer—actual content to allow efficient use of data by multiple backends. For example, the installed backend is a mirror between an on-disk and in-memory backend, the in-memory backend will either get the content moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file. The patch includes a few backends: - NullOutputBackend, for ignoring all backends. - OnDiskOutputBackend, for writing to disk (the default), initially based on the logic in `clang::CompilerInstance`. - InMemoryOutputBackend, for writing to an `InMemoryFileSystem`. - MirroringOutputBackend, for writing to multiple backends. OutputDestination's API is designed around supporting this. - FilteringOutputBackend, for filtering which outputs get written to the underlying backend. Why doesn't this inherit from llvm::vfs::FileSystem? Separating the input and output abstractions seems a bit cleaner. It's easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`). Other work in the area See also: https://reviews.llvm.org/D78058 <https://reviews.llvm.org/D78058> (thanks to Marc Rasi for posting that patch, and to Sam McCall for some feedback on an earlier version of this proposal). Thanks for reading! Duncan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210126/f608bed8/attachment.html>
Sam McCall via llvm-dev
2021-Jan-28 11:19 UTC
[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs
Really glad to see this work, virtualizing module cache is something we've wanted to experiment with for tooling, but never got to. I want to get into the patches in more detail, but some high-level thoughts... On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev < cfe-dev at lists.llvm.org> wrote:> *TL;DR*: Let's virtualize compiler outputs in Clang. These patches would > get us started: > - https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager > - https://reviews.llvm.org/D95502 Initial adoption of > llvm::vfs::OutputManager in Clang. > > *Questions for the reader* > - Should we virtualize compiler outputs in Clang? (Hint: *yes*.) >Definitely agree.> - Does this support belong in LLVM? (I think it does, so that non-Clang > tools can easily reuse it.) >Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in LLVM, as well as the most common implementations of it. Based on experience with VirtualFileSystem, I'd like this interface to be as narrow as possible to make it feasible to implement/wrap correctly, and to reason about how the wider system interacts with it. This roughly corresponds to OutputBackend + OutputDestination in the patch, except: - the OutputConfig seems like it belongs to particular backends, not the overall backend abstraction - OutputDestination has a lot of stuff in it, I'll need to dig further into the patch to try to understand why As for OutputManager itself, I think this belongs in clang, if it's needed. Its main job seems to be to store a set of default options and manage the lifecycle of backends, and it's not obvious that those sorts of concerns will generalize across tools or that there's much to be gained from sharing code here.> - Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think > `llvm::` itself is too broad.) >llvm::vfs:: is definitely the right namespace for the core writing stuff IMO. If more ancillary bits (parts some but not all tools might use) need to go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer those to live in clang:: at least for now.> - Do you have a use case that this won't address well? > - Should that be fixed in the initial patch, or could this be evolved > in-tree to address that? > - Any other major concerns / suggestions? >Thread-safety of the core plug-in interface is something that would be nice to explicitly address, as this has been a pain-point with vfs::FileSystem. It's tempting to say "not threadsafe, you should lock", but this results in throwing an unnecessary global lock around all FS accesses in multithreaded programs, in the common case that the real FS is being used. Relatedly, working-directory/relative-path handling should be considered. (And a question/concern about the relationship between input and output virtualization, elaborated at the bottom)> - If you think the above patches should be split up for initial review / > commit, how? >Obviously my favorite would be to see a minimal core writable VFS interface extracted and land that first. What's built on top of it is less critical, and I'm not concerned about it landing in larger chunks.> > (Other feedback welcome too!) > > *Longer version* > There are a number of use cases for capturing compiler outputs, which I'm > hoping this proposal is a step toward addressing. > > - Tooling wants to capture outputs directly, without going through the > filesystem. > - Sometimes, tertiary outputs can be ignored, or still need to be written > to disk. > - clang-scan-deps is using a form of stripped down "implicit" modules to > determine which modules need to be built explicitly. It doesn't really want > to be using the on-disk module cache—in-memory would be better. > - clang's ModuleManager manually maintains an in-memory modules cache for > implicit modules. This involves copying the PCM outputs into memory. It'd > be better for these modules to be file-backed, instead of copies of the > stream. > > The patch has a bunch of details written / tested ( > https://reviews.llvm.org/D95501). Here are the high-level structures in > the design: > > - OutputManager—a shared manager for creating outputs without knowing > about the storage. > - OutputConfig—configuration set on the OutputManager that can be > (partially) overridden for specific outputs. > - Output—opaque object with a raw_pwrite_stream, output path, and > `erase`/`close` functionality. Internally, it has a linked list of output > destinations. > - OutputBackend—abstract interface installed in an OutputManager to create > the "guts" of an output. While an OutputManager only has one installed, > these can be layered / forked / mirrored. > - OutputDestination—abstract interface paired with an OutputBackend, whose > lifetime is managed by an Output. > - ContentBuffer—actual content to allow efficient use of data by multiple > backends. For example, the installed backend is a mirror between an on-disk > and in-memory backend, the in-memory backend will either get the content > moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file. > > The patch includes a few backends: > > - NullOutputBackend, for ignoring all backends. > - OnDiskOutputBackend, for writing to disk (the default), initially based > on the logic in `clang::CompilerInstance`. > - InMemoryOutputBackend, for writing to an `InMemoryFileSystem`. > - MirroringOutputBackend, for writing to multiple backends. > OutputDestination's API is designed around supporting this. > - FilteringOutputBackend, for filtering which outputs get written to > the underlying backend. > > *Why doesn't this inherit from llvm::vfs::FileSystem?* > Separating the input and output abstractions seems a bit cleaner. It's > easy enough to join them, when useful: e.g., write to an > `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same > FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`). >I agree with separating the interfaces. In hosted environments your input VFS is often read-only and outputs go somewhere else. But I wonder, is there an implicit assumption that data written to OutputManager is readable via the (purportedly independent) vfs::FileSystem? This seems like a helpful assumption for module caching, but is extremely constraining and eliminates many of the simplest and most interesting possibilities. If you're going to require the FileSystem and OutputBackend to be linked, then I think they *should* be the same object. But if it's mostly module caching that requires that, then it seems simpler and less invasive to virtualize module caching directly (put(module, key) + get(key)) rather than file access. *Other work in the area*> See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for > posting that patch, and to Sam McCall for some feedback on an earlier > version of this proposal). > > Thanks for reading! > Duncan > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210128/2d29b48a/attachment.html>