thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs [Jan 2021]

If this information is useful, please help other people find it:
Share via:

Sam McCall via llvm-dev

2021-Jan-28 11:19 UTC

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Really glad to see this work, virtualizing module cache is something we've
wanted to experiment with for tooling, but never got to. I want to get into
the patches in more detail, but some high-level thoughts...

On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> *TL;DR*: Let's virtualize compiler outputs in Clang. These patches
would
> get us started:
> - https://reviews.llvm.org/D95501 Add llvm::vfs::OutputManager
> - https://reviews.llvm.org/D95502 Initial adoption of
> llvm::vfs::OutputManager in Clang.
>
> *Questions for the reader*
> - Should we virtualize compiler outputs in Clang? (Hint: *yes*.)
>Definitely agree.

> - Does this support belong in LLVM? (I think it does, so that non-Clang
> tools can easily reuse it.)
>Ideally, the core abstraction (path -> pwrite_stream) certainly belongs in
LLVM, as well as the most common implementations of it.
Based on experience with VirtualFileSystem, I'd like this interface to be
as narrow as possible to make it feasible to implement/wrap correctly, and
to reason about how the wider system interacts with it.

This roughly corresponds to OutputBackend + OutputDestination in the patch,
except:
 - the OutputConfig seems like it belongs to particular backends, not the
overall backend abstraction
 - OutputDestination has a lot of stuff in it, I'll need to dig further
into the patch to try to understand why

As for OutputManager itself, I think this belongs in clang, if it's needed.
Its main job seems to be to store a set of default options and manage the
lifecycle of backends, and it's not obvious that those sorts of concerns
will generalize across tools or that there's much to be gained from sharing
code here.

> - Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think
> `llvm::` itself is too broad.)
>llvm::vfs:: is definitely the right namespace for the core writing stuff
IMO.
If more ancillary bits (parts some but not all tools might use) need to go
in llvm, llvm:: seems to be the best namespace we have (like e.g.
SourceMgr) but maybe we should add a new one. But as mentioned, I'd prefer
those to live in clang:: at least for now.

> - Do you have a use case that this won't address well?
> - Should that be fixed in the initial patch, or could this be evolved
> in-tree to address that?
> - Any other major concerns / suggestions?
>Thread-safety of the core plug-in interface is something that would be nice
to explicitly address, as this has been a pain-point with vfs::FileSystem.
It's tempting to say "not threadsafe, you should lock", but this
results in
throwing an unnecessary global lock around all FS accesses in multithreaded
programs, in the common case that the real FS is being used.

Relatedly, working-directory/relative-path handling should be considered.

(And a question/concern about the relationship between input and output
virtualization, elaborated at the bottom)

> - If you think the above patches should be split up for initial review /
> commit, how?
>Obviously my favorite would be to see a minimal core writable VFS interface
extracted and land that first. What's built on top of it is less critical,
and I'm not concerned about it landing in larger chunks.

>
> (Other feedback welcome too!)
>
> *Longer version*
> There are a number of use cases for capturing compiler outputs, which
I'm
> hoping this proposal is a step toward addressing.
>
> - Tooling wants to capture outputs directly, without going through the
> filesystem.
> - Sometimes, tertiary outputs can be ignored, or still need to be written
> to disk.
> - clang-scan-deps is using a form of stripped down "implicit"
modules to
> determine which modules need to be built explicitly. It doesn't really
want
> to be using the on-disk module cache—in-memory would be better.
> - clang's ModuleManager manually maintains an in-memory modules cache
for
> implicit modules. This involves copying the PCM outputs into memory.
It'd
> be better for these modules to be file-backed, instead of copies of the
> stream.
>
> The patch has a bunch of details written / tested (
> https://reviews.llvm.org/D95501). Here are the high-level structures in
> the design:
>
> - OutputManager—a shared manager for creating outputs without knowing
> about the storage.
> - OutputConfig—configuration set on the OutputManager that can be
> (partially) overridden for specific outputs.
> - Output—opaque object with a raw_pwrite_stream, output path, and
> `erase`/`close` functionality. Internally, it has a linked list of output
> destinations.
> - OutputBackend—abstract interface installed in an OutputManager to create
> the "guts" of an output. While an OutputManager only has one
installed,
> these can be layered / forked / mirrored.
> - OutputDestination—abstract interface paired with an OutputBackend, whose
> lifetime is managed by an Output.
> - ContentBuffer—actual content to allow efficient use of data by multiple
> backends. For example, the installed backend is a mirror between an on-disk
> and in-memory backend, the in-memory backend will either get the content
> moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed
file.
>
> The patch includes a few backends:
>
> - NullOutputBackend, for ignoring all backends.
> - OnDiskOutputBackend, for writing to disk (the default), initially based
> on the logic in `clang::CompilerInstance`.
> - InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
> - MirroringOutputBackend, for writing to multiple backends.
> OutputDestination's API is designed around supporting this.
> - FilteringOutputBackend, for filtering which outputs get written to
> the underlying backend.
>
> *Why doesn't this inherit from llvm::vfs::FileSystem?*
> Separating the input and output abstractions seems a bit cleaner. It's
> easy enough to join them, when useful: e.g., write to an
> `InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same
> FS in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
>I agree with separating the interfaces. In hosted environments your input
VFS is often read-only and outputs go somewhere else.

But I wonder, is there an implicit assumption that data written to
OutputManager is readable via the (purportedly independent)
vfs::FileSystem? This seems like a helpful assumption for module caching,
but is extremely constraining and eliminates many of the simplest and most
interesting possibilities.

If you're going to require the FileSystem and OutputBackend to be linked,
then I think they *should* be the same object.
But if it's mostly module caching that requires that, then it seems simpler
and less invasive to virtualize module caching directly (put(module, key) +
get(key)) rather than file access.

*Other work in the area*> See also: https://reviews.llvm.org/D78058 (thanks to Marc Rasi for
> posting that patch, and to Sam McCall for some feedback on an earlier
> version of this proposal).
>
> Thanks for reading!
> Duncan
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210128/2d29b48a/attachment.html>

Duncan P. N. Exon Smith via llvm-dev

2021-Jan-28 19:24 UTC

head link

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Thanks for the detailed response Sam!
> This roughly corresponds to OutputBackend + OutputDestination in the patch,
except:
>  - the OutputConfig seems like it belongs to particular backends, not the
overall backend abstraction
Ideally the OnDiskOutputConfig would almost entirely be settings on
OnDiskOutputBackend since no one else will care. It isn't that way in the
initial patch because Clang decides most of this stuff on a per-output basis.
Maybe there's a refactoring that could fix it.

Here are the problems I hit that led to this design:

1. Some properties need to be associated with specific outputs, not backends,
because they relate to properties of the outputs themselves. Here are two:
- ClientIntentOutputConfig::NeedsSeeking
- OnDiskOutputConfig::OpenFlagText

Maybe also something like:
- OnDiskOutputConfig::CreateDirectories
(and maybe others)

I couldn't think of a good way to solve this besides passing in
configuration at output creation time.

2. Some "configurers" may be handed a pre-constructed opaque
OutputManager / OutputBackend and need to configure internal
OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
- OnDiskOutputConfig::RemoveFileOnSignal
(others flags might benefit)

This is "documented" by the call chains of
clang::CompilerInstance::createOutputFile.

I reused the solution from #1 since it needed (?) to exist anyway. Another
option would be to add an OutputBackend visitor, to find and reconfigure any
"contained" OnDiskOutputBackend. This seems pretty heavyweight though.
>  - OutputDestination has a lot of stuff in it, I'll need to dig further
into the patch to try to understand why
Firstly, `Output` has the user-facing abstraction. `OutputDestination` has
low-level bits. Another reasonable name might be `OutputImpl` but
`OutputDescription` seemed more descriptive.

Secondly, most of the low-level bits avoid unnecessary copies of content
buffers. Maybe there are simpler ideas for how to do this, but here are the
design goals I was working around:
- Avoid buffering content unless necessary.
- Avoid duplicating content buffers unless necessary.
- Support multiple destinations (for mirrored backends) without breaking the
above rules. The obvious "interesting" case is in-memory + on-disk (in
either order).
- Make sub-classes of `OutputDestination` as small as possible given the above.

Thirdly, there's some boilerplate related to lifetimes of multiple
destinations. Probably having an explicit `MirroringOutputDestination` would be
better.
> As for OutputManager itself, I think this belongs in clang, if it's
needed. Its main job seems to be to store a set of default options and manage
the lifecycle of backends, and it's not obvious that those sorts of concerns
will generalize across tools or that there's much to be gained from sharing
code here.
In the end, its main job is to wrap an OutputDestination (low-level abstraction)
in an Output (high-level abstraction). Output does a bunch of work for
OutputDestination, such as manage the intermediate content buffer.

Probably it's better to have the OutputBackend return an Output directly
(and get rid of the OutputManager).

(At one point OutputManager was managing multiple backends directly and was
involved in the store operation(s); but since I factored that logic out to
MirroringOutputBackend and OutputDestination it probably doesn't need to
exist.)
> - Any other major concerns / suggestions?
> Thread-safety of the core plug-in interface is something that would be nice
to explicitly address, as this has been a pain-point with vfs::FileSystem.
> It's tempting to say "not threadsafe, you should lock", but
this results in throwing an unnecessary global lock around all FS accesses in
multithreaded programs, in the common case that the real FS is being used.
Right, I hit multi-threading limitations myself when prototyping a follow-up
patch (didn't get around to posting it until this AM):
https://reviews.llvm.org/D95628 <https://reviews.llvm.org/D95628>

My intuition is:

Thread-safe:
- OutputBackend::createDestination
- Concurrently touching more than one Output/OutputDestination

Thread-unsafe:
- Using a single Output/OutputDestination concurrently

This all seems cheap and easy to maintain because of the limited interface. The
problem I hit in the above patch is that for the InMemoryOutputBackend you also
need any readers of the InMemoryFileSystem to be thread-safe.

Relatedly, https://reviews.llvm.org/D95583
<https://reviews.llvm.org/D95583> (a prep patch for the above) allows the
llvm::LockManager to be skipped. This is not really about outputs — it's
inter-process coordination to avoid doing redundant work in competing Clang
command-line invocations (at one point it was needed for correctness, but the
main benefit now is to avoid taxing the on-disk filesystem) — but it does touch
on the idea of exclusive write access.
> Relatedly, working-directory/relative-path handling should be considered.
Yeah, you're probably right. Any specific thoughts on what to do here? It
seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see (e.g.)
the discussion at the end of https://reviews.llvm.org/D70701
<https://reviews.llvm.org/D70701>.

Even for POSIX, working directories could complicate concurrency guarantees. A
simple solution is to not have an API for changing working directories, and
instead model that by creating a proxy backend (ChangeDirectoryOutputBackend)
that prepends the (new) working directory to new outputs; since each backend has
an immutable view of the CWD concurrent access should be fine.

Two other thoughts related to paths:

1. I wonder if this abstraction treats the "path" as too central a
property of the output. Can this evolve to allow a compiler to build a directory
structure bottom-up without having to know the destination a priori? (E.g., an
inner layer makes a blob, a middle layer makes a tree out of a few of those, and
an outer layer decides where to put the tree.)

I think it can. One approach is to use proxy backends:
- Inner layer writes to '-' / stdout. (Why not pass a pre-constructed
Output/raw_pwrite_stream? See note below.)
- Middle layer passes the instances of the inner layer a proxy backend that maps
stdout to the various blob names. (E.g., `/name1` and `/name2`.)
- Outer layer passes the middle layer a proxy backend that "does the right
thing" with the tree. (E.g., writes to `/Users/dexonsmith/name{1,2}`.)

=> If writing on-disk, "the right thing" is to prepend the correct
output directory for the tree and pass each file to a regular
OnDiskOutputBackend.
=> If writing to (e.g.) Git's CAS, "the right thing" is to call
git-hash-object on Output::close and track the name-to-hash mapping as outputs
come in, and then call "git-mktree" when the middle layer is
"done" (maybe a callback in the
backend-passed-to-the-middle-layer's destructor).

(IOW, a refactoring where instead of passing absolute paths / directories /
filenames down through all the layers, proxy output backends build up the
path/destination piece-by-piece.)

I think it's doable with the abstraction as-proposed. But let me know if
anyone has concerns. For example:
- Is `Output::getPath()` an abstraction leak?
- Should we have a `createOutput` that doesn't take a path?
- ...

Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
- The inner layer needs an output backend if it (sometimes) dumps
"side" files (such as AST record layouts into ".ast-dump" or
textual IR into ".ll"). This avoids needing to know the on-disk file
path ("/path/to/output" => "/path/to/output.ll"), or to
even know whether there's a disk.

2. How should we virtualize stdout / stderr?
- "'-' means stdout" is probably good enough since LLVM makes
that assumption all over. Unless someone disagrees?
- I'm not sure what to do with stderr. No one ever "closes" this
stream.
- Are there other outputs that don't have path names?

3. Do we need to virtualize llvm::sys::fs::create_directories?
- If so, why?
> (And a question/concern about the relationship between input and output
virtualization, elaborated at the bottom)
> Why doesn't this inherit from llvm::vfs::FileSystem?
> Separating the input and output abstractions seems a bit cleaner. It's
easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem`
(via an `InMemoryOutputBackend`) and install this same FS in the `FileManager`
(maybe indirectly via an `OverlayFileSystem`).
> I agree with separating the interfaces. In hosted environments your input
VFS is often read-only and outputs go somewhere else.
> 
> But I wonder, is there an implicit assumption that data written to
OutputManager is readable via the (purportedly independent) vfs::FileSystem?
This seems like a helpful assumption for module caching, but is extremely
constraining and eliminates many of the simplest and most interesting
possibilities.
> 
> If you're going to require the FileSystem and OutputBackend to be
linked, then I think they *should* be the same object.
No, I don't think that should be a requirement / expectation. It's a
specific requirement for Clang's implicitly built modules, and I think Clang
should be responsible for hooking them together when necessary.
> But if it's mostly module caching that requires that, then it seems
simpler and less invasive to virtualize module caching directly (put(module,
key) + get(key)) rather than file access.
Agreed, explicitly virtualizing module caching might be a good thing to do.
Either way this is Clang's job to coordinate; I just think the output
manager should efficiently support mirroring outputs to an additional/custom
backend that Clang installs.

Note: implicit modules doesn't currently rely on reading the modules it has
just built from disk. It uses InMemoryModuleCache to avoid having to read
anything it has written to disk and to ensure consistency between
CompilerInstances across an implicit build. It's pretty awkward though.
> On 2021 Jan  28, at 03:19, Sam McCall <sammccall at google.com>
wrote:
> 
> Really glad to see this work, virtualizing module cache is something
we've wanted to experiment with for tooling, but never got to. I want to get
into the patches in more detail, but some high-level thoughts...
> 
> On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev
<cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
wrote:
> TL;DR: Let's virtualize compiler outputs in Clang. These patches would
get us started:
> - https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501>
Add llvm::vfs::OutputManager
> - https://reviews.llvm.org/D95502 <https://reviews.llvm.org/D95502>
Initial adoption of llvm::vfs::OutputManager in Clang.
> 
> Questions for the reader
> - Should we virtualize compiler outputs in Clang? (Hint: yes.)
> Definitely agree.
>  
> 	- Does this support belong in LLVM? (I think it does, so that non-Clang
tools can easily reuse it.)
> Ideally, the core abstraction (path -> pwrite_stream) certainly belongs
in LLVM, as well as the most common implementations of it.
> Based on experience with VirtualFileSystem, I'd like this interface to
be as narrow as possible to make it feasible to implement/wrap correctly, and to
reason about how the wider system interacts with it.
> 
> This roughly corresponds to OutputBackend + OutputDestination in the patch,
except:
>  - the OutputConfig seems like it belongs to particular backends, not the
overall backend abstraction
>  - OutputDestination has a lot of stuff in it, I'll need to dig further
into the patch to try to understand why
> 
> As for OutputManager itself, I think this belongs in clang, if it's
needed. Its main job seems to be to store a set of default options and manage
the lifecycle of backends, and it's not obvious that those sorts of concerns
will generalize across tools or that there's much to be gained from sharing
code here.
>  
> 	- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I think
`llvm::` itself is too broad.)
> llvm::vfs:: is definitely the right namespace for the core writing stuff
IMO.
> If more ancillary bits (parts some but not all tools might use) need to go
in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr) but
maybe we should add a new one. But as mentioned, I'd prefer those to live in
clang:: at least for now.
>  
> - Do you have a use case that this won't address well?
> 	- Should that be fixed in the initial patch, or could this be evolved
in-tree to address that?
> - Any other major concerns / suggestions?
> Thread-safety of the core plug-in interface is something that would be nice
to explicitly address, as this has been a pain-point with vfs::FileSystem.
> It's tempting to say "not threadsafe, you should lock", but
this results in throwing an unnecessary global lock around all FS accesses in
multithreaded programs, in the common case that the real FS is being used.
> 
> Relatedly, working-directory/relative-path handling should be considered.
> 
> (And a question/concern about the relationship between input and output
virtualization, elaborated at the bottom)
>  
> - If you think the above patches should be split up for initial review /
commit, how?
> Obviously my favorite would be to see a minimal core writable VFS interface
extracted and land that first. What's built on top of it is less critical,
and I'm not concerned about it landing in larger chunks.
>  
> 
> (Other feedback welcome too!)
> 
> Longer version
> There are a number of use cases for capturing compiler outputs, which
I'm hoping this proposal is a step toward addressing.
> 
> - Tooling wants to capture outputs directly, without going through the
filesystem.
> 	- Sometimes, tertiary outputs can be ignored, or still need to be written
to disk.
> - clang-scan-deps is using a form of stripped down "implicit"
modules to determine which modules need to be built explicitly. It doesn't
really want to be using the on-disk module cache—in-memory would be better.
> - clang's ModuleManager manually maintains an in-memory modules cache
for implicit modules. This involves copying the PCM outputs into memory.
It'd be better for these modules to be file-backed, instead of copies of the
stream.
> 
> The patch has a bunch of details written / tested
(https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501>). Here
are the high-level structures in the design:
> 
> - OutputManager—a shared manager for creating outputs without knowing about
the storage.
> - OutputConfig—configuration set on the OutputManager that can be
(partially) overridden for specific outputs.
> - Output—opaque object with a raw_pwrite_stream, output path, and
`erase`/`close` functionality. Internally, it has a linked list of output
destinations.
> - OutputBackend—abstract interface installed in an OutputManager to create
the "guts" of an output. While an OutputManager only has one
installed, these can be layered / forked / mirrored.
> - OutputDestination—abstract interface paired with an OutputBackend, whose
lifetime is managed by an Output.
> - ContentBuffer—actual content to allow efficient use of data by multiple
backends. For example, the installed backend is a mirror between an on-disk and
in-memory backend, the in-memory backend will either get the content moved
directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.
> 
> The patch includes a few backends:
> 
> - NullOutputBackend, for ignoring all backends.
> - OnDiskOutputBackend, for writing to disk (the default), initially based
on the logic in `clang::CompilerInstance`.
> - InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
> - MirroringOutputBackend, for writing to multiple backends.
OutputDestination's API is designed around supporting this.
> - FilteringOutputBackend, for filtering which outputs get written to the
underlying backend.
> 
> Why doesn't this inherit from llvm::vfs::FileSystem?
> Separating the input and output abstractions seems a bit cleaner. It's
easy enough to join them, when useful: e.g., write to an `InMemoryFileSystem`
(via an `InMemoryOutputBackend`) and install this same FS in the `FileManager`
(maybe indirectly via an `OverlayFileSystem`).
> I agree with separating the interfaces. In hosted environments your input
VFS is often read-only and outputs go somewhere else.
> 
> But I wonder, is there an implicit assumption that data written to
OutputManager is readable via the (purportedly independent) vfs::FileSystem?
This seems like a helpful assumption for module caching, but is extremely
constraining and eliminates many of the simplest and most interesting
possibilities.
> 
> If you're going to require the FileSystem and OutputBackend to be
linked, then I think they *should* be the same object.
> But if it's mostly module caching that requires that, then it seems
simpler and less invasive to virtualize module caching directly (put(module,
key) + get(key)) rather than file access.
> 
> Other work in the area
> See also: https://reviews.llvm.org/D78058
<https://reviews.llvm.org/D78058> (thanks to Marc Rasi for posting that
patch, and to Sam McCall for some feedback on an earlier version of this
proposal).
> 
> Thanks for reading!
> Duncan
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210128/300db78d/attachment.html>

Duncan P. N. Exon Smith via llvm-dev

2021-Feb-03 03:36 UTC

head link

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

Update: I've incorporated much of Sam's feedback into the main patch
(https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501>).

- Simplify OutputConfig, restricting it to semantic information about a specific
output. Sink all backend configuration to flags on the backends themselves.
- Remove OutputManager, instead exposing OutputBackend directly.
- Merge Output and OutputDestination into a single class called OutputFile, and
rename the API for creating them to OutputBackend::createFile().
- Add support for working directories via OutputDirectory, and add
OutputBackend::getDirectory() and OutputBackend::createDirectory().
- Add support for createUniqueFile() and createUniqueDirectory(), both heavily
used in clang. Backends without read access can use StableUniqueEntityAdaptor to
implement these.

The main thing not settled is the threading guarantees. Restating the straw man
I proposed:

- All OutputBackend APIs are safe to call concurrently. Since OutputDirectory
*is* an OutputBackend, it can be used concurrently as well.
- An OutputFile cannot be used concurrently, but two files from the same backend
can be.

Interested in everyone's thoughts on that; if that sounds reasonable I can
update the patch to make it so.

Two other points:

- Sam proposed dropping OutputConfig. I don't think we can, as the API
client needs to communicate semantic information about specific outputs, not
about the backends generally.
- Sam suggested OutputDestination (now OutputFile) seemed bloated. I talked
through some of the details in my previous response; most of the complexity is
there to make MirroringOutputBackend work efficiently and avoid duplication in
subclasses. As a counterpoint, the only API a concrete subclass needs to
override is storeContentBuffer().

Sam, let please take another look and let me know if you have more high-level
comments.
Others, if you have feedback, please let me know!

Thanks!
Duncan
> On 2021 Jan  28, at 11:24, Duncan P. N. Exon Smith via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 
> Thanks for the detailed response Sam!
> 
>> This roughly corresponds to OutputBackend + OutputDestination in the
patch, except:
>>  - the OutputConfig seems like it belongs to particular backends, not
the overall backend abstraction
> 
> Ideally the OnDiskOutputConfig would almost entirely be settings on
OnDiskOutputBackend since no one else will care. It isn't that way in the
initial patch because Clang decides most of this stuff on a per-output basis.
Maybe there's a refactoring that could fix it.
> 
> Here are the problems I hit that led to this design:
> 
> 1. Some properties need to be associated with specific outputs, not
backends, because they relate to properties of the outputs themselves. Here are
two:
> - ClientIntentOutputConfig::NeedsSeeking
> - OnDiskOutputConfig::OpenFlagText
> 
> Maybe also something like:
> - OnDiskOutputConfig::CreateDirectories
> (and maybe others)
> 
> I couldn't think of a good way to solve this besides passing in
configuration at output creation time.
> 
> 2. Some "configurers" may be handed a pre-constructed opaque
OutputManager / OutputBackend and need to configure internal
OnDiskOutputBackend. In particular, I found that some tooling wants to turn off:
> - OnDiskOutputConfig::RemoveFileOnSignal
> (others flags might benefit)
> 
> This is "documented" by the call chains of
clang::CompilerInstance::createOutputFile.
> 
> I reused the solution from #1 since it needed (?) to exist anyway. Another
option would be to add an OutputBackend visitor, to find and reconfigure any
"contained" OnDiskOutputBackend. This seems pretty heavyweight though.
> 
>>  - OutputDestination has a lot of stuff in it, I'll need to dig
further into the patch to try to understand why
> 
> Firstly, `Output` has the user-facing abstraction. `OutputDestination` has
low-level bits. Another reasonable name might be `OutputImpl` but
`OutputDescription` seemed more descriptive.
> 
> Secondly, most of the low-level bits avoid unnecessary copies of content
buffers. Maybe there are simpler ideas for how to do this, but here are the
design goals I was working around:
> - Avoid buffering content unless necessary.
> - Avoid duplicating content buffers unless necessary.
> - Support multiple destinations (for mirrored backends) without breaking
the above rules. The obvious "interesting" case is in-memory + on-disk
(in either order).
> - Make sub-classes of `OutputDestination` as small as possible given the
above.
> 
> Thirdly, there's some boilerplate related to lifetimes of multiple
destinations. Probably having an explicit `MirroringOutputDestination` would be
better.
> 
>> As for OutputManager itself, I think this belongs in clang, if it's
needed. Its main job seems to be to store a set of default options and manage
the lifecycle of backends, and it's not obvious that those sorts of concerns
will generalize across tools or that there's much to be gained from sharing
code here.
> 
> In the end, its main job is to wrap an OutputDestination (low-level
abstraction) in an Output (high-level abstraction). Output does a bunch of work
for OutputDestination, such as manage the intermediate content buffer.
> 
> Probably it's better to have the OutputBackend return an Output
directly (and get rid of the OutputManager).
> 
> (At one point OutputManager was managing multiple backends directly and was
involved in the store operation(s); but since I factored that logic out to
MirroringOutputBackend and OutputDestination it probably doesn't need to
exist.)
> 
>> - Any other major concerns / suggestions?
>> Thread-safety of the core plug-in interface is something that would be
nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
>> It's tempting to say "not threadsafe, you should lock",
but this results in throwing an unnecessary global lock around all FS accesses
in multithreaded programs, in the common case that the real FS is being used.
> 
> Right, I hit multi-threading limitations myself when prototyping a
follow-up patch (didn't get around to posting it until this AM):
> https://reviews.llvm.org/D95628 <https://reviews.llvm.org/D95628>
> 
> My intuition is:
> 
> Thread-safe:
> - OutputBackend::createDestination
> - Concurrently touching more than one Output/OutputDestination
> 
> Thread-unsafe:
> - Using a single Output/OutputDestination concurrently
> 
> This all seems cheap and easy to maintain because of the limited interface.
The problem I hit in the above patch is that for the InMemoryOutputBackend you
also need any readers of the InMemoryFileSystem to be thread-safe.
> 
> Relatedly, https://reviews.llvm.org/D95583
<https://reviews.llvm.org/D95583> (a prep patch for the above) allows the
llvm::LockManager to be skipped. This is not really about outputs — it's
inter-process coordination to avoid doing redundant work in competing Clang
command-line invocations (at one point it was needed for correctness, but the
main benefit now is to avoid taxing the on-disk filesystem) — but it does touch
on the idea of exclusive write access.
> 
>> Relatedly, working-directory/relative-path handling should be
considered.
> 
> Yeah, you're probably right. Any specific thoughts on what to do here?
It seems like llvm::vfs::FileSystem gets them pretty wrong for Windows; see
(e.g.) the discussion at the end of https://reviews.llvm.org/D70701
<https://reviews.llvm.org/D70701>.
> 
> Even for POSIX, working directories could complicate concurrency
guarantees. A simple solution is to not have an API for changing working
directories, and instead model that by creating a proxy backend
(ChangeDirectoryOutputBackend) that prepends the (new) working directory to new
outputs; since each backend has an immutable view of the CWD concurrent access
should be fine.
> 
> Two other thoughts related to paths:
> 
> 1. I wonder if this abstraction treats the "path" as too central
a property of the output. Can this evolve to allow a compiler to build a
directory structure bottom-up without having to know the destination a priori?
(E.g., an inner layer makes a blob, a middle layer makes a tree out of a few of
those, and an outer layer decides where to put the tree.)
> 
> I think it can. One approach is to use proxy backends:
> - Inner layer writes to '-' / stdout. (Why not pass a
pre-constructed Output/raw_pwrite_stream? See note below.)
> - Middle layer passes the instances of the inner layer a proxy backend that
maps stdout to the various blob names. (E.g., `/name1` and `/name2`.)
> - Outer layer passes the middle layer a proxy backend that "does the
right thing" with the tree. (E.g., writes to
`/Users/dexonsmith/name{1,2}`.)
> 
> => If writing on-disk, "the right thing" is to prepend the
correct output directory for the tree and pass each file to a regular
OnDiskOutputBackend.
> => If writing to (e.g.) Git's CAS, "the right thing" is to
call git-hash-object on Output::close and track the name-to-hash mapping as
outputs come in, and then call "git-mktree" when the middle layer is
"done" (maybe a callback in the
backend-passed-to-the-middle-layer's destructor).
> 
> (IOW, a refactoring where instead of passing absolute paths / directories /
filenames down through all the layers, proxy output backends build up the
path/destination piece-by-piece.)
> 
> I think it's doable with the abstraction as-proposed. But let me know
if anyone has concerns. For example:
> - Is `Output::getPath()` an abstraction leak?
> - Should we have a `createOutput` that doesn't take a path?
> - ...
> 
> Why not pass a pre-constructed Output/raw_pwrite_stream to the inner layer?
> - The inner layer needs an output backend if it (sometimes) dumps
"side" files (such as AST record layouts into ".ast-dump" or
textual IR into ".ll"). This avoids needing to know the on-disk file
path ("/path/to/output" => "/path/to/output.ll"), or to
even know whether there's a disk.
> 
> 2. How should we virtualize stdout / stderr?
> - "'-' means stdout" is probably good enough since LLVM
makes that assumption all over. Unless someone disagrees?
> - I'm not sure what to do with stderr. No one ever "closes"
this stream.
> - Are there other outputs that don't have path names?
> 
> 3. Do we need to virtualize llvm::sys::fs::create_directories?
> - If so, why?
> 
>> (And a question/concern about the relationship between input and output
virtualization, elaborated at the bottom)
> 
>> Why doesn't this inherit from llvm::vfs::FileSystem?
>> Separating the input and output abstractions seems a bit cleaner.
It's easy enough to join them, when useful: e.g., write to an
`InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS
in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
>> I agree with separating the interfaces. In hosted environments your
input VFS is often read-only and outputs go somewhere else.
>> 
>> But I wonder, is there an implicit assumption that data written to
OutputManager is readable via the (purportedly independent) vfs::FileSystem?
This seems like a helpful assumption for module caching, but is extremely
constraining and eliminates many of the simplest and most interesting
possibilities.
>> 
>> If you're going to require the FileSystem and OutputBackend to be
linked, then I think they *should* be the same object.
> 
> No, I don't think that should be a requirement / expectation. It's
a specific requirement for Clang's implicitly built modules, and I think
Clang should be responsible for hooking them together when necessary.
> 
>> But if it's mostly module caching that requires that, then it seems
simpler and less invasive to virtualize module caching directly (put(module,
key) + get(key)) rather than file access.
> 
> Agreed, explicitly virtualizing module caching might be a good thing to do.
Either way this is Clang's job to coordinate; I just think the output
manager should efficiently support mirroring outputs to an additional/custom
backend that Clang installs.
> 
> Note: implicit modules doesn't currently rely on reading the modules it
has just built from disk. It uses InMemoryModuleCache to avoid having to read
anything it has written to disk and to ensure consistency between
CompilerInstances across an implicit build. It's pretty awkward though.
> 
>> On 2021 Jan  28, at 03:19, Sam McCall <sammccall at google.com
<mailto:sammccall at google.com>> wrote:
>> 
>> Really glad to see this work, virtualizing module cache is something
we've wanted to experiment with for tooling, but never got to. I want to get
into the patches in more detail, but some high-level thoughts...
>> 
>> On Wed, Jan 27, 2021 at 6:23 AM Duncan P. N. Exon Smith via cfe-dev
<cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
wrote:
>> TL;DR: Let's virtualize compiler outputs in Clang. These patches
would get us started:
>> - https://reviews.llvm.org/D95501
<https://reviews.llvm.org/D95501> Add llvm::vfs::OutputManager
>> - https://reviews.llvm.org/D95502
<https://reviews.llvm.org/D95502> Initial adoption of
llvm::vfs::OutputManager in Clang.
>> 
>> Questions for the reader
>> - Should we virtualize compiler outputs in Clang? (Hint: yes.)
>> Definitely agree.
>>  
>> 	- Does this support belong in LLVM? (I think it does, so that
non-Clang tools can easily reuse it.)
>> Ideally, the core abstraction (path -> pwrite_stream) certainly
belongs in LLVM, as well as the most common implementations of it.
>> Based on experience with VirtualFileSystem, I'd like this interface
to be as narrow as possible to make it feasible to implement/wrap correctly, and
to reason about how the wider system interacts with it.
>> 
>> This roughly corresponds to OutputBackend + OutputDestination in the
patch, except:
>>  - the OutputConfig seems like it belongs to particular backends, not
the overall backend abstraction
>>  - OutputDestination has a lot of stuff in it, I'll need to dig
further into the patch to try to understand why
>> 
>> As for OutputManager itself, I think this belongs in clang, if it's
needed. Its main job seems to be to store a set of default options and manage
the lifecycle of backends, and it's not obvious that those sorts of concerns
will generalize across tools or that there's much to be gained from sharing
code here.
>>  
>> 	- Is `llvm::vfs::` a reasonable namespace? (If not, suggestions? I
think `llvm::` itself is too broad.)
>> llvm::vfs:: is definitely the right namespace for the core writing
stuff IMO.
>> If more ancillary bits (parts some but not all tools might use) need to
go in llvm, llvm:: seems to be the best namespace we have (like e.g. SourceMgr)
but maybe we should add a new one. But as mentioned, I'd prefer those to
live in clang:: at least for now.
>>  
>> - Do you have a use case that this won't address well?
>> 	- Should that be fixed in the initial patch, or could this be evolved
in-tree to address that?
>> - Any other major concerns / suggestions?
>> Thread-safety of the core plug-in interface is something that would be
nice to explicitly address, as this has been a pain-point with vfs::FileSystem.
>> It's tempting to say "not threadsafe, you should lock",
but this results in throwing an unnecessary global lock around all FS accesses
in multithreaded programs, in the common case that the real FS is being used.
>> 
>> Relatedly, working-directory/relative-path handling should be
considered.
>> 
>> (And a question/concern about the relationship between input and output
virtualization, elaborated at the bottom)
>>  
>> - If you think the above patches should be split up for initial review
/ commit, how?
>> Obviously my favorite would be to see a minimal core writable VFS
interface extracted and land that first. What's built on top of it is less
critical, and I'm not concerned about it landing in larger chunks.
>>  
>> 
>> (Other feedback welcome too!)
>> 
>> Longer version
>> There are a number of use cases for capturing compiler outputs, which
I'm hoping this proposal is a step toward addressing.
>> 
>> - Tooling wants to capture outputs directly, without going through the
filesystem.
>> 	- Sometimes, tertiary outputs can be ignored, or still need to be
written to disk.
>> - clang-scan-deps is using a form of stripped down "implicit"
modules to determine which modules need to be built explicitly. It doesn't
really want to be using the on-disk module cache—in-memory would be better.
>> - clang's ModuleManager manually maintains an in-memory modules
cache for implicit modules. This involves copying the PCM outputs into memory.
It'd be better for these modules to be file-backed, instead of copies of the
stream.
>> 
>> The patch has a bunch of details written / tested
(https://reviews.llvm.org/D95501 <https://reviews.llvm.org/D95501>). Here
are the high-level structures in the design:
>> 
>> - OutputManager—a shared manager for creating outputs without knowing
about the storage.
>> - OutputConfig—configuration set on the OutputManager that can be
(partially) overridden for specific outputs.
>> - Output—opaque object with a raw_pwrite_stream, output path, and
`erase`/`close` functionality. Internally, it has a linked list of output
destinations.
>> - OutputBackend—abstract interface installed in an OutputManager to
create the "guts" of an output. While an OutputManager only has one
installed, these can be layered / forked / mirrored.
>> - OutputDestination—abstract interface paired with an OutputBackend,
whose lifetime is managed by an Output.
>> - ContentBuffer—actual content to allow efficient use of data by
multiple backends. For example, the installed backend is a mirror between an
on-disk and in-memory backend, the in-memory backend will either get the content
moved directly into an llvm::MemoryBuffer, or a just-written mmap'ed file.
>> 
>> The patch includes a few backends:
>> 
>> - NullOutputBackend, for ignoring all backends.
>> - OnDiskOutputBackend, for writing to disk (the default), initially
based on the logic in `clang::CompilerInstance`.
>> - InMemoryOutputBackend, for writing to an `InMemoryFileSystem`.
>> - MirroringOutputBackend, for writing to multiple backends.
OutputDestination's API is designed around supporting this.
>> - FilteringOutputBackend, for filtering which outputs get written to
the underlying backend.
>> 
>> Why doesn't this inherit from llvm::vfs::FileSystem?
>> Separating the input and output abstractions seems a bit cleaner.
It's easy enough to join them, when useful: e.g., write to an
`InMemoryFileSystem` (via an `InMemoryOutputBackend`) and install this same FS
in the `FileManager` (maybe indirectly via an `OverlayFileSystem`).
>> I agree with separating the interfaces. In hosted environments your
input VFS is often read-only and outputs go somewhere else.
>> 
>> But I wonder, is there an implicit assumption that data written to
OutputManager is readable via the (purportedly independent) vfs::FileSystem?
This seems like a helpful assumption for module caching, but is extremely
constraining and eliminates many of the simplest and most interesting
possibilities.
>> 
>> If you're going to require the FileSystem and OutputBackend to be
linked, then I think they *should* be the same object.
>> But if it's mostly module caching that requires that, then it seems
simpler and less invasive to virtualize module caching directly (put(module,
key) + get(key)) rather than file access.
>> 
>> Other work in the area
>> See also: https://reviews.llvm.org/D78058
<https://reviews.llvm.org/D78058> (thanks to Marc Rasi for posting that
patch, and to Sam McCall for some feedback on an earlier version of this
proposal).
>> 
>> Thanks for reading!
>> Duncan
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210202/66633f23/attachment-0001.html>

llvm dev - Jan 2021 - [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs

[llvm-dev] [cfe-dev] RFC: Add an llvm::vfs::OutputManager to allow Clang to virtualize compiler outputs