Haven't got to this but would like to take a look/review it before it goes
in.
*skimming over some of the description*
Sounds like 'stream' might not be the right terminology - since they
return
pointers into data that (I think) remains valid for the life of the stream?
(this also makes me wonder a bit about memory usage if the cross-block
operation is used a lot (causing allocations) but the values are then
discarded by the user - the memory can't be reused, it's effectively
leaked)
Also - the wrappers ThinStreamReader/Writer - do they benefit substantially
from being thin-stream aware, rather than abstractions around any stream of
bytes? (since they transform the data anyway)
Oh, reinterpret casting... hrm. That kind of file reading/writing scheme
usually makes me a bit uncomfortable due to portability concerns (having to
align, byte swap, etc, structs to match the on-disk format can make those
structures problematic to work with - if you have to byte swap anyway,
you'd need to copy the data out of the underlying buffer anyway, right?)
On Wed, Feb 22, 2017 at 10:13 AM Zachary Turner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I take this as no objections. I've changed the name from ThinStream to
> BinaryStream as it more accurately conveys what it is used for, and
I've
> got the tests and comments mostly ready to go, so I'll commit this
later
> today if there's no objections?
>
> On Sat, Feb 18, 2017 at 5:09 PM Zachary Turner <zturner at
google.com> wrote:
>
> Some background:
>
> A while back while working on code to read / write PDB files, I came up
> with Yet Another Stream Abstraction. Note that LLVM already has a few.
> Off the top of my head, theres:
>
> 1) `MemoryBuffer` and its associated class hierarchy
> 2) `raw_ostream` and it's associated classes.
> 3) `DataExtractor` which is used for reading from a StringRef.
>
> There's probably more, and indivdiual subprojects might have even
created
> their own.
>
> The reason I couldn't use any of these and needed to invent another is
> because PDB files are not laid out contiguously in memory. You can think
> of it as a file system where there is an MFT that defines the blocks that
> individual files live on, and then you have to re-sequence the blocks in
> order to read out a "file".
>
> To add to the complexity, each "file" inside of here is actually
a list of
> variable length records where records, or even individual fields of records
> might cross a block boundary and require multiple reads to piece together.
> I needed a way to view these streams as being contiguous, and reading data
> out of them allowing all the magic of broken fields and records,
> discontiguous records, etc to just disappear behind the interface. Also, I
> needed the ability to read and write using a single interface without
> worrying about the details of the block structure. None of LLVM's
existing
> abstractions provided convenient mechanisms for writing this kind of data.
>
> So I came up with the following set of abstractions:
>
> *ThinStream* - An abstract base class that provides read-only access to
> data. The interface is:
> virtual Error readBytes(uint32_t Offset, uint32_t Size,
> ArrayRef<uint8_t> &Buffer) const = 0;
> virtual Error readLongestContiguousChunk(uint32_t Offset,
> ArrayRef<uint8_t> &Buffer) const = 0;
> virtual uint32_t getLength() const = 0;
>
> An important distinction between `ThinStream` and existing stream
> implementations is that API encourages implementations to be *zero copy*.
> Instead of giving it a buffer to write into, it just returns you a slice of
> the existing buffer. This makes it *very efficient*. Similar to
> `ArrayRef` / `MutableArrayRef`, I also provide *WritableThinStream* for
> cases where your data is not read-only. This is another area where
> functionality is provided that was not present in existing abstractions
> (i.e. writeability).
>
> I have several implementations of this class and some abstractions for
> working with them. *ByteStream* and *MutableByteStream* provide a
> concrete implementatino where the backing store is an `ArrayRef` or
> `MutableArrayRef`. *MappedBlockStream* (which is PDB specific) provides
> an implementation that seeks around a file, piecing together blocks from
> various locations in a PDB file. When a call to `readBytes` spans a block
> boundary, it does multiple reads, allocates a contiguous buffer from a
> `BumpPtrAllocator` and returns a reference to that (subsequent requests for
> the same offset return from the cached allocation). There is also
> *FileBufferThinStream* which adapts an `llvm::FileOutputBuffer` so you
> can write to a file system object. One could easily imagine an
> implementation that adapts `llvm::MemoryBuffer` so that you could read and
> write from mmap'ed files.
>
> But all of these just allow reading and writing raw bytes. To handle
> reading and writing semantic data, there are two additional classes.
> *ThinStreamReader* and *ThinStreamWriter.*
>
> These accept any subclass of `ThinStream` and maintain an offset and allow
> you to read integers in any endianness, strings, objects, arrays of objects
> (both fixed length records and variable length records), and various other
> things.
>
> Finally, there are *ThinStreamRef* and *WritableThinStreamRef* which you
> can think of as `ArrayRef` and `MutableArrayRef` for ThinStreams. They
> allow slicing, dropping, etc and copy-semantics so that you can easily pass
> streams around without worrying about reference lifetime issues.
>
> To re-iterate: *When using a ThinStreamReader, copies are the exception,
> not the rule and for most implementations of ThinStream will never occur.*
> Suppose you have a struct:
>
> struct Header {
> char Magic[48];
> ulittle16_t A;
> ulittle16_t B;
> ulittle64_t C;
> };
>
> To read this using a `ThinStreamReader`, you would write this:
>
> ThinStreamReader Reader(Stream);
> const Header *H;
> if (auto EC = Reader.readObject(H))
> return EC;
>
> and `ThinStreamReader` just reinterpret_casts the underlying bytes to your
> structure. The same is true for null terminated strings, arrays of
> objects, and everything else. *It is up to the user to ensure that reads
> and writes happen at proper alignments.* (LLVM can still assert though
> if you do a misaligned read/write). But when reading and writing records
> in binary file formats, this is not a new issue.
>
> The proposal:
> This code has been used in LLVM's PDB library for some time, and I want
to
> move it up to Support.
>
> I've discussed this with some people offline. beanz@ has expressed
> interest for some work he wants to do on libobject. It can also replace a
> few thousand lines of (untested) code in LLDB that does essentially the
> same thing. I suspect it can be used anytime anyone is reading from or
> writing to a binary file format, perhaps even being faster than existing
> implementations due to the zero-copy aspect.
>
> I have a (somewhat large) patch locally that gets LLVM working with the
> code raised up to Support. If you want to look at the existing
> implementation, the following files are relevant:
>
> // Files that would move up to Support
> include/DebugInfo/MSF/ByteStream.h
> include/DebugInfo/MSF/StreamInterface.h
> include/DebugInfo/MSF/StreamReader.h
> include/DebugInfo/MSF/StreamWriter.h
> include/DebugInfo/MSF/StreamArray.h
> include/DebugInfo/MSF/StreamRef.h
> lib/DebugInfo/MSF/StreamReader.cpp
> lib/DebugInfo/MSF/StreamWriter.cpp
> lib/DebugInfo/MSF/StreamRef.cpp
>
> // Files that would remain PDB specific
> include/DebugInfo/MSF/MappedBlockStream.h
> lib/DebugInfo/MSF/MappedBlockStream.cpp
>
> (In the existing implementation, Thin is not used in the class names)
>
> The code is lacking doxygen style comments, but I would add complete
> documentation as the result of any move.
>
> Questions / comments welcome.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170222/6c9bbe22/attachment-0001.html>