I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. cat f1 f2 f3 f4 f5 > f15 Is this already possible when source and target are on the same ZFS filesystem? Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? TIA Per -- This message posted from opensolaris.org
Darren J Moffat
2009-Dec-03 12:08 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote:> I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS filesystem? > > Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this?The answer to this is likely deduplication which ZFS now has. The reason dedup should help here is that after the ''cat'' f15 will be made up of blocks that match the blocks of f1 f2 f3 f4 f5. Copy-on-write isn''t what helps you here it is dedup. -- Darren J Moffat
Peter Tribble
2009-Dec-03 12:20 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 3, 2009 at 12:08 PM, Darren J Moffat <darrenm at opensolaris.org> wrote:> Per Baatrup wrote: >> >> I would like to to concatenate N files into one big file taking advantage >> of ZFS copy-on-write semantics so that the file concatenation is done >> without actually copying any (large amount of) file content. >> ?cat f1 f2 f3 f4 f5 > f15 >> Is this already possible when source and target are on the same ZFS >> filesystem? >> >> Am looking into the ZFS source code to understand if there are sufficient >> (private) interfaces to make a simple "zcat -o f15 ? f1 f2 f3 f4 f5" >> userland application in C code. Does anybody have advice on this? > > The answer to this is likely deduplication which ZFS now has. > > The reason dedup should help here is that after the ''cat'' f15 will be made > up of blocks that match the blocks of f1 f2 f3 f4 f5.Is that likely to happen? dedup is at the block level, so the blocks in f2 will only match the same data in f15 if they''re aligned, which is only going to happen if f1 ends on a block boundary. Besides, you still have to read all the data off the disk, manipulate it, and write it all back. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Darren J Moffat
2009-Dec-03 12:23 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Peter Tribble wrote:> On Thu, Dec 3, 2009 at 12:08 PM, Darren J Moffat > <darrenm at opensolaris.org> wrote: >> Per Baatrup wrote: >>> I would like to to concatenate N files into one big file taking advantage >>> of ZFS copy-on-write semantics so that the file concatenation is done >>> without actually copying any (large amount of) file content. >>> cat f1 f2 f3 f4 f5 > f15 >>> Is this already possible when source and target are on the same ZFS >>> filesystem? >>> >>> Am looking into the ZFS source code to understand if there are sufficient >>> (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" >>> userland application in C code. Does anybody have advice on this? >> The answer to this is likely deduplication which ZFS now has. >> >> The reason dedup should help here is that after the ''cat'' f15 will be made >> up of blocks that match the blocks of f1 f2 f3 f4 f5. > > Is that likely to happen? dedup is at the block level, so the blocks > in f2 will only > match the same data in f15 if they''re aligned, which is only going to happen if > f1 ends on a block boundary.Correct you will only get the maximum benefit if the source files are ending on a block boundary. Which is why I said "likely deduplication".> Besides, you still have to read all the data off the disk, manipulate > it, and write it all back.Yep. -- Darren J Moffat
Bob Friesenhahn
2009-Dec-03 12:29 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Darren J Moffat wrote:> > The answer to this is likely deduplication which ZFS now has. > > The reason dedup should help here is that after the ''cat'' f15 will be made up > of blocks that match the blocks of f1 f2 f3 f4 f5. > > Copy-on-write isn''t what helps you here it is dedup.Isn''t this only true if the file sizes are such that the concatenated blocks are perfectly aligned on the same zfs block boundaries they used before? This seems unlikely to me. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Erik Ableson
2009-Dec-03 13:06 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On 3 d?c. 2009, at 13:29, Bob Friesenhahn <bfriesen at simple.dallas.tx.u s> wrote:> On Thu, 3 Dec 2009, Darren J Moffat wrote: >> >> The answer to this is likely deduplication which ZFS now has. >> >> The reason dedup should help here is that after the ''cat'' f15 will >> be made up of blocks that match the blocks of f1 f2 f3 f4 f5. >> >> Copy-on-write isn''t what helps you here it is dedup. > > Isn''t this only true if the file sizes are such that the > concatenated blocks are perfectly aligned on the same zfs block > boundaries they used before? This seems unlikely to me.It''s also worth noting that if the block alignment works out for the dedup, the actual write traffic will be trivial, consisting only of pointer references, so the heavy lifting will be the read operations. Much depends on the contents of the files. Fixed size binary blobs that align nicely with 16/32/64k boundaries, or variable sized text files. Cordialement, Erik Ableson
Darren J Moffat wrote:> Per Baatrup wrote: >> I would like to to concatenate N files into one big file taking >> advantage of ZFS copy-on-write semantics so that the file >> concatenation is done without actually copying any (large amount of) >> file content. >> cat f1 f2 f3 f4 f5 > f15 >> Is this already possible when source and target are on the same ZFS >> filesystem? >> >> Am looking into the ZFS source code to understand if there are >> sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 >> f3 f4 f5" userland application in C code. Does anybody have advice on >> this? > > The answer to this is likely deduplication which ZFS now has. > > The reason dedup should help here is that after the ''cat'' f15 will be > made up of blocks that match the blocks of f1 f2 f3 f4 f5. > > Copy-on-write isn''t what helps you here it is dedup. >Well, to be precise, dedup is implemented based on the COW features in ZFS-es block allocator :) So yes, COW helps: it is the actual optimization feature. However, for this use case, it is DEDUP that obviates the need to do any ''special case'' handling for this specific type of job, because DEDUP generalizes the detection of re-useable sotrage blocks. In all accounts: yes, yes and yes inclusive :)
Per Baatrup wrote:> I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS filesystem? > > Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? > > TIA > Per >You are right that a lot of blocks should be re-usable. This is essentially ZFS''s (new) dedup feature, so why bother writing complicated (non-posix) userland extensions, when you already have the storage optimization in recent versions of ZFS...? Also what you are proposing in fs implementation would not actually be on file level but (normally) on block allocation level. This idea will break down badly unless your ''files'' are all aligned and in multiples of the blocksize (which is dynamic/configurable in ZFS). Lastly, you might post at zfs-code :)
"dedup" operates on the block level leveraging the existing FFS checksums. Read "What to dedup: Files, blocks, or bytes" here http://blogs.sun.com/bonwick/entry/zfs_dedup The trick should be that the zcat userland app already knows that it will generate duplicate files so data read and writes could be avoided all together. -- This message posted from opensolaris.org
After reading all the comments it appears that there may be a ''real'' problem with unaligned block sizes that DEDUP simply will not handle. What you seem to be after, then, is the opposite of sparse files, ''virtual files'' that can be chained together as a linked list of _fragments_ of allocation blocks as well as full allocation blocks. This could then be leveraged by a specialized concatenation driver (userland) to avoid realigning the blocks and missing the ''opportunity'' to DEDUP or COW the existing blocks. As always in computing, a specialized per-use-case driver will be able to yields the best optimizations. However, there will be a balance point since obviously the optimization based on leaving parts of allocation blocks unused is not going to healthy for say: cat s1 s2 s3 s4 .... s999 > all_s_files rm s* Where s1...s999 all use (much) less than a block size. All the ''gain'' in DEDUP is quickly offset by the enormous waste of block space after deletion of the constituent files. Per Baatrup wrote:> I would like to to concatenate N files into one big file taking advantage of ZFS copy-on-write semantics so that the file concatenation is done without actually copying any (large amount of) file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS filesystem? > > Am looking into the ZFS source code to understand if there are sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 f3 f4 f5" userland application in C code. Does anybody have advice on this? > > TIA > Per >
Michael Schuster
2009-Dec-03 13:25 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote:> "dedup" operates on the block level leveraging the existing FFS > checksums. Read "What to dedup: Files, blocks, or bytes" here > http://blogs.sun.com/bonwick/entry/zfs_dedup > > The trick should be that the zcat userland app already knows that it > will generate duplicate files so data read and writes could be avoided > all together.you''d probably be better off avoiding "zcat" - it''s been in use since almost forever, from the man-page: zcat The zcat utility writes to standard output the uncompressed form of files that have been compressed using compress. It is the equivalent of uncompress-c. Input files are not affected. :-) cheers Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
Per Baatrup Petersen
2009-Dec-03 13:37 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Thank you for the feedback Michael. "zcat" was my acronym for a special ZFS aware version of "cat" and I did not know that it was an existing command. Simply forgot to check. Should rename if to "zfscat" or something similar. Venlig hilsen Per Michael Schuster skrev:> Per Baatrup wrote: >> "dedup" operates on the block level leveraging the existing FFS >> checksums. Read "What to dedup: Files, blocks, or bytes" here >> http://blogs.sun.com/bonwick/entry/zfs_dedup >> >> The trick should be that the zcat userland app already knows that it >> will generate duplicate files so data read and writes could be avoided >> all together. > > you''d probably be better off avoiding "zcat" - it''s been in use since > almost forever, from the man-page: > > zcat > The zcat utility writes to standard output the uncompressed > form of files that have been compressed using compress. It > is the equivalent of uncompress-c. Input files are not > affected. > > :-) > > cheers > Michael
"zcat" was my acronym for a special ZFS aware version of "cat" and the name was obviously a big mistake as I did not know it was an existing command and simply forgot to check. Should rename if to "zfscat" or something similar? -- This message posted from opensolaris.org
Roland Rambau
2009-Dec-03 15:24 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
gang, actually a simpler version of that idea would be a "zcp": if I just cp a file, I know that all blocks of the new file will be duplicates; so the cp could take full advantage for the dedup without a need to check/read/write anz actual data -- Roland Per Baatrup schrieb:> "dedup" operates on the block level leveraging the existing FFS checksums. Read "What to dedup: Files, blocks, or bytes" here http://blogs.sun.com/bonwick/entry/zfs_dedup > > The trick should be that the zcat userland app already knows that it will generate duplicate files so data read and writes could be avoided all together.
michael schuster
2009-Dec-03 15:32 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Roland Rambau wrote:> gang, > > actually a simpler version of that idea would be a "zcp": > > if I just cp a file, I know that all blocks of the new file > will be duplicates; so the cp could take full advantage for > the dedup without a need to check/read/write anz actual dataI think they call it ''ln'' ;-) and that even works on ufs. Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
Darren J Moffat
2009-Dec-03 15:40 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Bob Friesenhahn wrote:> On Thu, 3 Dec 2009, Darren J Moffat wrote: >> >> The answer to this is likely deduplication which ZFS now has. >> >> The reason dedup should help here is that after the ''cat'' f15 will be >> made up of blocks that match the blocks of f1 f2 f3 f4 f5. >> >> Copy-on-write isn''t what helps you here it is dedup. > > Isn''t this only true if the file sizes are such that the concatenated > blocks are perfectly aligned on the same zfs block boundaries they used > before? This seems unlikely to me.Yes that would be the case. -- Darren J Moffat
Actually ''ln -s source target'' would not be the same "zcp source target" as writing to the source file after the operation would change the target file as well where as for "zcp" this would only change the source file due to copy-on-write semantics of ZFS. -- This message posted from opensolaris.org
michael schuster
2009-Dec-03 15:48 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Per Baatrup wrote:> Actually ''ln -s source target'' would not be the same "zcp source target" > as writing to the source file after the operation would change the > target file as well where as for "zcp" this would only change the source > file due to copy-on-write semantics of ZFS.I actually was thinking of creating a hard link (without the -s option), but your point is valid for hard and soft links. cheers Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
michael schuster wrote:> Roland Rambau wrote: >> gang, >> >> actually a simpler version of that idea would be a "zcp": >> >> if I just cp a file, I know that all blocks of the new file >> will be duplicates; so the cp could take full advantage for >> the dedup without a need to check/read/write anz actual data > > I think they call it ''ln'' ;-) and that even works on ufs. > > Michael+1 More and more it sounds like an optimization that will either A. not add much over dedup or B. have value only in specific situations - and completely misbehave in other situations (even the same situations after passage of time) Why not just make a special-purpose application (completely user-land) for it? I know, ''ln'' is remotely kin of this idea but, ''ln'' is POSIX and people know what to expect. What you''d practically need to do is whip up a vfs layer that exposes the underlying blocks of a filesystem and possibly name them by their SHA256 or MD5 hash. Then you''d need (another?) vfs abstraction that allows ''virtual'' files to be assembled from these blocks in multiple independent chains. I know there is already a fuse implementation of the first vfs driver (the name evades me, but I think it was something like chunkfs[1]) and one could at least whip up a reasonable read-only Proof-of-Concept of the second part. The reason _I_ wouldn''t do that is because, I''m already happy with e.g.: mkfifo /var/run/my_part_collector (while true; do cat /local/data/my_part_* > /var/run/my_part_collector; done)& wc -l /var/run/my_part_collector The equivalent of this could be (better) expressed in C, perl or any language of your choice). I believe this is all POSIX. [1] The reason this exists is obviously for backup and synchronization implementations: it will make it possible to backup files using rsync when the encryption key is not available to the backup process (with a EBC mode crypto algo); it should make it ''simple'' to synchronize ones large monolythic files with e.g. Amazon S3 cloud storage etc. etc.
Bob Friesenhahn
2009-Dec-03 15:58 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Erik Ableson wrote:> > Much depends on the contents of the files. Fixed size binary blobs that align > nicely with 16/32/64k boundaries, or variable sized text files.Note that the default zfs block size is 128K and so that will therefore be the default dedup block size. Most files are less than 128K and occupy a short tail block so concatenating them will not usually enjoy the benefits of deduplication. It is not wise to riddle zfs with many special-purpose features since zfs would then be encumbered by these many features, which tend to defeat future improvements. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Darren J Moffat
2009-Dec-03 16:03 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Bob Friesenhahn wrote:> On Thu, 3 Dec 2009, Erik Ableson wrote: >> >> Much depends on the contents of the files. Fixed size binary blobs >> that align nicely with 16/32/64k boundaries, or variable sized text >> files. > > Note that the default zfs block size is 128K and so that will therefore > be the default dedup block size. > > Most files are less than 128K and occupy a short tail block so > concatenating them will not usually enjoy the benefits of deduplication.Most ? I think that is a bit of a sweeping statement. In know of some environments where "most" files are multiple gigabytes in size and others where 1K is the upper bound of the file system. So I don''t think you can say at all that "Most" files are < 128K. -- Darren J Moffat
On Thu, Dec 3, 2009 at 9:58 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Thu, 3 Dec 2009, Erik Ableson wrote: >> >> Much depends on the contents of the files. Fixed size binary blobs that >> align nicely with 16/32/64k boundaries, or variable sized text files. > > Note that the default zfs block size is 128K and so that will therefore be > the default dedup block size. > > Most files are less than 128K and occupy a short tail block so concatenating > them will not usually enjoy the benefits of deduplication. > > It is not wise to riddle zfs with many special-purpose features since zfs > would then be encumbered by these many features, which tend to defeat future > improvements.Well it could be done in a way such that it could be fs-agnostic (perhaps extending /bin/cat with a new flag such as -o outputfile, or detecting if stdout is a file vs tty, though corner cases might get tricky). If a particular fs supported such a feature, it could take advantage of it, but if it didn''t, it could fall back to doing a read+append. Sort of like how mv figures out if the source & target are the same or different filesystems and acts accordingly. There are a few use cases I''ve encountered where having this would have been _very_ useful (usually when trying to get large crashdumps to Sun quickly). In general, it would allow one to manipulate very large files by breaking them up into smaller subsets while still having the end result be a single file.
Bob Friesenhahn
2009-Dec-03 16:31 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, 3 Dec 2009, Jason King wrote:> > Well it could be done in a way such that it could be fs-agnostic > (perhaps extending /bin/cat with a new flag such as -o outputfile, or > detecting if stdout is a file vs tty, though corner cases might get > tricky). If a particular fs supported such a feature, it could take > advantage of it, but if it didn''t, it could fall back to doing a > read+append. Sort of like how mv figures out if the source & target > are the same or different filesystems and acts accordingly.The most common way that I concatenate files into a larger file is by using a utility such as ''tar'', which outputs a different format. I rarely use ''cat'' to concatenate files. It is desired to concatenate files in a way which works best for deduplication then a tar-like format can be invented which takes care to always start new file output on a filesystem block boundary. With zfs deduplication this should be faster and take less space than compressing the entire result as long as the ouput is stored in the same pool. If output is written to a destination filesystem which uses a different block size, then the ideal block size will be that of the destination filesystem so that large archive files can still be usefull deduplicated. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Roland Rambau
2009-Dec-03 17:10 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Michael, michael schuster schrieb:> Roland Rambau wrote: >> gang, >> >> actually a simpler version of that idea would be a "zcp": >> >> if I just cp a file, I know that all blocks of the new file >> will be duplicates; so the cp could take full advantage for >> the dedup without a need to check/read/write anz actual data > > I think they call it ''ln'' ;-) and that even works on ufs.quite similar but with a critical difference: with hard links any modifications through either link are seen by both links, since it stays a single file (note that editors like vi do an implicit cp, they do NOT update the original file ) That "zcp" ( actually it should be just a feature of ''cp'' ) would be blockwise copy-on-write. It would have exactly the same semantics as cp but just avoid any data movement, since we can easily predict what the effect of a cp followed by a dedup should be. -- Roland -- ********************************************************** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008-2222 mailto:Roland.Rambau at sun.com ********************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht M?nchen: HRB 161028; Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin H?ring ******* UNIX ********* /bin/sh ******** FORTRAN **********
Roland, Clearly an extension of "cp" would be very nice when managing large files. Today we are relying heavily on snapshots for this, but this requires disipline on storing files in separate zfs''es avioding to snapshot too many files that changes frequently. The reason I was speaking about "cat" in stead of "cp" is that in addition to copying a single file I would like also to concatenate several files into a single file. Can this be accomplished with your "(z)cp"? --Per -- This message posted from opensolaris.org
A Darren Dunham
2009-Dec-03 17:58 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 09:36:23AM -0800, Per Baatrup wrote:> The reason I was speaking about "cat" in stead of "cp" is that in > addition to copying a single file I would like also to concatenate > several files into a single file. Can this be accomplished with your > "(z)cp"?Unless you have special data formats, I think it''s unlikely that the last ZFS block in the file will be exactly full. But to append without copying, you''d need some way of ignoring a portion of the data in a non-final ZFS block and stitching together the bytestream. I don''t think that''s possible with the ZFS layout. -- Darren
Nicolas Williams
2009-Dec-03 18:08 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 03:57:28AM -0800, Per Baatrup wrote:> I would like to to concatenate N files into one big file taking > advantage of ZFS copy-on-write semantics so that the file > concatenation is done without actually copying any (large amount of) > file content. > cat f1 f2 f3 f4 f5 > f15 > Is this already possible when source and target are on the same ZFS > filesystem? > > Am looking into the ZFS source code to understand if there are > sufficient (private) interfaces to make a simple "zcat -o f15 f1 f2 > f3 f4 f5" userland application in C code. Does anybody have advice on > this?There have been plenty of answers already. Quite aside from dedup, the fact that all blocks in a file must have the same uncompressed size means that if any of f2..f5 have different block sizes from f1, or any of f1..f5''s last blocks are partial then ZFS could not perform this concatenation as efficiently as you wish. In other words: dedup _is_ what you''re looking for... ...but also ZFS most likely could not do any better with any other, more specific non-dedup solution. Nico --
Roland Rambau
2009-Dec-03 18:37 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Per, Per Baatrup schrieb:> Roland, > > Clearly an extension of "cp" would be very nice when managing large files. > Today we are relying heavily on snapshots for this, but this requires disipline on storing files in separate zfs''es avioding to snapshot too many files that changes frequently. > > The reason I was speaking about "cat" in stead of "cp" is that in addition to copying a single file I would like also to concatenate several files into a single file. Can this be accomplished with your "(z)cp"?No - "zcp" is a simpler case than what you proposed, and thats why I pointed it out as a discussion case. ( And it is clearly NOT the same as ''ln''. ) Btw. I would be surprised to hear that this can be implemented with current APIs; you would need a call like (my fantasy here) "write_existing_block()" where the data argument is not a pointer to a buffer in memory but instead a reference to an already existing data block in the pool. Based on such a call ( and a corresponding one for read that returns those references in the pool ) IMHO an implementation of the commands would be straight forward ( the actual work would be in the implementation of those calls ). This can certainly been done - I just doubt it already exists. -- Roland -- ********************************************************** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008-2222 mailto:Roland.Rambau at sun.com ********************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht M?nchen: HRB 161028; Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin H?ring ******* UNIX ********* /bin/sh ******** FORTRAN **********
>if any of f2..f5 have different block sizes from f1This restriction does not sound so bad to me if this only refers to changes to the blocksize of a particular ZFS filesystem or copying between different ZFSes in the same pool. This can properly be managed with a "-f" switch on the userlan app to force the copy when it would fail.>any of f1..f5''s last blocks are partialDoes this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS blocksize? This is a severe restriction that will fail unless in very special cases. Is this related to the disk format or is it restriction in the implrmentation? (do you know where to look in the source code?).>...but also ZFS most likely could not do any better with any other, more >specific non-dedup solutionProperly lots of I/O traffic, digest calculation+lookups, could be saved as we already know it will be a duplicate. (In our case the files are gigabyte sizes) --Per -- This message posted from opensolaris.org
>Btw. I would be surprised to hear that this can be implemented >with current APIs;I agree. However it looks like an opportunity to dive into the Z-source code. -- This message posted from opensolaris.org
Nicolas Williams
2009-Dec-03 21:26 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote:> >if any of f2..f5 have different block sizes from f1 > > This restriction does not sound so bad to me if this only refers to > changes to the blocksize of a particular ZFS filesystem or copying > between different ZFSes in the same pool. This can properly be managed > with a "-f" switch on the userlan app to force the copy when it would > fail.Why expose such details? If you have dedup on and if the file blocks and sizes align then cat f1 f2 f3 f4 f5 > f6 will do the right thing and consume only space for new metadata. If the file blocks and sizes do not align then cat f1 f2 f3 f4 f5 > f6 will still work correctly. Or do you mean that you want a way to do that cat ONLY if it would consume no new space for data? (That might actually be a good justification for a ZFS cat command, though I think, too, that one could script it.)> >any of f1..f5''s last blocks are partial > > Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS > blocksize? This is a severe restriction that will fail unless in very > special cases.Say f1 is 1MB, f2 is 128KB, f3 is 510 bytes, f4 is 514 bytes, and f5 is 10MB, and the recordsize for their containing datasets is 128KB, then the new file will consume 10MB + 128KB more than f1..f5 did, but 1MB + 128KB will be de-duplicated. This is not really "a severe restriction". To make ZFS do better than that would require much extra metadata and complexity in the filesystem that users who don''t need to do space-efficient file concatenation (most users, that is) won''t want to pay for.> Is this related to the disk format or is it restriction in the > implrmentation? (do you know where to look in the source code?).Both.> >...but also ZFS most likely could not do any better with any other, more > >specific non-dedup solution > > Properly lots of I/O traffic, digest calculation+lookups, could be > saved as we already know it will be a duplicate. (In our case the > files are gigabyte sizes)ZFS hashes, and records hashes of blocks, not sub-blocks. Look at my above example. To efficiently dedup the concatenation of the 10MB of f5 would require being able to have something like "sub-block pointers". Alternatively, if you want a concatenation-specific feature ZFS would have to have a metadata notion of concatentation, but then the Unix way of concatenating files couldn''t be used for this since the necessary context is lost in the I/O redirection. Nico --
Daniel Carosone
2009-Dec-03 21:45 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
> > Isn''t this only true if the file sizes are such that the concatenated > > blocks are perfectly aligned on the same zfs block boundaries they used > > before? This seems unlikely to me. > > Yes that would be the case.While eagerly awaiting b128 to appear in IPS, I have been giving this issue (block size and alignment vs dedup) some thought recently. I have a different, but sufficiently similar, scenario where the effectiveness of dedup will depend heavily on this factor. For this case, though, the alignment question for short tails is relatively easily dealt with. The key is that the record size of the file is "up to 128k" and may be shorter depending on various circumstances, such as the write pattern used. To simplify, let us assume that the original files were all written quickly and sequentially, that is that they have n 128k blocks, plus a shorter tail. When concatenating them, it should be sufficient to write out the target file in 128k chunks from the source, then the first tail, then issue an fsync before moving on to the chunks from the second file. If the source files were not written in this pattern (e.g. log files, accumulating small varying-size writes), the best thing to do is to rewrite those "in place" as well, with the same pattern as being written to the joined file. This can also have an improvement on compression efficiency, by allowing larger block sizes than the original. Issues/questions: * This is an optimistic method of alignment, is there any mechanism to get stronger results - ie, to know the size of each record of the original, or to produce specific record size/alignment on output? * There''s already the very useful seek interface for finding holes and data, perhaps something similar is useful here. Or a direct io related option to read, that can return short reads only up to the end of the current record? * Perhaps a pause of some kind (to wait for the txg to close) is also necessary, to ensure the tail doesn''t get combined with new data and reblocked? -- This message posted from opensolaris.org
A Darren Dunham
2009-Dec-03 22:55 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote:> >any of f1..f5''s last blocks are partial > Does this mean that f1,f2,f3,f4 needs to be exact multiplum of the ZFS > blocksize? This is a severe restriction that will fail unless in very > special cases. Is this related to the disk format or is it > restriction in the implrmentation? (do you know where to look in the > source code?).I''m sure it''s related to the FS structure. How do you find a particular point in a file quickly? You don''t read up to that point, you want to go to it directly. To do so, you have to know how the file is indexed. If every block contains the same amount of data, this is a simple math equation. If some blocks have more or less data, then you have to keep track of them and their size. I doubt ZFS has any space or ability to include non-full blocks in the middle of a file. -- Darren
Michael Schuster
2009-Dec-04 06:45 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Nicolas Williams wrote:> On Thu, Dec 03, 2009 at 12:44:16PM -0800, Per Baatrup wrote: >>> if any of f2..f5 have different block sizes from f1 >> This restriction does not sound so bad to me if this only refers to >> changes to the blocksize of a particular ZFS filesystem or copying >> between different ZFSes in the same pool. This can properly be managed >> with a "-f" switch on the userlan app to force the copy when it would >> fail. > > Why expose such details? > > If you have dedup on and if the file blocks and sizes align then > > cat f1 f2 f3 f4 f5 > f6 > > will do the right thing and consume only space for new metadata.I think Per''s concern was not only with space consumed but also the effort involved in the process (think large files); if I read his emails correctly, he''d like what amounts to manipulation of meta-data only to have the data blocks of what was originally 5 files to end up in one file; the traditional concat operation will cause all the data to be read and written back, at which point dedup will kick in, and so most of the processing has already been spent. (Per, please correct/comment) Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
Michael, Your explanation is 100% correct: I am concerned about the effort when managing quite large files ex. 500MB. In my specific case we have DVD/BlueRay chapter files 500MB - 2GB (part of movie) that are concatenated into complete movie (3-20GB).>From my point of view (large files) it is not so important whether there is a minor issue with handling the last block in terms of disk space efficiency.--Per -- This message posted from opensolaris.org
Jeffry Molanus
2009-Dec-04 10:21 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
Actually, I asked about this a while ago only called it file-level cloning. Consider you have 100VM''s and you want to clone just one? BTRFS added a specialized IOCTL() call to make the FS aware that it has to clone this obviously saves copy time and dedup time. Regards, Jeffry> -----Oorspronkelijk bericht----- > Van: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] Namens Roland Rambau > Verzonden: donderdag 3 december 2009 16:25 > Aan: Per Baatrup > CC: zfs-discuss at opensolaris.org > Onderwerp: Re: [zfs-discuss] file concatenation with ZFS copy-on-write > > gang, > > actually a simpler version of that idea would be a "zcp": > > if I just cp a file, I know that all blocks of the new file > will be duplicates; so the cp could take full advantage for > the dedup without a need to check/read/write anz actual data > > -- Roland > > Per Baatrup schrieb: > > "dedup" operates on the block level leveraging the existing FFS > checksums. Read "What to dedup: Files, blocks, or bytes" here > http://blogs.sun.com/bonwick/entry/zfs_dedup > > > > The trick should be that the zcat userland app already knows that it > will generate duplicate files so data read and writes could be avoided > all together. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I was thinking in the same direction about the efficiency of the offset calculations. Trying to get into the ZFS source code to understand this part, but did not have time to get there yet. This issue may be a showstopper for the proposal as it would restrict the functionality to quite rare cases. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Dec-04 15:21 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Fri, 4 Dec 2009, Jeffry Molanus wrote:> Actually, I asked about this a while ago only called it file-level cloning. Consider you have 100VM''s and you want to clone just one? > > BTRFS added a specialized IOCTL() call to make the FS aware that it > has to clone this obviously saves copy time and dedup time.The best thing that I see in Solaris for efficiently copying/concatenating files are the functions sendfile() and sendfilev() in libsendfile. Unfortunately, these are not portable interfaces and the amount of data which can be copied in one call is limited by the maximum value supported by the size_t type. A 64-bit program should be able to request "sending" the content of a large file into another large file in one call, but a 32-bit program would need to use multiple calls. If Solaris sendfile is similar to the sendfile in other OSs, the data copy is done in kernel space (with potentially zero-copy for the read) but the data still needs to be copied from/to the filesystem. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Dec-04 17:50 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Dec 4, 2009, at 2:21 AM, Jeffry Molanus wrote:> Actually, I asked about this a while ago only called it file-level > cloning. Consider you have 100VM''s and you want to clone just one?In my experience, cloning is done for basic provisioning, so how would you get to the case where you could not clone any particular VM? -- richard> > BTRFS added a specialized IOCTL() call to make the FS aware that it > has to clone this obviously saves copy time and dedup time. > > Regards, Jeffry > >> -----Oorspronkelijk bericht----- >> Van: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] Namens Roland Rambau >> Verzonden: donderdag 3 december 2009 16:25 >> Aan: Per Baatrup >> CC: zfs-discuss at opensolaris.org >> Onderwerp: Re: [zfs-discuss] file concatenation with ZFS copy-on- >> write >> >> gang, >> >> actually a simpler version of that idea would be a "zcp": >> >> if I just cp a file, I know that all blocks of the new file >> will be duplicates; so the cp could take full advantage for >> the dedup without a need to check/read/write anz actual data >> >> -- Roland >> >> Per Baatrup schrieb: >>> "dedup" operates on the block level leveraging the existing FFS >> checksums. Read "What to dedup: Files, blocks, or bytes" here >> http://blogs.sun.com/bonwick/entry/zfs_dedup >>> >>> The trick should be that the zcat userland app already knows that it >> will generate duplicate files so data read and writes could be >> avoided >> all together. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2009-Dec-04 17:58 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
The way I see it, a filename is a handle to a specific set of blocks. For applications that can handle multiple files, no worries. For applications that can''t (inferring DVD players?) I sense that a fixing the tail block issue in a file system is probably not the best place. This affects all file systems because all of them must use at least 512 bytes for a physical block. Suppose we consider a shortcut, say a symbolic link with multiple sources. When read, it will appear to the application as a single file, but be comprised of the concatenated contents of multiple files, respecting the proper EOFs. This could work as long as the files are readonly. Would that be too much of a constraint? -- richard On Dec 3, 2009, at 5:23 AM, sgheeren wrote:> After reading all the comments it appears that there may be a ''real'' > problem with unaligned block sizes that DEDUP simply will not handle. > > What you seem to be after, then, is the opposite of sparse files, > ''virtual files'' that can be chained together as a linked list of > _fragments_ of allocation blocks as well as full allocation blocks. > This could then be leveraged by a specialized concatenation driver > (userland) to avoid realigning the blocks and missing the > ''opportunity'' to DEDUP or COW the existing blocks. > > As always in computing, a specialized per-use-case driver will be > able to yields the best optimizations. However, there will be a > balance point since obviously the optimization based on leaving > parts of allocation blocks unused is not going to healthy for say: > > cat s1 s2 s3 s4 .... s999 > all_s_files > rm s* > > Where s1...s999 all use (much) less than a block size. All the > ''gain'' in DEDUP is quickly offset by the enormous waste of block > space after deletion of the constituent files. > > Per Baatrup wrote: >> I would like to to concatenate N files into one big file taking >> advantage of ZFS copy-on-write semantics so that the file >> concatenation is done without actually copying any (large amount >> of) file content. >> cat f1 f2 f3 f4 f5 > f15 >> Is this already possible when source and target are on the same ZFS >> filesystem? >> >> Am looking into the ZFS source code to understand if there are >> sufficient (private) interfaces to make a simple "zcat -o f15 f1 >> f2 f3 f4 f5" userland application in C code. Does anybody have >> advice on this? >> >> TIA >> Per >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jeffry Molanus
2009-Dec-04 19:54 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
> In my experience, cloning is done for basic provisioning, so how would > you get > to the case where you could not clone any particular VM? > -- richardWell, a situation where this might come in handy is when you have your typical ISP provider that has multiple ESX hosts with multiple datastores. ESX has limits on how many datastores it can have so cloning filesystems over and over will only get you that far. (16 I believe?). Or a VDI environment for schools for instance? Instead of cloning a complete zfs fs; you can clone the freshmen-gold.vmkd times the new subscribed students? Let''s assume the scenario of the school? You have a NFS export containing gold images with different pre installed applications or whatever. How would you rapidly deploy 500 new gold-images? Copy them 500 times? If you clone them on the ESX side; you would also have to copy them. Moreover why copy->dedup if you can prevent the dedup process all together? Since the dedup process in inline; it could affect the storage performance as it goes along. Regards, Jeffry
Richard Elling
2009-Dec-04 21:07 UTC
[zfs-discuss] file concatenation with ZFS copy-on-write
On Dec 4, 2009, at 11:54 AM, Jeffry Molanus wrote:> >> In my experience, cloning is done for basic provisioning, so how >> would >> you get >> to the case where you could not clone any particular VM? >> -- richard > > Well, a situation where this might come in handy is when you have > your typical ISP provider that has multiple ESX hosts with multiple > datastores. ESX has limits on how many datastores it can have so > cloning filesystems over and over will only get you that far. (16 I > believe?). Or a VDI environment for schools for instance? Instead of > cloning a complete zfs fs; you can clone the freshmen-gold.vmkd > times the new subscribed students?For ESX, the current limit of NFS datastores is 64.> Let''s assume the scenario of the school? You have a NFS export > containing gold images with different pre installed applications or > whatever. How would you rapidly deploy 500 new gold-images? Copy > them 500 times? If you clone them on the ESX side; you would also > have to copy them. Moreover why copy->dedup if you can prevent the > dedup process all together? Since the dedup process in inline; it > could affect the storage performance as it goes along.In my experience, this is why people are using iSCSI instead of NFS for ESX. ESX has a LUN limit of 256, so eventually you''ll need another box :-) This is a bit of a shame, but the real solution resides in VMWare''s court. Apparently the claim is that NFS mounts consume memory and therefore the restriction. Again, this is a bit of a shame because of the relative cost of RAM versus thousands of man-hours spent gnashing teeth on the issue. -- richard
Hi all I wonder if there has been any new development on this matter over the past 6 months. Today i pondered an idea of zfs-aware "mv", capable of doing zero read/write of file data when moving files between datasets of one pool. This seems like a "(z)cp" idea proposed in this thread and seems like a trivial job for sun - who have all APIs and functional implementations for cloning and dedup as a means to reference the same block from different files. Such extension to cp should be cheaper than generac dedup and useful for copying any "templated" file sets. i thought of local zones first, but most people may init them by packages (though zoneadm says it is copying thousands of files), so /etc/skel might be a better example of the usecase - though nearly useless ,) jim -- This message posted from opensolaris.org
Hi all I wonder if there has been any new development on this matter over the past 6 months. Today i pondered an idea of zfs-aware "mv", capable of doing zero read/write of file data when moving files between datasets of one pool. This seems like a "(z)cp" idea proposed in this thread and seems like a trivial job for sun - who have all APIs and functional implementations for cloning and dedup as a means to reference the same block from different files. Such extension to cp should be cheaper than generac dedup and useful for copying any "templated" file sets. i thought of local zones first, but most people may init them by packages (though zoneadm says it is copying thousands of files), so /etc/skel might be a better example of the usecase - though nearly useless ,) jim -- This message posted from opensolaris.org