Hi, The ZFS On-Disk Format Specification [1] isn''t very clear on what is compressed on disk and what''s not. Could someone please clarify that? Am I correct that, for example, the MOS pointed to by ub_rootbp is compressed with the algorithm defined by the comp field of blkptr_t? Pekka 1. http://opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf
On Fri, Dec 16, 2005 at 04:40:27PM +0200, Pekka J Enberg wrote:> Hi, > > The ZFS On-Disk Format Specification [1] isn''t very clear on what is > compressed on disk and what''s not. Could someone please clarify that? Am I > correct that, for example, the MOS pointed to by ub_rootbp is compressed > with the algorithm defined by the comp field of blkptr_t?Pekka - Any block can be compressed or uncompressed as expressed in the relevant blkptr_t (ub_rootbp, in your example). Which blocks are compressed is entirely an implementation detail, and doesn''t affect the on-disk specification. For example, the initial release of ZFS had all metadata compressed. However, this made failure diagnosis and disaster recovery more difficult, and was reversed in build 29: 6354299 Disable metadata compression, at least temporarily However, this could be changed at a future date; it doesn''t matter as far as the on-disk spec goes. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> Any block can be compressed or uncompressed as expressed in the relevant > blkptr_t (ub_rootbp, in your example). Which blocks are compressed is > entirely an implementation detail, and doesn''t affect the on-disk > specification.Is there a mechanism in place to try to compress part of the block and only store it compressed if a reasonable ratio is achieved? reiser4 implements something similar, although at the file level, I think, which is probably preferable. I know ideally one would create a compressed filesystem for ASCII config files, source trees, etc. and an uncompressed filesystem for mp3s, compressed tarballs, divx video, etc., but in the case of building software from source it''s a pain to keep the source trees and compressed tarballs on different filesystems. This message posted from opensolaris.org
On Sat, Dec 17, 2005 at 08:56:36PM -0800, Jake Maciejewski wrote:> > Any block can be compressed or uncompressed as expressed in the relevant > > blkptr_t (ub_rootbp, in your example). Which blocks are compressed is > > entirely an implementation detail, and doesn''t affect the on-disk > > specification. > > Is there a mechanism in place to try to compress part of the block and > only store it compressed if a reasonable ratio is achieved? reiser4 > implements something similar, although at the file level, I think, > which is probably preferable. I know ideally one would create a > compressed filesystem for ASCII config files, source trees, etc. and > an uncompressed filesystem for mp3s, compressed tarballs, divx video, > etc., but in the case of building software from source it''s a pain to > keep the source trees and compressed tarballs on different > filesystems.That is exactly what we do. We only store the block compressed if we achieve at least 25% savings. If a block does not compress very well, it gets stored uncompressed. It''s an implementation detail, not an on-disk format thing. Most probably, we will make it a tunable at some point in the future as we implement other compression algorithms. The Reiser method is totally untenable for larger files. Imagine having to recompress, say, a large log file every time you appended to it. That would really suck. ZFS, by comparison, only compresses the blocks that are written; in this case, the last block. Compressing the file as a whole also sucks if you want to change compression algorithms on the fly. Imagine you had a large file, changed the compression algorithm, then re-wrote one block of it. With ZFS, we would just store that one block as being compressed with a different algorithm. With a full-file method, you''d have the re-compress the entire file, which could take quite some time. There is no need to make separate filesystems. Since the uncompressable files you mentioned are large, write-once kinds of things, you only pay a small CPU tax on the initial write to find out they are uncompressable. Subsequent reads are fast, since the block pointer indicates that the blocks are stored uncompressed. --Bill
Cool. I hoped that was the way compression was implemented, but I didn''t see anything in the docs explicitly stating so. Unless I overlooked it, something should probably be added to the man page and admin guide. Regarding the reiser4 method, I speak with no authority on the issue and compression support hasn''t been finalized, so don''t be quick to judge. What I meant, though, wasn''t that the file would be compressed as a single unit, but rather that if part of the file doesn''t compress, chances are none of it''s worth compressing, hence avoiding testing each block. This message posted from opensolaris.org
Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory? This message posted from opensolaris.org
On Mon, 2005-12-19 at 10:45, John Smith wrote:> Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory?The unix stat(2) call and its variants return two size-relevant values: st_size "file size in bytes" st_blocks "number of 512 byte blocks allocated". ls -l shows (among other things) st_size; du and ls -s show st_blocks It''s the second field, st_blocks, which reflects the actual on-disk footprint of the file rather than the apparent size. On UFS, st_blocks includes overhead such as indirect blocks: : 1 %; mkfile 1m f : 1 %; ls -l f -rw------- 1 sommerfeld staff 1048576 Dec 19 13:02 f : 1 %; ls -ls f 2064 -rw------- 1 sommerfeld staff 1048576 Dec 19 13:02 f : 1 %; bc 1048576/512 2048 ZFS is similar, with the added wrinkle that, because of compression, you can often store more than 512 bytes of file content in a single 512-byte block. - Bill
On Mon, Dec 19, 2005 at 07:45:52AM -0800, John Smith wrote:> Apologies for hijacking the thread, but how does du(1) and ls know the > compressed size of files in a directory?They look at the st_blocks field of the the stat structure, as returned by the stat(2) system call. st_blocks The total number of physical blocks of size 512 bytes actually allocated on disk. This field is not defined for block special or character special files. You may find it interesting that the compression ratio (as reported by ''zfs get ratio'') is not calculated by comparing st_size ("the address of the end of the file") to st_blocks. That would inflate the compression ratio for sparse files. Rather, ZFS internally tracks the compressed and uncompressed size of each block, and the sums of each of these for each filesystem. On additional bit of trickyness is that any compression applied to metadata is not counted, since metadata compression is an implementation detail and not controlled by the ''compression'' property. For details, see the code in dsl_dataset_block_born() and dsl_dataset_block_kill(), which use the macros BP_GET_PSIZE() and BP_GET_UCSIZE() to determine the compressed and uncompressed sizes, respectively. BP_GET_UCSIZE() is where we determine if it is metadata, in which case we report the compressed size. --matt
> Apologies for hijacking the thread, but how does > du(1) and ls know the compressed size of files in a > directory?[my apologies if this is posted twice; it seems that the message I sent yesterday didn''t go through] They look at the st_blocks field of the the stat structure, as returned by the stat(2) system call. st_blocks The total number of physical blocks of size 512 bytes actually allocated on disk. This field is not defined for block special or character special files. You may find it interesting that the compression ratio (as reported by ''zfs get ratio'') is not calculated by comparing st_size ("the address of the end of the file") to st_blocks. That would inflate the compression ratio for sparse files. Rather, ZFS internally tracks the compressed and uncompressed size of each block, and the sums of each of these for each filesystem. On additional bit of trickyness is that any compression applied to metadata is not counted, since metadata compression is an implementation detail and not controlled by the ''compression'' property. For details, see the code in dsl_dataset_block_born() and dsl_dataset_block_kill(), which use the macros BP_GET_PSIZE() and BP_GET_UCSIZE() to determine the compressed and uncompressed sizes, respectively. BP_GET_UCSIZE() is where we determine if it is metadata, in which case we report the compressed size. --matt This message posted from opensolaris.org
Bill Sommerfeld <sommerfeld at sun.com> wrote:> On Mon, 2005-12-19 at 10:45, John Smith wrote: > > Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory? > > The unix stat(2) call and its variants return two size-relevant values: > st_size "file size in bytes" > st_blocks "number of 512 byte blocks allocated". > > ls -l shows (among other things) st_size; du and ls -s show st_blocksI should make an important note: Although this is 100% correct, it will currently fool ''star -diff'' as I did forget to use the same (correct) sparse check in diff mode as in cdreate mode. It will definitely fool GNU tar in any operation mode. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Matthew Ahrens <ahrens at sun.com> wrote:> On Mon, Dec 19, 2005 at 07:45:52AM -0800, John Smith wrote: > > Apologies for hijacking the thread, but how does du(1) and ls know the > > compressed size of files in a directory? > > They look at the st_blocks field of the the stat structure, as returned > by the stat(2) system call. > > st_blocks The total number of physical blocks of size > 512 bytes actually allocated on disk. This > field is not defined for block special or > character special files.As it seems that only a few people know this, it would be nice if this text could mention that the value in st_blocks is in units of DEV_BSIZE. This would help people to understand why e.g. things look hosed on HP-UX.... J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily