Hello! Assuming the default recordsize (FSB) in zfs is 128k, so: 1 - If i have a file with 10k, the zfs will allocate a FSD of 10k. Right? As zfs is not static like the other filesystems, i don?t have that old internal fragmentation... 2 - If the above is right, i don?t need to adjust the recordsize (FSB) if i will handle a lot of tiny files. Right? 3 - if the two above are right ones, so the tuning of the recordsize is just important for files greater than the FSB. Let?s say, 129k... but so, another question: If the file is 129k, the zfs will allocate one filesystem block of 128k and another of... 1k! Right? Or two of 128k? 4 - The last one... ;-) For the FSB allocation, how the zfs knows the file size, for know if the file is smaller than the FSB? Something related to the txg? When the write goes to the disk, the zfs knows (some way) if that write is a whole file or a piece of it? Thanks a lot! Leal. -- This message posted from opensolaris.org
On Fri, 5 Sep 2008, Marcelo Leal wrote:> 4 - The last one... ;-) For the FSB allocation, how the zfs knows > the file size, for know if the file is smaller than the FSB? > Something related to the txg? When the write goes to the disk, the > zfs knows (some way) if that write is a whole file or a piece of it?For synchronous writes (file opened with O_DSYNC option), ZFS must write the data based on what it has been provided in the write so at any point in time, the quality of the result (amount of data in tail block) depends on application requests. However, if the application continues to extend the file via synchronous writes, existing data in the sub-sized "tail" block will be re-written to a new location (due to ZFS COW) with the extra data added. This means that the filesystem block size is more important for synchronous writes, and particularly if there is insufficient RAM to cache the already written block. For asynchronous writes, ZFS will buffer writes in RAM for up to five seconds before actually writing it. This buffering allows ZFS to make better informed decisions about how to write the data so that the data is written to full blocks as contiguously as possible. If the application writes asynchronously, but then issues an fsync() call, then any cached data will be committed to disk at that time. It can be seen that for asynchronous writes, the quality of the written data layout is somewhat dependent on how much RAM the system has available and how fast the data is written. With more RAM, there can be more useful write caching (up to five seconds) and ZFS can make better decisions when it writes the data so that the data in a file can be written optimally, even with the pressure of multi-user writes. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> On Fri, 5 Sep 2008, Marcelo Leal wrote: > > 4 - The last one... ;-) For the FSB allocation, > how the zfs knows > > the file size, for know if the file is smaller than > the FSB? > > Something related to the txg? When the write goes > to the disk, the > > zfs knows (some way) if that write is a whole file > or a piece of it? > > For synchronous writes (file opened with O_DSYNC > option), ZFS must > write the data based on what it has been provided in > the write so at > any point in time, the quality of the result (amount > of data in tail > block) depends on application requests. However, if > the application > continues to extend the file via synchronous writes, > existing data in > the sub-sized "tail" block will be re-written to a > new location (due > to ZFS COW) with the extra data added. This means > that the filesystem > block size is more important for synchronous writes, > and particularly > if there is insufficient RAM to cache the already > written block.If i understand well, the recordsize is really important for big files. Because with small files, and small updates, we have a lot of chances to have the data well organized on disk. I think the problem is the big files... where we have tiny updates. In the creation?s time of the pool, the recordsize is 128k, but i don?t know if that limit is real for, let?s say, when we are copying a DVD image. I think the recordsize can be lager. If so, if in lager files we can have a recordsize of... 1mb? So, what happen if we would change after that, 1k?> > For asynchronous writes, ZFS will buffer writes in > RAM for up to five > seconds before actually writing it. This buffering > allows ZFS to make > better informed decisions about how to write the data > so that the data > is written to full blocks as contiguously as > possible. If the > application writes asynchronously, but then issues an > fsync() call, > then any cached data will be committed to disk at > that time. > > It can be seen that for asynchronous writes, the > quality of the > written data layout is somewhat dependent on how much > RAM the system > has available and how fast the data is written. With > more RAM, there > can be more useful write caching (up to five seconds) > and ZFS can make > better decisions when it writes the data so that the > data in a file > can be written optimally, even with the pressure of > multi-user writes. >Agree. Any other ZFS experts to answer the first questions? ;-)> Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, > http://www.GraphicsMagick.org/ > _____________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssThanks bfriesen! Leal -- This message posted from opensolaris.org
Hello Marcelo, Monday, September 8, 2008, 1:51:09 PM, you wrote: ML> If i understand well, the recordsize is really important for big ML> files. Because with small files, and small updates, we have a lot ML> of chances to have the data well organized on disk. I think the ML> problem is the big files... where we have tiny updates. In the ML> creation?s time of the pool, the recordsize is 128k, but i don?t ML> know if that limit is real for, let?s say, when we are copying a ML> DVD image. I think the recordsize can be lager. If so, if in lager ML> files we can have a recordsize of... 1mb? So, what happen if we would change after that, 1k? Maximum record size currently supported is 128KB. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Hello milek, That information remains true? ZFS algorithm for selecting block sizes The initial block size is the smallest support block size larger than the first write to the file. Grow to the next largest block size for the entire file when the total file length increases beyond the current block size (up to the maximum block size). Shrink the block size when the entire file will fit in a single smaller block. ZFS currently support nine block sizes, from 512 bytes to 128K. Larger block size could be supported in the future, but see roch''s blog on why 128k is enough Thanks. -- This message posted from opensolaris.org