Ilya
2009-Nov-10 02:42 UTC
[zfs-discuss] Couple questions about ZFS writes and fragmentation
1. Is it true that because block sizes vary (in powers of 2 of course) on each write that there will be very little internal fragmentation? 2. I came upon this statement in a forum post: [i]"ZFS uses 128K data blocks by default whereas other filesystems typically use 4K or 8K blocks. This naturally reduces the potential for fragmentation by 32X over 4k blocks."[/i] How is this true? I mean, if you have a 128k default block size and you store a 4k file within that block then you will have a ton of slack space to clear up. 3. Another statement from a post: [i]"the seek time for single-user contiguous access is essentially zero since the seeks occur while the application is already busy processing other data. When mirror vdevs are used, any device in the mirror may be used to read the data."[/i] All this is saying that is when you are reading off of one physical device you will already be seeking for the blocks that you need from the other device so the seek time will no longer be an issue right? 4. In terms of where ZFS chooses to write data, is it always going to pick one metaslab and write to only free blocks within that metaslab? Or will it go all over the place? 5. When ZFS looks for a place to write data, does it look somewhere to intelligently see that there are some number of free blocks available within this particular metaslab and if so where is this located? 6. Could anyone clarify this post: [i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause fragmentation if portions of existing files are updated. If a large portion of a file is overwritten in a short period of time, the result should be reasonably fragment-free but if parts of the file are updated over a long period of time (like a database) then the file is certain to be fragmented. This is not such a big problem as it appears to be since such files were already typically accessed using random access."[/i] 7. An aside question...I was reading a paper about ZFS and it stated that offsets are something like 8 bytes from the first vdev label. Is there any reason why the storage pool is after 2 vdev labels? Thanks guys -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Nov-10 04:04 UTC
[zfs-discuss] Couple questions about ZFS writes and fragmentation
On Mon, 9 Nov 2009, Ilya wrote:> > 2. I came upon this statement in a forum post: > > [i]"ZFS uses 128K data blocks by default whereas other filesystems > typically use 4K or 8K blocks. This naturally reduces the potential > for fragmentation by 32X over 4k blocks."[/i] > > How is this true? I mean, if you have a 128k default block size and > you store a 4k file within that block then you will have a ton of > slack space to clear up.Short files are given a short block. Files larger than 128K are diced into 128K blocks, but the last block may be shorter. The fragmentation discussed is fragmentation at the file level.> 3. Another statement from a post: > > [i]"the seek time for single-user contiguous access is essentially > zero since the seeks occur while the application is already busy > processing other data. When mirror vdevs are used, any device in the > mirror may be used to read the data."[/i] > > All this is saying that is when you are reading off of one physical > device you will already be seeking for the blocks that you need from > the other device so the seek time will no longer be an issue right?The seek time becomes less of an issue for sequential reads if blocks are read from different disks, and the reads are scheduled in advance. It still consumes drive IOPS if the disk needs to seek.> 6. Could anyone clarify this post: > > [i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause > fragmentation if portions of existing files are updated. If a large > portion of a file is overwritten in a short period of time, the > result should be reasonably fragment-free but if parts of the file > are updated over a long period of time (like a database) then the > file is certain to be fragmented. This is not such a big problem as > it appears to be since such files were already typically accessed > using random access."[/i]The point here is that zfs buffers unwritten data in memory for up to 30 seconds. With a large amount of buffered data, zfs is able to write the data in a more sequential and better-optimized fashion, while wasting fewer IOPS. Databases usually use random I/O and synchronous writes, which tends to scramble the data layout on disk with a copy-on-write model. Zfs is not optimized for database performance. On the other hand, the copy-on-write model reduces the chance of database corruption if there is a power failure or system crash. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Nov-10 04:54 UTC
[zfs-discuss] Couple questions about ZFS writes and fragmentation
On Nov 9, 2009, at 6:42 PM, Ilya wrote:> 1. Is it true that because block sizes vary (in powers of 2 of > course) on each write that there will be very little internal > fragmentation?Block size limit (aka recordsize) is in powers of 2. Block sizes are as needed.> 2. I came upon this statement in a forum post: > > [i]"ZFS uses 128K data blocks by default whereas other filesystems > typically use 4K or 8K blocks. This naturally reduces the potential > for fragmentation by 32X over 4k blocks."[/i] > > How is this true? I mean, if you have a 128k default block size and > you store a 4k file within that block then you will have a ton of > slack space to clear up.If a file only uses 4 KB, ZFS only allocates 4 KB to the file.> 3. Another statement from a post: > > [i]"the seek time for single-user contiguous access is essentially > zero since the seeks occur while the application is already busy > processing other data. When mirror vdevs are used, any device in the > mirror may be used to read the data."[/i] > > All this is saying that is when you are reading off of one physical > device you will already be seeking for the blocks that you need from > the other device so the seek time will no longer be an issue right?This comment makes no sense to me. By the time the I/O request is handled by the disk, the relationship to a user is long gone. Also, seeks only apply to HDDs. Either side of a mirror can be used for reading... that part makes sense.> > 4. In terms of where ZFS chooses to write data, is it always going > to pick one metaslab and write to only free blocks within that > metaslab? Or will it go all over the place?Yes :-)> 5. When ZFS looks for a place to write data, does it look somewhere > to intelligently see that there are some number of free blocks > available within this particular metaslab and if so where is this > located?Yes, of course.> 6. Could anyone clarify this post: > > [i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause > fragmentation if portions of existing files are updated. If a large > portion of a file is overwritten in a short period of time, the > result should be reasonably fragment-free but if parts of the file > are updated over a long period of time (like a database) then the > file is certain to be fragmented. This is not such a big problem as > it appears to be since such files were already typically accessed > using random access."[/i]YMMV. Allan Packer & Neel did a study on the affect of this on MySQL. But some databases COW themselves, so it is not a given that the application will read data sequentially. Video http://www.youtube.com/watch?v=a31NhwzlAxs Slides http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf> 7. An aside question...I was reading a paper about ZFS and it stated > that offsets are something like 8 bytes from the first vdev label. > Is there any reason why the storage pool is after 2 vdev labels?Historically, the first 8 KB of a slice was used to store the disk label. In the bad old days, people writing applications often did not know this and would clobber the label. So the first 8 KB of the ZFS label is not used to preserve any disk label. The storage pool data starts at an offset of 4 MB, 3.5 MB past the second label. This area is reserved for a boot block. Where did you see it documented as starting after the first two labels? -- richard
Ilya
2009-Nov-10 05:15 UTC
[zfs-discuss] Couple questions about ZFS writes and fragmentation
Wow, this forum is great and uber-fast in response, appreciate the responses, makes sense. Only, what does ZFS do to write to data? Let''s say that you want to write x blocks somewhere, is ZFS going to find a pointer to the space map of some metaslab and then write there? Is it going to find a metaslab closest to the outside of the HDD for higher bandwidth? And the label thing, heh, I made a mistake in what I read, you are right. Within the vdev array though, after the storage pool location though, it also showed more vdev labels coming after it (vdev 1, vdev 2, boot block, storage space, vdev 3, vdev4). Would there more vdev labels after #4 or more storage space? Thanks again -- This message posted from opensolaris.org
Richard Elling
2009-Nov-10 05:27 UTC
[zfs-discuss] Couple questions about ZFS writes and fragmentation
On Nov 9, 2009, at 9:15 PM, Ilya wrote:> Wow, this forum is great and uber-fast in response, appreciate the > responses, makes sense.Nothing on TV tonight and all of my stress tests are passing :-)> Only, what does ZFS do to write to data? Let''s say that you want to > write x blocks somewhere, is ZFS going to find a pointer to the > space map of some metaslab and then write there? Is it going to find > a metaslab closest to the outside of the HDD for higher bandwidth?By default, it does start with the metaslabs on the outer cylinders. But it may also decide to skip to another metaslab. For example, the redundant metadata is spread further away. Similarly, if you have copies=2 or 3, then those will be spatially diverse as well.> And the label thing, heh, I made a mistake in what I read, you are > right. Within the vdev array though, after the storage pool location > though, it also showed more vdev labels coming after it (vdev 1, > vdev 2, boot block, storage space, vdev 3, vdev4). Would there more > vdev labels after #4 or more storage space?The 4th label (label 3) is at the end, modulo 256 KB. -- richard> > Thanks again > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss