thr3ads.net - zfs discuss - [zfs-discuss] Couple questions about ZFS writes and fragmentation [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Ilya

2009-Nov-10 02:42 UTC

[zfs-discuss] Couple questions about ZFS writes and fragmentation

1. Is it true that because block sizes vary (in powers of 2 of course) on each
write that there will be very little internal fragmentation?

2. I came upon this statement in a forum post:

[i]"ZFS uses 128K data blocks by default whereas other filesystems
typically use 4K or 8K blocks. This naturally reduces the potential for
fragmentation by 32X over 4k blocks."[/i]

How is this true? I mean, if you have a 128k default block size and you store a
4k file within that block then you will have a ton of slack space to clear up.

3. Another statement from a post:

[i]"the seek time for single-user contiguous access is essentially zero
since the seeks occur while the application is already busy processing other
data. When mirror vdevs are used, any device in the mirror may be used to read
the data."[/i]

All this is saying that is when you are reading off of one physical device you
will already be seeking for the blocks that you need from the other device so
the seek time will no longer be an issue right?

4. In terms of where ZFS chooses to write data, is it always going to pick one
metaslab and write to only free blocks within that metaslab? Or will it go all
over the place?

5. When ZFS looks for a place to write data, does it look somewhere to
intelligently see that there are some number of free blocks available within
this particular metaslab and if so where is this located?

6. Could anyone clarify this post:

[i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause
fragmentation if portions of existing files are updated. If a large portion of a
file is overwritten in a short period of time, the result should be reasonably
fragment-free but if parts of the file are updated over a long period of time
(like a database) then the file is certain to be fragmented. This is not such a
big problem as it appears to be since such files were already typically accessed
using random access."[/i]

7. An aside question...I was reading a paper about ZFS and it stated that
offsets are something like 8 bytes from the first vdev label. Is there any
reason why the storage pool is after 2 vdev labels?

Thanks guys
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Nov-10 04:04 UTC

head link

[zfs-discuss] Couple questions about ZFS writes and fragmentation

On Mon, 9 Nov 2009, Ilya wrote:>
> 2. I came upon this statement in a forum post:
>
> [i]"ZFS uses 128K data blocks by default whereas other filesystems 
> typically use 4K or 8K blocks. This naturally reduces the potential 
> for fragmentation by 32X over 4k blocks."[/i]
>
> How is this true? I mean, if you have a 128k default block size and 
> you store a 4k file within that block then you will have a ton of 
> slack space to clear up.
Short files are given a short block.  Files larger than 128K are diced 
into 128K blocks, but the last block may be shorter.

The fragmentation discussed is fragmentation at the file level.
> 3. Another statement from a post:
>
> [i]"the seek time for single-user contiguous access is essentially 
> zero since the seeks occur while the application is already busy 
> processing other data. When mirror vdevs are used, any device in the 
> mirror may be used to read the data."[/i]
>
> All this is saying that is when you are reading off of one physical 
> device you will already be seeking for the blocks that you need from 
> the other device so the seek time will no longer be an issue right?
The seek time becomes less of an issue for sequential reads if blocks 
are read from different disks, and the reads are scheduled in advance. 
It still consumes drive IOPS if the disk needs to seek.
> 6. Could anyone clarify this post:
>
> [i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause 
> fragmentation if portions of existing files are updated. If a large 
> portion of a file is overwritten in a short period of time, the 
> result should be reasonably fragment-free but if parts of the file 
> are updated over a long period of time (like a database) then the 
> file is certain to be fragmented. This is not such a big problem as 
> it appears to be since such files were already typically accessed 
> using random access."[/i]
The point here is that zfs buffers unwritten data in memory for up to 
30 seconds.  With a large amount of buffered data, zfs is able to 
write the data in a more sequential and better-optimized fashion, 
while wasting fewer IOPS.  Databases usually use random I/O and 
synchronous writes, which tends to scramble the data layout on disk 
with a copy-on-write model.  Zfs is not optimized for database 
performance.  On the other hand, the copy-on-write model reduces the 
chance of database corruption if there is a power failure or system 
crash.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Nov-10 04:54 UTC

head link

[zfs-discuss] Couple questions about ZFS writes and fragmentation

On Nov 9, 2009, at 6:42 PM, Ilya wrote:
> 1. Is it true that because block sizes vary (in powers of 2 of  
> course) on each write that there will be very little internal  
> fragmentation?
Block size limit (aka recordsize) is in powers of 2. Block sizes are  
as needed.
> 2. I came upon this statement in a forum post:
>
> [i]"ZFS uses 128K data blocks by default whereas other filesystems  
> typically use 4K or 8K blocks. This naturally reduces the potential  
> for fragmentation by 32X over 4k blocks."[/i]
>
> How is this true? I mean, if you have a 128k default block size and  
> you store a 4k file within that block then you will have a ton of  
> slack space to clear up.
If a file only uses 4 KB, ZFS only allocates 4 KB to the file.
> 3. Another statement from a post:
>
> [i]"the seek time for single-user contiguous access is essentially  
> zero since the seeks occur while the application is already busy  
> processing other data. When mirror vdevs are used, any device in the  
> mirror may be used to read the data."[/i]
>
> All this is saying that is when you are reading off of one physical  
> device you will already be seeking for the blocks that you need from  
> the other device so the seek time will no longer be an issue right?
This comment makes no sense to me. By the time the I/O request is  
handled by
the disk, the relationship to a user is long gone. Also, seeks only  
apply to HDDs.
Either side of a mirror can be used for reading... that part makes  
sense.
>
> 4. In terms of where ZFS chooses to write data, is it always going  
> to pick one metaslab and write to only free blocks within that  
> metaslab? Or will it go all over the place?
Yes :-)
> 5. When ZFS looks for a place to write data, does it look somewhere  
> to intelligently see that there are some number of free blocks  
> available within this particular metaslab and if so where is this  
> located?
Yes, of course.
> 6. Could anyone clarify this post:
>
> [i]"ZFS uses a copy-on-write model. Copy-on-write tends to cause  
> fragmentation if portions of existing files are updated. If a large  
> portion of a file is overwritten in a short period of time, the  
> result should be reasonably fragment-free but if parts of the file  
> are updated over a long period of time (like a database) then the  
> file is certain to be fragmented. This is not such a big problem as  
> it appears to be since such files were already typically accessed  
> using random access."[/i]
YMMV.  Allan Packer & Neel did a study on the affect of this on MySQL.  
But
some databases COW themselves, so it is not a given that the application
will read data sequentially.
Video http://www.youtube.com/watch?v=a31NhwzlAxs
Slides
http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf
> 7. An aside question...I was reading a paper about ZFS and it stated  
> that offsets are something like 8 bytes from the first vdev label.  
> Is there any reason why the storage pool is after 2 vdev labels?
Historically, the first 8 KB of a slice was used to store the disk  
label.
In the bad old days, people writing applications often did not know this
and would clobber the label.  So the first 8 KB of the ZFS label is not
used to preserve any disk label.

The storage pool data starts at an offset of 4 MB, 3.5 MB past the  
second
label. This area is reserved for a boot block.  Where did you see it
documented as starting after the first two labels?
  -- richard

Ilya

2009-Nov-10 05:15 UTC

head link

[zfs-discuss] Couple questions about ZFS writes and fragmentation

Wow, this forum is great and uber-fast in response, appreciate the responses,
makes sense.

Only, what does ZFS do to write to data? Let''s say that you want to
write x blocks somewhere, is ZFS  going to find a pointer to the space map of
some metaslab and then write there? Is it going to find a metaslab closest to
the outside of the HDD for higher bandwidth?

And the label thing, heh, I made a mistake in what I read, you are right. Within
the vdev array though, after the storage pool location though, it also showed
more vdev labels coming after it (vdev 1, vdev 2, boot block, storage space,
vdev 3, vdev4). Would there more vdev labels after #4 or more storage space?

Thanks again
-- 
This message posted from opensolaris.org

Richard Elling

2009-Nov-10 05:27 UTC

head link

[zfs-discuss] Couple questions about ZFS writes and fragmentation

On Nov 9, 2009, at 9:15 PM, Ilya wrote:
> Wow, this forum is great and uber-fast in response, appreciate the  
> responses, makes sense.
Nothing on TV tonight and all of my stress tests are passing :-)
> Only, what does ZFS do to write to data? Let''s say that you want
to
> write x blocks somewhere, is ZFS  going to find a pointer to the  
> space map of some metaslab and then write there? Is it going to find  
> a metaslab closest to the outside of the HDD for higher bandwidth?
By default, it does start with the metaslabs on the outer cylinders.
But it may also decide to skip to another metaslab. For example, the
redundant metadata is spread further away.  Similarly, if you have
copies=2 or 3, then those will be spatially diverse as well.
> And the label thing, heh, I made a mistake in what I read, you are  
> right. Within the vdev array though, after the storage pool location  
> though, it also showed more vdev labels coming after it (vdev 1,  
> vdev 2, boot block, storage space, vdev 3, vdev4). Would there more  
> vdev labels after #4 or more storage space?
The 4th label (label 3) is at the end, modulo 256 KB.
  -- richard
>
> Thanks again
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Nov 2009 - Couple questions about ZFS writes and fragmentation

[zfs-discuss] Couple questions about ZFS writes and fragmentation

[zfs-discuss] Couple questions about ZFS writes and fragmentation

[zfs-discuss] Couple questions about ZFS writes and fragmentation

[zfs-discuss] Couple questions about ZFS writes and fragmentation

[zfs-discuss] Couple questions about ZFS writes and fragmentation