If a 8K file system block is written on a 9 disk raidz vdev, how is the data distributed (writtened) between all devices in the vdev since a zfs write is one continuously IO operation? Is it distributed evenly (1.125KB) per device? -- This message posted from opensolaris.org
Hi Brad, RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev will look like this: | P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | 1K per device with an additional 1K for parity. Adam On Jan 4, 2010, at 3:17 PM, Brad wrote:> If a 8K file system block is written on a 9 disk raidz vdev, how is the data distributed (writtened) between all devices in the vdev since a zfs write is one continuously IO operation? > > Is it distributed evenly (1.125KB) per device? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Hi Adam,>From your the picture, it looks like the data is distributed evenly (with the exception of parity) across each spindle then wrapping around again (final 4K) - is this one single write operation or two?| P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | <---------one write op?? | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | <---------one write op?? For a stripe configuration, is this would it would like look for 8K? | D00 D01 D02 D03 D04 D05 D06 D07 D08 | | D09 D10 D11 D12 D13 D14 D15 D16 D17 | -- This message posted from opensolaris.org
Kjetil Torgrim Homme
2010-Jan-05 16:22 UTC
[zfs-discuss] raidz stripe size (not stripe width)
Brad <beneri3 at yahoo.com> writes:> Hi Adam,I''m not Adam, but I''ll take a stab at it anyway. BTW, your crossposting is a bit confusing to follow, at least when using gmane.org. I think it is better to stick to one mailing list anyway?> From your the picture, it looks like the data is distributed evenly > (with the exception of parity) across each spindle then wrapping > around again (final 4K) - is this one single write operation or two?it is a single write operation per device. actually, it may be "less than" one write operation since the transaction group, which probably contains many more updates, is written as a whole. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On Jan 4, 2010, at 7:08 PM, Brad wrote:> Hi Adam, > > From your the picture, it looks like the data is distributed evenly > (with the exception of parity) across each spindle then wrapping > around again (final 4K) - is this one single write operation or two? > > | P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | <--------- > one write op?? > | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | <--------- > one write op??One physical write op per vdev because the columns will likely be coalesced at the vdev. Obviously, one physical write cannot span multiple vdevs.> For a stripe configuration, is this would it would like look for 8K? > > | D00 D01 D02 D03 D04 D05 D06 D07 D08 | > | D09 D10 D11 D12 D13 D14 D15 D16 D17 |No. It is very likely the entire write will be to one vdev. Again, this is dynamic striping, not RAID-0. RAID-0 is defined by SNIA as "A disk array data mapping technique in which fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern." In ZFS, there is no "fixed-length sequence." The next column is chosen approximately every MB or so. You get the benefit of sequential access to the media, with the stochastic spreading across vdevs as well. When you have multiple top-level vdevs, such as multiple mirrors or multiple raidz sets, then you get the ~ 1MB spread across the top level and the normal allocations in the sets. In other words, any given record should be in one set. Again, this limits hyperspreading and allows you to scale to very large numbers of disks. It seems to work reasonably well in practice. I attempted to describe this in pictures for my ZFS tutorials. You can be the judge, and suggestions are always welcome. See slide 27 at http://www.slideshare.net/relling/zfs-tutorial-usenix-lisa09-conference [for the alias, I''ve only today succeeded in uploading the slides to slideshare... been trying off and on for more than a month :-(] -- richard