Bhaskar Jayaraman
2008-Apr-14 10:43 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
Hi, I''m doing the following actions on my solaris 10 system. Please let me know if zfs will do the following things: - Question1: - Will zfs employ ordinary raid0 stripes while creating the file dust? Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB) reside on /tank/mnt-pt/file4, will zfs employ stripes while creating this file? ===============================================================zpool create exp /tank/mnt-pt/file1 /tank/mnt-pt/file2 /tank/mnt-pt/file3 # df -h Filesystem size used avail capacity Mounted on /dev/dsk/c0d0s0 15G 8.3G 6.4G 57% / /devices 0K 0K 0K 0% /devices /dev 0K 0K 0K 0% /dev ctfs 0K 0K 0K 0% /system/contract proc 0K 0K 0K 0% /proc mnttab 0K 0K 0K 0% /etc/mnttab swap 2.3G 1012K 2.3G 1% /etc/svc/volatile objfs 0K 0K 0K 0% /system/object sharefs 0K 0K 0K 0% /etc/dfs/sharetab /usr/lib/libc/libc_hwcap3.so.1 15G 8.3G 6.4G 57% /lib/libc.so.1 fd 0K 0K 0K 0% /dev/fd swap 2.3G 44K 2.3G 1% /tmp swap 2.3G 92K 2.3G 1% /var/run /dev/dsk/c0d0s4 15G 15M 15G 1% /second_root /dev/dsk/c0d0s7 32G 4.0G 28G 13% /export/home tank 16G 5.9G 10G 37% /tank /dev/dsk/c2t1d0s2 17G 4.7G 12G 29% /tank/mnt-pt exp 3.4G 1K 3.4G 1% /exp # zpool status pool: exp state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM exp ONLINE 0 0 0 /tank/mnt-pt/file1 ONLINE 0 0 0 /tank/mnt-pt/file2 ONLINE 0 0 0 /tank/mnt-pt/file3 ONLINE 0 0 0 # mkfile 3000m /exp/dust <=== Here I have eaten up 3 GB out of 3.4 GB available Question1: - Will zfs employ ordinary raid0 stripes while creating the file dust? # zpool add exp /tank/mnt-pt/file4 <== I''m adding another disk to the pool # zpool status pool: exp state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM exp ONLINE 0 0 0 /tank/mnt-pt/file1 ONLINE 0 0 0 /tank/mnt-pt/file2 ONLINE 0 0 0 /tank/mnt-pt/file3 ONLINE 0 0 0 /tank/mnt-pt/file4 ONLINE 0 0 0 # df -h Filesystem size used avail capacity Mounted on /dev/dsk/c0d0s0 15G 8.3G 6.4G 57% / /devices 0K 0K 0K 0% /devices /dev 0K 0K 0K 0% /dev ctfs 0K 0K 0K 0% /system/contract proc 0K 0K 0K 0% /proc mnttab 0K 0K 0K 0% /etc/mnttab swap 2.3G 1012K 2.3G 1% /etc/svc/volatile objfs 0K 0K 0K 0% /system/object sharefs 0K 0K 0K 0% /etc/dfs/sharetab /usr/lib/libc/libc_hwcap3.so.1 15G 8.3G 6.4G 57% /lib/libc.so.1 fd 0K 0K 0K 0% /dev/fd swap 2.3G 44K 2.3G 1% /tmp swap 2.3G 92K 2.3G 1% /var/run /dev/dsk/c0d0s4 15G 15M 15G 1% /second_root /dev/dsk/c0d0s7 32G 4.0G 28G 13% /export/home tank 16G 5.9G 10G 37% /tank /dev/dsk/c2t1d0s2 17G 4.7G 12G 29% /tank/mnt-pt exp 4.6G 2.9G 1.7G 64% /exp <=== Capacity increased to 4.6 GB # mkfile 1500m /exp/dust1 <=== I add another file which to eat up more space # df -h Filesystem size used avail capacity Mounted on /dev/dsk/c0d0s0 15G 8.3G 6.4G 57% / /devices 0K 0K 0K 0% /devices /dev 0K 0K 0K 0% /dev ctfs 0K 0K 0K 0% /system/contract proc 0K 0K 0K 0% /proc mnttab 0K 0K 0K 0% /etc/mnttab swap 2.3G 1012K 2.3G 1% /etc/svc/volatile objfs 0K 0K 0K 0% /system/object sharefs 0K 0K 0K 0% /etc/dfs/sharetab /usr/lib/libc/libc_hwcap3.so.1 15G 8.3G 6.4G 57% /lib/libc.so.1 fd 0K 0K 0K 0% /dev/fd swap 2.3G 44K 2.3G 1% /tmp swap 2.3G 92K 2.3G 1% /var/run /dev/dsk/c0d0s4 15G 15M 15G 1% /second_root /dev/dsk/c0d0s7 32G 4.0G 28G 13% /export/home tank 16G 5.9G 10G 37% /tank /dev/dsk/c2t1d0s2 17G 4.7G 12G 29% /tank/mnt-pt exp 4.6G 4.4G 192M 96% /exp <=== Now capacity used is 96% Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB) reside on /tank/mnt-pt/file4, will zfs employ stripes while creating this file? ======================================================================== This message posted from opensolaris.org
Brandon High
2008-Apr-15 02:33 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Mon, Apr 14, 2008 at 3:43 AM, Bhaskar Jayaraman <bhaskar.jayaraman at lsi.com> wrote:> Question1: - Will zfs employ ordinary raid0 stripes while creating the file dust?Sort of, though its not raid0. It will balance the writes across the members of its storage pools. So in your 3 disk zpool, the writes will initially be spread across all 3 members. When the zpool gets very full, writes may go to one device more than others due to space requirements.> Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB) reside on /tank/mnt-pt/file4, will zfs employ stripes while creating this file?Yes and no. Some of the data will be written to other members of the zpool, but the majority of the file will be written to the newly added disk. I think that some space will be reserved for the ZIL on all members, so there would be a little less than 400 MB written to the existing members and the rest written to the new device. I did a quick search for references and could find any, so take this with a grain of salt. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Brandon High
2008-Apr-15 18:58 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Tue, Apr 15, 2008 at 2:44 AM, Jayaraman, Bhaskar <Bhaskar.Jayaraman at lsi.com> wrote:> Thanks Brandon, so basically there is no way of knowing: - > 1] How your file will be distributed across the disks > 2] What will be the stripe sizeYou could look at the source to try to determine it. I''m not sure if there''s a hard & fast rule however. The stripe size will be across all vdevs that have space. For each stripe written, more data will land on the empty vdev. Once the previously existing vdevs fill up, writes will go to the new vdev.> 3] If the blocks are redistributed when a new disk is attached to the > storage.No existing data is redistributed, so far as I know. If there is a lot of churn on the volume, it will eventually balance out, but if you add a new vdev and don''t add data to the zpool, there will be no data on the new vdev. Unless you''ve dedicated a device, the ZIL is supposed to stripe across all vdevs, so there will be some activity on the new volume.> If you happen to know, what does "Every block is its own RAID-Z stripe, > regardless of blocksize" mean? http://blogs.sun.com/bonwick/category/ZFSIn traditional RAID-5 or RAID-6, the entire stripe has to be read, altered, XOR re-calculated, and written to disk. Because of this, there''s the RAID-5 write hole, where data can be lost or corrupted. By "full stripe", I don''t think he meant a write across all vdevs, but a write of data and associated XOR information, which would allow recovery in the event of a device death.> 4] So if I create files 64kb or less then the I''m assuming zfs will > determine some stripe size and stripe my file across the assigned block > pointers (let''s assume 4 blocks of 16kb each)??? > > 5] However if I create a file of 256kb then it may stripe it across 4 > blocks again with the stripe size at 64kb this time, but we can never be > sure how this is decided is that right?For a non-protected stripe, I don''t think that''s how it works. It''ll write a data to some (but not necessarily all) of the vdevs. I believe the cut off is 512KB. So a write of 1MB will "stripe" across 2 vdevs (biased toward under-utilized vdevs), while 4 writes of 64KB each could land on the same vdev. I think RAID-Z is different, since the stripe needs to spread across all devices for protection. I''m not sure how it''s done.> 6] Still I don''t see how each block becoming its own stripe unless there > is byte level striping with each byte on a different disk block which > would be very wasteful.See above. Again, my answers may not be correct. This is what I''ve gleaned from my own research. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Bob Friesenhahn
2008-Apr-15 19:12 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Tue, 15 Apr 2008, Brandon High wrote:> > I think RAID-Z is different, since the stripe needs to spread across > all devices for protection. I''m not sure how it''s done.My understanding is that RAID-Z is indeed different and does NOT have to spread across all devices for protection. It can use less than the total available devices and since parity is distributed the parity could be written to any drive. I am sure that someone will correct me if the above is wrong. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Brandon High
2008-Apr-15 22:24 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Tue, Apr 15, 2008 at 12:12 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 15 Apr 2008, Brandon High wrote: > > I think RAID-Z is different, since the stripe needs to spread across > > all devices for protection. I''m not sure how it''s done. > > My understanding is that RAID-Z is indeed different and does NOT have to > spread across all devices for protection. It can use less than the total > available devices and since parity is distributed the parity could be > written to any drive.I think you''re right. The parity information for a block has to be written to a second (or third for raidz2) vdev to qualify as a "full stripe write", but this is not necessarily writing to all devices in the zpool. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Richard Elling
2008-Apr-16 22:19 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
Brandon High wrote:> On Tue, Apr 15, 2008 at 2:44 AM, Jayaraman, Bhaskar > <Bhaskar.Jayaraman at lsi.com> wrote: > >> Thanks Brandon, so basically there is no way of knowing: - >> 1] How your file will be distributed across the disks >> 2] What will be the stripe size >> > > You could look at the source to try to determine it. I''m not sure if > there''s a hard & fast rule however. > > The stripe size will be across all vdevs that have space. For each > stripe written, more data will land on the empty vdev. Once the > previously existing vdevs fill up, writes will go to the new vdev. >In general, for the first time a device is filled, space will be allocated in the spacemap one slab at a time. The default slab size is 1MByte. So when you look at physical I/O, you may see something like 8 128kByte sequential writes to one vdev concurrent with 8 128kByte sequential writes to another vdev, and so on. Reads go where needed. As the space is filled, then you may see freed blocks in the spacemap become re-allocated. How it will look depends on many factors, but I can map it in 3-D :-)>> 3] If the blocks are redistributed when a new disk is attached to the >> storage. >> > > No existing data is redistributed, so far as I know. If there is a lot > of churn on the volume, it will eventually balance out, but if you add > a new vdev and don''t add data to the zpool, there will be no data on > the new vdev. > > Unless you''ve dedicated a device, the ZIL is supposed to stripe across > all vdevs, so there will be some activity on the new volume. > > >> If you happen to know, what does "Every block is its own RAID-Z stripe, >> regardless of blocksize" mean? http://blogs.sun.com/bonwick/category/ZFS >> > > In traditional RAID-5 or RAID-6, the entire stripe has to be read, > altered, XOR re-calculated, and written to disk. Because of this, > there''s the RAID-5 write hole, where data can be lost or corrupted. > > By "full stripe", I don''t think he meant a write across all vdevs, but > a write of data and associated XOR information, which would allow > recovery in the event of a device death. > > >> 4] So if I create files 64kb or less then the I''m assuming zfs will >> determine some stripe size and stripe my file across the assigned block >> pointers (let''s assume 4 blocks of 16kb each)??? >>No, not normally. ZFS groups writes to try to do 128kByte writes. So in a single 128kByte block, there may be parts of different files. By default the transaction group is flushed every 5 seconds, but there are many reasons this may change (see other recent threads here). It makes better sense if you think of ZFS allocating memory across the devices, like the Solaris VM, rather than trying to look at it as writing data to disks in the traditional disk storage sense. See Jeff Bonwick''s papers on the slab allocator and additional commentary on space maps at http://blogs.sun.com/bonwick (see the comments, too :-) -- richard>> 5] However if I create a file of 256kb then it may stripe it across 4 >> blocks again with the stripe size at 64kb this time, but we can never be >> sure how this is decided is that right? >> > > For a non-protected stripe, I don''t think that''s how it works. It''ll > write a data to some (but not necessarily all) of the vdevs. I believe > the cut off is 512KB. So a write of 1MB will "stripe" across 2 vdevs > (biased toward under-utilized vdevs), while 4 writes of 64KB each > could land on the same vdev. > > I think RAID-Z is different, since the stripe needs to spread across > all devices for protection. I''m not sure how it''s done. > > >> 6] Still I don''t see how each block becoming its own stripe unless there >> is byte level striping with each byte on a different disk block which >> would be very wasteful. >> > > See above. > > Again, my answers may not be correct. This is what I''ve gleaned from > my own research. > > -B > >
Brandon High
2008-Apr-16 22:48 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
On Wed, Apr 16, 2008 at 3:19 PM, Richard Elling <Richard.Elling at sun.com> wrote:> Brandon High wrote: > > The stripe size will be across all vdevs that have space. For each > > stripe written, more data will land on the empty vdev. Once the > > previously existing vdevs fill up, writes will go to the new vdev. > > In general, for the first time a device is filled, space will be allocated > in the spacemap one slab at a time. The default slab size is 1MByte. > So when you look at physical I/O, you may see something like 8 > 128kByte sequential writes to one vdev concurrent with 8 128kByte > sequential writes to another vdev, and so on. Reads go where > needed.In a case where a new vdev is added to an almost full zpool, more of the writes should land on the empty device though, right? So maybe 2 slabs will land on the new vdev for every one that goes to an previously existing vdev. One problem that I''m having trouble getting my head around, and I''m sure other are too, is that in terms of block allocation, zfs''s dynamic striping is absolutely nothing like raid-0. Correct me if I''m wrong, but while the end result is similar (i/o is distributed against all vdevs in the pool) the details are very different. That it''s compared to raid-0 just makes matters worse due to terminology overlap and a preconception of how striping works. Dynamic striping doesn''t really seem to be striping at all - It''s dynamically distributed block allocation across the members of a zdev. Likewise, raidz and raidz2 compare favorably to raid-5 and raid-6, but don''t share implementation details. Some of the literature out there like Jeff''s blog mentions a raidz "full stripe write". This is not really correct, since it isn''t writing a full stripe due to the way the storage is allocated. The end result is equivalent to a full stripe write in a conventional parity protected system, but it''s not the same thing. Short of creating a whole new dialect around zfs, I can''t think of a way to eliminate the overlap but it is a little confusing when still figuring out how things work. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Mario Goebbels (Webmail)
2008-Apr-16 23:45 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinarystorage pool?
> In a case where a new vdev is added to an almost full zpool, more of > the writes should land on the empty device though, right? So maybe 2 > slabs will land on the new vdev for every one that goes to an > previously existing vdev.(Un)Available disk space influences vdev selection. New writes will very likely end up on the empty vdev.> One problem that I''m having trouble getting my head around, and I''m > sure other are too, is that in terms of block allocation, zfs''s > dynamic striping is absolutely nothing like raid-0. Correct me if I''m > wrong, but while the end result is similar (i/o is distributed against > all vdevs in the pool) the details are very different. That it''s > compared to raid-0 just makes matters worse due to terminology overlap > and a preconception of how striping works. Dynamic striping doesn''t > really seem to be striping at all - It''s dynamically distributed block > allocation across the members of a zdev.The only thing that''s nothing like RAID-0 is the device selection process for where a stripe/block goes. RAID-0 is dumb modulo, whereas ZFS makes educated guesses based on vdev stats. Since a file in ZFS is still being spread in fixed length pieces (default ZFS record size is 128KB, larger than what you usually find as stripe size in RAID-0s), I think calling it striping is still valid. Regards, -mg
Robert Milkowski
2008-Apr-17 13:28 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
Hello Richard, Wednesday, April 16, 2008, 11:19:27 PM, you wrote: RE> No, not normally. ZFS groups writes to try to do 128kByte writes. RE> So in a single 128kByte block, there may be parts of different files. RE> By default the transaction group is flushed every 5 seconds, but there RE> are many reasons this may change (see other recent threads here). To clarify - you are not referring to fs level block? -- Best regards, Robert mailto:milek at task.gda.pl http://milek.blogspot.com
Richard Elling
2008-Apr-17 17:08 UTC
[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?
Robert Milkowski wrote:> Hello Richard, > > Wednesday, April 16, 2008, 11:19:27 PM, you wrote: > > RE> No, not normally. ZFS groups writes to try to do 128kByte writes. > RE> So in a single 128kByte block, there may be parts of different files. > RE> By default the transaction group is flushed every 5 seconds, but there > RE> are many reasons this may change (see other recent threads here). > > > > To clarify - you are not referring to fs level block? > >I should probably use the term "recordsize" for ZFS in this context. You are correct, block is often overloaded. Perhaps recordsize less so. -- richard