thr3ads.net - zfs discuss - [zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool? [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Bhaskar Jayaraman

2008-Apr-14 10:43 UTC

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

Hi, I''m doing the following actions on my solaris 10 system. Please let
me know if zfs will do the following things: -

Question1: - Will zfs employ ordinary raid0 stripes while creating the file
dust?

Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB) reside
on /tank/mnt-pt/file4, will zfs employ stripes while creating this file?
===============================================================zpool create exp
/tank/mnt-pt/file1 /tank/mnt-pt/file2 /tank/mnt-pt/file3
# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         15G   8.3G   6.4G    57%    /
/devices                 0K     0K     0K     0%    /devices
/dev                     0K     0K     0K     0%    /dev
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   2.3G  1012K   2.3G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap3.so.1
                        15G   8.3G   6.4G    57%    /lib/libc.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   2.3G    44K   2.3G     1%    /tmp
swap                   2.3G    92K   2.3G     1%    /var/run
/dev/dsk/c0d0s4         15G    15M    15G     1%    /second_root
/dev/dsk/c0d0s7         32G   4.0G    28G    13%    /export/home
tank                    16G   5.9G    10G    37%    /tank
/dev/dsk/c2t1d0s2       17G   4.7G    12G    29%    /tank/mnt-pt
exp                    3.4G     1K   3.4G     1%    /exp

# zpool status
pool: exp
 state: ONLINE
 scrub: none requested
config:

        NAME                  STATE     READ WRITE CKSUM
        exp                   ONLINE       0     0     0
          /tank/mnt-pt/file1  ONLINE       0     0     0
          /tank/mnt-pt/file2  ONLINE       0     0     0
          /tank/mnt-pt/file3  ONLINE       0     0     0

# mkfile 3000m /exp/dust <=== Here I have eaten up 3 GB out of 3.4 GB
available

Question1: - Will zfs employ ordinary raid0 stripes while creating the file
dust?

# zpool add exp /tank/mnt-pt/file4 <== I''m adding another disk to
the pool
# zpool status
pool: exp
 state: ONLINE
 scrub: none requested
config:

        NAME                  STATE     READ WRITE CKSUM
        exp                   ONLINE       0     0     0
          /tank/mnt-pt/file1  ONLINE       0     0     0
          /tank/mnt-pt/file2  ONLINE       0     0     0
          /tank/mnt-pt/file3  ONLINE       0     0     0
          /tank/mnt-pt/file4  ONLINE       0     0     0


# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         15G   8.3G   6.4G    57%    /
/devices                 0K     0K     0K     0%    /devices
/dev                     0K     0K     0K     0%    /dev
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   2.3G  1012K   2.3G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap3.so.1
                        15G   8.3G   6.4G    57%    /lib/libc.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   2.3G    44K   2.3G     1%    /tmp
swap                   2.3G    92K   2.3G     1%    /var/run
/dev/dsk/c0d0s4         15G    15M    15G     1%    /second_root
/dev/dsk/c0d0s7         32G   4.0G    28G    13%    /export/home
tank                    16G   5.9G    10G    37%    /tank
/dev/dsk/c2t1d0s2       17G   4.7G    12G    29%    /tank/mnt-pt
exp                    4.6G   2.9G   1.7G    64%    /exp <=== Capacity
increased to 4.6 GB

# mkfile 1500m /exp/dust1 <=== I add another file which to eat up more space
# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         15G   8.3G   6.4G    57%    /
/devices                 0K     0K     0K     0%    /devices
/dev                     0K     0K     0K     0%    /dev
ctfs                     0K     0K     0K     0%    /system/contract
proc                     0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
swap                   2.3G  1012K   2.3G     1%    /etc/svc/volatile
objfs                    0K     0K     0K     0%    /system/object
sharefs                  0K     0K     0K     0%    /etc/dfs/sharetab
/usr/lib/libc/libc_hwcap3.so.1
                        15G   8.3G   6.4G    57%    /lib/libc.so.1
fd                       0K     0K     0K     0%    /dev/fd
swap                   2.3G    44K   2.3G     1%    /tmp
swap                   2.3G    92K   2.3G     1%    /var/run
/dev/dsk/c0d0s4         15G    15M    15G     1%    /second_root
/dev/dsk/c0d0s7         32G   4.0G    28G    13%    /export/home
tank                    16G   5.9G    10G    37%    /tank
/dev/dsk/c2t1d0s2       17G   4.7G    12G    29%    /tank/mnt-pt
exp                    4.6G   4.4G   192M    96%    /exp <=== Now capacity
used is 96%

Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB) reside
on /tank/mnt-pt/file4, will zfs employ stripes while creating this file?


======================================================================== 
 
This message posted from opensolaris.org

Brandon High

2008-Apr-15 02:33 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

On Mon, Apr 14, 2008 at 3:43 AM, Bhaskar Jayaraman
<bhaskar.jayaraman at lsi.com> wrote:>  Question1: - Will zfs employ ordinary raid0 stripes while creating the
file dust?
Sort of, though its not raid0. It will balance the writes across the
members of its storage pools. So in your 3 disk zpool, the writes will
initially be spread across all 3 members.

When the zpool gets very full, writes may go to one device more than
others due to space requirements.
>  Question2: - Since most of my file /exp/dust1 (~74% = 1 - 400MB/1500MB)
reside on /tank/mnt-pt/file4, will zfs employ stripes while creating this file?
Yes and no.

Some of the data will be written to other members of the zpool, but
the majority of the file will be written to the newly added disk. I
think that some space will be reserved for the ZIL on all members, so
there would be a little less than 400 MB written to the existing
members and the rest written to the new device.

I did a quick search for references and could find any, so take this
with a grain of salt.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Brandon High

2008-Apr-15 18:58 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

On Tue, Apr 15, 2008 at 2:44 AM, Jayaraman, Bhaskar
<Bhaskar.Jayaraman at lsi.com> wrote:> Thanks Brandon, so basically there is no way of knowing: -
>  1] How your file will be distributed across the disks
>  2] What will be the stripe size
You could look at the source to try to determine it. I''m not sure if
there''s a hard & fast rule however.

The stripe size will be across all vdevs that have space. For each
stripe written, more data will land on the empty vdev. Once the
previously existing vdevs fill up, writes will go to the new vdev.
>  3] If the blocks are redistributed when a new disk is attached to the
>  storage.
No existing data is redistributed, so far as I know. If there is a lot
of churn on the volume, it will eventually balance out, but if you add
a new vdev and don''t add data to the zpool, there will be no data on
the new vdev.

Unless you''ve dedicated a device, the ZIL is supposed to stripe across
all vdevs, so there will be some activity on the new volume.
>  If you happen to know, what does "Every block is its own RAID-Z
stripe,
>  regardless of blocksize" mean?
http://blogs.sun.com/bonwick/category/ZFS
In traditional RAID-5 or RAID-6, the entire stripe has to be read,
altered, XOR re-calculated, and written to disk. Because of this,
there''s the RAID-5 write hole, where data can be lost or corrupted.

By "full stripe", I don''t think he meant a write across all
vdevs, but
a write of data and associated XOR information, which would allow
recovery in the event of a device death.
>  4] So if I create files 64kb or less then the I''m assuming zfs
will
>  determine some stripe size and stripe my file across the assigned block
>  pointers (let''s assume 4 blocks of 16kb each)???
>
>  5] However if I create a file of 256kb then it may stripe it across 4
>  blocks again with the stripe size at 64kb this time, but we can never be
>  sure how this is decided is that right?
For a non-protected stripe, I don''t think that''s how it works.
It''ll
write a data to some (but not necessarily all) of the vdevs. I believe
the cut off is 512KB. So a write of 1MB will "stripe" across 2 vdevs
(biased toward under-utilized vdevs), while 4 writes of 64KB each
could land on the same vdev.

I think RAID-Z is different, since the stripe needs to spread across
all devices for protection. I''m not sure how it''s done.
>  6] Still I don''t see how each block becoming its own stripe
unless there
>  is byte level striping with each byte on a different disk block which
>  would be very wasteful.
See above.

Again, my answers may not be correct. This is what I''ve gleaned from
my own research.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Bob Friesenhahn

2008-Apr-15 19:12 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

On Tue, 15 Apr 2008, Brandon High wrote:>
> I think RAID-Z is different, since the stripe needs to spread across
> all devices for protection. I''m not sure how it''s done.
My understanding is that RAID-Z is indeed different and does NOT have 
to spread across all devices for protection.  It can use less than the 
total available devices and since parity is distributed the parity 
could be written to any drive.

I am sure that someone will correct me if the above is wrong.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Brandon High

2008-Apr-15 22:24 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

On Tue, Apr 15, 2008 at 12:12 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Tue, 15 Apr 2008, Brandon High wrote:
> > I think RAID-Z is different, since the stripe needs to spread across
> > all devices for protection. I''m not sure how it''s
done.
>
>  My understanding is that RAID-Z is indeed different and does NOT have to
> spread across all devices for protection.  It can use less than the total
> available devices and since parity is distributed the parity could be
> written to any drive.
I think you''re right. The parity information for a block has to be
written to a second (or third for raidz2) vdev to qualify as a "full
stripe write", but this is not necessarily writing to all devices in
the zpool.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Richard Elling

2008-Apr-16 22:19 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

Brandon High wrote:> On Tue, Apr 15, 2008 at 2:44 AM, Jayaraman, Bhaskar
> <Bhaskar.Jayaraman at lsi.com> wrote:
>   
>> Thanks Brandon, so basically there is no way of knowing: -
>>  1] How your file will be distributed across the disks
>>  2] What will be the stripe size
>>     
>
> You could look at the source to try to determine it. I''m not sure
if
> there''s a hard & fast rule however.
>
> The stripe size will be across all vdevs that have space. For each
> stripe written, more data will land on the empty vdev. Once the
> previously existing vdevs fill up, writes will go to the new vdev.
>   
In general, for the first time a device is filled, space will be allocated
in the spacemap one slab at a time.  The default slab size is 1MByte.
So when you look at physical I/O, you may see something like 8
128kByte sequential writes to one vdev concurrent with 8 128kByte
sequential writes to another vdev, and so on.  Reads go where
needed.

As the space is filled, then you may see freed blocks in the spacemap
become re-allocated.  How it will look depends on many factors,
but I can map it in 3-D :-)
>>  3] If the blocks are redistributed when a new disk is attached to the
>>  storage.
>>     
>
> No existing data is redistributed, so far as I know. If there is a lot
> of churn on the volume, it will eventually balance out, but if you add
> a new vdev and don''t add data to the zpool, there will be no data
on
> the new vdev.
>
> Unless you''ve dedicated a device, the ZIL is supposed to stripe
across
> all vdevs, so there will be some activity on the new volume.
>
>   
>>  If you happen to know, what does "Every block is its own RAID-Z
stripe,
>>  regardless of blocksize" mean?
http://blogs.sun.com/bonwick/category/ZFS
>>     
>
> In traditional RAID-5 or RAID-6, the entire stripe has to be read,
> altered, XOR re-calculated, and written to disk. Because of this,
> there''s the RAID-5 write hole, where data can be lost or
corrupted.
>
> By "full stripe", I don''t think he meant a write across
all vdevs, but
> a write of data and associated XOR information, which would allow
> recovery in the event of a device death.
>
>   
>>  4] So if I create files 64kb or less then the I''m assuming
zfs will
>>  determine some stripe size and stripe my file across the assigned
block
>>  pointers (let''s assume 4 blocks of 16kb each)???
>>     
No, not normally.  ZFS groups writes to try to do 128kByte writes.
So in a single 128kByte block, there may be parts of different files.
By default the transaction group is flushed every 5 seconds, but there
are many reasons this may change (see other recent threads here).

It makes better sense if you think of ZFS allocating memory across
the devices, like the Solaris VM, rather than trying to look at it as
writing data to disks in the traditional disk storage sense.  See Jeff
Bonwick''s papers on the slab allocator and additional commentary
on space maps at http://blogs.sun.com/bonwick (see the comments,
too :-)
 -- richard
>>  5] However if I create a file of 256kb then it may stripe it across 4
>>  blocks again with the stripe size at 64kb this time, but we can never
be
>>  sure how this is decided is that right?
>>     
>
> For a non-protected stripe, I don''t think that''s how it
works. It''ll
> write a data to some (but not necessarily all) of the vdevs. I believe
> the cut off is 512KB. So a write of 1MB will "stripe" across 2
vdevs
> (biased toward under-utilized vdevs), while 4 writes of 64KB each
> could land on the same vdev.
>
> I think RAID-Z is different, since the stripe needs to spread across
> all devices for protection. I''m not sure how it''s done.
>
>   
>>  6] Still I don''t see how each block becoming its own stripe
unless there
>>  is byte level striping with each byte on a different disk block which
>>  would be very wasteful.
>>     
>
> See above.
>
> Again, my answers may not be correct. This is what I''ve gleaned
from
> my own research.
>
> -B
>
>

Brandon High

2008-Apr-16 22:48 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

On Wed, Apr 16, 2008 at 3:19 PM, Richard Elling <Richard.Elling at
sun.com> wrote:> Brandon High wrote:
> > The stripe size will be across all vdevs that have space. For each
> > stripe written, more data will land on the empty vdev. Once the
> > previously existing vdevs fill up, writes will go to the new vdev.
>
>  In general, for the first time a device is filled, space will be allocated
>  in the spacemap one slab at a time.  The default slab size is 1MByte.
>  So when you look at physical I/O, you may see something like 8
>  128kByte sequential writes to one vdev concurrent with 8 128kByte
>  sequential writes to another vdev, and so on.  Reads go where
>  needed.
In a case where a new vdev is added to an almost full zpool, more of
the writes should land on the empty device though, right? So maybe 2
slabs will land on the new vdev for every one that goes to an
previously existing vdev.

One problem that I''m having trouble getting my head around, and
I''m
sure other are too, is that in terms of block allocation, zfs''s
dynamic striping is absolutely nothing like raid-0. Correct me if I''m
wrong, but while the end result is similar (i/o is distributed against
all vdevs in the pool) the details are very different. That it''s
compared to raid-0 just makes matters worse due to terminology overlap
and a preconception of how striping works. Dynamic striping doesn''t
really seem to be striping at all - It''s dynamically distributed block
allocation across the members of a zdev.

Likewise, raidz and raidz2 compare favorably to raid-5 and raid-6, but
don''t share implementation details. Some of the literature out there
like Jeff''s blog mentions a raidz "full stripe write". This
is not
really correct, since it isn''t writing a full stripe due to the way
the storage is allocated. The end result is equivalent to a full
stripe write in a conventional parity protected system, but it''s not
the same thing.

Short of creating a whole new dialect around zfs, I can''t think of a
way to eliminate the overlap but it is a little confusing when still
figuring out how things work.

-B

-- 
Brandon High bhigh at freaks.com
"The good is the enemy of the best." - Nietzsche

Mario Goebbels (Webmail)

2008-Apr-16 23:45 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinarystorage pool?

> In a case where a new vdev is added to an almost full zpool, more of
> the writes should land on the empty device though, right? So maybe 2
> slabs will land on the new vdev for every one that goes to an
> previously existing vdev.
(Un)Available disk space influences vdev selection. New writes will very
likely end up on the empty vdev.
> One problem that I''m having trouble getting my head around, and
I''m
> sure other are too, is that in terms of block allocation, zfs''s
> dynamic striping is absolutely nothing like raid-0. Correct me if
I''m
> wrong, but while the end result is similar (i/o is distributed against
> all vdevs in the pool) the details are very different. That it''s
> compared to raid-0 just makes matters worse due to terminology overlap
> and a preconception of how striping works. Dynamic striping
doesn''t
> really seem to be striping at all - It''s dynamically distributed
block
> allocation across the members of a zdev.
The only thing that''s nothing like RAID-0 is the device selection
process
for where a stripe/block goes. RAID-0 is dumb modulo, whereas ZFS makes
educated guesses based on vdev stats. Since a file in ZFS is still being
spread in fixed length pieces (default ZFS record size is 128KB, larger
than what you usually find as stripe size in RAID-0s), I think calling it
striping is still valid.

Regards,
-mg

Robert Milkowski

2008-Apr-17 13:28 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

Hello Richard,

Wednesday, April 16, 2008, 11:19:27 PM, you wrote:

RE> No, not normally.  ZFS groups writes to try to do 128kByte writes.
RE> So in a single 128kByte block, there may be parts of different files.
RE> By default the transaction group is flushed every 5 seconds, but there
RE> are many reasons this may change (see other recent threads here).



To clarify - you are not referring to fs level block?


-- 
Best regards,
 Robert                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling

2008-Apr-17 17:08 UTC

head link

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

Robert Milkowski wrote:> Hello Richard,
>
> Wednesday, April 16, 2008, 11:19:27 PM, you wrote:
>
> RE> No, not normally.  ZFS groups writes to try to do 128kByte writes.
> RE> So in a single 128kByte block, there may be parts of different
files.
> RE> By default the transaction group is flushed every 5 seconds, but
there
> RE> are many reasons this may change (see other recent threads here).
>
>
>
> To clarify - you are not referring to fs level block?
>
>   
I should probably use the term "recordsize" for ZFS in this context.  
You are
correct, block is often overloaded.  Perhaps recordsize less so.
 -- richard

zfs discuss - Apr 2008 - Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinarystorage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?

[zfs-discuss] Will ZFS employ raid0 stripes in an ordinary storage pool?