thr3ads.net - zfs discuss - [zfs-discuss] Basic question about striping and ZFS [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Ilya

2009-Nov-05 01:00 UTC

[zfs-discuss] Basic question about striping and ZFS

Researching about ZFS and had a question leating to Raid-Z and the striping. So,
I was glacing over Jeff''s blog
(http://blogs.sun.com/bonwick/entry/raid_z):

[i]"RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe
width. Every block is its own RAID-Z stripe, regardless of blocksize. This means
that every RAID-Z write is a full-stripe write. This, when combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the RAID
write hole. RAID-Z is also faster than traditional RAID because it never has to
do read-modify-write. "[/i]

So firstly, is this literally referring to the blocks of a file for example?
Also by stripe, is this referring to the stripe UNITS (within a whole stripe) or
the ENTIRE stripe across disks?

So, let''s say that you have a file of 64 kb per sector (stripe units
consisting of blocks of whatever size totaling 64k) across four disks.

Disk 0: Stripe 1
Disk 1: Stripe 2
Disk 2: Stripe 3
Disk 3: Parity

When Jeff''s blog mentions that "every block has it''s own
stripe" what does he exactly mean in the context of this example? And
let''s say that I am modifying/write out bytes in the first stripe, how
does this affect the other stripes/parity?
-- 
This message posted from opensolaris.org

Ilya

2009-Nov-05 02:06 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Forgot to add, are those four stripe units (for that one file) above considered
the stripe itself? Or are each of those stripe units on the seperate disks
considered as separate stripes?
-- 
This message posted from opensolaris.org

Cindy Swearingen

2009-Nov-05 16:24 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Hi Ilya,

You might check the slides on this page:

http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs

Particularly, slides 14-18.

In this case, graphic illustrations are probably the best way
to answer your questions.

Let us know if they do not answer your questions.

Cindy

On 11/04/09 18:00, Ilya wrote:> Researching about ZFS and had a question leating to Raid-Z and the
striping. So, I was glacing over Jeff''s blog
(http://blogs.sun.com/bonwick/entry/raid_z):
> 
> [i]"RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic
stripe width. Every block is its own RAID-Z stripe, regardless of blocksize.
This means that every RAID-Z write is a full-stripe write. This, when combined
with the copy-on-write transactional semantics of ZFS, completely eliminates the
RAID write hole. RAID-Z is also faster than traditional RAID because it never
has to do read-modify-write. "[/i]
> 
> So firstly, is this literally referring to the blocks of a file for
example? Also by stripe, is this referring to the stripe UNITS (within a whole
stripe) or the ENTIRE stripe across disks?
> 
> So, let''s say that you have a file of 64 kb per sector (stripe
units consisting of blocks of whatever size totaling 64k) across four disks.
> 
> Disk 0: Stripe 1
> Disk 1: Stripe 2
> Disk 2: Stripe 3
> Disk 3: Parity
> 
> When Jeff''s blog mentions that "every block has it''s
own stripe" what does he exactly mean in the context of this example? And
let''s say that I am modifying/write out bytes in the first stripe, how
does this affect the other stripes/parity?

Kjetil Torgrim Homme

2009-Nov-05 17:36 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Cindy Swearingen <Cindy.Swearingen at Sun.COM> writes:
> You might check the slides on this page:
>
> http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs
>
> Particularly, slides 14-18.
>
> In this case, graphic illustrations are probably the best way
> to answer your questions.
thanks, Cindy.  can you explain the meaning of the blocks marked X in
the illustration on page 18?

also, I''d appreciate it if you or someone else also could clarify these
questions:

1) is it correct that in a file, only the last block can be smaller than
the recordsize?

example: I have a file with 112 KiB data, and append 32 KiB.  there is
no free space immediately following the existing partial block.  will
ZFS rewrite the file as a full 128 KiB record and an additional 16 KiB
partial record?  

if ZFS does allow several partial blocks, dedup will be less likely to
happen, I think.

2) the basic block of a disk is typically a sector consisting of 512
bytes.  the recordsize in ZFS is a power of two, but the maximum width
of a RAID-Z can vary from 3 drives to ... lots.  let''s say we have 6
drives.  one drive is used for parity, so 128 KiB is split across 5
drives, which is stored like

  stripe  1   P D D D D D
   .      :   : : : : : : 
   .      :   : : : : : : 
  stripe 51   P D D D D D
  stripe 52   P D . . . .

so a full sized record will use one additional sector for parity, and
the write sizes on each drive are a bit screwy.  correct?

2a) has anyone tested if using power-of-two-plus-one drives in a RAID-Z
    impacts performance?

2b) are drives happy with writing/reading "unaligned" 512 byte blocks?

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Bob Friesenhahn

2009-Nov-05 18:23 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

On Thu, 5 Nov 2009, Kjetil Torgrim Homme wrote:>
> 1) is it correct that in a file, only the last block can be smaller than
> the recordsize?
>
> example: I have a file with 112 KiB data, and append 32 KiB.  there is
> no free space immediately following the existing partial block.  will
> ZFS rewrite the file as a full 128 KiB record and an additional 16 KiB
> partial record?
When compression is enabled, the uncompressed data block size is the 
same as the recordsize, but the on-disk size is usually smaller so the 
answer is clearly ''no''.  I believe that zfs will re-write
existing
blocks (based on the uncompressed size) rather than chain many short 
blocks.  Otherwise slowly-written files (e.g. log files) would be 
quite horribly fragmented.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ilya

2009-Nov-05 19:55 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Hey

Thanks for the slides but some things are still unclear.

Slide 18 shows variably sizes extents but doesn''t explain the process
of full-on write. What I''m looking for is one example. I still
don''t understand how it works with variable sized extents. So if you
have 2 stripe units on one disk and 1 stripe unit for the parity and you modify
half of the first stripe unit only, when you do a full-stripe write, what
happens in terms of a full-stripe write?

I also didn''t see a distinction between parity and metadata
reconstruction. I still do not know the process behind the metadata
reconstruction for bad data and when parity is used for bad data.
-- 
This message posted from opensolaris.org

A Darren Dunham

2009-Nov-05 21:19 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

On Thu, Nov 05, 2009 at 11:55:58AM -0800, Ilya wrote:
> Slide 18 shows variably sizes extents but doesn''t explain the
process
> of full-on write. What I''m looking for is one example. I still
don''t
> understand how it works with variable sized extents. So if you have 2
> stripe units on one disk and 1 stripe unit for the parity and you
> modify half of the first stripe unit only, when you do a full-stripe
> write, what happens in terms of a full-stripe write?
You never modify a ZFS block (or part of a ZFS block).  You write a new
replacement block elsewhere and a new metadata tree is constructed that
references the new block and not the old one (other than via snapshots).

I''m not sure I understand your picture of "2 stripe units on one
disk
and 1 stripe unit for parity".  That doesn''t seem correct.  Are
you
looking at a particular portion of that graphic that you could
reference? 
> I also didn''t see a distinction between parity and metadata
> reconstruction. I still do not know the process behind the metadata
> reconstruction for bad data and when parity is used for bad data.
Not sure what you mean by metadata reconstruction.

The checksums stored in parent blocks can be used to validate child
blocks (either metadata or data).

If the checksum fails, and there is redundant information (copy, parity,
mirror), then the system tries to see if the data is available through
the redundant data.  It will read the other half of the mirror, or try
to read a data using a parity reconstruction.  It will then validate
that other read via the checksum.  If that checksum succeeds, you''ve
read the data and the system should attempt to rewrite the redundant
info (assuming it was a bad block and not a disk failure that has left
the pool in a degraded state).

If the checksum fails and there is no redundant copy, then the data is
not returned.

-- 
Darren

Ilya

2009-Nov-06 03:13 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

So then of what use is the parity? 

And how is the metadata used to reconstruct bad data? I understand obviously
what the metadata contains but I don''t get how ZFS traverses through a
file system and USES the metadata to construct bad blocks.

I understand that you write everything to separate blocks. My question was this:

If you have initially two stripes over two disks like this:

Disk 1: XXXX (Stripe Unit 1)
Disk 2: XXXX (Stripe Unit 2)

You then want to modify something in the first stripe unit with modifications
which are smaller so now Disk 1 and Disk 2 stripes look like this:

Disk 1: XXYY (the y''s indicate modified bits or bytes or whatever)
Disk 2: XXXX

So now, with a full-stripe write, you then make new blocks for both stripes and
just copy the data over to the new blocks. Now, tell me if I am write with what
happens on a full-stripe write:

You read in Disk 1 and Disk 2 stripes in the file system cache. You then apply
the modifications to the Disk 1 stripe within the cache. After this, you compute
the parity within the cache and finally you write out both Disk 1 Stripe and
Disk 2 stripe to new blocks. Since the modifications to the disk 1 stripe (the
Ys) were smaller than the total stripe size, the new sector which will be
written to will be of a smaller stripe size than the originals.

Is this correct?
-- 
This message posted from opensolaris.org

Kjetil Torgrim Homme

2009-Nov-06 10:29 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Ilya <kruhex at gmail.com> writes:
> So then of what use is the parity?
parity is used to reconstruct a bad sector.
> And how is the metadata used to reconstruct bad data?
it is not.  the checksum (which is part of the block pointer) is used to
identify bad data, not correct it.  (the checksum does enable ZFS to
find out *which* block is bad -- traditional RAID can''t do this, they
will need a hardware error to identify which block is bad.)
> I understand obviously what the metadata contains but I don''t get
how
> ZFS traverses through a file system and USES the metadata to construct
> bad blocks.
I don''t know the ZFS metadata layout accurately, but it''s all
checksummed pointers, so the whole thing is actually a kind of Merkle
Tree.
> I understand that you write everything to separate blocks. My question
> was this:
>
> If you have initially two stripes over two disks like this:
>
> Disk 1: XXXX (Stripe Unit 1)
> Disk 2: XXXX (Stripe Unit 2)
>
> You then want to modify something in the first stripe unit with
> modifications which are smaller so now Disk 1 and Disk 2 stripes look
> like this:
>
> Disk 1: XXYY (the y''s indicate modified bits or bytes or whatever)
> Disk 2: XXXX
well, they don''t -- you''re talking about the file contents.
> So now, with a full-stripe write, you then make new blocks for both
> stripes and just copy the data over to the new blocks. Now, tell me if
> I am [right] with what happens on a full-stripe write:
>
> You read in Disk 1 and Disk 2 stripes in the file system cache. You
> then apply the modifications to the Disk 1 stripe within the
> cache. After this, you compute the parity within the cache and finally
> you write out both Disk 1 Stripe and Disk 2 stripe to new
> blocks. Since the modifications to the disk 1 stripe (the Ys) were
> smaller than the total stripe size, the new sector which will be
> written to will be of a smaller stripe size than the originals.
>
> Is this correct?
it depends on the size of YY.  ZFS will rewrite a full block
(recordsize).  if each unit in your illustration is 512 byte sectors,
it will become

          LBA ->
  Disk 1: ....XXYY
  Disk 2: ....XXXX

("." for free space) (actually, I can''t quite make out how
you intend
your graph to be interpreted.  shouldn''t the YY be striped?)

in addition, the old pointer and checksum will need to be updated.  the
block containing that pointer will also be rewritten, and in turn its
parent block needs updating, all the way to the top of the tree.

if each unit is 32 KiB, YY will consume half a block, and it will look
something like:

          LBA ->
  Disk 1: XX..YY
  Disk 2: XX..XX


well, that''s my understanding, anyway.  I learnt on Usenet a long time
ago that an expedient way to get a correct answer from the experts was
to post a wrong answer ;-)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Kjetil Torgrim Homme

2009-Nov-06 10:33 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
> On Thu, 5 Nov 2009, Kjetil Torgrim Homme wrote:
>>
>> 1) is it correct that in a file, only the last block can be smaller
than
>> the recordsize?
>>
>> example: I have a file with 112 KiB data, and append 32 KiB.  there is
>> no free space immediately following the existing partial block.  will
>> ZFS rewrite the file as a full 128 KiB record and an additional 16 KiB
>> partial record?
>
> When compression is enabled, the uncompressed data block size is the
> same as the recordsize, but the on-disk size is usually smaller so the
> answer is clearly ''no''.  I believe that zfs will re-write
existing
> blocks (based on the uncompressed size) rather than chain many short
> blocks.  Otherwise slowly-written files (e.g. log files) would be
> quite horribly fragmented.
right, I did mean logical (uncompressed) size in that question.  thanks!

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Kjetil Torgrim Homme

2009-Nov-06 11:34 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Kjetil Torgrim Homme <kjetilho at linpro.no> writes:
> 2a) has anyone tested if using power-of-two-plus-one drives in a RAID-Z
>     impacts performance?
I see now in
<http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide>:

  * Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1) 
  * Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2) 
  * Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3) 
  * (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2,
    4, or 8

which answers my question, at least indirectly :-) I think it''s a bit
funny that the three lines above it seem to not follow the
recommendation, though.  I suggest to change the wording to something
like:

  * In a single-parity RAIDZ (raidz) configuration, use no less than 3
    disks (2+1)    [...]
  * For best performance, use (N+P) disks with P = 1 (raidz), 2
    (raidz2), or 3 (raidz3) and N equals 2, 4, or 8
> 2b) are drives happy with writing/reading "unaligned" 512 byte
blocks?
using power-of-two-plus-one doesn''t stop this from happening.  I guess
what I''m asking is if it could be worthwhile to use a larger basic
block
on the drives.  this could waste quite a bit of space, of course,
depending on usage pattern.

one thing I really don''t like (intuitively) about RAID-Z is the
eagerness to stripe even very small files.  when writing a page (4 KiB)
to a 9 disk wide RAID-Z, all 9 disks are involved in the operation,
writing 512 bytes to each.  that might not be too bad since it won''t be
the only write in that transaction anyway, but if the data needs to be
read, eight of your disks need to seek and read in tandem.  the upside
is that the overhead is just 512 bytes.  with a 4 KiB basic block, a
single page write will waste 4 KiB.  the gain is that only one disk
needs to seek to read it.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

A Darren Dunham

2009-Nov-06 18:45 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

On Thu, Nov 05, 2009 at 07:13:32PM -0800, Ilya wrote:> So then of what use is the parity? 
The parity allows a ZFS data block to be regenerated if a portion of it
is lost due to a disk failure or bad block.
> And how is the metadata used to reconstruct bad data?
You go and try to construct the ZFS data block by reading disks.  With
parity stored, there are multiple ways to reconstruct the original data.
D0 + D1 + D2 + D3 => Data
D0 + D1 + D2 + P  => Data
D0 + D1 + P  + D3 => Data
[...]

If you have a corrupt block instead of a device failure, you might have
to try some of those before finding valid data.  The metadata parent is
able to validate the block.
> I understand obviously what the metadata contains but I don''t get
how
> ZFS traverses through a file system and USES the metadata to construct
> bad blocks.
It uses the metadata to validate good blocks and detect bad blocks.

-- 
Darren

Kjetil Torgrim Homme

2009-Nov-23 15:39 UTC

head link

[zfs-discuss] Basic question about striping and ZFS

Kjetil Torgrim Homme <kjetilho at linpro.no>
writes:> Cindy Swearingen <Cindy.Swearingen at Sun.COM> writes:
>> You might check the slides on this page:
>>
>> http://hub.opensolaris.org/bin/view/Community+Group+zfs/docs
>>
>> Particularly, slides 14-18.
>>
>> In this case, graphic illustrations are probably the best way
>> to answer your questions.
>
> thanks, Cindy.  can you explain the meaning of the blocks marked X in
> the illustration on page 18?
I found the explanation in an older (2009-09-03) message to this list
from Adam Leventhal:

|   RAID-Z writes full stripes every time; note that without careful
|   accounting it would be possible to effectively fragment the vdev
|   such that single sectors were free but useless since single-parity
|   RAID-Z requires two adjacent sectors to store data (one for data,
|   one for parity). To address this, RAID-Z rounds up its allocation to
|   the next (nparity + 1).  This ensures that all space is accounted
|   for. RAID-Z will thus skip sectors that are unused based on this
|   rounding. For example, under raidz1 a write of 1024 bytes would
|   result in 512 bytes of parity, 512 bytes of data on two devices and
|   512 bytes skipped.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

zfs discuss - Nov 2009 - Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS

[zfs-discuss] Basic question about striping and ZFS