thr3ads.net - zfs discuss - [zfs-discuss] A few questions [Sep 2008]

If this information is useful, please help other people find it:
Share via:

mike

2008-Sep-15 03:21 UTC

[zfs-discuss] A few questions

1) Is there a big performance difference between raidz1 and raidz2? In
traditional RAID I would think there would be since it would have two
drives of parity to write/read to at the same time.

2) Can the ZFS boot/root volumes be using the same devices as the
normal volumes? Like one raidz pool that is used for data (like
normal) but also used for ZFS boot?

3) Does anyone have any heartburn for the new Seagate 1.5TB disks?
I''ve had at least one person make comments about being nervous about
them

4) What are the largest raidz pools people have successfully used
(with snv_94 or later) - 10 disk raidz2? What are the reasons for not
creating larger raidz1/raidz2 groups? Is it only performance and the
possibility of multi-disk failure? Am I really going to suffer that
bad performance if I created a raidz2 10 or 12 disk group?

I''m trying to build a small machine (mid-tower hopefully) and maximize
the amount of usable space...

Thanks.

gm_sjo

2008-Sep-15 05:35 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

2008/9/15 Ben Rockwood:> On Thumpers I''ve created single pools of 44 disks, in 11 disk
RAIDZ2''s.
> I''ve come to regret this.  I recommend keeping pools reasonably
sized
> and to keep stripes thinner than this.
Could you clarify why you came to regret it? I was intending to create
a single pool for 8 1TB disks.

mike

2008-Sep-15 05:39 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Sun, Sep 14, 2008 at 9:51 PM, Ben Rockwood <benr at cuddletech.com>
wrote:
> Frankly, most of the answers to the questions you''ve asked are
based on
> your hardware configuration more than the theoretical optimum.  If I
> were building out a setup I''d probly go with PCIe SATA adapters
that can
> host up to 4 drives... buy 3 of them, hang 600GB+ disks off them, and
> create a RAIDZ per controller.  I''d use RAIDZ only if this were a
home
> box that I could replace drives on in a reasonable timeframe and did
> backups of.  If that wasn''t the case, RAIDZ2.
See, for my goal - a quiet, as-compact-as-possible storage solution
that won''t fly. I''m going to need to do 7 disk raidz1 or 8
disk raidz2
I think to be able to stretch it.

I don''t want to be spending a lot of money for redundancy - these
machines do not need to be highly available and I can power it down if
a disk fails until I can physically replace the disk. I don''t want to
wind up with a lot of disks sitting around for parity''s sake. If this
was a hardware RAID I''d be looking at maybe 12 disk RAID6/dual parity
RAID5 (10 disk usable) for example...


> When building your box, don''t forget to stash as much memory as
possible
> in the box for cache (ZFS ARC), its the best way to improve your read
> performance in real-world workloads.
This will be for home storage for DVDs, etc. Very low traffic, maybe 4
streams maximum. Shared mainly using CIFS, maybe some NFS. I''ll be
putting 4GB of ram in it. I figure that should be decent.

mike

2008-Sep-15 06:22 UTC

head link

[zfs-discuss] A few questions

I''m looking at this raidoptimizer spreadsheet that Richard generated,
and now I''m wondering -

Couldn''t I do a large stripe (like 10 disks) and then just have 3
disks marked as spare?

Does it work like that? and if so, what would the pros/cons be doing a
large stripe with spares vs. raidz1 or raidz2?

I''m also showing a 15 disk raidz2 with 206 million MTTDL[1] years and
122,000 MTTDL[2] years, 13tb (at 1TB disk) and can suffer 2 disk
failures... course only 91 iops. but the max theoretical bandwidth is
over 1,000MB/sec... so many options. Trying to find a decent mix and
match of iops, bandwidth, MTTDL''s...

Pavan Chandrashekar - Sun Microsystems

2008-Sep-15 07:30 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

Ben Rockwood wrote:> mike wrote:
>> 1) Is there a big performance difference between raidz1 and raidz2? In
>> traditional RAID I would think there would be since it would have two
>> drives of parity to write/read to at the same time.
>>   
> 
> I haven''t benchmarked this specific case in some time, but in my
> experience I couldn''t call it "big".
> 
>> 2) Can the ZFS boot/root volumes be using the same devices as the
>> normal volumes? Like one raidz pool that is used for data (like
>> normal) but also used for ZFS boot?
>>   
> 
> That would involve putting pools on disk partitions, and I''m not
certain
> how that would effect the bootloader code. 
A downside to the performance in putting pools on a partition as opposed 
to the entire disk is that ZFS turns off the write cache in case of the 
former. You might want to check if that is of concern in your setup.

Pavan

Ian Collins

2008-Sep-15 08:25 UTC

head link

[zfs-discuss] A few questions

mike wrote:> I''m looking at this raidoptimizer spreadsheet that Richard
generated,
> and now I''m wondering -
>
> Couldn''t I do a large stripe (like 10 disks) and then just have 3
> disks marked as spare?
>
> Does it work like that? and if so, what would the pros/cons be doing a
> large stripe with spares vs. raidz1 or raidz2?
>
>   It would be a rather silly, you wouldn''t have any redundancy.

Ian

Brian Hechinger

2008-Sep-15 11:05 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Mon, Sep 15, 2008 at 01:00:49PM +0530, Pavan Chandrashekar - Sun Microsystems
wrote:> 
> A downside to the performance in putting pools on a partition as opposed 
> to the entire disk is that ZFS turns off the write cache in case of the 
> former. You might want to check if that is of concern in your setup.
You might want to doublecheck that fact.  I forget who I worked through on this
one,
but it was determined that in the case of SATA, that was not true.  I forget the
details now, but basically I went to turn the cache on (since there was only one
partion, and it was "owned" by ZFS) for a pair of disks that were
using a partition
instead of the whole disk, and it was already on.

Someone else on the list pointed out that would likely be the case and also
checked
several systems of their own.
>From what we can tell, SCSI/FC/SAS/etc all do indeed disable the
disk''s cache, whereasSATA does not.

-brian
-- 
"Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you''ll end up with a cupboard full
of
pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)

Richard Elling

2008-Sep-15 15:39 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

Brian Hechinger wrote:> On Mon, Sep 15, 2008 at 01:00:49PM +0530, Pavan Chandrashekar - Sun
Microsystems wrote:
>   
>> A downside to the performance in putting pools on a partition as
opposed
>> to the entire disk is that ZFS turns off the write cache in case of the
>> former. You might want to check if that is of concern in your setup.
>>     
>
> You might want to doublecheck that fact.  I forget who I worked through on
this one,
> but it was determined that in the case of SATA, that was not true.  I
forget the
> details now, but basically I went to turn the cache on (since there was
only one
> partion, and it was "owned" by ZFS) for a pair of disks that were
using a partition
> instead of the whole disk, and it was already on.
>
> Someone else on the list pointed out that would likely be the case and also
checked
> several systems of their own.
>
> >From what we can tell, SCSI/FC/SAS/etc all do indeed disable the
disk''s cache, whereas
> SATA does not.
>   
Methinks it will depend on the disk model and firmware rev, not
the host interface technology.
 -- richard

Richard Elling

2008-Sep-15 15:40 UTC

head link

[zfs-discuss] A few questions

mike wrote:> I''m looking at this raidoptimizer spreadsheet that Richard
generated,
> and now I''m wondering -
>   
cool :-)
> Couldn''t I do a large stripe (like 10 disks) and then just have 3
> disks marked as spare?
>
> Does it work like that? and if so, what would the pros/cons be doing a
> large stripe with spares vs. raidz1 or raidz2?
>
> I''m also showing a 15 disk raidz2 with 206 million MTTDL[1] years
and
> 122,000 MTTDL[2] years, 13tb (at 1TB disk) and can suffer 2 disk
> failures... course only 91 iops. but the max theoretical bandwidth is
> over 1,000MB/sec... so many options. Trying to find a decent mix and
> match of iops, bandwidth, MTTDL''s...
>   
performance, space, RAS -- it is a trade-off
 -- richard

Brian Hechinger

2008-Sep-15 15:57 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Mon, Sep 15, 2008 at 08:39:01AM -0700, Richard Elling
wrote:> 
> Methinks it will depend on the disk model and firmware rev, not
> the host interface technology.
Hmmmm.  This of course would take more research then since I can''t
really
say for anything other than what I have here (Seagate SATA disks).

-brian
-- 
"Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you''ll end up with a cupboard full
of
pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)

mike

2008-Sep-15 17:18 UTC

head link

[zfs-discuss] A few questions

On Mon, Sep 15, 2008 at 8:40 AM, Richard Elling <Richard.Elling at
sun.com> wrote:
> performance, space, RAS -- it is a trade-off
How about you whip up a "weight" factor for people like myself... :)
This is how I would weigh my priorities:

#1 Available space
#2 Redundancy
#3 Speed (as long as I can get at least 30-40MB/sec over CIFS I think
that should be fine, any faster is awesome)

(Totally joking about making it a feature. But would appreciate any
tips for this)

Everyone else: I get the stripe vs. raidz comparison now. Essentially
a stripe has no ditto blocks then?

Will Murnane

2008-Sep-15 17:28 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Mon, Sep 15, 2008 at 13:18, mike <mike503 at gmail.com>
wrote:> Everyone else: I get the stripe vs. raidz comparison now. Essentially
> a stripe has no ditto blocks then?"Ditto blocks" it has; this is ZFS'' name for redundant
metadata.
There are multiple copies of directory entries and such, so that
corrupting a single block can''t cause you to lose the entire
filesystem below that block, even on a single disk.  But a stripe
lacks parity blocks, which ZFS needs to recreate damaged data.
Raidz{,2} have this, and mirrors have this, but single disks do not.

Will

mike

2008-Sep-15 17:36 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will.murnane at gmail.com>
wrote:
> "Ditto blocks" it has; this is ZFS'' name for redundant
metadata.
> There are multiple copies of directory entries and such, so that
> corrupting a single block can''t cause you to lose the entire
> filesystem below that block, even on a single disk.  But a stripe
> lacks parity blocks, which ZFS needs to recreate damaged data.
> Raidz{,2} have this, and mirrors have this, but single disks do not.
okay. ditto is only metadata. gotcha.

Miles Nordin

2008-Sep-15 17:47 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

>>>>> "wm" == Will Murnane <will.murnane at
gmail.com> writes:
    wm> corrupting a single block can''t cause you to lose the entire
    wm> filesystem below that block, even on a single disk.

an entire pool, on the other hand....
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080915/2079550c/attachment.bin>

Richard Elling

2008-Sep-15 17:58 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

mike wrote:> On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will.murnane at
gmail.com> wrote:
>
>   
>> "Ditto blocks" it has; this is ZFS'' name for
redundant metadata.
>> There are multiple copies of directory entries and such, so that
>> corrupting a single block can''t cause you to lose the entire
>> filesystem below that block, even on a single disk.  But a stripe
>> lacks parity blocks, which ZFS needs to recreate damaged data.
>> Raidz{,2} have this, and mirrors have this, but single disks do not.
>>     
>
> okay. ditto is only metadata. gotcha.
>   
By default, but you can set the number of copies for your data
by setting the "copies" parameter on your file system.  In other
words, "copies=1" is the default.
    zfs get copies myfilesystemname
will show the current setting for your file system.

It is a little difficult to understand copies without pictures, so I
blogged about it and drew some pictures.
http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection

The real takeaway is that you have many different ways to protect
your data, which to choose is not always clear.
 -- richard

Wade.Stuart at fallon.com

2008-Sep-15 18:13 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

zfs-discuss-bounces at opensolaris.org wrote on 09/15/2008 12:58:44 PM:
> mike wrote:
> > On Mon, Sep 15, 2008 at 10:28 AM, Will Murnane <will.
> murnane at gmail.com> wrote:
> >
> >
> >> "Ditto blocks" it has; this is ZFS'' name for
redundant metadata.
> >> There are multiple copies of directory entries and such, so that
> >> corrupting a single block can''t cause you to lose the
entire
> >> filesystem below that block, even on a single disk.  But a stripe
> >> lacks parity blocks, which ZFS needs to recreate damaged data.
> >> Raidz{,2} have this, and mirrors have this, but single disks do
not.
> >>
> >
> > okay. ditto is only metadata. gotcha.
> >
>
> By default, but you can set the number of copies for your data
> by setting the "copies" parameter on your file system.  In other
> words, "copies=1" is the default.
>     zfs get copies myfilesystemname
> will show the current setting for your file system.
>
> It is a little difficult to understand copies without pictures, so I
> blogged about it and drew some pictures.
> http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection
The big takeaway here is that while copies=N guarantees there will be N
copies of blocks,  it _does not_ guarantee that they will always be on
separate disks.  If you lose a disk there is still a chance that you lose
the pool. zraid(2) > copies=N

-Wade
>
> The real takeaway is that you have many different ways to protect
> your data, which to choose is not always clear.
>  -- richard
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Eric Schrock

2008-Sep-15 18:24 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Mon, Sep 15, 2008 at 01:13:28PM -0500, Wade.Stuart at fallon.com
wrote:> 
> The big takeaway here is that while copies=N guarantees there will be N
> copies of blocks,  it _does not_ guarantee that they will always be on
> separate disks.  If you lose a disk there is still a chance that you lose
> the pool. zraid(2) > copies=N
> 
It''s also worth pointing out that ZFS currently doesn''t
survive toplevel
vdev failure.  If it happens while the pool is up, it will enter the I/O
failure state when trying to write the labels, despite the fact it could
theoretically continue on with the rest of the vdevs.  If it happens
while the pool is exported or the system is down, it treats this like
the root vdev is faulted.  Both of these are bugs that are being worked
on - the only reason a pool should be faulted is because critical
pool-wide metadata is not available.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

gm_sjo

2008-Sep-16 19:22 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

2008/9/15 gm_sjo:> 2008/9/15 Ben Rockwood:
>> On Thumpers I''ve created single pools of 44 disks, in 11 disk
RAIDZ2''s.
>> I''ve come to regret this.  I recommend keeping pools
reasonably sized
>> and to keep stripes thinner than this.
>
> Could you clarify why you came to regret it? I was intending to create
> a single pool for 8 1TB disks.
Sorry, just bouncing the back for Ben incase he missed it.

Peter Tribble

2008-Sep-16 21:28 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Tue, Sep 16, 2008 at 10:03 PM, Ben Rockwood <benr at cuddletech.com>
wrote:> gm_sjo wrote:
>> 2008/9/15 gm_sjo:
>>
>>> 2008/9/15 Ben Rockwood:
>>>
>>>> On Thumpers I''ve created single pools of 44 disks, in
11 disk RAIDZ2''s.
>>>> I''ve come to regret this.  I recommend keeping pools
reasonably sized
>>>> and to keep stripes thinner than this.
>>>>
>>> Could you clarify why you came to regret it? I was intending to
create
>>> a single pool for 8 1TB disks.
>>>
>>
>> Sorry, just bouncing the back for Ben incase he missed it.
>>
>
> No, I didn''t miss it, just was hoping I could get some
benchmarking in
> to justify my points.
>
>
> You want to keep stripes wide to reduce wasted disk space.... but you
> also want to keep them narrow to reduce the elements involved in parity
> calculation.  In light home use I don''t see a problem with an 8
disk
> RAIDZ/RAIDZ2.  If your serving in a multi-user environment your primary
> concern is to reduce the movement of the disk heads, and thus narrower
> stripes become adventagious.
I''m not sure that the width of the stripe is directly a problem. But
what is true
is that the random read performance of raidz1/2 is basically that of a single
drive, so having more vdevs is better. Given a fixed number of drives, more
vdevs implies narrower stripes, but that''s a side-effect rather than a
cause.

For what it''s worth, we put all the disks on our thumpers into a single
pool -
mostly it''s 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the
OS and
would happily go much bigger.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

mike

2008-Sep-16 22:27 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Tue, Sep 16, 2008 at 2:28 PM, Peter Tribble <peter.tribble at
gmail.com> wrote:
> For what it''s worth, we put all the disks on our thumpers into a
single pool -
> mostly it''s 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for
the OS and
> would happily go much bigger.
so you have 9 drive raidz1 (8 disks usable) +  hot spare, or
8 drive raidz1 (7 disks usable) +  hot spare?

It sounds like people -can- build larger pools but due to their
storage needs (performance, availability, etc) choose NOT to. For home
usage with maybe 4 clients maximum and can deal with downtime when
swapping out a drive, I think I can live with "decent" performance
(not "insane") and try to maximize my space (without making
ZFS''s
redundancy features useless.)

gm_sjo

2008-Sep-17 07:40 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

Am I right in thinking though that for every raidz1/2 vdev, you''re
effectively losing the storage of one/two disks in that vdev?

Peter Tribble

2008-Sep-17 07:46 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com>
wrote:> Am I right in thinking though that for every raidz1/2 vdev, you''re
> effectively losing the storage of one/two disks in that vdev?
Well yeah - you''ve got to have some allowance for redundancy.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

gm_sjo

2008-Sep-17 09:11 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

2008/9/17 Peter Tribble:> On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com>
wrote:
>> Am I right in thinking though that for every raidz1/2 vdev,
you''re
>> effectively losing the storage of one/two disks in that vdev?
>
> Well yeah - you''ve got to have some allowance for redundancy.
This is what i''m struggling to get my head around - the chances of
losing two disks at the same time are pretty darn remote (within a
reasonable time-to-replace delta), so what advantage is there (other
than potentially pointless uber-redundancy) in running multiple
raidz/2 vdevs? Are you not infact losing performance by reducing the
amount of spindles used for a given pool?

Peter Tribble

2008-Sep-17 09:22 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

On Wed, Sep 17, 2008 at 10:11 AM, gm_sjo <saqmaster at gmail.com>
wrote:> 2008/9/17 Peter Tribble:
>> On Wed, Sep 17, 2008 at 8:40 AM, gm_sjo <saqmaster at gmail.com>
wrote:
>>> Am I right in thinking though that for every raidz1/2 vdev,
you''re
>>> effectively losing the storage of one/two disks in that vdev?
>>
>> Well yeah - you''ve got to have some allowance for redundancy.
>
> This is what i''m struggling to get my head around - the chances of
> losing two disks at the same time are pretty darn remote (within a
> reasonable time-to-replace delta), so what advantage is there (other
> than potentially pointless uber-redundancy) in running multiple
> raidz/2 vdevs? Are you not infact losing performance by reducing the
> amount of spindles used for a given pool?
No. The number of spindles is constant. The snag is that for random reads,
the performance of a raidz1/2 vdev is essentially that of a single disk. (The
writes are fast because they''re always full-stripe; but so are the
reads.) So
your effective random read performance is that of a single disk times the
number of raidz vdevs.

It''s a tradeoff, as in all things. Fewer vdevs means less wasted space,
but
lower performance.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Ralf Ramge

2008-Sep-17 09:23 UTC

head link

[zfs-discuss] [storage-discuss] A few questions

gm_sjo wrote:
> Are you not infact losing performance by reducing the
> amount of spindles used for a given pool?
This depends. Usually, RAIDZ1/2 isn''t a good performancer when it comes
to random access read I/O, for instance. If I wanted to scale 
performance by adding spindles, I would use mirrors (RAID 10). If you 
want to scale filesystem sizes, RAIDZ is your friend.

I once had the problem that I needed a high random I/O performance and 
at least a 11 TB large filesystem on a X4500. Mirroring was out of the 
question (not enough disk space left), and RAIDZ gave me only about 25% 
of the performance of the existing Linux ext2 boxes I had to compete 
with. But in the end, striping 13 RAIDZ sets of 3 drives each + 1 hot 
spare delivered acceptable results in both categories. But it took me a 
lot of benchmarks to get there.

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
ralf.ramge at webde.de - http://web.de/

1&1 Internet AG
Brauerstra?e 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas 
Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, 
Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren

Nils Goroll

2008-Sep-18 10:09 UTC

head link

[zfs-discuss] [storage-discuss] A few questions : RAID set width

Hi all,

Ben Rockwood wrote:> You want to keep stripes wide to reduce wasted disk space.... but you
> also want to keep them narrow to reduce the elements involved in parity
> calculation.
I Ben''s argument, and the main point IMHO is how the RAID behaves in
the
degraded state. When a disk fails, that disk''s data has to be
reconstructed by
reading from ALL the other disks of the RAID set. Effectively, for the degraded 
RAID case, N disks of a RAID are reduced to the performance of one disk only. 
Also this situation will last until the RAID is reconstructed after replacing 
the failed disk, which is an argument for not using too large disks (see another
thread on this list).

Nils

Nils Goroll

2008-Sep-18 10:15 UTC

head link

[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ

Hi Peter,

Sorry, I have read you post after posting a reply myself.

Peter Tribble wrote:> No. The number of spindles is constant. The snag is that for random reads,
> the performance of a raidz1/2 vdev is essentially that of a single disk.
(The
> writes are fast because they''re always full-stripe; but so are the
reads.)
Can you elaborate on this?

My understanding is that with RAIDZ the writes are always full-stripe for as 
much data as can be agglomerated into a single contiguous write, but I thought 
this did not imply that all of the data has to be read at once except with a 
degraded RAID.

What about for instance writing 16MB chunks and reading 8K random?
Wouldn''t
RAIDZ access only the disks containing the 8K bits?

Nils

Nils Goroll

2008-Sep-18 10:24 UTC

head link

[zfs-discuss] typo: [storage-discuss] A few questions : RAID set width

> I Ben''s argument, and the main point IMHO is how the RAID behaves
in the    ^
second

Robert Milkowski

2008-Sep-18 10:53 UTC

head link

[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ

Hello Nils,

Thursday, September 18, 2008, 11:15:37 AM, you wrote:

NG> Hi Peter,

NG> Sorry, I have read you post after posting a reply myself.

NG> Peter Tribble wrote:>> No. The number of spindles is constant. The snag is that for random
reads,
>> the performance of a raidz1/2 vdev is essentially that of a single
disk. (The
>> writes are fast because they''re always full-stripe; but so are
the reads.)
NG> Can you elaborate on this?

NG> My understanding is that with RAIDZ the writes are always full-stripe for
as
NG> much data as can be agglomerated into a single contiguous write, but I
thought
NG> this did not imply that all of the data has to be read at once except
with a
NG> degraded RAID.

NG> What about for instance writing 16MB chunks and reading 8K random?
Wouldn''t
NG> RAIDZ access only the disks containing the 8K bits?

Basically, the way RAID-Z works is that it spreads FS block to all
disks in a given VDEV, minus parity/checksum disks). Because when you
read data back from zfs before it gets to application zfs will check
it''s checksum (fs checksum, not a raid-z one) so it needs entire fs
block... which is spread to all data disks in a given vdev.

-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Nils Goroll

2008-Sep-18 11:26 UTC

head link

[zfs-discuss] RAIDZ read-optimized write?

Hi Robert,
> Basically, the way RAID-Z works is that it spreads FS block to all
> disks in a given VDEV, minus parity/checksum disks). Because when you
> read data back from zfs before it gets to application zfs will check
> it''s checksum (fs checksum, not a raid-z one) so it needs entire
fs
> block... which is spread to all data disks in a given vdev.
Thank you very much for correcting my long-time misconception.

On the other hand, isn''t there room for improvement here? If it was
possible to
break large writes into smaller blocks with individual checkums(for instance 
those which are larger than a preferred_read_size parameter), we could still 
write all of these with a single RAIDZ(2) line, avoid the RAIDx write penalty 
and improve read performance because we''d only need to issue a single
read I/O
for each requested block - needing to access the full RAIDZ line only for the 
degraded RAID case.

I think that this could make a big difference for write-once read many random 
access-type applications like DSS systems etc.

Is this feasible at all?

Nils

Bob Friesenhahn

2008-Sep-18 14:36 UTC

head link

[zfs-discuss] RAIDZ read-optimized write?

On Thu, 18 Sep 2008, Nils Goroll wrote:>
> On the other hand, isn''t there room for improvement here? If it
was possible to
> break large writes into smaller blocks with individual checkums(for
instance
> those which are larger than a preferred_read_size parameter), we could
still
> write all of these with a single RAIDZ(2) line, avoid the RAIDx write
penalty
> and improve read performance because we''d only need to issue a
single read I/O
> for each requested block - needing to access the full RAIDZ line only for
the
> degraded RAID case.
>
> I think that this could make a big difference for write-once read many
random
> access-type applications like DSS systems etc.
I imagine that this is indeed possible but that the law of diminishing 
returns would prevail.  The level of per-block overhead would become 
much greater so sequential throughput would be reduced and more disk 
space would be wasted.

You can be sure that the ZFS inventors thoroughly explored all of 
these issues and it would surprise me if someone didn''t prototype it 
to see how it actually performs.

ZFS is designed for the present and the future.  Legacy filesystems 
were designed for the past.  In the present, the cost of memory is 
dramatically reduced, and in the future it will be even more so. 
This means that systems will contain massive cache RAM which 
dramatically reduces the number of read (and write) accesses.  Also, 
solid state disks (SSDs) will eventually become common and SSDs don''t 
exhibit a seek penalty so designing the filesystem to avoid seeks does 
not carry over into the long term future.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

A Darren Dunham

2008-Sep-18 16:00 UTC

head link

[zfs-discuss] RAIDZ read-optimized write?

On Thu, Sep 18, 2008 at 01:26:09PM +0200, Nils Goroll
wrote:> Thank you very much for correcting my long-time misconception.
> 
> On the other hand, isn''t there room for improvement here? If it
was
> possible to break large writes into smaller blocks with individual
> checkums(for instance those which are larger than a
> preferred_read_size parameter), we could still write all of these with
> a single RAIDZ(2) line, avoid the RAIDx write penalty and improve read
> performance because we''d only need to issue a single read I/O for
each
> requested block - needing to access the full RAIDZ line only for the
> degraded RAID case.
Don''t forget that the parent block contains the checksum so that it can
be compared.  There isn''t room in the parent for an arbitrary number of
checksums as would be required with an arbitrary number of columns.

-- 
Darren

Richard Elling

2008-Sep-18 16:27 UTC

head link

[zfs-discuss] RAIDZ read-optimized write?

Nils Goroll wrote:> Hi Robert,
>
>   
>> Basically, the way RAID-Z works is that it spreads FS block to all
>> disks in a given VDEV, minus parity/checksum disks). Because when you
>> read data back from zfs before it gets to application zfs will check
>> it''s checksum (fs checksum, not a raid-z one) so it needs
entire fs
>> block... which is spread to all data disks in a given vdev.
>>     
>
> Thank you very much for correcting my long-time misconception.
>
> On the other hand, isn''t there room for improvement here? If it
was possible to
> break large writes into smaller blocks with individual checkums(for
instance
> those which are larger than a preferred_read_size parameter), we could
still
> write all of these with a single RAIDZ(2) line, avoid the RAIDx write
penalty
> and improve read performance because we''d only need to issue a
single read I/O
> for each requested block - needing to access the full RAIDZ line only for
the
> degraded RAID case.
>
> I think that this could make a big difference for write-once read many
random
> access-type applications like DSS systems etc.
>
> Is this feasible at all?
>   
Someone in the community was supposedly working on this, at one
time. It gets brought up about every 4-5 months or so.  Lots of detail
in the archives.
 -- richard

Nils Goroll

2008-Sep-19 08:58 UTC

head link

[zfs-discuss] RAIDZ read-optimized write?

Hi Richard,
> Someone in the community was supposedly working on this, at one
> time. It gets brought up about every 4-5 months or so.  Lots of detail
> in the archives.
Thank you for the pointer and sorry for the noise. I will definitely browse the 
archives to find out more regarding this question.

Bob and Darren, thank you as well for your comments. I don''t expect it
to be
easy to optimize RAIDZ for random read I/O, but I do not agree with the argument
that caching heals all I/O problems. Yes, it would be desirable to always have 
so large a cache to eliminate almost all read-I/O, but those of us who are 
responsible for deploying such systems will know that physical read I/O 
performance does matter for random access patterns in particular, but also for 
more sequential access patterns on ZFS due to the inherent fragmentation that 
comes with COW (plus relevance for cache warmup times etc).

In short, I consider this optimization approach worthwhile exploring, but I 
don''t think I''ll be able to do this myself.

I would appreciate any pointers to background information regarding this
question.

Thank you,

Nils

zfs discuss - Sep 2008 - A few questions

[zfs-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions

[zfs-discuss] [storage-discuss] A few questions : RAID set width

[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ

[zfs-discuss] typo: [storage-discuss] A few questions : RAID set width

[zfs-discuss] [storage-discuss] A few questions - small read I/O performance on RAIDZ

[zfs-discuss] RAIDZ read-optimized write?

[zfs-discuss] RAIDZ read-optimized write?

[zfs-discuss] RAIDZ read-optimized write?

[zfs-discuss] RAIDZ read-optimized write?

[zfs-discuss] RAIDZ read-optimized write?