thr3ads.net - zfs discuss - [zfs-discuss] rethinking RaidZ and Record size [Jan 2010]

If this information is useful, please help other people find it:
Share via:

matthew patton

2010-Jan-04 07:27 UTC

[zfs-discuss] rethinking RaidZ and Record size

I find it baffling that RaidZ(2,3) was designed to split a record-size block
into N (N=# of member devices) pieces and send the uselessly tiny requests to
spinning rust when we know the massive delays entailed in head seeks and
rotational delay. The ZFS-mirror and load-balanced configuration do the
obviously correct thing and don''t split records and gain more by
utilizing parallel access. I can''t imagine the code-path for RAIDZ
would be so hard to fix.

I''ve read posts back to 06 and all I see are lamenting about the
horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to claw back
performance by combining multiple such vDEVs. I understand RAIDZ will never
equal Mirroring, but it could get damn close if it didn''t break
requests down and better yet utilized copies=N and properly placed the copies on
disparate spindles. This is somewhat analogous to what the likes of 3PAR do and
it''s not rocket science.

An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount of storage
but the latter is a hell of a lot more resilient and max IOPS should be higher
to boot. An non-broken-up RAIDZ4+P would still be 1/2 the IOPS of the 8 disk
mirror but I''d at least save a bundle of coin in either reduced spindle
count or using slower drives.

With all the great things ZFS is capable of, why hasn''t this been
redesigned long ago? what glaringly obvious truth am I missing?

Ross Walker

2010-Jan-04 15:46 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Mon, Jan 4, 2010 at 2:27 AM, matthew patton <pattonme at yahoo.com>
wrote:> I find it baffling that RaidZ(2,3) was designed to split a record-size
block into N (N=# of member devices) pieces and send the uselessly tiny requests
to spinning rust when we know the massive delays entailed in head seeks and
rotational delay. The ZFS-mirror and load-balanced configuration do the
obviously correct thing and don''t split records and gain more by
utilizing parallel access. I can''t imagine the code-path for RAIDZ
would be so hard to fix.
>
> I''ve read posts back to 06 and all I see are lamenting about the
horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to claw back
performance by combining multiple such vDEVs. I understand RAIDZ will never
equal Mirroring, but it could get damn close if it didn''t break
requests down and better yet utilized copies=N and properly placed the copies on
disparate spindles. This is somewhat analogous to what the likes of 3PAR do and
it''s not rocket science.
>
> An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount of
storage but the latter is a hell of a lot more resilient and max IOPS should be
higher to boot. An non-broken-up RAIDZ4+P would still be 1/2 the IOPS of the 8
disk mirror but I''d at least save a bundle of coin in either reduced
spindle count or using slower drives.
>
> With all the great things ZFS is capable of, why hasn''t this been
redesigned long ago? what glaringly obvious truth am I missing?
It is the sacrifice that was made to remove the write hole
vulnerability that existed in RAID5/6. Personnally I am thinking now
that that write hole isn''t so bad, and with COW writes and a write
log, that vulnerability really could be marginalized.

If you are running copies=2 you could utilize hardware RAID5/6 with
battery backed write cache for the RAID and present it as a couple
LUNs to ZFS which should provide higher performance with data
resiliency in place.

Say with 14 drives, 2x 7 drive RAID6s, make a 2 vdev zpool out of them
and copies=2, should provide more then enough data resiliency and
performance, at a cost. If the drives are large enough though it could
overcome that.

-Ross

matthew patton

2010-Jan-04 18:34 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

Chris Siebenmann <cks at cs.toronto.edu> wrote:
>  People have already mentioned the RAID-[56] write hole,
> but it''s more
> than that; in a never-overwrite system with multiple blocks
> in one RAID
> stripe, how do you handle updates to some of the blocks?
> 
>  See:
> ??? http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSensibleRAID
Oh that''s easy. Netapp''s been doing this since forever. A
little extra meta-data is nothing to worry about. Snapshots create by comparison
massive metadata footprints.

since it''s all COW why not do full stripe writes all the time? Assume a
4 disk raidZ(3+p)

A1 A2 A3 AP
B1 B2 B3 BP
C1 C2 C3 CP
...

Then the transaction group timer fires and dumps a bunch of records needing
syncing: A2'', B2'', C3'', D2'', A1'',
B3''. I write these out in a totally new/empty stripes as

A1'' A2'' C3'' XP
B2'' B3'' D2'' XP

I don''t have to read any of the original blocks and parity is
calculated from in-memory.

Then I just modify the metadata to mark the original blocks as
invalid/superceded. But for XOR and stripe recovery purposes we can leave the
original stripe perfectly alone. When a full stripe is no longer valid (all
blocks superceded) and isn''t part of a snapshot it gets put on the
"clean/ready for reuse" list.

After a while one could potentially end up with all of the "A" blocks
sitting on just one spindle. They are still fully protected but a sequential
read of A1-A3 will obviously be much slower than if they were properly spread
across 3 spindles. This is where array scrubbing would step in and rebalance the
A-series. On the other hand the elevator algorithm applied to the transaction
group could order things such that all ''x1'' blocks go on
spindle 1, ''x2'' go on spindle 2 etc. If there aren''t
enough of a particular spindle then just use an empty block to fill in the hole
or if that is too wasteful, resort to the less optimal ordering for the
left-overs and scrubbing will eventually take care of it.

Note that my representation mimics RAID4 in layout. You can of course move the
parity block around, indeed parity block spindle is a simple function of stripe
index and array width. Eg. for stripe N on width W -> parity is on spindle W
- (N mod W). Is distributed parity worth doing? No, I don''t think so.

Richard Elling

2010-Jan-04 18:46 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
> I find it baffling that RaidZ(2,3) was designed to split a record- 
> size block into N (N=# of member devices) pieces and send the  
> uselessly tiny requests to spinning rust when we know the massive  
> delays entailed in head seeks and rotational delay. The ZFS-mirror  
> and load-balanced configuration do the obviously correct thing and  
> don''t split records and gain more by utilizing parallel access. I
> can''t imagine the code-path for RAIDZ would be so hard to fix.
Knock yourself out :-)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
> I''ve read posts back to 06 and all I see are lamenting about the  
> horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to  
> claw back performance by combining multiple such vDEVs. I understand  
> RAIDZ will never equal Mirroring, but it could get damn close if it  
> didn''t break requests down and better yet utilized copies=N and  
> properly placed the copies on disparate spindles. This is somewhat  
> analogous to what the likes of 3PAR do and it''s not rocket
science.
That is not the issue for small, random reads.  For all reads, the  
checksum is
verified. When you spread the record across multiple disks, then you  
need
to read the record back from those disks. In general, this means that as
long as the recordsize is larger than the requested small read, then  
your
performance will approach the N/(N-P) * IOPS limit. At the  
pathological edge,
you can set recordsize to 512 bytes and you end up with mirroring (!)
The small, random read performance model I developed only calculates
the above IOPS limit, and does not consider recordsize.

The physical I/O is much more difficult to correlate to the logical I/ 
O because
of all of the coalescing and caching that occurs at all of the lower  
levels in
the stack.
> An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount  
> of storage but the latter is a hell of a lot more resilient and max  
> IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still  
> be 1/2 the IOPS of the 8 disk mirror but I''d at least save a
bundle
> of coin in either reduced spindle count or using slower drives.
>
> With all the great things ZFS is capable of, why hasn''t this been
> redesigned long ago? what glaringly obvious truth am I missing?
Performance, dependability, space: pick two.
  -- richard

Roch

2010-Jan-05 16:00 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

Richard Elling writes:
 > On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
 > 
 > > I find it baffling that RaidZ(2,3) was designed to split a record- 
 > > size block into N (N=# of member devices) pieces and send the  
 > > uselessly tiny requests to spinning rust when we know the massive  
 > > delays entailed in head seeks and rotational delay. The ZFS-mirror  
 > > and load-balanced configuration do the obviously correct thing and  
 > > don''t split records and gain more by utilizing parallel
access. I
 > > can''t imagine the code-path for RAIDZ would be so hard to
fix.
 > 
 > Knock yourself out :-)
 >
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
 > 
 > > I''ve read posts back to 06 and all I see are lamenting about
the
 > > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to  
 > > claw back performance by combining multiple such vDEVs. I understand
 > > RAIDZ will never equal Mirroring, but it could get damn close if it  
 > > didn''t break requests down and better yet utilized copies=N
and
 > > properly placed the copies on disparate spindles. This is somewhat  
 > > analogous to what the likes of 3PAR do and it''s not rocket
science.
 > 
 > That is not the issue for small, random reads.  For all reads, the  
 > checksum is
 > verified. When you spread the record across multiple disks, then you  
 > need
 > to read the record back from those disks. In general, this means that as
 > long as the recordsize is larger than the requested small read, then  
 > your
 > performance will approach the N/(N-P) * IOPS limit. At the  
 > pathological edge,
 > you can set recordsize to 512 bytes and you end up with mirroring (!)
 > The small, random read performance model I developed only calculates
 > the above IOPS limit, and does not consider recordsize.
 > 
 > The physical I/O is much more difficult to correlate to the logical I/ 
 > O because
 > of all of the coalescing and caching that occurs at all of the lower  
 > levels in
 > the stack.
 > 
 > > An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount
 > > of storage but the latter is a hell of a lot more resilient and max  
 > > IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still
 > > be 1/2 the IOPS of the 8 disk mirror but I''d at least save a
bundle
 > > of coin in either reduced spindle count or using slower drives.
 > >
 > > With all the great things ZFS is capable of, why hasn''t this
been
 > > redesigned long ago? what glaringly obvious truth am I missing?
 > 
 > Performance, dependability, space: pick two.
 >   -- richard
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


If store record X in one column like raid-5 or 6 does; then
you need to generate parity for that record X by grouping
with other unrelated records Y, Z, T etc. When X if freed in the
filesystem, it still holds parity information protecting Y,
Z, T so you can''t get rid of what was stored @ X. If you try
to store new data in X and in associated parity by fail in
mid-stream you hit the raid-5 write hole. Moreover now
that X is not referenced in the filesystem, no more checksum
is associated with it and if bit rot occurs in X and disk
holding Y dies, resilvering would generate garbage for Y.

This seems to force use to chunk up disks with every unit
checksummed even if freed. Secure deletion becomes a problem
as well. And you need can end up madly searching for free
stripes, repositioning old blocks in partial striped even if
the pool is just 10% filled up.

Can one do this with raid-dp ?
	http://blogs.sun.com/roch/entry/need_inodes


That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently. 

-r

Robert Milkowski

2010-Jan-05 16:49 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 16:00, Roch wrote:> That said, I truly am for a evolution for random read
> workloads. Raid-Z on 4K sectors is quite appealing. It means
> that small objects become nearly mirrored with good random read
> performance while large objects are stored efficiently.
>
>    
Have you got any benchmarks available (comparing 512B to 4K to classical 
RAID-5)?

The problem is that while RAID-Z is really good for some workloads it is 
really bad for others.
Sometimes having L2ARC might effectively mitigate the problem but for 
some workloads it won''t (due to the huge size of a working set). In
such
environments RAID-Z2 offers much worse performance then similarly 
configured NetApp (RAID-DP, same number of disk drives). If ZFS would 
provide another RAID-5/RAID-6 like protection but with different 
characteristics so writing to a pool would be slower but reading from it 
would be much faster (comparable to RAID-DP) some customers would be 
very happy. Then maybe a new kind of cache device would be needed to 
buffer writes to NV storage to make writes faster (like "HW" arrays
have
been doing for years).

A possible *workaround* is to use SVM to set-up RAID-5 and create a zfs 
pool on top of it.
How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.

-- 
Robert Milkowski
http://milek.blogspot.com

A Darren Dunham

2010-Jan-05 18:35 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Tue, Jan 05, 2010 at 04:49:00PM +0000, Robert Milkowski
wrote:> A possible *workaround* is to use SVM to set-up RAID-5 and create a
> zfs pool on top of it.
> How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.
As far as I know, it does not address it.  It''s possible that adding a
transaction volume would help by replaying anything that affected the
volume, but I don''t know that sufficient information is present.

Symantec Volume Manager offers an explicit Raid5 log device.  There
doesn''t appear to be any corresponding object in SVM.

-- 
Darren

Roch Bourbonnais

2010-Jan-05 18:37 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

Le 5 janv. 10 ? 17:49, Robert Milkowski a ?crit :
> On 05/01/2010 16:00, Roch wrote:
>> That said, I truly am for a evolution for random read
>> workloads. Raid-Z on 4K sectors is quite appealing. It means
>> that small objects become nearly mirrored with good random read
>> performance while large objects are stored efficiently.
>>
>>
>
> Have you got any benchmarks available (comparing 512B to 4K to  
> classical RAID-5)?
Using 8K ''soft'' sector prototype on an otherwise plain raid-z
layout,
we got 8X more random reads than with 512B sectors; as would be  
expected.
>
> The problem is that while RAID-Z is really good for some workloads  
> it is really bad for others.
The bigger sector makes raid-z like mirroring for small records. And  
so performance of raid-z will be very good and it''s also space  
efficient for large objects.
> Sometimes having L2ARC might effectively mitigate the problem but  
> for some workloads it won''t (due to the huge size of a working
set).
> In such environments RAID-Z2 offers much worse performance then  
> similarly configured NetApp (RAID-DP, same number of disk drives).  
> If ZFS would provide another RAID-5/RAID-6 like protection but with  
> different characteristics so writing to a pool would be slower but  
> reading from it would be much faster (comparable to RAID-DP) some  
> customers would be very happy.
Agreed.
> Then maybe a new kind of cache device would be needed to buffer  
> writes to NV storage to make writes faster (like "HW" arrays have
> been doing for years).
>
Writes are not the problem and we have log device to offload them.  
It''s really about maintaining integrity of raid-5 type layout in the  
presence of bit-rot even if such
bit-rot occur within free space.
>
> A possible *workaround* is to use SVM to set-up RAID-5 and create a  
> zfs pool on top of it.
> How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.
>
It doesn''t.

> -- 
> Robert Milkowski
> http://milek.blogspot.com
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100105/8ecf1e58/attachment.bin>

Richard Elling

2010-Jan-05 18:49 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:
> On 05/01/2010 16:00, Roch wrote:
>> That said, I truly am for a evolution for random read
>> workloads. Raid-Z on 4K sectors is quite appealing. It means
>> that small objects become nearly mirrored with good random read
>> performance while large objects are stored efficiently.
>>
>>
>
> Have you got any benchmarks available (comparing 512B to 4K to  
> classical RAID-5)?
Not fair!  A 512 byte random write workload will absolutely clobber a
RAID-5 implementation. It is the RAID-5 pathological worst case.
For many arrays, even a 4 KB random write workload will suck most
heinously.

The raidz pathological worst case is a random read from many-column
raidz where files have records 128 KB in size.  The inflated read  
problem
is why it makes sense to match recordsize for fixed record workloads.
This includes CIFS workloads which use 4 KB records. It is also why
having many columns in the raidz for large records does not improve
performance. Hence the 3 to 9 raidz disk limit recommendation in the
zpool man page.

http://www.baarf.com
> The problem is that while RAID-Z is really good for some workloads  
> it is really bad for others.
> Sometimes having L2ARC might effectively mitigate the problem but  
> for some workloads it won''t (due to the huge size of a working
set).
> In such environments RAID-Z2 offers much worse performance then  
> similarly configured NetApp (RAID-DP, same number of disk drives).  
> If ZFS would provide another RAID-5/RAID-6 like protection but with  
> different characteristics so writing to a pool would be slower but  
> reading from it would be much faster (comparable to RAID-DP) some  
> customers would be very happy. Then maybe a new kind of cache device  
> would be needed to buffer writes to NV storage to make writes faster  
> (like "HW" arrays have been doing for years).
This still does not address the record checksum.  This is only a problem
for small, random read workloads, which means L2ARC is a good solution.
If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
HDD and good performance do not belong in the same sentence anymore.
Game over -- SSDs win.
> A possible *workaround* is to use SVM to set-up RAID-5 and create a  
> zfs pool on top of it.
> How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.
IIRC, SVM does a prewrite.  Dog slow.  Also, SVM is, AFAICT, on life  
support.
The source is out there if anyone wants to carry it forward. Actually,  
many of us
would be quite happy for SVM to fade from our memory :-)
  -- richard

Robert Milkowski

2010-Jan-05 19:23 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 18:37, Roch Bourbonnais wrote:>
> Writes are not the problem and we have log device to offload them. 
> It''s really about maintaining integrity of raid-5 type layout in
the
> presence of bit-rot even if such
> bit-rot occur within free space.
>
How is it addressed in RAID-DP?


-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Jan-05 19:30 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 18:49, Richard Elling wrote:> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote: 
> The problem is that while RAID-Z is really good for some workloads it 
> is really bad for others.
>> Sometimes having L2ARC might effectively mitigate the problem but for 
>> some workloads it won''t (due to the huge size of a working
set). In
>> such environments RAID-Z2 offers much worse performance then 
>> similarly configured NetApp (RAID-DP, same number of disk drives). If 
>> ZFS would provide another RAID-5/RAID-6 like protection but with 
>> different characteristics so writing to a pool would be slower but 
>> reading from it would be much faster (comparable to RAID-DP) some 
>> customers would be very happy. Then maybe a new kind of cache device 
>> would be needed to buffer writes to NV storage to make writes faster 
>> (like "HW" arrays have been doing for years).
>
> This still does not address the record checksum.  This is only a problem
> for small, random read workloads, which means L2ARC is a good solution.
> If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO
> HDD and good performance do not belong in the same sentence anymore.
> Game over -- SSDs win.
>
as I wrote - sometimes the working set is so big that L2ARC or not there 
is virtually no difference and it is not practical to deploy L2ARC 
several TBs in size or bigger. For such workload RAID-DP behaves much 
better (many small random reads, not that much writes).

Tristan Ball

2010-Jan-05 19:56 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 6/01/2010 3:00 AM, Roch wrote:> Richard Elling writes:
>   >  On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
>   >
>   >  >  I find it baffling that RaidZ(2,3) was designed to split a
record-
>   >  >  size block into N (N=# of member devices) pieces and send the
>   >  >  uselessly tiny requests to spinning rust when we know the
massive
>   >  >  delays entailed in head seeks and rotational delay. The
ZFS-mirror
>   >  >  and load-balanced configuration do the obviously correct
thing and
>   >  >  don''t split records and gain more by utilizing
parallel access. I
>   >  >  can''t imagine the code-path for RAIDZ would be so
hard to fix.
>   >
>   >  Knock yourself out :-)
>   > 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
>   >
>   >  >  I''ve read posts back to 06 and all I see are
lamenting about the
>   >  >  horrendous drop in IOPs, about sizing RAIDZ to ~4+P and
trying to
>   >  >  claw back performance by combining multiple such vDEVs. I
understand
>   >  >  RAIDZ will never equal Mirroring, but it could get damn close
if it
>   >  >  didn''t break requests down and better yet utilized
copies=N and
>   >  >  properly placed the copies on disparate spindles. This is
somewhat
>   >  >  analogous to what the likes of 3PAR do and it''s not
rocket science.
>   >
>
>    
[snipped for space ]>
> That said, I truly am for a evolution for random read
> workloads. Raid-Z on 4K sectors is quite appealing. It means
> that small objects become nearly mirrored with good random read
> performance while large objects are stored efficiently.
>
> -r
>    Sold! Let''s do that then! :-)

Seriously - are there design or architectural reasons why this isn''t 
done by default, or at least an option? Or is it just a "no one''s
had
time to implement yet" thing?
I understand that 4K sectors might be less space efficient for lots of 
small files, but I suspect lots of us would happilly make that trade off!

Thanks,
     Tristan

Richard Elling

2010-Jan-05 20:19 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:> On 05/01/2010 18:49, Richard Elling wrote:
>> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:
>
>> The problem is that while RAID-Z is really good for some workloads  
>> it is really bad for others.
>>> Sometimes having L2ARC might effectively mitigate the problem but  
>>> for some workloads it won''t (due to the huge size of a
working
>>> set). In such environments RAID-Z2 offers much worse performance  
>>> then similarly configured NetApp (RAID-DP, same number of disk  
>>> drives). If ZFS would provide another RAID-5/RAID-6 like  
>>> protection but with different characteristics so writing to a pool
>>> would be slower but reading from it would be much faster  
>>> (comparable to RAID-DP) some customers would be very happy. Then  
>>> maybe a new kind of cache device would be needed to buffer writes  
>>> to NV storage to make writes faster (like "HW" arrays
have been
>>> doing for years).
>>
>> This still does not address the record checksum.  This is only a  
>> problem
>> for small, random read workloads, which means L2ARC is a good  
>> solution.
>> If L2ARC is a set of HDDs, then you could gain some advantage, but  
>> IMHO
>> HDD and good performance do not belong in the same sentence anymore.
>> Game over -- SSDs win.
>>
>
> as I wrote - sometimes the working set is so big that L2ARC or not  
> there is virtually no difference and it is not practical to deploy  
> L2ARC several TBs in size or bigger. For such workload RAID-DP  
> behaves much better (many small random reads, not that much writes).
If you are doing small, random reads on dozens of TB of data, then  
you''ve
got a much bigger problem on your hands... kinda like counting grains of
sand on the beach during low tide :-).  Hopefully, you do not have to  
randomly
update that data because your file system isn''t COW :-). Fortunately,  
most
workloads are not of that size and scope.

Since there are already 1 TB SSDs on the market, the only thing  
keeping the
HDD market alive is the low $/TB.  Moore''s Law predicts that cost  
advantage
will pass.  SSDs are already the low $/IOPS winners.
  -- richard

Richard Elling

2010-Jan-05 20:26 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 5, 2010, at 11:56 AM, Tristan Ball wrote:> On 6/01/2010 3:00 AM, Roch wrote:
>> That said, I truly am for a evolution for random read
>> workloads. Raid-Z on 4K sectors is quite appealing. It means
>> that small objects become nearly mirrored with good random read
>> performance while large objects are stored efficiently.
>>
>> -r
>>
> Sold! Let''s do that then! :-)
>
> Seriously - are there design or architectural reasons why this
isn''t
> done by default, or at least an option? Or is it just a "no
one''s
> had time to implement yet" thing?
Waiting on hardware to be become widely available might be a long wait.
See also PSARC 2008/769
http://arc.opensolaris.org/caselog/PSARC/2008/769/inception.materials/design_doc
> I understand that 4K sectors might be less space efficient for lots  
> of small files, but I suspect lots of us would happilly make that  
> trade off!
+1 (for better reliability, too!)
  -- richard

Bob Friesenhahn

2010-Jan-05 21:06 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Tue, 5 Jan 2010, Richard Elling wrote:>
> Since there are already 1 TB SSDs on the market, the only thing keeping the
> HDD market alive is the low $/TB.  Moore''s Law predicts that cost
advantage
> will pass.  SSDs are already the low $/IOPS winners.
SSD vendors are still working to stabilize their designs.  Most of 
them seem to be unworthy for use in more than a laptop computer.  A 
number of computer vendors (e.g. Apple & Dell) who offered SSDs in 
their computers encountered an expectedly high rate of product 
failure.

According to Sun''s own engineers, Moore''s Law is very bad for 
enterprise SSDs.  FLASH devices built to very small geometries are 
more likely to wear out and forget.  Current design trends are moving 
in a direction which is contrary to the requirements of enterprise 
SSDs. See

   http://www.eetimes.com/showArticle.jhtml?articleID=219200284

Perhaps inovative designers like Suncast will figure out how to build 
reliable SSDs based on parts which are more likely to wear out and 
forget.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tristan Ball

2010-Jan-05 21:30 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 6/01/2010 7:19 AM, Richard Elling wrote:>
> If you are doing small, random reads on dozens of TB of data, then
you''ve
> got a much bigger problem on your hands... kinda like counting grains of
> sand on the beach during low tide :-).  Hopefully, you do not have to 
> randomly
> update that data because your file system isn''t COW :-).
Fortunately,
> most
> workloads are not of that size and scope.
>
> Since there are already 1 TB SSDs on the market, the only thing 
> keeping the
> HDD market alive is the low $/TB.  Moore''s Law predicts that cost 
> advantage
> will pass.  SSDs are already the low $/IOPS winners.
>  -- richard
>These workloads (small random reads over huge datasets) might be getting 
more common in some environments - because it seems to be what you get 
when you consolidate virtual machine storage.

We''ve got a moderately large number of Virtual Machines (a mix of 
Debian, Win2K Win2K3) running a very large set of applications, and our 
reads are all over the place! :-( I have to say I remain impressed at 
how well the ARC behaves, but even then our hit rate is often not 
wonderful.

I _dream_ about being able to afford to build out my entire storage from 
cheap/large SSD''s. My guess would be that in about 2 years
I''ll be able
to. One of the reasons we''ve essentially put a hold on buying 
"enterprise storage" or fast FC/SCSI disks. A large part of the 
justification for FC/SCSI disks is their performance, and they''re going
to be completely eclipsed within the lifetime of any serious mid-range 
to high end storage array. Until that day we''re make do large sata 
drives, mirrored, with relatively high spindle counts to avoid long 
per-disk queues.

:-)

T

PS:  OK, I know other tier-1 storage vendors have started integrating 
SSD''s as well, but they hadn''t when we started out current
round of
storage upgrades, and I stilll think opensolaris+sata hdds+ssd''s gives 
us a cleaner,cheaper and easier upgrade path than most tier-1 vendors 
can provide.

Robert Milkowski

2010-Jan-05 23:08 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 20:19, Richard Elling wrote:> On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:
>> On 05/01/2010 18:49, Richard Elling wrote:
>>> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:
>>
>>> The problem is that while RAID-Z is really good for some workloads 
>>> it is really bad for others.
>>>> Sometimes having L2ARC might effectively mitigate the problem
but
>>>> for some workloads it won''t (due to the huge size of a
working
>>>> set). In such environments RAID-Z2 offers much worse
performance
>>>> then similarly configured NetApp (RAID-DP, same number of disk 
>>>> drives). If ZFS would provide another RAID-5/RAID-6 like
protection
>>>> but with different characteristics so writing to a pool would
be
>>>> slower but reading from it would be much faster (comparable to 
>>>> RAID-DP) some customers would be very happy. Then maybe a new
kind
>>>> of cache device would be needed to buffer writes to NV storage
to
>>>> make writes faster (like "HW" arrays have been doing
for years).
>>>
>>> This still does not address the record checksum.  This is only a 
>>> problem
>>> for small, random read workloads, which means L2ARC is a good
solution.
>>> If L2ARC is a set of HDDs, then you could gain some advantage, but
IMHO
>>> HDD and good performance do not belong in the same sentence
anymore.
>>> Game over -- SSDs win.
>>>
>>
>> as I wrote - sometimes the working set is so big that L2ARC or not 
>> there is virtually no difference and it is not practical to deploy 
>> L2ARC several TBs in size or bigger. For such workload RAID-DP 
>> behaves much better (many small random reads, not that much writes).
>
> If you are doing small, random reads on dozens of TB of data, then
you''ve
> got a much bigger problem on your hands... kinda like counting grains of
> sand on the beach during low tide :-).  Hopefully, you do not have to 
> randomly
> update that data because your file system isn''t COW :-).
Fortunately,
> most
> workloads are not of that size and scope.
>
Well, nevertheless some environments are like that (and no, I''m not 
speculating) and the truth is that NetApp with RAID-DP with the same 
amount of disk drives proven to be faster than RAID-Z2 even with a help 
of SSDs as L2ARC. The point is that NetApp allowed to provide the 
capacity of RAID-6 and protection of dual parity while providing better 
performance to RAID-Z2 in the environment.
In other workloads RAIDZ-2 will be better, but not in this particular 
environment.

All I''m saying is that having yet another RAID type in ZFS which offers
capacity similar to RAID-5/RAID-6 but with different performance 
characteristics so small random reads are on par with RAID-DP while 
sacrificing write performance would be beneficial for some environments.

RAID-Z with bigger sector size could improve performance but provided 
capacity could be much less than RAID-5/6 so it not necessary might be 
an apple-to-apple comparison (but still useful for some environments).

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Jan-05 23:10 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 20:19, Richard Elling wrote:> [...] Fortunately, most
> workloads are not of that size and scope.
>
Forgot to mention it in my last email - yes, I agree. The environment 
I''m talking about is rather unusual and in most other cases where 
RAID-5/6 was considered the performance of RAID-Z1/2 was good enough or 
even better.


-- 
Robert Milkowski
http://milek.blogspot.com

Michael Herf

2010-Jan-05 23:31 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

Many large-scale photo hosts start with netapp as the default "good
enough" way to handle multiple-TB storage. With a 1-5% cache on top,
the workload is truly random-read over many TBs. But these workloads
almost assume a frontend cache to take care of hot traffic, so L2ARC
is just a nice implementation of that, not a silver bullet.

I agree that RAID-DP is much more scalable for reads than RAIDZx, and
this basically turns into a cost concern at scale.

The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
used instead of netapp. But this certainly reduces the cost advantage
significantly.

mike

p.s. I managed the team that built blogger.com''s photo hosting, and
picasaweb.google.com, so I''ve seen some of this stuff at scale
(neither of these use netapp). For large photos, it''s pretty simple:
the more independent spindles, the better.

Robert Milkowski

2010-Jan-05 23:58 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On 05/01/2010 23:31, Michael Herf wrote:> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
> used instead of netapp. But this certainly reduces the cost advantage
> significantly.
>    
This is true to some extent. I didn''t want to bring it up as I wanted
to
focus only on technical aspect.

-- 
Robert Milkowski
http://milek.blogspot.com

David Magda

2010-Jan-06 00:25 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 5, 2010, at 16:06, Bob Friesenhahn wrote:
> Perhaps inovative designers like Suncast will figure out how to  
> build reliable SSDs based on parts which are more likely to wear out  
> and forget.
At which point we''ll probably start seeing the memristor start making  
an appearance in various devices. :)

Wes Felter

2010-Jan-06 21:30 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

Michael Herf wrote:
> I agree that RAID-DP is much more scalable for reads than RAIDZx, and
> this basically turns into a cost concern at scale.
> 
> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
> used instead of netapp. But this certainly reduces the cost advantage
> significantly.
Has anyone compared RAID-Z2 against something like LSI MegaRAID RAID-6? 
If a sub-$1,000 RAID controller can save thousands of dollars worth of 
disks it would somewhat put the lie to the idea that ZFS kills hardware 
RAID.

Wes Felter

Richard Elling

2010-Jan-06 22:22 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Jan 6, 2010, at 1:30 PM, Wes Felter wrote:
> Michael Herf wrote:
>
>> I agree that RAID-DP is much more scalable for reads than RAIDZx, and
>> this basically turns into a cost concern at scale.
>> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could  
>> be
>> used instead of netapp. But this certainly reduces the cost advantage
>> significantly.
>
> Has anyone compared RAID-Z2 against something like LSI MegaRAID  
> RAID-6? If a sub-$1,000 RAID controller can save thousands of  
> dollars worth of disks it would somewhat put the lie to the idea  
> that ZFS kills hardware RAID.
ZFS doesn''t kill "hardware RAID."  First, there is no such
thing as
"hardware RAID"
there is only software RAID. Second, "hardware RAID" systems are  
pretty much
useless without a file system, database, or some other application  
which can
translate from a set of blocks to something useful.  Rather, ZFS works  
very nicely
with "hardware RAID" systems or JBODs, iSCSI, et.al.  You can happily
add the
advantage of ZFS features using a LSI MegaRAID RAID-6 controller at  
little
or no additional cost :-)
  -- richard

Ross Walker

2010-Jan-06 22:33 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size

On Wed, Jan 6, 2010 at 4:30 PM, Wes Felter <wesley at felter.org>
wrote:> Michael Herf wrote:
>
>> I agree that RAID-DP is much more scalable for reads than RAIDZx, and
>> this basically turns into a cost concern at scale.
>>
>> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be
>> used instead of netapp. But this certainly reduces the cost advantage
>> significantly.
>
> Has anyone compared RAID-Z2 against something like LSI MegaRAID RAID-6? If
a
> sub-$1,000 RAID controller can save thousands of dollars worth of disks it
> would somewhat put the lie to the idea that ZFS kills hardware RAID.
A hardware RAID6 controller with a big battery backed write cache will
beat RAIDZ2 hands down. It avoids the write-hole problem with the
battery cache, but you still have the possibility of silent data
corruption to deal with.

You could put two LSI MegaRAID controllers into a 2U box each going to
a storage enclosure setup with a RAID6 array, then setup a zpool made
out of a mirrored vdev of each. That takes RAIDZ2 out of the picture
while providing integrity and performance. Extra cost is always
assumed if you want both. If you want to add redundancy it will cost
you double.

-Ross

Wilkinson, Alex

2010-Jan-07 06:39 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote: 

    >Rather, ZFS works very nicely with "hardware RAID" systems or
JBODs
    >iSCSI, et.al.  You can happily  add the

Im not sure how ZFS works very nicely with say for example an EMC Cx310 array ?

  -Alex

IMPORTANT: This email remains the property of the Australian Defence
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT
1914.  If you have received this email in error, you are requested to contact
the sender and delete the email.

Richard Elling

2010-Jan-07 07:00 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote:
>
>    0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote:
>
>> Rather, ZFS works very nicely with "hardware RAID" systems or
JBODs
>> iSCSI, et.al.  You can happily  add the
>
> Im not sure how ZFS works very nicely with say for example an EMC  
> Cx310 array ?
Why would ZFS be any different than other file systems on a Cx310?
  -- richard

Wilkinson, Alex

2010-Jan-07 07:09 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

0n Wed, Jan 06, 2010 at 11:00:49PM -0800, Richard Elling wrote: 

    >On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote:
    >>
    >>    0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote:
    >>
    >>> Rather, ZFS works very nicely with "hardware RAID"
systems or JBODs
    >>> iSCSI, et.al.  You can happily  add the
    >>
    >> Im not sure how ZFS works very nicely with say for example an EMC  
    >> Cx310 array ?
    >
    >Why would ZFS be any different than other file systems on a Cx310?

Well, not specifically the filesystem but using ZFS as a volume manager.
Please see:
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028089.html

  -Alex

IMPORTANT: This email remains the property of the Australian Defence
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT
1914.  If you have received this email in error, you are requested to contact
the sender and delete the email.

Richard Elling

2010-Jan-07 17:57 UTC

head link

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

On Jan 6, 2010, at 11:09 PM, Wilkinson, Alex wrote:>    0n Wed, Jan 06, 2010 at 11:00:49PM -0800, Richard Elling wrote:
>
>> On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote:
>>>
>>>   0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote:
>>>
>>>> Rather, ZFS works very nicely with "hardware RAID"
systems or JBODs
>>>> iSCSI, et.al.  You can happily  add the
>>>
>>> Im not sure how ZFS works very nicely with say for example an EMC
>>> Cx310 array ?
>>
>> Why would ZFS be any different than other file systems on a Cx310?
>
> Well, not specifically the filesystem but using ZFS as a volume  
> manager.
> Please see:
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028089.html
Choice is good, no? :-)  If you choose not to use ZFS for RAID, then  
that is a
perfectly reasonable choice. Many people have a mix of protected and
unprotected storage of all sorts. But it is good to know that if you  
have very,
very important data, you can protect it in many complementary ways.
  -- richard

zfs discuss - Jan 2010 - rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]

[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]