thr3ads.net - zfs discuss - [zfs-discuss] A few questions [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Lanky Doodle

2010-Dec-16 08:59 UTC

[zfs-discuss] A few questions

Hiya,

I have been playing with ZFS for a few days now on a test PC, and I plan to use
if for my home media server after being very impressed!

I''ve got the basics of creating zpools and zfs filesystems with
compression and dedup etc, but I''m wondering if there''s a
better way to handle security. I''m using Windows 7 clients by the way.

I have used this ''guide'' to do the permissions -
http://www.slepicka.net/?p=37

Also, at present I have 5x 1TB drives to use in my home server so I plan to
create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures
etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan
and nother 5 in Feb/March so that I can avoid all disks being from the same
batch). I did plan on creating seperate zpoolz with each set of 5 drives;

drives 1-5 volume0 zpool
drives 6-10 volume1 zpool
drives 11-15 volume2 zpool

so that I can sustain 3 simultaneous drives failures, as long as it''s
one drive from each set. However I think this will mean each zpool will have
independant shares which I don''t want. I have used this guide -
http://southbrain.com/south/tutorials/zpools.html - which says you can combine
zpools into a ''parent'' zpool, but can this be done in my
scenario (staggered) as it looks like the child zpools have to be created before
the parent is done. So basically I''d need to be able to;

Create volume0 zpool now
Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent
zpool
Create volume2 in Feb/March and add to parent zpool

I know I could just add each disk to volume0 zpool but I''ve read
it''s a bugger to do and that creating seperate zpools with news disks
is a much better way to go.

I think that''s it for now. Sorry for the mammoth first post!

Thanks
-- 
This message posted from opensolaris.org

Roy Sigurd Karlsbakk

2010-Dec-16 15:51 UTC

head link

[zfs-discuss] A few questions

> Also, at present I have 5x 1TB drives to use in my home server so I
> plan to create a RAID-Z1 pool which will have my shares on it (Movies,
> Music, Pictures etc). I then plan to increase this in sets of 5 (so
> another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can
> avoid all disks being from the same batch). I did plan on creating
> seperate zpoolz with each set of 5 drives;
> 
> drives 1-5 volume0 zpool
> drives 6-10 volume1 zpool
> drives 11-15 volume2 zpool
Although this seems a good idea to start with, there are issues with it
performance-wise. If you fill up VDEV0 (drives 1-5) and then attach VDEV1
(drives 6-10), new writes will still be initially striped across the two VDEVs,
leading to a performance impact on writes. There is currently no way of
balancing VDEV fills without manualy backup/restore or in-vdev copying the data
from one place to another and then removing the original data.
> so that I can sustain 3 simultaneous drives failures, as long as
it''s
> one drive from each set. However I think this will mean each zpool
> will have independant shares which I don''t want. I have used this
> guide - http://southbrain.com/south/tutorials/zpools.html - which says
> you can combine zpools into a ''parent'' zpool, but can
this be done in
> my scenario (staggered) as it looks like the child zpools have to be
> created before the parent is done. So basically I''d need to be
able
> to;
For the scheme to work as above, start with something like

 # zpool create mypool raidz1 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0

Later, you''ll add the new vdev

 # zpool add mypool raidz1 c0t6d0 c0t7d0 c0t8d0 c2t9d0 c2t10d0

This will work as described above. However, I would do this somehow differently.
Start off with, say, 6 1TB drives in RAIDz2 and set autoexpand=on on the pool
(remember compression=on on the zfs pool fs too).

 # zpool create mypool raidz2 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 c2t6d0
 # zpool set autoexpand on mypool
 # zfs set compression=on mypool

Compression is lzjb, and it won''t compress much for audio or video, but
then, won''t hurt much either. When this starts to get somewhat close to
a fill, get new, larger drives and replace the one by one with the older 1TB
drives. Once all are replaced by larger, say 1,5TB drives, whops, your array is
larger. This will scale better performance-wise and you won''t need that
many controllers. Also, with RAIDz2, you can lose any two drives.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Lanky Doodle

2010-Dec-16 16:28 UTC

head link

[zfs-discuss] A few questions

Thanks for the reply.

In that case, wouldn''t it be better to, as you say, start with a 6
drive Z2, then just keep adding drives until the case is full, for a single Z2
zpool?

Or even Z3, if that''s available now?

I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 drive
bays. Hence the plan to start with 5, then 10, then all the way to 15.

This seems a more logical (and cheaper) solution than keep replacing with bigger
drives as they come to market.
-- 
This message posted from opensolaris.org

Freddie Cash

2010-Dec-16 16:50 UTC

head link

[zfs-discuss] A few questions

On Thu, Dec 16, 2010 at 12:59 AM, Lanky Doodle <lanky_doodle at
hotmail.com> wrote:> I have been playing with ZFS for a few days now on a test PC, and I plan to
use if for my home media server after being very impressed!
Works great for that. Have a similar setup at home, using FreeBSD.
> Also, at present I have 5x 1TB drives to use in my home server so I plan to
create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures
etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan
and nother 5 in Feb/March so that I can avoid all disks being from the same
batch). I did plan on creating seperate zpoolz with each set of 5 drives;
No no no. Create 1 pool.

Create the pool initially with a single 5-drive raidz vdev.

Later, add the next five drives to the system, and create a new raidz
vdev *in the same pool*. Voila. You now have the equivalent of a
RAID50, as ZFS will stripe writes to both vdevs, increaseing the
overall size *and* speed of the pool.

Later, add the next five drives to the system, and create a new raidz
vdev in the same pool. Voila. You now have a pool with 3 vdevs, with
read/writes being striped across all three.

You can still lose 3 drives (1 per vdev) before losing the pool.

The commands to do this are along the lines of:

# zpool create mypool raidz disk1 disk2 disk3 disk4 disk5

# zpool add mypool raidz disk6 disk7 disk8 disk9 disk10

# zpool add mypool raidz disk11 disk12 disk13 disk14 disk15

Creating 1 pool gives you the best performance and the most
flexibility. Use separate filesystems on top of that pool if you want
to tweak all the different properties.

Going with 1 pool also increases your chances for dedupe, as dedupe is
done at the pool level.

--
Freddie Cash
fjwcash at gmail.com

Cindy Swearingen

2010-Dec-16 17:34 UTC

head link

[zfs-discuss] A few questions

Hi Lanky,

Other follow-up posters have given you good advice.

I don''t see where you are getting the idea that you can combine
pools with pools. You can''t do this and I don''t see that the
southbrain tutorial illustrates this either. All of his examples
for creating redundant pools are reasonable.

As others have said, you can create a RAIDZ pool with one vdev
of say 5 disks, and then later add another 5 disks, and so on.

Thanks,

Cindy

On 12/16/10 01:59, Lanky Doodle wrote:> Hiya,
> 
> I have been playing with ZFS for a few days now on a test PC, and I plan to
use if for my home media server after being very impressed!
> 
> I''ve got the basics of creating zpools and zfs filesystems with
compression and dedup etc, but I''m wondering if there''s a
better way to handle security. I''m using Windows 7 clients by the way.
> 
> I have used this ''guide'' to do the permissions -
http://www.slepicka.net/?p=37
> 
> Also, at present I have 5x 1TB drives to use in my home server so I plan to
create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures
etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan
and nother 5 in Feb/March so that I can avoid all disks being from the same
batch). I did plan on creating seperate zpoolz with each set of 5 drives;
> 
> drives 1-5 volume0 zpool
> drives 6-10 volume1 zpool
> drives 11-15 volume2 zpool
> 
> so that I can sustain 3 simultaneous drives failures, as long as
it''s one drive from each set. However I think this will mean each zpool
will have independant shares which I don''t want. I have used this guide
- http://southbrain.com/south/tutorials/zpools.html - which says you can combine
zpools into a ''parent'' zpool, but can this be done in my
scenario (staggered) as it looks like the child zpools have to be created before
the parent is done. So basically I''d need to be able to;
> 
> Create volume0 zpool now
> Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent
zpool
> Create volume2 in Feb/March and add to parent zpool
> 
> I know I could just add each disk to volume0 zpool but I''ve read
it''s a bugger to do and that creating seperate zpools with news disks
is a much better way to go.
> 
> I think that''s it for now. Sorry for the mammoth first post!
> 
> Thanks

Edward Ned Harvey

2010-Dec-17 03:21 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Lanky Doodle
> 
> In that case, wouldn''t it be better to, as you say, start with a 6
drive
Z2, then> just keep adding drives until the case is full, for a single Z2 zpool?
Doesn''t work that way.

You can create a vdev, and later, you can add more vdev''s.  So you can
create a raidz now, and later you can add another raidz.  But you cannot
create a raidz now, and later just add onesy-twosy disks to increase the
size incrementally.

> Or even Z3, if that''s available now?
Raidz3 is available now.  There is only one thing to be aware of.  ZFS
resilvering is very inefficient for typical usage scenarios.  The time to
resilver divides by the number of vdev''s in the pool (meaning 10
mirrors
will resilver 10x faster than an equivalently sized raidzN) and the time to
resilver is doubled when you have several disks within the vdev.  Due to
inefficiency, we''re talking about 12 hours (on my server) to resilver a
1TB
disk which is around 70% used.  This would have been ~3 weeks if I had one
big raidz3.  So it matters.

Your multiple raidz vdev''s of each 5-6 disks is a reasonable
compromise.

> I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15
> drive bays. Hence the plan to start with 5, then 10, then all the way to
15.> 
> This seems a more logical (and cheaper) solution than keep replacing with
> bigger drives as they come to market.
''Course, you can also replace bigger drives as they come to market,
too.
;-)

If you''ve got 5 disks in a raidz...  First scrub it.  Then, replace one
disk
with a larger disk, and wait for resilver.  Replace each disk, one by one,
with larger disks.  And eventually when you do the last one ... Your pool
becomes larger.  (Depending on your defaults, manual intervention may be
required to make the pool autoexpand when the devices have all been
upgraded.)

Lanky Doodle

2010-Dec-17 10:12 UTC

head link

[zfs-discuss] A few questions

Thanks for all the replies.

The bit about combining zpools came from this command on the southbrain
tutorial;

zpool create mail \
 mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0
c6t600D0230006B66680C50AB7821F0E900d0 \
 mirror c6t600D0230006B66680C50AB0187D75000d0
c6t600D0230006C1C4C0C50BE27386C4900d0

I admit I was getting confused between zpools and vdevs, thinking in the above
command that each mirror was a zpool and not a vdev.

Just so i''m correct, a normal command would like like

zpool create mypool raidz disk1 disk2 disk3 disk4 disk5

which would result in a zpool called my pool, which is made up of a 5 disk raidz
vdev? This means that zpools don''t actually ''contain''
physical devices, which is what I originally thought.
-- 
This message posted from opensolaris.org

Erik Trimble

2010-Dec-17 10:32 UTC

head link

[zfs-discuss] A few questions

On 12/17/2010 2:12 AM, Lanky Doodle wrote:> Thanks for all the replies.
>
> The bit about combining zpools came from this command on the southbrain
tutorial;
>
> zpool create mail \
>   mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0
c6t600D0230006B66680C50AB7821F0E900d0 \
>   mirror c6t600D0230006B66680C50AB0187D75000d0
c6t600D0230006C1C4C0C50BE27386C4900d0
>
> I admit I was getting confused between zpools and vdevs, thinking in the
above command that each mirror was a zpool and not a vdev.
>
> Just so i''m correct, a normal command would like like
>
> zpool create mypool raidz disk1 disk2 disk3 disk4 disk5
>
> which would result in a zpool called my pool, which is made up of a 5 disk
raidz vdev? This means that zpools don''t actually
''contain'' physical devices, which is what I originally
thought.You are correct that the above will have a single vdev of 5 disks.

Here''s a shorthand note:

A zpool is made of 1 or more vdevs.

Each vdev can be a raidz, mirror, or single device (either a file or 
disk).  So, you *can* have a zpool which has solely physical drives:

e.g.

zpool create tank disk1 disk2 disk3

will create a pool with 3 disks, with data being striped across the 
devices as desired.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Lanky Doodle

2010-Dec-17 12:06 UTC

head link

[zfs-discuss] A few questions

OK cool.

One last question. Reading the Admin Guid for ZFS, it says:

[i]"A more complex conceptual RAID-Z configuration would look similar to
the following:

raidz c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c7t0d0 raidz c8t0d0 c9t0d0
c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0
If you are creating a RAID-Z configuration with many disks, as in this example,
a RAID-Z configuration with 14 disks is better split into a two 7-disk
groupings. RAID-Z configurations with single-digit groupings of disks should
perform better"[/i]

This is relevant as my final setup was planned to be 15 disks, so only one more
than the example.

So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive vdevs.

Also, does anyone have anything to add re the security of CIFS when used with
Windows clients?

Thanks again guys, and gals...
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Dec-17 15:02 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Lanky Doodle
> 
> This is relevant as my final setup was planned to be 15 disks, so only one
> more than the example.
> 
> So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drivevdevs.

Both ways are fine.  Consider the balance between redundancy and drive
space.

Also, in the event of a resilver, the 3x5 radiz will be faster.  In rough
numbers, suppose you have 1TB drives, 70% full.  Then your resilver might be
8 days instead of 12 days.  That''s important when you consider the fact
that
during that window, you have degraded redundancy.  Another failed disk in
the same vdev would destroy the entire pool.

Also if a 2nd disk fails during resilver, it''s more likely to be in the
same
vdev, if you have only 2 vdev''s.  Your odds are better with smaller
vdev''s,
both because the resilver completes faster, and the probability of a 2nd
failure in the same vdev is smaller.

For both performance and reliability reasons, I recommend nothing except
single-drive mirrors, except in extreme data-is-not-important situations.
At least, that''s my recommendation until someday, when the resilver
efficiency is improved, or "fixed."

Lanky Doodle

2010-Dec-17 16:48 UTC

head link

[zfs-discuss] A few questions

Thanks!

By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk
mirrors - I am thinking of traditional RAID1 here.

Or do you mean 1 massive mirror with all 14 disks?

This is always a tough one for me. I too prefer RAID1 where redundancy is king,
but the trade off for me would be 5GB of ''wasted'' space -
total of 7GB in mirror and 12GB in 3x RAIDZ.

Decisions, decisions.....
-- 
This message posted from opensolaris.org

Cindy Swearingen

2010-Dec-17 16:55 UTC

head link

[zfs-discuss] A few questions

You should take a look at the ZFS best practices guide for RAIDZ and
mirrored configuration recommendations:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Its easy for me to say because I don''t have to buy storage but
mirrored storage pools are currently more flexible, provide good
performance, and replacing/resilvering data on disks is faster.

Thanks,

Cindy

On 12/17/10 09:48, Lanky Doodle wrote:> Thanks!
> 
> By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2
disk mirrors - I am thinking of traditional RAID1 here.
> 
> Or do you mean 1 massive mirror with all 14 disks?
> 
> This is always a tough one for me. I too prefer RAID1 where redundancy is
king, but the trade off for me would be 5GB of ''wasted'' space
- total of 7GB in mirror and 12GB in 3x RAIDZ.
> 
> Decisions, decisions.....

Alexander Lesle

2010-Dec-17 20:17 UTC

head link

[zfs-discuss] A few questions

at Dezember, 17 2010, 17:48 <Lanky Doodle> wrote in [1]:
> By single drive mirrors, I assume, in a 14 disk setup, you mean 7
> sets of 2 disk mirrors - I am thinking of traditional RAID1 here.
> Or do you mean 1 massive mirror with all 14 disks?
Edward means a set of two-way-mirrors.

Do you remember what he wrote:>> Also, in the event of a resilver, the 3x5 radiz will be faster.  In
rough
>> numbers, suppose you have 1TB drives, 70% full.  Then your resilver
might be
>> 8 days instead of 12 days.  That''s important when you consider
the fact that
>> during that window, you have degraded redundancy.  Another failed disk
in
>> the same vdev would destroy the entire pool.
>
>> Also if a 2nd disk fails during resilver, it''s more likely to
be in the same
>> vdev, if you have only 2 vdev''s.  Your odds are better with
smaller vdev''s,
>> both because the resilver completes faster, and the probability of a
2nd
>> failure in the same vdev is smaller.
And this scene is a horrible notion. In that time resilvering is
running you have to hope that nothing fails. In his example between
192 to 288 hours - thats a long a very long time.
And be aware that a disk will broken at some point.
> This is always a tough one for me. I too prefer RAID1 where
> redundancy is king, but the trade off for me would be 5GB of
> ''wasted'' space - total of 7GB in mirror and 12GB in 3x
RAIDZ.
You lost at most space when you make a pool with mirrors BUT
the I/O is much faster and its more secure and you have
all the features of zfs too.
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
> Decisions, decisions.....
My suggestion is
make a two-way-mirror of small disks or ssd for the OS. This is not
easy to do after installation, you have to look for a howto.
Sorry I dont find the link at the moment.

At Sol11 Express Oracle announced that at TestInstall you can set
RootPool to mirror during installation. At the moment I try it out
in a VM but I didnt find this option. :-(

zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4

When you need more space you can expand a bundle of two disks to your
lankyserver. Each pair with the same capacity is effective.

zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8  ...

Consider that its a good decision when you plan one spare disk.
You can using the zpool add command when you want to add a
spare disk at a later time.
http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view

When you build a raidz pool every disk in this pool must have the same
space as the smallest disk have or bigger. Raidz pool uses only this
space that the smallest disk have. The rest of the bigger disk is
waste.
At a mirrored pool only the pair must have the same space so you can
use a pair of 1 TB disks, one pair of 2 TB disks at the same pool. In
this scene your spare disk _must have_ the biggest space.

Read this for your decision:
http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best

-- 
Best Regards
Alexander
Dezember, 17 2010
........
[1] mid:382802084.111292604519623.JavaMail.Twebapp at sf-app1
........

Bob Friesenhahn

2010-Dec-18 02:16 UTC

head link

[zfs-discuss] A few questions

On Fri, 17 Dec 2010, Edward Ned Harvey wrote:>
> Also if a 2nd disk fails during resilver, it''s more likely to be
in the same
> vdev, if you have only 2 vdev''s.  Your odds are better with
smaller vdev''s,
> both because the resilver completes faster, and the probability of a 2nd
> failure in the same vdev is smaller.
While I agree that smaller vdevs are more reliable, your statement 
about the failure being more likely be in the same vdev if you have 
only 2 vdev''s to be a rather useless statement.  The probability of 
vdev failure does not have anything to do with the number of vdevs. 
However, the probability of vdev failure increases tremendously if 
there is only one vdev and there is a second disk failure.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2010-Dec-18 13:54 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Alexander Lesle
> 
> at Dezember, 17 2010, 17:48 <Lanky Doodle> wrote in [1]:
> 
> > By single drive mirrors, I assume, in a 14 disk setup, you mean 7
> > sets of 2 disk mirrors - I am thinking of traditional RAID1 here.
> 
> > Or do you mean 1 massive mirror with all 14 disks?
> 
> Edward means a set of two-way-mirrors.
Correct.
mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 ...
You would normally call this a stripe of mirrors.  Even though the ZFS
concept of striping is more advanced than traditional raid striping...  We
still call this a ZFS stripe for lack of any other term.  A ZFS stripe has
all the good characteristics of raid concatenation and striping, without any
of the bad characteristics.  It can utilize bandwidth on multiple disks when
it wants to, or use a single device when it wants to for small blocks.  It
can dynamically add randomly sized devices, and it can be done
one-at-a-time.  Gaining everything good of traditional raid stripe or
concatenation, without any of the negatives of traditional raid stripe and
concatenation.

> At Sol11 Express Oracle announced that at TestInstall you can set
> RootPool to mirror during installation. At the moment I try it out
> in a VM but I didnt find this option. :-(
Actually, even in solaris 10, I habitually install the root filesystem onto
a ZFS mirror.  You just select 2 disks, and it''s automatically a
mirror.

> zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4
> 
> When you need more space you can expand a bundle of two disks to your
> lankyserver. Each pair with the same capacity is effective.
> 
> zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8  ...
Correct.

Edward Ned Harvey

2010-Dec-18 14:09 UTC

head link

[zfs-discuss] A few questions

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> Sent: Friday, December 17, 2010 9:16 PM
> 
> While I agree that smaller vdevs are more reliable, your statement
> about the failure being more likely be in the same vdev if you have
> only 2 vdev''s to be a rather useless statement.  The probability
of
> vdev failure does not have anything to do with the number of vdevs.
> However, the probability of vdev failure increases tremendously if
> there is only one vdev and there is a second disk failure.
I''m not sure you got what I meant.  I''ll rephrase and see if
it''s more
clear:

Correct, the number of vdev''s doesn''t affect the probability
of a failure in
a specific vdev, but the number of disks in a vdev does.  Lanky said he was
considering 2x7disk raidz, versus 3x5disk raidz.  So when I said he''s
more
likely to have a 2nd disk fail in the same vdev if he only has 2 vdev''s
...
That was meant to be taken in context, not as a generalization about pools
in general.

Consider a single disk.  Let P be the probability of the disk failing,
within 1 day.

If you have 5 disks in a raidz vdev, and one fails, there are 4 remaining.
If resilver will last 8 days, then the probability of a 2nd disk failing is
4*8*P = 32P

If you have 7 disks in a raidz vdev, and one fails, there are 6 remaining.
If a resilver will last 12 days, then the probability of a 2nd disk failing
is 6*12*P = 72P

Lanky Doodle

2010-Dec-18 18:23 UTC

head link

[zfs-discuss] A few questions

On the subject of where to install ZFS, I was planning to use either Compact
Flash or USB drive (both of which would be mounted internally); using up 2 of
the drive bays for a mirrored install is possibly a waste of physical space,
considering it''s a) a home media server and b) the config can be backed
up to a protected ZFS pool - if the CF or USB drive failed I would just replace
and restore the config.

Can you have an equivalent of a global hot spare in ZFS. If I did go down the
mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 etc) all
the way up to 14 disks that would leave the 15th disk spare.

Now this is getting really complex, but can you have server failover in ZFS,
much like DFS-R in Windows - you point clients to a clustered ZFS namespace so
if a complete server failed nothing is interrupted.

I am still undecided as to mirror vs RAID Z. I am going to be ripping
uncompressed Blu-Rays so space is vital. I use RAID DP in NetApp kit at work and
I''m guessing RAID Z2 is the equivalent? I have 5TB space at the moment
so going to the expense of mirroring for only 2TB extra doesn''t seem
much of a pay off.

Maybe a compromise of 2x 7-disk RAID Z1 with global hotspare is the way to go?

Put it this way, I currently use Windows Home Server, which has no true disk
failure protection, so any of ZFS''s redundancy schemes is going to be a
step up; is there an equivalent system in ZFS where if 1 disk fails you only
lose that disks data, like unRAID?

Thanks everyone for your input so far :)
-- 
This message posted from opensolaris.org

Mark Sandrock

2010-Dec-18 23:40 UTC

head link

[zfs-discuss] A few questions

On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:
> Now this is getting really complex, but can you have server failover in
ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace
so if a complete server failed nothing is interrupted.
This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) --
not only the storage pool fails over,
but also the server IP address fails over, so that NFS, etc. shares remain
active, when one storage head goes down.

Amber Road uses ZFS, but the clustering and failover are not related to the
filesystem type.

Mark

Edward Ned Harvey

2010-Dec-19 05:17 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Lanky Doodle
> 
> On the subject of where to install ZFS, I was planning to use either
Compact> Flash or USB drive (both of which would be mounted internally); using up 2
of> the drive bays for a mirrored install is possibly a waste of physical
space,> considering it''s a) a home media server and b) the config can be
backed up
to> a protected ZFS pool - if the CF or USB drive failed I would just replace
and> restore the config.
All of the above is correct.  One thing you should keep in mind however:  If
your unmirrored rpool (usb fob) fails...  Although yes you can restore
assuming you have been sufficiently backing it up ... You will suffer an
ungraceful halt.  Maybe you can live with that.

> Can you have an equivalent of a global hot spare in ZFS. If I did go down
the> mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5
etc) all> the way up to 14 disks that would leave the 15th disk spare.
Check the zpool man page for "spare," but I know you can have spares
assigned to a vdev, and I''m pretty sure you can assign any given spare
to
multiples, effectively making it a global hotspare.  So yes is the answer.

> Now this is getting really complex, but can you have server failover in
ZFS,> much like DFS-R in Windows - you point clients to a clustered ZFS
namespace> so if a complete server failed nothing is interrupted.
If that''s somehow possible, it''s something I don''t
know.  I don''t believe
you can do that with ZFS.

> I am still undecided as to mirror vs RAID Z. I am going to be ripping
> uncompressed Blu-Rays so space is vital. 
For both read and write, raidz works extremely well for sequential
operations.  It sounds like you''re probably going to be doing mostly
sequential operations, so raidz should perform very well for you.  A lot of
people will avoid raidzN because it doesn''t perform very well for
random
reads, so they opt for mirrors instead.  But in your case, no so much.

In your case, the only reason I can think to avoid raidz would be if
you''re
worrying about resilver times.  That''s a valid concern, but you can
linearly
choose any number of disks you want ... You could make raidz using 3-disks
each...  It''s just a compromise between the mirror and the larger raidz
vdev.

> I use RAID DP in NetApp kit at work
> and I''m guessing RAID Z2 is the equivalent? 
Yup, raid-dp and raidz2 are conceptually pretty much the same.

> Put it this way, I currently use Windows Home Server, which has no true
disk> failure protection, so any of ZFS''s redundancy schemes is going to
be a
step> up; is there an equivalent system in ZFS where if 1 disk fails you only
lose that> disks data, like unRAID?
No.  Not unless you make that many separate volumes.

Lanky Doodle

2010-Dec-20 09:30 UTC

head link

[zfs-discuss] A few questions

Thanks Edward.

I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it
goes against one of my principles when building enterprise servers.

Is there any argument against using the rpool for all data storage as well as
being the install volume?

Say for example I chucked 15x 1TB disks in there and created a mirrored rpool
during installation, using 2 disks. If I added another 6 mirrors (12 disks) to
it that would give me an rpool of 7TB. The 15th disk being a spare.

Or, say I selected 3 disks during install, does this create a 3 way mirrored
rpool or does it give you the option of creating raidz? If so, I could then
create a further 4x 3 drive raidz''s, giving me a 10TB rpool.

Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 drive
raidz''s, giving me an 8TB rpool. Again this gives me a spare disk.

Either of these 3 should keep resilvering times to a minimum, against say one
big raidz2 of 13 disks.

Why does resilvering take so long in raidz anyway?
-- 
This message posted from opensolaris.org

Lanky Doodle

2010-Dec-20 10:11 UTC

head link

[zfs-discuss] A few questions

Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris
11 Express, as that is what i''m using.
-- 
This message posted from opensolaris.org

Phil Harman

2010-Dec-20 10:42 UTC

head link

[zfs-discuss] A few questions

> Why does resilvering take so long in raidz anyway?
Because it''s broken. There were some changes a while back that made it
more broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn''t do a single pass of the drives, but uses a
"smarter" temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if
there''s a steady flow of updates to the pool.

As far as I''m aware, the only way to get bounded resilver times is to
stop the workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the rest
of us.

Deano

2010-Dec-20 11:03 UTC

head link

[zfs-discuss] A few questions

Hi,
Which brings up an interesting question... 

IF it were fixed in for example illumos or freebsd is there a plan for how
to handle possible incompatible zfs implementations?

Currently the basic version numbering only works as it implies only one
stream of development, now with multiple possible stream does ZFS need to
move to a feature bit system or are we going to have to have forks or
multiple incompatible versions?

Thanks,
Deano

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Phil Harman
Sent: 20 December 2010 10:43
To: Lanky Doodle
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] A few questions
> Why does resilvering take so long in raidz anyway?
Because it''s broken. There were some changes a while back that made it
more
broken.

There has been a lot of discussion, anecdotes and some data on this list. 

The resilver doesn''t do a single pass of the drives, but uses a
"smarter"
temporal algorithm based on metadata.

However, the current implentation has difficulty finishing the job if
there''s a steady flow of updates to the pool.

As far as I''m aware, the only way to get bounded resilver times is to
stop
the workload until resilvering is completed.

The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.

I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the
rest of us.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Lanky Doodle

2010-Dec-20 11:29 UTC

head link

[zfs-discuss] A few questions

> I believe Oracle is aware of the problem, but most of
> the core ZFS team has left. And of course, a fix for
> Oracle Solaris no longer means a fix for the rest of
> us.
OK, that is a bit concerning then. As good as ZFS may be, i''m not sure
I want to committ to a file system that is ''broken'' and may
not be fully fixed, if at all.

Hmmmmmnnn...
-- 
This message posted from opensolaris.org

Phil Harman

2010-Dec-20 12:54 UTC

head link

[zfs-discuss] A few questions

On 20/12/2010 11:03, Deano wrote:> Hi,
> Which brings up an interesting question...
>
> IF it were fixed in for example illumos or freebsd is there a plan for how
> to handle possible incompatible zfs implementations?
>
> Currently the basic version numbering only works as it implies only one
> stream of development, now with multiple possible stream does ZFS need to
> move to a feature bit system or are we going to have to have forks or
> multiple incompatible versions?
>
> Thanks,
> Deano
Changes to the resilvering implementation don''t necessarily require 
changes to the on disk format (although they could). Of course, there 
might be an issue moving a pool mid-resilver from one implementation to 
another.

With arguably considerably more ZFS expertise outside Oracle than in, 
there''s a good chance the community will get to a fix first. It would 
then be interesting to see whether NIH prevails, or perhaps even a new 
spirit of "share and share alike".

"You may say I''m a dreamer ..."

Phil Harman

2010-Dec-20 13:09 UTC

head link

[zfs-discuss] A few questions

On 20/12/2010 11:29, Lanky Doodle wrote:>> I believe Oracle is aware of the problem, but most of
>> the core ZFS team has left. And of course, a fix for
>> Oracle Solaris no longer means a fix for the rest of
>> us.
> OK, that is a bit concerning then. As good as ZFS may be, i''m not
sure I want to committ to a file system that is ''broken'' and
may not be fully fixed, if at all.
>
> Hmmmmmnnn...
My home server is still running snv_82, and my iMac is running Apple''s 
last public beta release for Leopard. The way I see it, the on-disk 
format is sound, and the basic "always consistent on disk" promise
seems
to be worth something. My files are read-mostly, and performance isn''t 
an issue for me. ZFS has protected my data for several years now in the 
face of various hardware issues. I''ll upgrade my NAS appliance to 
OpenSolaris snv_134b sometime soon, but as far as I can tell, I can''t 
use Oracle Solaris 11 Express for licensing reasons (I have backups of 
business data). I''ll be watching Illumos with interest, but snv_82 has 
served me well for 3 years, so I figure snv_134b probably has quite a 
lot of useful life left in it, and maybe then brtfs will be ready for 
prime time?

Joerg Schilling

2010-Dec-20 13:26 UTC

head link

[zfs-discuss] A few questions

Phil Harman <phil.harman at gmail.com> wrote:
> Changes to the resilvering implementation don''t necessarily
require
> changes to the on disk format (although they could). Of course, there 
> might be an issue moving a pool mid-resilver from one implementation to 
> another.
We seem to come to a similar problem as wuth UFS 20 years ago. At that time,
Sun did enhance the UFS on-disk format but the *BSDs did not follow this change 
even though the format change was "documented" in the related include
files.

For a future ZFS development, thee may be a need to allow an implementation to 
implement on-disk version 1..21 + 24 and another implementation to support 
on-disk version 1..23 + 25.

These thoughts of course are void in case that Oracle continues the OSS 
decisions for Solaris and other Solaris variants can import the code related to
recent enhancements.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Richard Elling

2010-Dec-20 13:59 UTC

head link

[zfs-discuss] A few questions

On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com> wrote:
>> Why does resilvering take so long in raidz anyway?
> 
> Because it''s broken. There were some changes a while back that
made it more broken.
"broken" is the wrong term here. It functions as designed and
correctly
resilvers devices. Disagreeing with the design is quite different than
proving a defect.
> There has been a lot of discussion, anecdotes and some data on this list. 
"slow because I use devices with poor random write(!) performance"
is very different than "broken."
> The resilver doesn''t do a single pass of the drives, but uses a
"smarter" temporal algorithm based on metadata.
A design that only does a single pass does not handle the temporal
changes. Many RAID implementations use a mix of spatial and temporal
resilvering and suffer with that design decision.
> However, the current implentation has difficulty finishing the job if
there''s a steady flow of updates to the pool.
Please define current. There are many releases of ZFS, and
many improvements have been made over time. What has not
improved is the random write performance of consumer-grade
HDDs.
> As far as I''m aware, the only way to get bounded resilver times is
to stop the workload until resilvering is completed.
I know of no RAID implementation that bounds resilver times
for HDDs. I believe it is not possible. OTOH, whether a resilver
takes 10 seconds or 10 hours makes little difference in data
availability. Indeed, this is why we often throttle resilvering
activity. See previous discussions on this forum regarding the
dueling RFEs.
> The problem exists for mirrors too, but is not as marked because mirror
reconstruction is inherently simpler.
Resilver time is bounded by the random write performance of
the resilvering device. Mirroring or raidz make no difference.
> I believe Oracle is aware of the problem, but most of the core ZFS team has
left. And of course, a fix for Oracle Solaris no longer means a fix for the rest
of us.
Some "improvements" were made post-b134 and pre-b148.
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/973d9f26/attachment-0001.html>

Lanky Doodle

2010-Dec-20 15:24 UTC

head link

[zfs-discuss] A few questions

Thanks relling.

I suppose at the end of the day any file system/volume manager has it''s
flaws so perhaps it''s better to look at the positives of each and
decide based on them.

So, back to my question above, is there a deciding argument [i]against[/i]
putting data on the install volume (rpool). Forget about mirroring for a sec;

1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive
raidz1''s, giving me a 10TB rpool with no spare disks

2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive
raidsz1''s giving me a 12TB rpool with no spare disks

3) Select 7 disks during install creating raidz1. Create a further 7 drive
raidz1 giving me 12TB rpool with 1 spare disk

As there is no space gain between 2) and 3) there is no point going for 3),
other than having a spare disk, but resilver times would be slower.

So it becomes between 1) and 2). Neither offer spare disks but 1) would offer
faster resilver times with upto 5 simultaneous disk failures and 2) would offer
2TB extra space with upto 3 simultaneous disk failures.

FYI, I am using Samsung SpinPoint F2''s, which have the variable RPM
speeds
(http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq)

I may wait at least until I get the next 4 drives in (I actually have 6 at the
mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think
about it and hopefully time for them to fix resilvering! ;-)

Thanks again...
--
This message posted from opensolaris.org

Phil Harman

2010-Dec-20 15:31 UTC

head link

[zfs-discuss] A few questions

On 20/12/2010 13:59, Richard Elling wrote:> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com 
> <mailto:phil.harman at gmail.com>> wrote:
>
>>> Why does resilvering take so long in raidz anyway?
>> Because it''s broken. There were some changes a while back that
made
>> it more broken.
>
> "broken" is the wrong term here. It functions as designed and
correctly
> resilvers devices. Disagreeing with the design is quite different than
> proving a defect.
It might be the wrong term in general, but I think it does apply in the 
budget home media server context of this thread. I think we can agree 
that ZFS currently doesn''t play well on cheap disks. I think we can
also
agree that the performance of ZFS resilvering is known to be suboptimal 
under certain conditions.

For a long time at Sun, the rule was "correctness is a constraint, 
performance is a goal". However, in the real world, performance is often 
also a constraint (just as a quick but erroneous answer is a wrong 
answer, so also, a slow but correct answer can also be "wrong").

Then one brave soul at Sun once ventured that "if Linux is faster,
it''s
a Solaris bug!" and to his surprise, the idea caught on. I later went on 
to tell people that ZFS delievered RAID "where I = inexpensive", so
I''m
a just a little frustrated when that promise becomes less respected over 
time. First it was USB drives (which I agreed with), now it''s SATA (and
I''m not so sure).
>> There has been a lot of discussion, anecdotes and some data on this 
>> list.
>
> "slow because I use devices with poor random write(!)
performance"
> is very different than "broken."
Again, context is everything. For example, if someone was building a 
business critical NAS appliance from consumer grade parts, I''d be the 
first to say "are you nuts?!"
>> The resilver doesn''t do a single pass of the drives, but uses
a
>> "smarter" temporal algorithm based on metadata.
>
> A design that only does a single pass does not handle the temporal
> changes. Many RAID implementations use a mix of spatial and temporal
> resilvering and suffer with that design decision.
Actually, it''s easy to see how a combined spatial and temporal approach
could be implemented to an advantage for mirrored vdevs.
>> However, the current implentation has difficulty finishing the job if 
>> there''s a steady flow of updates to the pool.
>
> Please define current. There are many releases of ZFS, and
> many improvements have been made over time. What has not
> improved is the random write performance of consumer-grade
> HDDs.
I was led to believe this was not yet fixed in Solaris 11, and that 
there are therefore doubts about what Solaris 10 update may see the fix, 
if any.
>> As far as I''m aware, the only way to get bounded resilver
times is to
>> stop the workload until resilvering is completed.
>
> I know of no RAID implementation that bounds resilver times
> for HDDs. I believe it is not possible. OTOH, whether a resilver
> takes 10 seconds or 10 hours makes little difference in data
> availability. Indeed, this is why we often throttle resilvering
> activity. See previous discussions on this forum regarding the
> dueling RFEs.
I don''t share your disbelief or "little difference" analysys.
If it is
true that no current implementation succeeds, isn''t that a great 
opportunity to change the rules? Wasn''t resilver time vs availability 
was a major factor in Adam Leventhal''s paper introducing the need for 
RAIDZ3?

The appropriateness or otherwise of resilver throttling depends on the 
context. If I can tolerate further failures without data loss (e.g. 
RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if 
I can recover business critical data in a timely manner, then great. But 
there may come a point where I would rather take a short term 
performance hit to close the window on total data loss.
>> The problem exists for mirrors too, but is not as marked because 
>> mirror reconstruction is inherently simpler.
>
> Resilver time is bounded by the random write performance of
> the resilvering device. Mirroring or raidz make no difference.
This only holds in a quiesced system.
>> I believe Oracle is aware of the problem, but most of the core ZFS 
>> team has left. And of course, a fix for Oracle Solaris no longer 
>> means a fix for the rest of us.
>
> Some "improvements" were made post-b134 and pre-b148.
That is, indeed, good news.
>  -- richard-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/40cadc56/attachment-0001.html>

Edward Ned Harvey

2010-Dec-20 16:45 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Lanky Doodle
> 
> > I believe Oracle is aware of the problem, but most of
> > the core ZFS team has left. And of course, a fix for
> > Oracle Solaris no longer means a fix for the rest of
> > us.
> 
> OK, that is a bit concerning then. As good as ZFS may be, i''m not
sure I
want> to committ to a file system that is ''broken'' and may not
be fully fixed,if at all.

ZFS is not "broken."  It is, however, a weak spot, that resilver is
very
inefficient.  For example:

On my server, which is made up of 10krpm SATA drives, 1TB each...  My drives
can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
to resilver the entire drive (in a mirror) sequentially, it would take ...
8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
and disks are around 70% full, and resilver takes 12-14 hours.

So although resilver is "broken" by some standards, it is bounded, and
you
can limit it to something which is survivable, by using mirrors instead of
raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
fine.  But you start getting unsustainable if you get up to 21-disk radiz3
for example.

Edward Ned Harvey

2010-Dec-20 17:06 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Lanky Doodle
> 
> Is there any argument against using the rpool for all data storage as well
as> being the install volume?
Generally speaking, you can''t do it.
The rpool is only supported on mirrors, not raidz.  I believe this is
because you need rpool in order to load the kernel, and until the kernel is
loaded, there''s just no reasonable way to have a fully zfs-aware,
supports-every-feature bootloader able to read rpool in order to fetch the
kernel.

Normally, you''ll dedicate 2 disks to the OS, and then you build
additional
separate data pools.  If you absolutely need all the disk space of the OS
disks, then you partition the OS into a smaller section of the OS disks and
assign the remaining space to some pool.  But doing that partitioning scheme
can be complex, and if you''re not careful, risky.  I don''t
advise it unless
you truly have your back against a wall for more disk space.

> Why does resilvering take so long in raidz anyway?
There are some really long and sometimes complex threads in this mailing
list discussing that.  Fundamentally ... First of all, it''s not always
true.
It depends on your usage behavior and the type of disks you''re using. 
But
the "typical" usage includes reading & writing a lot of files,
essentially
randomly over time, creating and deleting snapshots, using spindle disks, so
the "typical" usage behavior does have a resilver performance problem.

The root cause of the problem is that ZFS does not resilver the whole
disk...  It only resilvers the used portions of the disk.  Sounds like a
performance enhancer, right?  It would be, if the disks were mostly empty
... or if ZFS were resilvering a partial disk, in order according to disk
layout.  Unfortunately, it''s resilvering according to the temporal
order
blocks were written, and usually a disk is significantly full (say, 50% or
more) and as such, the disks have to thrash all around, performing all sorts
of random reads, until eventually it can read all the used parts in random
order.

It''s worse on raidzN than on mirrors, because the number of items which
must
be read is higher in radizN, assuming you''re using larger
vdev''s and
therefore more items exist scattered about inside that vdev.  You therefore
have a higher number of things which must be randomly read before you reach
completion.

Saxon, Will

2010-Dec-20 17:20 UTC

head link

[zfs-discuss] A few questions

> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> Sent: Monday, December 20, 2010 11:46 AM
> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] A few questions
> 
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Lanky Doodle
> >
> > > I believe Oracle is aware of the problem, but most of
> > > the core ZFS team has left. And of course, a fix for
> > > Oracle Solaris no longer means a fix for the rest of
> > > us.
> >
> > OK, that is a bit concerning then. As good as ZFS may be, i''m
not sure I
> want
> > to committ to a file system that is ''broken'' and may
not be fully fixed,
> if at all.
> 
> ZFS is not "broken."  It is, however, a weak spot, that resilver
is very
> inefficient.  For example:
> 
> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
> drives
> can each sustain 1Gbit/sec sequential read/write.  This means, if I needed
> to resilver the entire drive (in a mirror) sequentially, it would take ...
> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS mirrors,
> and disks are around 70% full, and resilver takes 12-14 hours.
> 
> So although resilver is "broken" by some standards, it is
bounded, and you
> can limit it to something which is survivable, by using mirrors instead of
> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still be
> fine.  But you start getting unsustainable if you get up to 21-disk radiz3
> for example.
This argument keeps coming up on the list, but I don''t see where anyone
has made a good suggestion about whether this can even be
''fixed'' or how it would be done.

As I understand it, you have two basic types of array reconstruction: in a
mirror you can make a block-by-block copy and that''s easy, but in a
parity array you have to perform a calculation on the existing data and/or
existing parity to reconstruct the missing piece. This is pretty easy when you
can guarantee that all your stripes are the same width, start/end on the same
sectors/boundaries/whatever and thus know a piece of them lives on all drives in
the set. I don''t think this is possible with ZFS since we have variable
stripe width. A failed disk d may or may not contain data from stripe s (or
transaction t). This information has to be discovered by looking at the
transaction records. Right?

Can someone speculate as to how you could rebuild a variable stripe width array
without replaying all the available transactions? I am no filesystem engineer
but I can''t wrap my head around how this could be handled any better
than it already is. I''ve read that resilvering is throttled -
presumably to keep performance degradation to a minimum during the process -
maybe this could be a tunable (e.g. priority: low, normal, high)?

Do we know if resilvers on a mirror are actually handled differently from those
on a raidz?

Sorry if this has already been explained. I think this is an issue that everyone
who uses ZFS should understand completely before jumping in, because the
behavior (while not ''wrong'') is clearly NOT the same as with
more conventional arrays.

-Will

Erik Trimble

2010-Dec-20 19:27 UTC

head link

[zfs-discuss] A few questions

On 12/20/2010 9:20 AM, Saxon, Will wrote:>> -----Original Message-----
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>> Sent: Monday, December 20, 2010 11:46 AM
>> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org
>> Subject: Re: [zfs-discuss] A few questions
>>
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Lanky Doodle
>>>
>>>> I believe Oracle is aware of the problem, but most of
>>>> the core ZFS team has left. And of course, a fix for
>>>> Oracle Solaris no longer means a fix for the rest of
>>>> us.
>>> OK, that is a bit concerning then. As good as ZFS may be,
i''m not sure I
>> want
>>> to committ to a file system that is ''broken'' and
may not be fully fixed,
>> if at all.
>>
>> ZFS is not "broken."  It is, however, a weak spot, that
resilver is very
>> inefficient.  For example:
>>
>> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
>> drives
>> can each sustain 1Gbit/sec sequential read/write.  This means, if I
needed
>> to resilver the entire drive (in a mirror) sequentially, it would take
...
>> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS
mirrors,
>> and disks are around 70% full, and resilver takes 12-14 hours.
>>
>> So although resilver is "broken" by some standards, it is
bounded, and you
>> can limit it to something which is survivable, by using mirrors instead
of
>> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still
be
>> fine.  But you start getting unsustainable if you get up to 21-disk
radiz3
>> for example.
> This argument keeps coming up on the list, but I don''t see where
anyone has made a good suggestion about whether this can even be
''fixed'' or how it would be done.
>
> As I understand it, you have two basic types of array reconstruction: in a
mirror you can make a block-by-block copy and that''s easy, but in a
parity array you have to perform a calculation on the existing data and/or
existing parity to reconstruct the missing piece. This is pretty easy when you
can guarantee that all your stripes are the same width, start/end on the same
sectors/boundaries/whatever and thus know a piece of them lives on all drives in
the set. I don''t think this is possible with ZFS since we have variable
stripe width. A failed disk d may or may not contain data from stripe s (or
transaction t). This information has to be discovered by looking at the
transaction records. Right?
>
> Can someone speculate as to how you could rebuild a variable stripe width
array without replaying all the available transactions? I am no filesystem
engineer but I can''t wrap my head around how this could be handled any
better than it already is. I''ve read that resilvering is throttled -
presumably to keep performance degradation to a minimum during the process -
maybe this could be a tunable (e.g. priority: low, normal, high)?
>
> Do we know if resilvers on a mirror are actually handled differently from
those on a raidz?
>
> Sorry if this has already been explained. I think this is an issue that
everyone who uses ZFS should understand completely before jumping in, because
the behavior (while not ''wrong'') is clearly NOT the same as
with more conventional arrays.
>
> -Willthe "problem" is NOT the checksum/error correction overhead.
that''s
relatively trivial.  The problem isn''t really even variable width (i.e.
variable number of disks one crosses) slabs.

The problem boils down to this:

When ZFS does a resilver, it walks the METADATA tree to determine what 
order to rebuild things from. That means, it resilvers the very first 
slab ever written, then the next oldest, etc.   The problem here is that 
slab "age" has nothing to do with where that data physically resides
on
the actual disks. If you''ve used the zpool as a WORM device, then,
sure,
there should be a strict correlation between increasing slab age and 
locality on the disk.  However, in any reasonable case, files get 
deleted regularly. This means that the probability that for a slab B, 
written immediately after slab A, it WON''T be physically near slab A.

In the end, the problem is that using metadata order, while reducing the 
total amount of work to do in the resilver (as you only resilver live 
data, not every bit on the drive), increases the physical inefficiency 
for each slab.  That is, seek time between cyclinders begins to dominate 
your slab reconstruction time.  In RAIDZ, this problem is magnified by 
both the much larger average vdev size vs mirrors, and the necessity 
that all drives containing a slab information return that data before 
the corrected data can be written to the resilvering drive.

Thus, current ZFS resilvering tends to be seek-time limited, NOT 
throughput limited.  This is really the "fault" of the underlying
media,
not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
negligible, but throughput isn''t),  they resilver really, really fast. 
In fact, they resilver at the maximum write throughput rate.   However, 
HDs are severely seek-limited, so that dominates HD resilver time.

The "answer" isn''t simple, as the problem is media-specific.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Erik Trimble

2010-Dec-20 19:28 UTC

head link

[zfs-discuss] A few questions

On 12/20/2010 9:20 AM, Saxon, Will wrote:>> -----Original Message-----
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>> Sent: Monday, December 20, 2010 11:46 AM
>> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org
>> Subject: Re: [zfs-discuss] A few questions
>>
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Lanky Doodle
>>>
>>>> I believe Oracle is aware of the problem, but most of
>>>> the core ZFS team has left. And of course, a fix for
>>>> Oracle Solaris no longer means a fix for the rest of
>>>> us.
>>> OK, that is a bit concerning then. As good as ZFS may be,
i''m not sure I
>> want
>>> to committ to a file system that is ''broken'' and
may not be fully fixed,
>> if at all.
>>
>> ZFS is not "broken."  It is, however, a weak spot, that
resilver is very
>> inefficient.  For example:
>>
>> On my server, which is made up of 10krpm SATA drives, 1TB each...  My
>> drives
>> can each sustain 1Gbit/sec sequential read/write.  This means, if I
needed
>> to resilver the entire drive (in a mirror) sequentially, it would take
...
>> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS
mirrors,
>> and disks are around 70% full, and resilver takes 12-14 hours.
>>
>> So although resilver is "broken" by some standards, it is
bounded, and you
>> can limit it to something which is survivable, by using mirrors instead
of
>> raidz.  For most people, even using 5-disk, or 7-disk raidzN will still
be
>> fine.  But you start getting unsustainable if you get up to 21-disk
radiz3
>> for example.
> This argument keeps coming up on the list, but I don''t see where
anyone has made a good suggestion about whether this can even be
''fixed'' or how it would be done.
>
> As I understand it, you have two basic types of array reconstruction: in a
mirror you can make a block-by-block copy and that''s easy, but in a
parity array you have to perform a calculation on the existing data and/or
existing parity to reconstruct the missing piece. This is pretty easy when you
can guarantee that all your stripes are the same width, start/end on the same
sectors/boundaries/whatever and thus know a piece of them lives on all drives in
the set. I don''t think this is possible with ZFS since we have variable
stripe width. A failed disk d may or may not contain data from stripe s (or
transaction t). This information has to be discovered by looking at the
transaction records. Right?
>
> Can someone speculate as to how you could rebuild a variable stripe width
array without replaying all the available transactions? I am no filesystem
engineer but I can''t wrap my head around how this could be handled any
better than it already is. I''ve read that resilvering is throttled -
presumably to keep performance degradation to a minimum during the process -
maybe this could be a tunable (e.g. priority: low, normal, high)?
>
> Do we know if resilvers on a mirror are actually handled differently from
those on a raidz?
>
> Sorry if this has already been explained. I think this is an issue that
everyone who uses ZFS should understand completely before jumping in, because
the behavior (while not ''wrong'') is clearly NOT the same as
with more conventional arrays.
>
> -Will
>As far as a possible fix, here''s what I can see:

[Note:  I''m not a kernel or FS-level developer. I would love to be able
to fix this myself, but I have neither the aptitude nor the [extensive] 
time to learn such skill]

We can either (a) change how ZFS does resilvering or (b) repack the 
zpool layouts to avoid the problem in the first place.

In case (a), my vote would be to seriously increase the number of 
in-flight resilver slabs, AND allow for out-of-time-order slab 
resilvering.  By that, I mean that ZFS would read several 
disk-sequential slabs, and then mark them as "done". This would mean a
*lot* of scanning the metadata tree (since leaves all over the place 
could be "done").   Frankly, I can''t say how bad that would
be; the
problem is that for ANY resilver, ZFS would have to scan the entire 
metadata tree to see if it had work to do, rather than simply look for 
the latest completed leave, then assume everything after that needs to 
be done.  There''d also be the matter of determining *if* one should
read
a disk sector...

In case (b), we need the ability to move slabs around on the physical 
disk (via the mythical "Block Pointer Re-write" method).  If there is 
that underlying mechanism, then a "defrag" utility can be run to
repack
the zpool to the point where chronological creation time = physical 
layout.  Which then substantially mitigates the seek time problem.

I can''t fix (a) - I don''t understand the codebase well enough.
Neither
can I do the BP-rewrite implementation.  However, if I can get 
BP-rewrite, I''ve got a prototype defragger that seems to work well 
(under simulation). I''m sure it could use some performance improvement,
but it works reasonably well on a simulated fragmented pool.

Please, Santa, can a good little boy get a BP-rewrite code commit in his 
stocking for Christmas?

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Mark Sandrock

2010-Dec-20 19:56 UTC

head link

[zfs-discuss] A few questions

Erik,

	just a hypothetical what-if ...

In the case of resilvering on a mirrored disk, why not take a snapshot, and then
resilver by doing a pure block copy from the snapshot? It would be sequential,
so long as the original data was unmodified; and random access in dealing with
the modified blocks only, right.

After the original snapshot had been replicated, a second pass would be done,
in order to update the clone to 100% live data.

Not knowing enough about the inner workings of ZFS snapshots, I don''t
know why
this would not be doable. (I''m biased towards mirrors for busy
filesystems.)

I''m supposing that a block-level snapshot is not doable -- or is it?

Mark

On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:
> On 12/20/2010 9:20 AM, Saxon, Will wrote:
>>> -----Original Message-----
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>>> Sent: Monday, December 20, 2010 11:46 AM
>>> To: ''Lanky Doodle''; zfs-discuss at
opensolaris.org
>>> Subject: Re: [zfs-discuss] A few questions
>>> 
>>>> From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-
>>>> bounces at opensolaris.org] On Behalf Of Lanky Doodle
>>>> 
>>>>> I believe Oracle is aware of the problem, but most of
>>>>> the core ZFS team has left. And of course, a fix for
>>>>> Oracle Solaris no longer means a fix for the rest of
>>>>> us.
>>>> OK, that is a bit concerning then. As good as ZFS may be,
i''m not sure I
>>> want
>>>> to committ to a file system that is ''broken''
and may not be fully fixed,
>>> if at all.
>>> 
>>> ZFS is not "broken."  It is, however, a weak spot, that
resilver is very
>>> inefficient.  For example:
>>> 
>>> On my server, which is made up of 10krpm SATA drives, 1TB each... 
My
>>> drives
>>> can each sustain 1Gbit/sec sequential read/write.  This means, if I
needed
>>> to resilver the entire drive (in a mirror) sequentially, it would
take ...
>>> 8,000 sec = 133 minutes.  About 2 hours.  In reality, I have ZFS
mirrors,
>>> and disks are around 70% full, and resilver takes 12-14 hours.
>>> 
>>> So although resilver is "broken" by some standards, it is
bounded, and you
>>> can limit it to something which is survivable, by using mirrors
instead of
>>> raidz.  For most people, even using 5-disk, or 7-disk raidzN will
still be
>>> fine.  But you start getting unsustainable if you get up to 21-disk
radiz3
>>> for example.
>> This argument keeps coming up on the list, but I don''t see
where anyone has made a good suggestion about whether this can even be
''fixed'' or how it would be done.
>> 
>> As I understand it, you have two basic types of array reconstruction:
in a mirror you can make a block-by-block copy and that''s easy, but in
a parity array you have to perform a calculation on the existing data and/or
existing parity to reconstruct the missing piece. This is pretty easy when you
can guarantee that all your stripes are the same width, start/end on the same
sectors/boundaries/whatever and thus know a piece of them lives on all drives in
the set. I don''t think this is possible with ZFS since we have variable
stripe width. A failed disk d may or may not contain data from stripe s (or
transaction t). This information has to be discovered by looking at the
transaction records. Right?
>> 
>> Can someone speculate as to how you could rebuild a variable stripe
width array without replaying all the available transactions? I am no filesystem
engineer but I can''t wrap my head around how this could be handled any
better than it already is. I''ve read that resilvering is throttled -
presumably to keep performance degradation to a minimum during the process -
maybe this could be a tunable (e.g. priority: low, normal, high)?
>> 
>> Do we know if resilvers on a mirror are actually handled differently
from those on a raidz?
>> 
>> Sorry if this has already been explained. I think this is an issue that
everyone who uses ZFS should understand completely before jumping in, because
the behavior (while not ''wrong'') is clearly NOT the same as
with more conventional arrays.
>> 
>> -Will
> the "problem" is NOT the checksum/error correction overhead.
that''s relatively trivial.  The problem isn''t really even
variable width (i.e. variable number of disks one crosses) slabs.
> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what
order to rebuild things from. That means, it resilvers the very first slab ever
written, then the next oldest, etc.   The problem here is that slab
"age" has nothing to do with where that data physically resides on the
actual disks. If you''ve used the zpool as a WORM device, then, sure,
there should be a strict correlation between increasing slab age and locality on
the disk.  However, in any reasonable case, files get deleted regularly. This
means that the probability that for a slab B, written immediately after slab A,
it WON''T be physically near slab A.
> 
> In the end, the problem is that using metadata order, while reducing the
total amount of work to do in the resilver (as you only resilver live data, not
every bit on the drive), increases the physical inefficiency for each slab. 
That is, seek time between cyclinders begins to dominate your slab
reconstruction time.  In RAIDZ, this problem is magnified by both the much
larger average vdev size vs mirrors, and the necessity that all drives
containing a slab information return that data before the corrected data can be
written to the resilvering drive.
> 
> Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput
limited.  This is really the "fault" of the underlying media, not ZFS.
For instance, if you have a raidZ of SSDs (where seek time is negligible, but
throughput isn''t),  they resilver really, really fast. In fact, they
resilver at the maximum write throughput rate.   However, HDs are severely
seek-limited, so that dominates HD resilver time.
> 
> 
> The "answer" isn''t simple, as the problem is
media-specific.
> 
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Erik Trimble

2010-Dec-20 20:05 UTC

head link

[zfs-discuss] A few questions

On 12/20/2010 11:56 AM, Mark Sandrock wrote:> Erik,
>
> 	just a hypothetical what-if ...
>
> In the case of resilvering on a mirrored disk, why not take a snapshot, and
then
> resilver by doing a pure block copy from the snapshot? It would be
sequential,
> so long as the original data was unmodified; and random access in dealing
with
> the modified blocks only, right.
>
> After the original snapshot had been replicated, a second pass would be
done,
> in order to update the clone to 100% live data.
>
> Not knowing enough about the inner workings of ZFS snapshots, I
don''t know why
> this would not be doable. (I''m biased towards mirrors for busy
filesystems.)
>
> I''m supposing that a block-level snapshot is not doable -- or is
it?
>
> MarkSnapshots on ZFS are true snapshots - they take a picture of the current 
state of the system. They DON''T copy any data around when created. So,
a
ZFS snapshot would be just as fragmented as the ZFS filesystem was at 
the time.

The problem is this:

Let''s say I write block A, B, C, and D on a clean zpool (what kind, it 
doesn''t matter).  I now delete block C.  Later on, I write block E.   
There is a probability (increasing dramatically as times goes on), that 
the on-disk layout will now look like:

A, B, E, D

rather than

A, B, [space], D, E

So, in the first case, I can do a sequential read to get A & B, but then 
must do a seek to get D, and a seek to get E.

The "fragmentation" problem is mainly due to file deletion, NOT to
file
re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing 
generally looks like a delete-then-write process, rather than a modify 
process).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Bakul Shah

2010-Dec-20 21:12 UTC

head link

[zfs-discuss] A few questions

On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble <erik.trimble at oracle.com>
wrote:> 
> The problem boils down to this:
> 
> When ZFS does a resilver, it walks the METADATA tree to determine what 
> order to rebuild things from. That means, it resilvers the very first 
> slab ever written, then the next oldest, etc.   The problem here is that 
> slab "age" has nothing to do with where that data physically
resides on
> the actual disks. If you''ve used the zpool as a WORM device, then,
sure,
> there should be a strict correlation between increasing slab age and 
> locality on the disk.  However, in any reasonable case, files get 
> deleted regularly. This means that the probability that for a slab B, 
> written immediately after slab A, it WON''T be physically near slab
A.
> 
> In the end, the problem is that using metadata order, while reducing the 
> total amount of work to do in the resilver (as you only resilver live 
> data, not every bit on the drive), increases the physical inefficiency 
> for each slab.  That is, seek time between cyclinders begins to dominate 
> your slab reconstruction time.  In RAIDZ, this problem is magnified by 
> both the much larger average vdev size vs mirrors, and the necessity 
> that all drives containing a slab information return that data before 
> the corrected data can be written to the resilvering drive.
> 
> Thus, current ZFS resilvering tends to be seek-time limited, NOT 
> throughput limited.  This is really the "fault" of the underlying
media,
> not ZFS.  For instance, if you have a raidZ of SSDs (where seek time is 
> negligible, but throughput isn''t),  they resilver really, really
fast.
> In fact, they resilver at the maximum write throughput rate.   However, 
> HDs are severely seek-limited, so that dominates HD resilver time.
You guys may be interested in a solution I used in a totally
different situation.  There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring'' operation. Some of these updates could
completely change the tree!  Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
and simple.

My algorithm started copying sequentially from the start.  If
N blocks were already copied when an update comes along,
updates of any block with block# > N are ignored (since the
sequential copy would get to them eventually).  Updates of
any block# <= N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.

I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver => under 12 hours).  How much worse
this gets depends on the amount of updates during
resilvering.

At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn''t worry about optimizing the
mostly empty FS case.

Bakul

Mark Sandrock

2010-Dec-20 21:33 UTC

head link

[zfs-discuss] A few questions

On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:
> On 12/20/2010 11:56 AM, Mark Sandrock wrote:
>> Erik,
>> 
>> 	just a hypothetical what-if ...
>> 
>> In the case of resilvering on a mirrored disk, why not take a snapshot,
and then
>> resilver by doing a pure block copy from the snapshot? It would be
sequential,
>> so long as the original data was unmodified; and random access in
dealing with
>> the modified blocks only, right.
>> 
>> After the original snapshot had been replicated, a second pass would be
done,
>> in order to update the clone to 100% live data.
>> 
>> Not knowing enough about the inner workings of ZFS snapshots, I
don''t know why
>> this would not be doable. (I''m biased towards mirrors for busy
filesystems.)
>> 
>> I''m supposing that a block-level snapshot is not doable -- or
is it?
>> 
>> Mark
> Snapshots on ZFS are true snapshots - they take a picture of the current
state of the system. They DON''T copy any data around when created. So,
a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the
time.
But if one does a raw (block) copy, there isn''t any fragmentation --
except for the COW updates.

If there were no updates to the snapshot, then it becomes a 100% sequential
block copy operation.

But even with COW updates, presumably the large majority of the copy would still
be sequential i/o.

Maybe for the 2nd pass, the filesystem would have to be locked, so the operation
would ever complete,
but if this is fairly short in relation to the overall resilvering time, then it
could still be a win in many cases.

I''m probably not explaining it well, and may be way off, but it seemed
an interesting notion.

Mark
> 
> 
> The problem is this:
> 
> Let''s say I write block A, B, C, and D on a clean zpool (what
kind, it doesn''t matter).  I now delete block C.  Later on, I write
block E.   There is a probability (increasing dramatically as times goes on),
that the on-disk layout will now look like:
> 
> A, B, E, D
> 
> rather than
> 
> A, B, [space], D, E
> 
> 
> So, in the first case, I can do a sequential read to get A & B, but
then must do a seek to get D, and a seek to get E.
> 
> The "fragmentation" problem is mainly due to file deletion, NOT
to file re-writing.  (though, in ZFS, being a C-O-W filesystem, re-writing
generally looks like a delete-then-write process, rather than a modify process).
> 
> 
> -- 
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>

Edward Ned Harvey

2010-Dec-21 00:19 UTC

head link

[zfs-discuss] A few questions

> From: Erik Trimble [mailto:erik.trimble at oracle.com]
> 
> We can either (a) change how ZFS does resilvering or (b) repack the
> zpool layouts to avoid the problem in the first place.
> 
> In case (a), my vote would be to seriously increase the number of
> in-flight resilver slabs, AND allow for out-of-time-order slab
> resilvering.  
Question for any clueful person:

Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
failed and is resilvering.  If you have an algorithm to create a list of all
the used blocks of disk1 in disk order, then you''re able to resilver
the
mirror extremely fast, because all the reads will be sequential in nature,
plus you get to skip past all the unused space.

Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
is resilvering).  You find some way of ordering all the used blocks of
disk1...  Which means disk1 will be able to read in optimal order and speed.
Does that necessarily imply disk2 will also work well?  Does the on-disk
order of blocks of disk1 necessarily match the order of blocks on disk2?

If there is no correlation between on-disk order of blocks for different
disks within the same vdev, then all hope is lost; it''s essentially
impossible to optimize the resilver/scrub order unless the on-disk order of
multiple disks is highly correlated or equal by definition.

Edward Ned Harvey

2010-Dec-21 00:28 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Erik Trimble
> 
> > In the case of resilvering on a mirrored disk, why not take a
snapshot,
and> then
> > resilver by doing a pure block copy from the snapshot? It would be
> sequential,
>
> So, a
> ZFS snapshot would be just as fragmented as the ZFS filesystem was at
> the time.
I think Mark was suggesting something like "dd" copy device 1 onto
device 2,
in order to guarantee a first-pass sequential resilver.  And my response
would be:  Creative thinking and suggestions are always a good thing.  In
fact, the above suggestion is already faster than the present-day solution
for what I''m calling "typical" usage, but there are an awful
lot of use
cases where the "dd" solution would be worse... Such as a pool which
is
largely sequential already, or largely empty, or made of high IOPS devices
such as SSD.  However, there is a desire to avoid resilvering unused blocks,
so I hope a better solution is possible... 

The fundamental requirement for a better optimized solution would be a way
to resilver according to disk ordering...  And it''s just a question for
somebody that actually knows the answer ... How terrible is the idea of
figuring out the on-disk order?

Eric D. Mudama

2010-Dec-21 02:05 UTC

head link

[zfs-discuss] A few questions

On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:>If there is no correlation between on-disk order of blocks for different
>disks within the same vdev, then all hope is lost; it''s essentially
>impossible to optimize the resilver/scrub order unless the on-disk order of
>multiple disks is highly correlated or equal by definition.
Very little is impossible.

Drives have been optimally ordering seeks for 35+ years.  I''m guessing
that the trick (difficult, but not impossible) is how to solve a
"travelling salesman" route pathing problem where you have billions or
trillions of transactions, and do it fast enough that it was worth
doing any extra computation besides just giving the device 32+ queued
commands at a time that align with the elements of each ordered
transaction ID.

Add to that all the complexity of unwinding the error recovery in the
event that you fail checksum validation on transaction N-1 after
moving past transaction N, which would be a required capability if you
wanted to queue more than a single transaction for verification at a
time.

Oh, and do all of the above without noticably affecting the throughput
of the applications already running on the system.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Mark Sandrock

2010-Dec-21 02:41 UTC

head link

[zfs-discuss] A few questions

It well may be that different methods are optimal for different use cases.

Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc.

It would be interesting to read more in this area, if papers are available.

I''ll have to take a look. ... Or does someone have pointers?

Mark


On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Erik Trimble
>> 
>>> In the case of resilvering on a mirrored disk, why not take a
snapshot,
> and
>> then
>>> resilver by doing a pure block copy from the snapshot? It would be
>> sequential,
>> 
>> So, a
>> ZFS snapshot would be just as fragmented as the ZFS filesystem was at
>> the time.
> 
> I think Mark was suggesting something like "dd" copy device 1
onto device 2,
> in order to guarantee a first-pass sequential resilver.  And my response
> would be:  Creative thinking and suggestions are always a good thing.  In
> fact, the above suggestion is already faster than the present-day solution
> for what I''m calling "typical" usage, but there are an
awful lot of use
> cases where the "dd" solution would be worse... Such as a pool
which is
> largely sequential already, or largely empty, or made of high IOPS devices
> such as SSD.  However, there is a desire to avoid resilvering unused
blocks,
> so I hope a better solution is possible... 
> 
> The fundamental requirement for a better optimized solution would be a way
> to resilver according to disk ordering...  And it''s just a
question for
> somebody that actually knows the answer ... How terrible is the idea of
> figuring out the on-disk order?
>

Richard Elling

2010-Dec-21 05:44 UTC

head link

[zfs-discuss] A few questions

On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:
> On 20/12/2010 13:59, Richard Elling wrote:
>> 
>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at
gmail.com> wrote:
>> 
>>> 
>>>> Why does resilvering take so long in raidz anyway?
>>> Because it''s broken. There were some changes a while back
that made it more broken.
>> 
>> "broken" is the wrong term here. It functions as designed and
correctly
>> resilvers devices. Disagreeing with the design is quite different than
>> proving a defect.
> 
> It might be the wrong term in general, but I think it does apply in the
budget home media server context of this thread.
If you only have a few slow drives, you don''t have performance.
Like trying to win the Indianapolis 500 with a tricycle...
> I think we can agree that ZFS currently doesn''t play well on cheap
disks. I think we can also agree that the performance of ZFS resilvering is
known to be suboptimal under certain conditions.
... and those conditions are also a strength. For example, most file
systems are nowhere near full. With ZFS you only resilver data. For those
who recall the resilver throttles in SVM or VXVM, you will appreciate not
having to resilver non-data.
> For a long time at Sun, the rule was "correctness is a constraint,
performance is a goal". However, in the real world, performance is often
also a constraint (just as a quick but erroneous answer is a wrong answer, so
also, a slow but correct answer can also be "wrong").
> 
> Then one brave soul at Sun once ventured that "if Linux is faster,
it''s a Solaris bug!" and to his surprise, the idea caught on. I
later went on to tell people that ZFS delievered RAID "where I =    
inexpensive", so I''m a just a little frustrated when that promise
becomes less respected over time. First it was USB drives (which I agreed with),
now it''s SATA (and I''m not so sure).
"slow" doesn''t begin with an "i" :-)
> 
>> 
>>> There has been a lot of discussion, anecdotes and some data on this
list.
>> 
>> "slow because I use devices with poor random write(!)
performance"
>> is very different than "broken."
> 
> Again, context is everything. For example, if someone was building a
business critical NAS appliance from consumer grade parts, I''d be the
first to say "are you nuts?!"
Unfortunately, the math does not support your position...
> 
>> 
>>> The resilver doesn''t do a single pass of the drives, but
uses a "smarter" temporal algorithm based on metadata.
>> 
>> A design that only does a single pass does not handle the temporal
>> changes. Many RAID implementations use a mix of spatial and temporal
>> resilvering and suffer with that design decision.
> 
> Actually, it''s easy to see how a combined spatial and temporal
approach could be implemented to an advantage for mirrored vdevs.
> 
>> 
>>> However, the current implentation has difficulty finishing the job
if there''s a steady flow of updates to the pool.
>> 
>> Please define current. There are many releases of ZFS, and
>> many improvements have been made over time. What has not
>> improved is the random write performance of consumer-grade
>> HDDs.
> 
> I was led to believe this was not yet fixed in Solaris 11, and that there
are therefore doubts about what Solaris 10 update may see the fix, if any.
> 
>> 
>>> As far as I''m aware, the only way to get bounded resilver
times is to stop the workload until resilvering is completed.
>> 
>> I know of no RAID implementation that bounds resilver times
>> for HDDs. I believe it is not possible. OTOH, whether a resilver
>> takes 10 seconds or 10 hours makes little difference in data
>> availability. Indeed, this is why we often throttle resilvering
>> activity. See previous discussions on this forum regarding the
>> dueling RFEs.
> 
> I don''t share your disbelief or "little difference"
analysys. If it is true that no current implementation succeeds, isn''t
that a great opportunity to change the rules? Wasn''t resilver time vs
availability was a major factor in Adam Leventhal''s paper introducing
the need for RAIDZ3?
No, it wasn''t. There are two failure modes we can model given the data
provided by disk vendors:
1. failures by time (MTBF)
2. failures by bits read (UER)

Over time, the MTBF has improved, but the failures by bits read has not
improved. Just a few years ago enterprise class HDDs had an MTBF
of around 1 million hours. Today, they are in the range of 1.6 million
hours. Just looking at the size of the numbers, the probability that a
drive will fail in one hour is on the order of 10e-6.

By contrast, the failure rate by bits read has not improved much.
Consumer class HDDs are usually spec''ed at 1 error per 1e14
bits read.  To put this in perspective, a 2TB disk has around 1.6e13
bits. Or, the probability of an unrecoverable read if you read every bit 
on a 2TB is growing well above 10%. Some of the better enterprise class 
HDDs are rated two orders of magnitude better, but the only way to get
much better is to use more bits for ECC... hence the move towards
4KB sectors.

In other words, the probability of losing data by reading data can be
larger than losing data next year. This is the case for triple parity RAID.
> The appropriateness or otherwise of resilver throttling depends on the
context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with
one failed device, or RAIDZ3 with two failed devices), or if I can recover
business critical data in a timely manner, then great. But there may come a
point where I would rather take a short term performance hit to close the window
on total data loss.
I agree. Back in the bad old days, we were stuck with silly throttles
on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
on the competing, non-scrub I/O. This works because in ZFS all I/O is not
created equal, unlike the layered RAID implementations such as SVM or
RAID arrays. ZFS schedules the regular workload at a higher priority than
scrubs or resilvers. Add the new throttles and the scheduler is even more
effective. So you get your interactive performance at the cost of longer
resilver times. This is probably a good trade-off for most folks.
> 
>> 
>>> The problem exists for mirrors too, but is not as marked because
mirror reconstruction is inherently simpler.
>> 
>> Resilver time is bounded by the random write performance of
>> the resilvering device. Mirroring or raidz make no difference.
> 
> This only holds in a quiesced system.
The effect will be worse for a mirror because you have direct
competition for the single, surviving HDD. For raidz*, we clearly
see the read workload spread out across the surving disks at
approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
then a resilver will keep the resilvering disk 100% busy writing, and 
the data disks approximately 25% busy reading. Later releases of 
ZFS will also prefetch the reads and the writes can be coalesced,
skewing the ratio a little bit, but the general case seems to be a
reasonable starting point.
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/c24f5516/attachment-0001.html>

Richard Elling

2010-Dec-21 05:55 UTC

head link

[zfs-discuss] A few questions

On Dec 20, 2010, at 4:19 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>> From: Erik Trimble [mailto:erik.trimble at oracle.com]
>> 
>> We can either (a) change how ZFS does resilvering or (b) repack the
>> zpool layouts to avoid the problem in the first place.
>> 
>> In case (a), my vote would be to seriously increase the number of
>> in-flight resilver slabs, AND allow for out-of-time-order slab
>> resilvering.  
> 
> Question for any clueful person:
> 
> Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2
> failed and is resilvering.  If you have an algorithm to create a list of
all
> the used blocks of disk1 in disk order, then you''re able to
resilver the
> mirror extremely fast, because all the reads will be sequential in nature,
> plus you get to skip past all the unused space.
Sounds like the definition of random access :-) 
> 
> Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3
> is resilvering).  You find some way of ordering all the used blocks of
> disk1...  Which means disk1 will be able to read in optimal order and
speed.
Sounds like prefetching :-)
> Does that necessarily imply disk2 will also work well?  Does the on-disk
> order of blocks of disk1 necessarily match the order of blocks on disk2?
This is an interesting question, that will become more interesting
as the physical sector size gets bigger...
 -- richard
>

Lanky Doodle

2010-Dec-21 09:55 UTC

head link

[zfs-discuss] A few questions

> It''s worse on raidzN than on mirrors, because the
> number of items which must
> be read is higher in radizN, assuming you''re using
> larger vdev''s and
> therefore more items exist scattered about inside
> that vdev.  You therefore
> have a higher number of things which must be randomly
> read before you reach
> completion.
In that case, isn''t the answer to have a dedicated parity disk (or 2 or
3 depending on what raidz* is used), ala raid-dp. Wouldn''t this
effectively be the ''same'' as a mirror when resilvering (the
only difference being parity vs actual data), as it''s doing so from a
single disk.

raid-dp covers the parity disk from failure so raidz1 probably wouldn''t
be sensible as if the parity disk fails.....
-- 
This message posted from opensolaris.org

Phil Harman

2010-Dec-21 11:48 UTC

head link

[zfs-discuss] A few questions

On 21/12/2010 05:44, Richard Elling wrote:> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com 
> <mailto:phil.harman at gmail.com>> wrote:
>> On 20/12/2010 13:59, Richard Elling wrote:
>>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at
gmail.com
>>> <mailto:phil.harman at gmail.com>> wrote:
>>>>> Why does resilvering take so long in raidz anyway?
>>>> Because it''s broken. There were some changes a while
back that made
>>>> it more broken.
>>> "broken" is the wrong term here. It functions as designed
and correctly
>>> resilvers devices. Disagreeing with the design is quite different
than
>>> proving a defect.
>> It might be the wrong term in general, but I think it does apply in 
>> the budget home media server context of this thread.
> If you only have a few slow drives, you don''t have performance.
> Like trying to win the Indianapolis 500 with a tricycle...
The context of this thread is a budget home media server (certainly not 
the Indy 500, but perhaps not as humble as tricycle touring either). And 
whilst it is a habit of the hardware advocate to blame the software ... 
and vice versa ... it''s not much help to those of us trying to build 
"good enough" systems across the performance and availability
spectrum.
>> I think we can agree that ZFS currently doesn''t play well on
cheap
>> disks. I think we can also agree that the performance of ZFS 
>> resilvering is known to be suboptimal under certain conditions.
> ... and those conditions are also a strength. For example, most file
> systems are nowhere near full. With ZFS you only resilver data. For those
> who recall the resilver throttles in SVM or VXVM, you will appreciate not
> having to resilver non-data.
I''d love to see the data and analysis for the assertion that "most
files
systems are nowhere near full", discounting, of course, any trivial 
cases. In my experience, in any cost conscious scenario, in the home or 
the enterprise, the expectation is that I''ll get to use the majority of
the space I''ve paid for (generally "through the nose" from
the storage
silo team in the enterprise scenario). To borrow your illustration, even 
Indy 500 teams care about fuel consumption.

What I don''t appreciate is having to resilver significantly more data 
than the drive can contain. But when it comes to the crunch, what I''d 
really appreciate was a bounded resilver time measured in hours not days 
or weeks.
>> For a long time at Sun, the rule was "correctness is a constraint,
>> performance is a goal". However, in the real world, performance is
>> often also a constraint (just as a quick but erroneous answer is a 
>> wrong answer, so also, a slow but correct answer can also be
"wrong").
>>
>> Then one brave soul at Sun once ventured that "if Linux is faster,
>> it''s a Solaris bug!" and to his surprise, the idea caught
on. I later
>> went on to tell people that ZFS delievered RAID "where I = 
>> inexpensive", so I''m a just a little frustrated when that
promise
>> becomes less respected over time. First it was USB drives (which I 
>> agreed with), now it''s SATA (and I''m not so sure).
> "slow" doesn''t begin with an "i" :-)
Both ZFS and RAID promised to play in the inexpensive space.
>>>> There has been a lot of discussion, anecdotes and some data on
this
>>>> list.
>>> "slow because I use devices with poor random write(!)
performance"
>>> is very different than "broken."
>> Again, context is everything. For example, if someone was building a 
>> business critical NAS appliance from consumer grade parts, I''d
be the
>> first to say "are you nuts?!"
> Unfortunately, the math does not support your position...
Actually, the math (e.g. raw drive metrics) doesn''t lead me to expect 
such a disparity.
>>>> The resilver doesn''t do a single pass of the drives,
but uses a
>>>> "smarter" temporal algorithm based on metadata.
>>> A design that only does a single pass does not handle the temporal
>>> changes. Many RAID implementations use a mix of spatial and
temporal
>>> resilvering and suffer with that design decision.
>> Actually, it''s easy to see how a combined spatial and temporal
>> approach could be implemented to an advantage for mirrored vdevs.
>>>> However, the current implentation has difficulty finishing the
job
>>>> if there''s a steady flow of updates to the pool.
>>> Please define current. There are many releases of ZFS, and
>>> many improvements have been made over time. What has not
>>> improved is the random write performance of consumer-grade
>>> HDDs.
>> I was led to believe this was not yet fixed in Solaris 11, and that 
>> there are therefore doubts about what Solaris 10 update may see the 
>> fix, if any.
>>>> As far as I''m aware, the only way to get bounded
resilver times is
>>>> to stop the workload until resilvering is completed.
>>> I know of no RAID implementation that bounds resilver times
>>> for HDDs. I believe it is not possible. OTOH, whether a resilver
>>> takes 10 seconds or 10 hours makes little difference in data
>>> availability. Indeed, this is why we often throttle resilvering
>>> activity. See previous discussions on this forum regarding the
>>> dueling RFEs.
>> I don''t share your disbelief or "little difference"
analysys. If it
>> is true that no current implementation succeeds, isn''t that a
great
>> opportunity to change the rules? Wasn''t resilver time vs
availability
>> was a major factor in Adam Leventhal''s paper introducing the
need for
>> RAIDZ3?
>
> No, it wasn''t.
Maybe we weren''t reading the same paper?

 From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a 
pointer to Adam''s ACM article)> The need for triple-parity RAID
> ...
> The time to populate a drive is directly relevant for RAID rebuild. As 
> disks in RAID systems take longer to reconstruct, the reliability of 
> the total system decreases due to increased periods running in a 
> degraded state. Today that can be four hours or longer; that could 
> easily grow to days or weeks.
 From http://queue.acm.org/detail.cfm?id=1670144 (Adam''s ACM
article)> While bit error rates have nearly kept pace with the growth in disk 
> capacity, throughput has not been given its due consideration when 
> determining RAID reliability.
Whilst Adam does discuss the lack of progress in bit error rates, his 
focus (in the article, and in his pointer to it) seems to be on drive 
capacity vs data rates, how this impact recovery times, and the 
consequential need to protect against multiple overlapping failures.
> There are two failure modes we can model given the data
> provided by disk vendors:
> 1. failures by time (MTBF)
> 2. failures by bits read (UER)
>
> Over time, the MTBF has improved, but the failures by bits read has not
> improved. Just a few years ago enterprise class HDDs had an MTBF
> of around 1 million hours. Today, they are in the range of 1.6 million
> hours. Just looking at the size of the numbers, the probability that a
> drive will fail in one hour is on the order of 10e-6.
>
> By contrast, the failure rate by bits read has not improved much.
> Consumer class HDDs are usually spec''ed at 1 error per 1e14
> bits read.  To put this in perspective, a 2TB disk has around 1.6e13
> bits. Or, the probability of an unrecoverable read if you read every bit
> on a 2TB is growing well above 10%. Some of the better enterprise class
> HDDs are rated two orders of magnitude better, but the only way to get
> much better is to use more bits for ECC... hence the move towards
> 4KB sectors.
>
> In other words, the probability of losing data by reading data can be
> larger than losing data next year. This is the case for triple parity 
> RAID.
MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when 
a disk fails, it is not considered "repairable", so a better metric is
MTTF (because there are no repairable failures)]

1.6 million hours equates to about 180 years, so why do HDD vendors 
guarantee their drives for considerably less (typically 3-5 years)?
It''s
because they base the figure on a constant failure rate expected during 
the normal useful life of the drive (typically 5 years).

However, quoting from
http://www.asknumbers.com/WhatisReliability.aspx> Field failures do not generally occur at a uniform rate, but follow a 
> distribution in time commonly described as a "bathtub curve." The
life
> of a device can be divided into three regions: Infant Mortality 
> Period, where the failure rate progressively improves; Useful Life 
> Period, where the failure rate remains constant; and Wearout Period, 
> where failure rates begin to increase. 
Crucially, the vendor''s quoted MTBF figures do not take into account 
"infant mortality" or early "wearout". Until every HDD is
fitted with an
environmental tell-tale device for shock, vibration, temperature, 
pressure, humidity, etc we can''t even come close to predicting either 
factor.

And this is just the HDD itself. In a system there are many ways to lose 
access to an HDD. So I''m exposed when I lose the first drive in a
RAIDZ1
(second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer 
the resilver takes, the longer I''m exposed.

Add to the mix that Indy 500 drives can degrade to tricyle performance 
before they fail utterly, and yes, low performing drives can still be an 
issue, even for the elite.
>> The appropriateness or otherwise of resilver throttling depends on 
>> the context. If I can tolerate further failures without data loss 
>> (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed 
>> devices), or if I can recover business critical data in a timely 
>> manner, then great. But there may come a point where I would rather 
>> take a short term performance hit to close the window on total data
loss.
>
> I agree. Back in the bad old days, we were stuck with silly throttles
> on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent
> on the competing, non-scrub I/O. This works because in ZFS all I/O is not
> created equal, unlike the layered RAID implementations such as SVM or
> RAID arrays. ZFS schedules the regular workload at a higher priority than
> scrubs or resilvers. Add the new throttles and the scheduler is even more
> effective. So you get your interactive performance at the cost of longer
> resilver times. This is probably a good trade-off for most folks.
>
>>>> The problem exists for mirrors too, but is not as marked
because
>>>> mirror reconstruction is inherently simpler.
>>>
>>> Resilver time is bounded by the random write performance of
>>> the resilvering device. Mirroring or raidz make no difference.
>>
>> This only holds in a quiesced system.
>
> The effect will be worse for a mirror because you have direct
> competition for the single, surviving HDD. For raidz*, we clearly
> see the read workload spread out across the surving disks at
> approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
> then a resilver will keep the resilvering disk 100% busy writing, and
> the data disks approximately 25% busy reading. Later releases of
> ZFS will also prefetch the reads and the writes can be coalesced,
> skewing the ratio a little bit, but the general case seems to be a
> reasonable starting point.
Mirrored systems need more drives to achieve the same capacity, so 
mirrored volumes are generally striped by some means, so the equivalent 
of your 4+1 RAIDZ1 is actually a 4+4. In such a configuration 
resilvering one drive at 100% would also result in a mean hit of 25%.

Obviously, a drive running at 100% has nothing more to give, so for fun 
let''s throttle the resilver to 25x1MB sequential reads per second
(which
is about 25% of a good drive''s capacity). At this rate, a 2TB drive
will
resilver in under 24 hours, so let''s make that the upper bound.

It is highly desirable to throttle the resilver and regular I/O rates 
according to required performance and availability metrics, so something 
better than 24 hours should be the norm.

It should also be possible for the system to report an ETA based on 
current and historic workload statistics. "You may say I''m a
dreamer..."

For mirrored vdevs, ZFS could resilver using an efficient block level 
copy, whilst keeping a record of progress, and considering copied blocks 
as already mirrored and ready to be read and updated by normal activity. 
Obviously, it''s much harder to apply this approach for RAIDZ.

Since slabs are allocated sequentially, it should also be possible to 
set a high water mark for the bulk copy, so that fresh pools with little 
or no data could also be resilvered in minutes or seconds.

I believe such an approach would benefit all ZFS users, not just the elite.
>  -- richard
Phil

p.s. just for the record, Nexenta''s Hardware Supported List (HSL) is an
excellent resource for those wanting to build NAS appliances that 
actually work...

    http://www.nexenta.com/corp/supported-hardware/hardware-supported-list

... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs 
(enterprise class drives at near consumer prices)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/e5b57b3d/attachment-0001.html>

Deano

2010-Dec-21 13:05 UTC

head link

[zfs-discuss] A few questions

On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:
> If you only have a few slow drives, you don''t have performance.
> Like trying to win the Indianapolis 500 with a tricycle...

Well you can put a jet engine on a tricycle and perhaps win it? Or you can
change the race course to only allow a tricycle space to move. In the context of
storage we have 2 factors hardware and software, having faster and more reliable
spindles is no reason to suggest that better software can?t be used to beat it.
The simple example is ZIL SSD, where using some software and even a cheap
commodity SSD will outperform sync writes than any amount of expensive spindle
drives. Before ZIL software is was easy to argue that the only way of speeding
up writes was more faster spindles.

The question therefore is, is there room in the software implementation to
achieve performance and reliability numbers similar to expensive drives whilst
using relative cheap drives?

ZFS is good but IMHO easy to see how it can be improved to better meet this
situation, I can?t currently say when this line of thinking and code will move
from research to production level use (tho I have a pretty good idea ;) ) but I
wouldn?t bet on the status quo lasting much longer. In some ways the removal of
OpenSolaris may actually be a good thing, as its catalyized a number of
developers from the view that zfs is Oracle led, to thinking ?what can we do
with zfs code as a base??

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case.
When a resilver or defrag is required, use it as a scratch space to give you a
block of fast IOPs storage space to accelerate the slow parts. When its done
secure erase and power it down, ready for the next time a resilver needs to
happen. The hardware is available, just needs someone to write the software?

Bye,

Deano

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/f9c6fd16/attachment.html>

Phil Harman

2010-Dec-21 13:11 UTC

head link

[zfs-discuss] A few questions

On 21/12/2010 13:05, Deano wrote:>
> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com 
> <mailto:phil.harman at gmail.com>> wrote:
>
> > If you only have a few slow drives, you don''t have
performance.
>
> > Like trying to win the Indianapolis 500 with a tricycle...
>
Actually, I didn''t say that, Richard did :)
> Well you can put a jet engine on a tricycle and perhaps win it? Or you 
> can change the race course to only allow a tricycle space to move. In 
> the context of storage we have 2 factors hardware and software, having 
> faster and more reliable spindles is no reason to suggest that better 
> software can?t be used to beat it. The simple example is ZIL SSD, 
> where using some software and  even a cheap commodity SSD will 
> outperform sync writes than any amount of expensive spindle drives. 
> Before ZIL software is was easy to argue that the only way of speeding 
> up writes was more faster spindles.
>
> The question therefore is, is there room in the software 
> implementation to achieve performance and reliability numbers similar 
> to expensive drives whilst using relative cheap drives?
>
> ZFS is good but IMHO easy to see how it can be improved to better meet 
> this situation, I can?t currently say when this line of thinking and 
> code will move from research to production level use (tho I have a 
> pretty good idea ;) ) but I wouldn?t bet on the status quo lasting 
> much longer. In some ways the removal of OpenSolaris may actually be a 
> good thing, as its catalyized a number of developers from the view 
> that zfs is Oracle led, to thinking ?what can we do with zfs code as a 
> base??
>
> Ffor example how about sticking a cheap 80GiB commodity SSD in the 
> storage case. When a resilver or defrag is required, use it as a 
> scratch space to give you a block of fast IOPs storage space to 
> accelerate the slow parts. When its done secure erase and power it 
> down, ready for the next time a resilver needs to happen. The hardware 
> is available, just needs someone to write the software?
>
> Bye,
>
> Deano
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/48a22208/attachment.html>

Deano

2010-Dec-21 13:13 UTC

head link

[zfs-discuss] A few questions

Doh sorry about that, the threading got very confused on my mail reader!

Bye,

Deano

From: Phil Harman [mailto:phil.harman at gmail.com] 
Sent: 21 December 2010 13:12
To: Deano
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] A few questions

On 21/12/2010 13:05, Deano wrote: 

On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:
> If you only have a few slow drives, you don''t have performance.
> Like trying to win the Indianapolis 500 with a tricycle...

Actually, I didn''t say that, Richard did :)

Well you can put a jet engine on a tricycle and perhaps win it? Or you can
change the race course to only allow a tricycle space to move. In the context of
storage we have 2 factors hardware and software, having faster and more reliable
spindles is no reason to suggest that better software can?t be used to beat it.
The simple example is ZIL SSD, where using some software and  even a cheap
commodity SSD will outperform sync writes than any amount of expensive spindle
drives. Before ZIL software is was easy to argue that the only way of speeding
up writes was more faster spindles.

The question therefore is, is there room in the software implementation to
achieve performance and reliability numbers similar to expensive drives whilst
using relative cheap drives?

ZFS is good but IMHO easy to see how it can be improved to better meet this
situation, I can?t currently say when this line of thinking and code will move
from research to production level use (tho I have a pretty good idea ;) ) but I
wouldn?t bet on the status quo lasting much longer. In some ways the removal of
OpenSolaris may actually be a good thing, as its catalyized a number of
developers from the view that zfs is Oracle led, to thinking ?what can we do
with zfs code as a base??

Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case.
When a resilver or defrag is required, use it as a scratch space to give you a
block of fast IOPs storage space to accelerate the slow parts. When its done
secure erase and power it down, ready for the next time a resilver needs to
happen. The hardware is available, just needs someone to write the software?

Bye,

Deano

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/d0bde944/attachment-0001.html>

Edward Ned Harvey

2010-Dec-21 13:24 UTC

head link

[zfs-discuss] A few questions

> From: edmudama at mail.bounceswoosh.org
> [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
> 
> On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
> >If there is no correlation between on-disk order of blocks for
different
> >disks within the same vdev, then all hope is lost; it''s
essentially
> >impossible to optimize the resilver/scrub order unless the on-disk
order
of> >multiple disks is highly correlated or equal by definition.
> 
> Very little is impossible.
> 
> Drives have been optimally ordering seeks for 35+ years.  I''m
guessing
Unless your drive is able to queue up a request to read every single used
part of the drive...  Which is larger than the command queue for any
reasonable drive in the world...  The point is, in order to be
"optimal" you
have to eliminate all those seeks, and perform sequential reads only.  The
only seeks you should do are to skip over unused space.

If you''re able to sequentially read the whole drive, skipping all the
unused
space, then you''re guaranteed to complete faster (or equal) than either
(a)
sequentially reading the whole drive, or (b) seeking all over the drive to
read the used parts in random order.

Edward Ned Harvey

2010-Dec-21 13:28 UTC

head link

[zfs-discuss] A few questions

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> > Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where
disk3> > is resilvering).  You find some way of ordering all the used blocks of
> > disk1...  Which means disk1 will be able to read in optimal order and
speed.> 
> Sounds like prefetching :-)
Ok.  Prefetch every used sector in the pool.  Problem solved.  Let the disks
sort all the requests into on-disk order.  Unless perhaps the number of
requests would exceed the limits of what the drive is able to sort ...
Which seems ... more than likely.

Eric D. Mudama

2010-Dec-21 17:34 UTC

head link

[zfs-discuss] A few questions

On Tue, Dec 21 at  8:24, Edward Ned Harvey wrote:>> From: edmudama at mail.bounceswoosh.org
>> [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
>>
>> On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:
>> >If there is no correlation between on-disk order of blocks for
different
>> >disks within the same vdev, then all hope is lost; it''s
essentially
>> >impossible to optimize the resilver/scrub order unless the on-disk
order
>of
>> >multiple disks is highly correlated or equal by definition.
>>
>> Very little is impossible.
>>
>> Drives have been optimally ordering seeks for 35+ years.  I''m
guessing
>
>Unless your drive is able to queue up a request to read every single used
>part of the drive...  Which is larger than the command queue for any
>reasonable drive in the world...  The point is, in order to be
"optimal" you
>have to eliminate all those seeks, and perform sequential reads only.  The
>only seeks you should do are to skip over unused space.
I don''t think you read my whole post.  I was saying this seek
calculation pre-processing would have to be done by the host server,
and while not impossible, is not trivial.  Present the next 32 seeks
to each device while the pre-processor works on the complete list of
future seeks, and the drive will do as well as possible.
>If you''re able to sequentially read the whole drive, skipping all
the unused
>space, then you''re guaranteed to complete faster (or equal) than
either (a)
>sequentially reading the whole drive, or (b) seeking all over the drive to
>read the used parts in random order.
Yes, I understand how that works.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Edward Ned Harvey

2010-Dec-22 04:16 UTC

head link

[zfs-discuss] A few questions

> From: edmudama at mail.bounceswoosh.org
> [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
> 
> >Unless your drive is able to queue up a request to read every single
used
> >part of the drive...  Which is larger than the command queue for any
> >reasonable drive in the world...  The point is, in order to be
"optimal"
you> >have to eliminate all those seeks, and perform sequential reads only.
The> >only seeks you should do are to skip over unused space.
> 
> I don''t think you read my whole post.  I was saying this seek
> calculation pre-processing would have to be done by the host server,
> and while not impossible, is not trivial.  Present the next 32 seeks
> to each device while the pre-processor works on the complete list of
> future seeks, and the drive will do as well as possible.
I did read that, but now I think, perhaps I misunderstand it, or you
misunderstood me?  I am thinking...  If you''re just queueing up a few
reads
at a time (less than infinity, or less than 99% of the pool) ...  I would
not assume that these 32 seeks are even remotely sequential....  I mean ...
32 blocks in a pool of presumably millions of blocks...  I would assume they
are essentially random, are they not?

In my mind, which is likely wrong or at least oversimplified, I think if you
want to order the list of blocks to read according to disk order (which
should at least be theoretically possible on mirrors, but perhaps not even
physically possible on raidz)...  You would have to first generate a list of
all the blocks to be read, and then sort it.  Rough estimate, for any pool
of a reasonable size, that sounds like some GB of ram to me.

Maybe there''s a less-than-perfect sort algorithm which has a much lower
memory footprint?  Like a simple hashing algorithm that will guarantee the
next few thousand seeks are in disk order...  Although they will skip or
jump over many blocks that will have to be done later ... An algorithm which
is not a perfect sort, but given some repetition and multiple passes over
the disk, might achieve an acceptable level of performance versus memory
footprint...

Richard Elling

2010-Dec-26 05:23 UTC

head link

[zfs-discuss] A few questions

On Dec 21, 2010, at 5:05 AM, Deano wrote:> 
> The question therefore is, is there room in the software implementation to
achieve performance and reliability numbers similar to expensive drives whilst
using relative cheap drives?
For some definition of "similar," yes. But using relatively cheap
drives does
not mean the overall system cost will be cheap.  For example, $250 will buy
8.6K random IOPS @ 4KB in an SSD[1], but to do that with "cheap disks"
might
require eighty 7,200 rpm SATA disks.
> ZFS is good but IMHO easy to see how it can be improved to better meet this
situation, I can?t currently say when this line of thinking and code will move
from research to production level use (tho I have a pretty good idea ;) ) but I
wouldn?t bet on the status quo lasting much longer. In some ways the removal of
OpenSolaris may actually be a good thing, as its catalyized a number of
developers from the view that zfs is Oracle led, to thinking ?what can we do
with zfs code as a base??
There are more people outside of Oracle developing for ZFS than inside Oracle.
This has been true for some time now.
> Ffor example how about sticking a cheap 80GiB commodity SSD in the storage
case. When a resilver or defrag is required, use it as a scratch space to give
you a block of fast IOPs storage space to accelerate the slow parts. When its
done secure erase and power it down, ready for the next time a resilver needs to
happen. The hardware is available, just needs someone to write the software?
In general, SSDs will not speed resilver unless the resilvering disk is an SSD.

[1]
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/feature/index.htm
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/0f136004/attachment.html>

Tim Cook

2010-Dec-26 05:40 UTC

head link

[zfs-discuss] A few questions

On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
<richard.elling at gmail.com>wrote:
> On Dec 21, 2010, at 5:05 AM, Deano wrote:
>
>
> The question therefore is, is there room in the software implementation to
> achieve performance and reliability numbers similar to expensive drives
> whilst using relative cheap drives?
>
>
> For some definition of "similar," yes. But using relatively cheap
drives
> does
> not mean the overall system cost will be cheap.  For example, $250 will buy
> 8.6K random IOPS @ 4KB in an SSD[1], but to do that with "cheap
disks"
> might
> require eighty 7,200 rpm SATA disks.
>
> ZFS is good but IMHO easy to see how it can be improved to better meet this
> situation, I can?t currently say when this line of thinking and code will
> move from research to production level use (tho I have a pretty good idea
;)
> ) but I wouldn?t bet on the status quo lasting much longer. In some ways
the
> removal of OpenSolaris may actually be a good thing, as its catalyized a
> number of developers from the view that zfs is Oracle led, to thinking
?what
> can we do with zfs code as a base??
>
>
> There are more people outside of Oracle developing for ZFS than inside
> Oracle.
> This has been true for some time now.
>
>
>
Pardon my skepticism, but where is the proof of this claim (I''m quite
certain you know I mean no disrespect)?  Solaris11 Express was a massive
leap in functionality and bugfixes to ZFS.  I''ve seen exactly nothing
out of
"outside of Oracle" in the time since it went closed.  We used to see
updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
GUI and userland apps isn''t work on ZFS.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/f33f15b3/attachment-0001.html>

Richard Elling

2010-Dec-26 06:21 UTC

head link

[zfs-discuss] MTBF and why we care [was: A few questions]

On Dec 21, 2010, at 3:48 AM, Phil Harman wrote:> On 21/12/2010 05:44, Richard Elling wrote:
>> 
>> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at
gmail.com> wrote:
>>> On 20/12/2010 13:59, Richard Elling wrote:
>>>> 
>>>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at
gmail.com> wrote:
>>>>>> Why does resilvering take so long in raidz anyway?
>>>>> Because it''s broken. There were some changes a
while back that made it more broken.
>>>> "broken" is the wrong term here. It functions as
designed and correctly
>>>> resilvers devices. Disagreeing with the design is quite
different than
>>>> proving a defect.
>>> It might be the wrong term in general, but I think it does apply in
the budget home media server context of this thread.
>> If you only have a few slow drives, you don''t have
performance.
>> Like trying to win the Indianapolis 500 with a tricycle...
> 
> The context of this thread is a budget home media server (certainly not the
Indy 500, but perhaps not as humble as tricycle touring either). And whilst it
is a habit of the hardware advocate to blame the software ... and vice versa ...
it''s not much help to those of us trying to build "good
enough" systems across the performance and availability spectrum.
it is all in how the expectations are set. For the home user, waiting overnight
for a resilver might not impact their daily lives (switch night/day around for 
developers :-)
>>> I think we can agree that ZFS currently doesn''t play well
on cheap disks. I think we can also agree that the performance of ZFS
resilvering is known to be suboptimal under certain conditions.
>> ... and those conditions are also a strength. For example, most file
>> systems are nowhere near full. With ZFS you only resilver data. For
those
>> who recall the resilver throttles in SVM or VXVM, you will appreciate
not
>> having to resilver non-data.
> 
> I''d love to see the data and analysis for the assertion that
"most files systems are nowhere near full", discounting, of course,
any trivial cases.
I wish I still had access to that data, since I left Sun, I''d be
pleasantly  surprised if
anyone keeps up with it any more.  But yes, we did track file system utilization
on
around 300,000 systems, clearly a statistically significant sample, for
Sun''s market
anyway.  Average space utilization is well below 50%.
> In my experience, in any cost conscious scenario, in the home or the
enterprise, the expectation is that I''ll get to use the majority of the
space I''ve paid for (generally "through the nose" from the
storage silo team in the enterprise scenario). To borrow your illustration, even
Indy 500 teams care about fuel consumption.
> 
> What I don''t appreciate is having to resilver significantly more
data than the drive can contain. But when it comes to the crunch, what
I''d really appreciate was a bounded resilver time measured in hours not
days or weeks.
For those following along, changeset 12296:7cf402a7f374 on May 3, 2010
brought a number of changes to scrubs and resilvers.
>>> For a long time at Sun, the rule was "correctness is a
constraint, performance is a goal". However, in the real world, performance
is often also a constraint (just as a quick but erroneous answer is a wrong
answer, so also, a slow but correct answer can also be "wrong").
>>> 
>>> Then one brave soul at Sun once ventured that "if Linux is
faster, it''s a Solaris bug!" and to his surprise, the idea caught
on. I later went on to tell people that ZFS delievered RAID "where I =
inexpensive", so I''m a just a little frustrated when that promise
becomes less respected over time. First it was USB drives (which I agreed with),
now it''s SATA (and I''m not so sure).
>> "slow" doesn''t begin with an "i" :-)
> 
> Both ZFS and RAID promised to play in the inexpensive space.
And tricycles are less expensive than Indy cars...
>>>>> There has been a lot of discussion, anecdotes and some data
on this list.
>>>> "slow because I use devices with poor random write(!)
performance"
>>>> is very different than "broken."
>>> Again, context is everything. For example, if someone was building
a business critical NAS appliance from consumer grade parts, I''d be the
first to say "are you nuts?!"
>> Unfortunately, the math does not support your position...
> 
> Actually, the math (e.g. raw drive metrics) doesn''t lead me to
expect such a disparity.
> 
>>>>> The resilver doesn''t do a single pass of the
drives, but uses a "smarter" temporal algorithm based on metadata.
>>>> A design that only does a single pass does not handle the
temporal
>>>> changes. Many RAID implementations use a mix of spatial and
temporal
>>>> resilvering and suffer with that design decision.
>>> Actually, it''s easy to see how a combined spatial and
temporal approach could be implemented to an advantage for mirrored vdevs.
>>>>> However, the current implentation has difficulty finishing
the job if there''s a steady flow of updates to the pool.
>>>> Please define current. There are many releases of ZFS, and
>>>> many improvements have been made over time. What has not
>>>> improved is the random write performance of consumer-grade
>>>> HDDs.
>>> I was led to believe this was not yet fixed in Solaris 11, and that
there are therefore doubts about what Solaris 10 update may see the fix, if any.
>>>>> As far as I''m aware, the only way to get bounded
resilver times is to stop the workload until resilvering is completed.
>>>> I know of no RAID implementation that bounds resilver times
>>>> for HDDs. I believe it is not possible. OTOH, whether a
resilver
>>>> takes 10 seconds or 10 hours makes little difference in data
>>>> availability. Indeed, this is why we often throttle resilvering
>>>> activity. See previous discussions on this forum regarding the
>>>> dueling RFEs.
>>> I don''t share your disbelief or "little
difference" analysys. If it is true that no current implementation
succeeds, isn''t that a great opportunity to change the rules?
Wasn''t resilver time vs availability was a major factor in Adam
Leventhal''s paper introducing the need for RAIDZ3?
>> 
>> No, it wasn''t.
> 
> Maybe we weren''t reading the same paper?
> 
> From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a
pointer to Adam''s ACM article)
>> The need for triple-parity RAID
>> ...
>> The time to populate a drive is directly relevant for RAID rebuild. As
disks in RAID systems take longer to reconstruct, the reliability of the total
system decreases due to increased periods running in a degraded state. Today
that can be four hours or longer; that could easily grow to days or weeks.
> 
> From http://queue.acm.org/detail.cfm?id=1670144 (Adam''s ACM
article)
>> While bit error rates have nearly kept pace with the growth in disk
capacity, throughput has not been given its due consideration when determining
RAID reliability.
> 
> Whilst Adam does discuss the lack of progress in bit error rates, his focus
(in the article, and in his pointer to it) seems to be on drive capacity vs data
rates, how this impact recovery times, and the consequential need to protect
against multiple overlapping failures.
> 
>> There are two failure modes we can model given the data
>> provided by disk vendors:
>> 1. failures by time (MTBF)
>> 2. failures by bits read (UER)
>> 
>> Over time, the MTBF has improved, but the failures by bits read has not
>> improved. Just a few years ago enterprise class HDDs had an MTBF
>> of around 1 million hours. Today, they are in the range of 1.6 million
>> hours. Just looking at the size of the numbers, the probability that a
>> drive will fail in one hour is on the order of 10e-6.
>> 
>> By contrast, the failure rate by bits read has not improved much.
>> Consumer class HDDs are usually spec''ed at 1 error per 1e14
>> bits read.  To put this in perspective, a 2TB disk has around 1.6e13
>> bits. Or, the probability of an unrecoverable read if you read every
bit
>> on a 2TB is growing well above 10%. Some of the better enterprise class
>> HDDs are rated two orders of magnitude better, but the only way to get
>> much better is to use more bits for ECC... hence the move towards
>> 4KB sectors.
>> 
>> In other words, the probability of losing data by reading data can be
>> larger than losing data next year. This is the case for triple parity
RAID.
> 
> MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when a
disk fails, it is not considered "repairable", so a better metric is
MTTF (because there are no repairable failures)]
They are the same in this context.
> 1.6 million hours equates to about 180 years, so why do HDD vendors
guarantee their drives for considerably less (typically 3-5 years)?
It''s because they base the figure on a constant failure rate expected
during the normal useful life of the drive (typically 5 years).
MTBF has units of "hours between failures," but is often shortened to
"hours."
It is often easier to do the math with Failures in Time (FITs) where Time is a
billion
hours. There is a direct correlation:

FITs = 1,000,000,000 / MTBF

To put this in perspective, a modern CPU has an MTBF of around 4 million hours
or 250 FITs. A simple PCI card can easily get to 10 million hours, or less than
100 FITs.

Or, if you prefer, the annualized failure rate (AFR) gives a more intuitive
response.

AFR  = 8760 hours per year / MTBF

AFR is often represented as a percentage, and ranges of 0.6% to 4% are useful
for
disks.

Remember, all failures due to wear out and described by MTBF in disks are
mechanical
failures.
> However, quoting from http://www.asknumbers.com/WhatisReliability.aspx
>> Field failures do not generally occur at a uniform rate, but follow a
distribution in time commonly described as a "bathtub curve." The life
of a device can be divided into three regions: Infant Mortality Period, where
the failure rate progressively improves; Useful Life Period, where the failure
rate remains constant; and Wearout Period, where failure rates begin to
increase.
> 
> Crucially, the vendor''s quoted MTBF figures do not take into
account "infant mortality" or early "wearout". Until every
HDD is fitted with an environmental tell-tale device for shock, vibration,
temperature, pressure, humidity, etc we can''t even come close to
predicting either factor.
Yes we can, and yes we do. All you need is a large enough sample size.
In many cases, the changes in failure rates occur because of events not 
considered in MTBF calculations: factory defects, contamination, environmental
conditions, physical damage, firmware bugs, etc.  
> And this is just the HDD itself. In a system there are many ways to lose
access to an HDD. So I''m exposed when I lose the first drive in a
RAIDZ1 (second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer
the resilver takes, the longer I''m exposed.
Indeed.  Let''s look at the math. For the simple MTTDL[1] model, that
does not
consider UER, we calculate the probability that we we have a second failure
during the repair time:
	single parity :
		MTTDL[1] = MTBF^2 / (N*(N-1) * MTTR)

	double parity:
		MTTDL[1] = MTBF^3 / (N * (N-1) * (N-2) * MTTR^2)

Mean Time To Repair (MTTR) includes logistical replacement and resilvering time,
so this model can show the advantage of hot spares (by reducing logistical 
replacement time).

The practical use of this model makes sense where MTTR is on the
order of 10s or 100s of hours while the MTBF is on the order of 1 million 
hours.

But the more difficult problem arises with the UER spec.  A consumer-grade
disk typically has a UER rating of error per 10^14 bits read. 10^14 bits is 
around 8 2TB drives. In other words, the probability of having an UER
during reconstruction of an 8+1 raidz using 2TB consumer-grade drives is
more like 63%, much higher than the MTTDL[1] model implies. We are just now 
seeing enterprise-class drives with a UER rating of 1 error per 10^16 bits read.
http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/NS/Cheetah%20NS%2010K.2/100516228d.pdf

> Add to the mix that Indy 500 drives can degrade to tricyle performance
before they fail utterly, and yes, low performing drives can still be an issue,
even for the elite.
Yes. I feel this will become the dominant issue with HDDs and one where there
is plenty of room for improvement in ZFS.
>>> The appropriateness or otherwise of resilver throttling depends on
the context. If I can tolerate further failures without data loss (e.g. RAIDZ2
with one failed device, or RAIDZ3 with two failed devices), or if I can recover
business critical data in a timely manner, then great. But there may come a
point where I would rather take a short term performance hit to close the window
on total data loss.
>> 
>> I agree. Back in the bad old days, we were stuck with silly throttles
>> on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is
dependent
>> on the competing, non-scrub I/O. This works because in ZFS all I/O is
not
>> created equal, unlike the layered RAID implementations such as SVM or
>> RAID arrays. ZFS schedules the regular workload at a higher priority
than
>> scrubs or resilvers. Add the new throttles and the scheduler is even
more
>> effective. So you get your interactive performance at the cost of
longer
>> resilver times. This is probably a good trade-off for most folks.
>> 
>>>>> The problem exists for mirrors too, but is not as marked
because mirror reconstruction is inherently simpler.
>>>> 
>>>> Resilver time is bounded by the random write performance of
>>>> the resilvering device. Mirroring or raidz make no difference.
>>> 
>>> This only holds in a quiesced system.
>> 
>> The effect will be worse for a mirror because you have direct
>> competition for the single, surviving HDD. For raidz*, we clearly
>> see the read workload spread out across the surving disks at
>> approximatey the 1/N ratio. In other words, if you have a 4+1 raidz,
>> then a resilver will keep the resilvering disk 100% busy writing, and 
>> the data disks approximately 25% busy reading. Later releases of 
>> ZFS will also prefetch the reads and the writes can be coalesced,
>> skewing the ratio a little bit, but the general case seems to be a
>> reasonable starting point.
> 
> Mirrored systems need more drives to achieve the same capacity, so mirrored
volumes are generally striped by some means, so the equivalent of your 4+1
RAIDZ1 is actually a 4+4. In such a configuration resilvering one drive at 100%
would also result in a mean hit of 25%.
For HDDs, writes take longer than reads, so reality is much more difficult to
model.  This is further complicated by ZFS''s I/O scheduler, track read
buffers,
ZFS prefetching, and the async nature of resilvering writes.
> Obviously, a drive running at 100% has nothing more to give, so for fun
let''s throttle the resilver to 25x1MB sequential reads per second
(which is about 25% of a good drive''s capacity). At this rate, a 2TB
drive will resilver in under 24 hours, so let''s make that the upper
bound.
OK.  I think this is a fair goal.  It is certainly easier to achieve than the
4.5 hours
you can expect for sustained writes to the media.
> It is highly desirable to throttle the resilver and regular I/O rates
according to required performance and availability metrics, so something better
than 24 hours should be the norm.
> 
> It should also be possible for the system to report an ETA based on current
and historic workload statistics. "You may say I''m a
dreamer..."
That is what happens today, but the algorithm doesn''t work well for
devices
with widely varying random performance profiles (eg HDDs).  As the resilver
throttle kicks in, due to other I/O taking priority, the resilver time is even
more
unpredictable.

An amusing CR is 6973953, where the "solution" is "do not print
estimated
time if hours_left is more than 30 days"
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6973953
> For mirrored vdevs, ZFS could resilver using an efficient block level copy,
whilst keeping a record of progress, and considering copied blocks as already
mirrored and ready to be read and updated by normal activity. Obviously,
it''s much harder to apply this approach for RAIDZ.
> 
> Since slabs are allocated sequentially, it should also be possible to set a
high water mark for the bulk copy, so that fresh pools with little or no data
could also be resilvered in minutes or seconds.
That is the case today.  Try it :-)
 -- richard
> I believe such an approach would benefit all ZFS users, not just the elite.
> 
>>  -- richard
> 
> Phil
> 
> p.s. just for the record, Nexenta''s Hardware Supported List (HSL)
is an excellent resource for those wanting to build NAS appliances that actually
work...
> 
>    http://www.nexenta.com/corp/supported-hardware/hardware-supported-list
> 
> ... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs (enterprise
class drives at near consumer prices)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/9aec1a6b/attachment-0001.html>

Robert Milkowski

2011-Jan-03 13:08 UTC

head link

[zfs-discuss] A few questions

On 12/26/10 05:40 AM, Tim Cook wrote:>
>
> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
> <richard.elling at gmail.com <mailto:richard.elling at
gmail.com>> wrote:
>
>
>     There are more people outside of Oracle developing for ZFS than
>     inside Oracle.
>     This has been true for some time now.
>
>
>
>
> Pardon my skepticism, but where is the proof of this claim (I''m
quite
> certain you know I mean no disrespect)?  Solaris11 Express was a 
> massive leap in functionality and bugfixes to ZFS.  I''ve seen
exactly
> nothing out of "outside of Oracle" in the time since it went
closed.
>  We used to see updates bi-weekly out of Sun.  Nexenta spending 
> hundreds of man-hours on a GUI and userland apps isn''t work on
ZFS.
>
>
Exactly my observation as well. I haven''t seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.

-- 
Robert Milkowski
http://milek.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/808050da/attachment.html>

Bob Friesenhahn

2011-Jan-03 14:38 UTC

head link

[zfs-discuss] A few questions

On Mon, 3 Jan 2011, Robert Milkowski wrote:> 
> Exactly my observation as well. I haven''t seen any ZFS related 
> development happening at Ilumos or Nexenta, at least not yet.
There seems to be plenty of zfs work on the FreeBSD project, but 
primarily with porting the latest available sources to FreeBSD (going 
very well!) rather than with developing zfs itself.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Garrett D''Amore

2011-Jan-03 15:56 UTC

head link

[zfs-discuss] A few questions

On 01/ 3/11 05:08 AM, Robert Milkowski wrote:> On 12/26/10 05:40 AM, Tim Cook wrote:
>>
>>
>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
>> <richard.elling at gmail.com <mailto:richard.elling at
gmail.com>> wrote:
>>
>>
>>     There are more people outside of Oracle developing for ZFS than
>>     inside Oracle.
>>     This has been true for some time now.
>>
>>
>>
>>
>> Pardon my skepticism, but where is the proof of this claim
(I''m quite
>> certain you know I mean no disrespect)?  Solaris11 Express was a 
>> massive leap in functionality and bugfixes to ZFS.  I''ve seen
exactly
>> nothing out of "outside of Oracle" in the time since it went
closed.
>>  We used to see updates bi-weekly out of Sun.  Nexenta spending 
>> hundreds of man-hours on a GUI and userland apps isn''t work on
ZFS.
>>
>>
>
> Exactly my observation as well. I haven''t seen any ZFS related 
> development happening at Ilumos or Nexenta, at least not yet.
Just because you''ve not seen it yet doesn''t imply it
isn''t happening.
Please be patient.

    - Garrett
>
> -- 
> Robert Milkowski
> http://milek.blogspot.com
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/fecd0af1/attachment.html>

Richard Elling

2011-Jan-03 16:28 UTC

head link

[zfs-discuss] A few questions

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
> On 12/26/10 05:40 AM, Tim Cook wrote:
>> 
>> 
>> 
>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at
gmail.com> wrote:
>> 
>> There are more people outside of Oracle developing for ZFS than inside
Oracle.
>> This has been true for some time now.
>> 
>>> 
>> 
>> 
>> 
>> Pardon my skepticism, but where is the proof of this claim
(I''m quite certain you know I mean no disrespect)?  Solaris11 Express
was a massive leap in functionality and bugfixes to ZFS.  I''ve seen
exactly nothing out of "outside of Oracle" in the time since it went
closed.  We used to see updates bi-weekly out of Sun.  Nexenta spending hundreds
of man-hours on a GUI and userland apps isn''t work on ZFS.
>> 
>> 
> 
> Exactly my observation as well. I haven''t seen any ZFS related
development happening at Ilumos or Nexenta, at least not yet.
I am quite sure you understand how pipelines work :-)
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/3281ed97/attachment.html>

Erik Trimble

2011-Jan-03 22:10 UTC

head link

[zfs-discuss] A few questions

On 1/3/2011 8:28 AM, Richard Elling wrote:> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
>> On 12/26/10 05:40 AM, Tim Cook wrote:
>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
>>> <richard.elling at gmail.com <mailto:richard.elling at
gmail.com>> wrote:
>>>
>>>
>>>     There are more people outside of Oracle developing for ZFS than
>>>     inside Oracle.
>>>     This has been true for some time now.
>>>
>>>
>>> Pardon my skepticism, but where is the proof of this claim
(I''m
>>> quite certain you know I mean no disrespect)?  Solaris11 Express
was
>>> a massive leap in functionality and bugfixes to ZFS.  I''ve
seen
>>> exactly nothing out of "outside of Oracle" in the time
since it went
>>> closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
>>> spending hundreds of man-hours on a GUI and userland apps
isn''t work
>>> on ZFS.
>>>
>>>
>>
>> Exactly my observation as well. I haven''t seen any ZFS related
>> development happening at Ilumos or Nexenta, at least not yet.
>
> I am quite sure you understand how pipelines work :-)
>  -- richard
>

I''m getting pretty close to my pain threshold on the BP_rewrite stuff, 
since not having that feature''s holding up a big chunk of work
I''d like
to push.

If anyone outside of Oracle is working on some sort of change to ZFS 
that will allow arbitrary movement/placement of pre-written slabs, can 
they please contact me?  I''m pretty much at the point where
I''m going to
start diving into that chunk of the source to see if there''s something 
little old me can do, and I''d far rather help on someone
else''s
implementation than have to do it myself from scratch.

I''d prefer a private contact, as I realize that such work may not be 
ready for public discussion yet.

Thanks, folks!

Oh, and this is completely just me, not Oracle talking in any way.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/d587ef77/attachment-0001.html>

Richard Elling

2011-Jan-03 22:15 UTC

head link

[zfs-discuss] A few questions

On Jan 3, 2011, at 2:10 PM, Erik Trimble wrote> On 1/3/2011 8:28 AM, Richard Elling wrote:
>> 
>> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
>>> On 12/26/10 05:40 AM, Tim Cook wrote:
>>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
<richard.elling at gmail.com> wrote:
>>>> 
>>>> There are more people outside of Oracle developing for ZFS than
inside Oracle.
>>>> This has been true for some time now.
>>>> 
>>>> 
>>>> Pardon my skepticism, but where is the proof of this claim
(I''m quite certain you know I mean no disrespect)?  Solaris11 Express
was a massive leap in functionality and bugfixes to ZFS.  I''ve seen
exactly nothing out of "outside of Oracle" in the time since it went
closed.  We used to see updates bi-weekly out of Sun.  Nexenta spending hundreds
of man-hours on a GUI and userland apps isn''t work on ZFS.
>>>> 
>>>> 
>>> 
>>> Exactly my observation as well. I haven''t seen any ZFS
related development happening at Ilumos or Nexenta, at least not yet.
>> 
>> I am quite sure you understand how pipelines work :-)
>>  -- richard
> 
> I''m getting pretty close to my pain threshold on the BP_rewrite
stuff, since not having that feature''s holding up a big chunk of work
I''d like to push.
> 
> If anyone outside of Oracle is working on some sort of change to ZFS that
will allow arbitrary movement/placement of pre-written slabs, can they please
contact me?  I''m pretty much at the point where I''m going to
start diving into that chunk of the source to see if there''s something
little old me can do, and I''d far rather help on someone
else''s implementation than have to do it myself from scratch.
> 
> I''d prefer a private contact, as I realize that such work may not
be ready for public discussion yet.
> 
> Thanks, folks!
> 
> Oh, and this is completely just me, not Oracle talking in any way.
Oracle doesn''t seem to say much at all :-(

But for those interested, Nexenta is actively hiring people to work in this
area.
 -- richard

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/d84cc3fa/attachment.html>

Robert Milkowski

2011-Jan-04 23:35 UTC

head link

[zfs-discuss] A few questions

On 01/ 3/11 04:28 PM, Richard Elling wrote:> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
>
>> On 12/26/10 05:40 AM, Tim Cook wrote:
>>>
>>>
>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
>>> <richard.elling at gmail.com <mailto:richard.elling at
gmail.com>> wrote:
>>>
>>>
>>>     There are more people outside of Oracle developing for ZFS than
>>>     inside Oracle.
>>>     This has been true for some time now.
>>>
>>>
>>>
>>>
>>> Pardon my skepticism, but where is the proof of this claim
(I''m
>>> quite certain you know I mean no disrespect)?  Solaris11 Express
was
>>> a massive leap in functionality and bugfixes to ZFS.  I''ve
seen
>>> exactly nothing out of "outside of Oracle" in the time
since it went
>>> closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
>>> spending hundreds of man-hours on a GUI and userland apps
isn''t work
>>> on ZFS.
>>>
>>>
>>
>> Exactly my observation as well. I haven''t seen any ZFS related
>> development happening at Ilumos or Nexenta, at least not yet.
>
> I am quite sure you understand how pipelines work :-)
>
Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? Somehow 
I don''t think so... but I would love to be proved wrong :)

-- 
Robert Milkowski
http://milek.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/e4112981/attachment-0001.html>

Robert Milkowski

2011-Jan-04 23:38 UTC

head link

[zfs-discuss] A few questions

On 01/ 4/11 11:35 PM, Robert Milkowski wrote:> On 01/ 3/11 04:28 PM, Richard Elling wrote:
>> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:
>>
>>> On 12/26/10 05:40 AM, Tim Cook wrote:
>>>>
>>>>
>>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
>>>> <richard.elling at gmail.com <mailto:richard.elling at
gmail.com>> wrote:
>>>>
>>>>
>>>>     There are more people outside of Oracle developing for ZFS
than
>>>>     inside Oracle.
>>>>     This has been true for some time now.
>>>>
>>>>
>>>>
>>>>
>>>> Pardon my skepticism, but where is the proof of this claim
(I''m
>>>> quite certain you know I mean no disrespect)?  Solaris11
Express
>>>> was a massive leap in functionality and bugfixes to ZFS. 
I''ve seen
>>>> exactly nothing out of "outside of Oracle" in the
time since it
>>>> went closed.  We used to see updates bi-weekly out of Sun. 
Nexenta
>>>> spending hundreds of man-hours on a GUI and userland apps
isn''t
>>>> work on ZFS.
>>>>
>>>>
>>>
>>> Exactly my observation as well. I haven''t seen any ZFS
related
>>> development happening at Ilumos or Nexenta, at least not yet.
>>
>> I am quite sure you understand how pipelines work :-)
>>
>
> Are you suggesting that Nexenta is developing new ZFS features behind 
> closed doors (like Oracle...) and then will share code later-on? 
> Somehow I don''t think so... but I would love to be proved wrong :)
I mean I would love to see Nexenta start delivering real innovation in 
Solaris/Illumos kernel (zfs, networking, ...), not that I would love to 
see it happening behind a closed doors :)

-- 
Robert Milkowski
http://milek.blogspot.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/c8dea63e/attachment.html>

Tim Cook

2011-Jan-05 05:15 UTC

head link

[zfs-discuss] A few questions

On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore <garrett at
nexenta.com> wrote:
>  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:
>
> On 12/26/10 05:40 AM, Tim Cook wrote:
>
>
>
> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at
gmail.com
> > wrote:
>
>>
>> There are more people outside of Oracle developing for ZFS than inside
>> Oracle.
>> This has been true for some time now.
>>
>>
>>
>
>  Pardon my skepticism, but where is the proof of this claim (I''m
quite
> certain you know I mean no disrespect)?  Solaris11 Express was a massive
> leap in functionality and bugfixes to ZFS.  I''ve seen exactly
nothing out of
> "outside of Oracle" in the time since it went closed.  We used to
see
> updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours on a
> GUI and userland apps isn''t work on ZFS.
>
>
>
> Exactly my observation as well. I haven''t seen any ZFS related
development
> happening at Ilumos or Nexenta, at least not yet.
>
>
> Just because you''ve not seen it yet doesn''t imply it
isn''t happening.
> Please be patient.
>
>    - Garrett
>

Or, conversely, don''t make claims of all this code contribution prior
to
having anything to show for your claimed efforts.  Duke Nukem Forever was
going to be the greatest video game ever created... we were told to "be
patient"... we''re still waiting for that too.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/bcab0250/attachment-0001.html>

Tim Cook

2011-Jan-05 07:48 UTC

head link

[zfs-discuss] A few questions

On Tue, Jan 4, 2011 at 8:21 PM, Garrett D''Amore <garrett at
nexenta.com> wrote:
>  On 01/ 4/11 09:15 PM, Tim Cook wrote:
>
>
>
> On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore <garrett at
nexenta.com>wrote:
>
>>  On 01/ 3/11 05:08 AM, Robert Milkowski wrote:
>>
>> On 12/26/10 05:40 AM, Tim Cook wrote:
>>
>>
>>
>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <
>> richard.elling at gmail.com> wrote:
>>
>>>
>>> There are more people outside of Oracle developing for ZFS than
inside
>>> Oracle.
>>> This has been true for some time now.
>>>
>>>
>>>
>>
>>  Pardon my skepticism, but where is the proof of this claim
(I''m quite
>> certain you know I mean no disrespect)?  Solaris11 Express was a
massive
>> leap in functionality and bugfixes to ZFS.  I''ve seen exactly
nothing out of
>> "outside of Oracle" in the time since it went closed.  We
used to see
>> updates bi-weekly out of Sun.  Nexenta spending hundreds of man-hours
on a
>> GUI and userland apps isn''t work on ZFS.
>>
>>
>>
>> Exactly my observation as well. I haven''t seen any ZFS related
development
>> happening at Ilumos or Nexenta, at least not yet.
>>
>>
>>  Just because you''ve not seen it yet doesn''t imply it
isn''t happening.
>> Please be patient.
>>
>>    - Garrett
>>
>
>
>  Or, conversely, don''t make claims of all this code contribution
prior to
> having anything to show for your claimed efforts.  Duke Nukem Forever was
> going to be the greatest video game ever created... we were told to
"be
> patient"... we''re still waiting for that too.
>
>
>
> Um, have you not been paying attention?  I''ve delivered quite a
lot of
> contribution to illumos already, just not in ZFS.   Take a close look --
> there almost certainly wouldn''t *be* an open source version of
OS/Net had I
> not done the work to enable this in libc, kernel crypto, and other bits.
> This work is still higher priority than ZFS innovation for a variety of
> reasons -- mostly because we need a viable and supportable illumos upon
> which to build those ZFS innovations.
>
> That said, much of the ZFS work I hope to contribute to illumos needs more
> baking, but some of it is already open source in NexentaStor.  (You can for
> a start look at zfs-monitor, the WORM support, and support for hardware
GZIP
> acceleration all as things that Nexenta has innovated in ZFS, and which are
> open source today if not part of illumos.  Check out
> http://www.nexenta.org for source code access.)
>
> So there, money placed where mouth is.  You?
>
>    - Garrett
>
>
>The claim was that there are more people contributing code from outside of
Oracle than inside to zfs.  Your contributions to Illumos do absolutely
nothing to backup that claim.  ZFS-monitor is not ZFS code (it''s an FMA
module), WORM also isn''t ZFS code, it''s an OS level operation,
and GZIP
hardware acceleration is produced by Indra networks, and has absolutely
nothing to do with ZFS.  Does it help ZFS?  Sure, but that''s hardly a
code
contribution to ZFS when it''s simply a hardware acceleration card that
accelerates ALL gzip code.

So, great job picking three projects that are not proof of developers
working on ZFS.  And great job not providing any proof to the claim there
are more developers working on ZFS outside of Oracle than within.

You''re going to need a hell of a lot bigger bank account to cash the
check
than what you''ve got.  As for me, I don''t recall making any
claims on this
list that I can''t back up, so I''m not really sure what
you''re getting at.  I
can only assume the defensive tone of your email is because you''ve been
called out and can''t backup the claims either.

So again: if you''ve got code in the works, great.  Talk about it when
it''s
ready.  Stop throwing out baseless claims that you have no proof of and then
fall back on "just be patient, it''s coming".  We''ve
heard that enough from
Oracle and Sun already.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/c2f67bca/attachment.html>

Edward Ned Harvey

2011-Jan-05 13:18 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Tim Cook
> 
> 
> The claim was that there are more people contributing code from outside of
> Oracle than inside to zfs. ?Your contributions to Illumos do absolutelynothing

Guys, please let''s just say this much:

To all those who are contributing to the open-source ZFS code, freebsd,
illumos project, and others, thank you very much.  :-)  We know certain
things are stable and production ready, but there has not yet been much
forward development after zpool 28, but the effort is well appreciated, and
for whatever comes next, yes we can all be patient.

Right now, Oracle is not contributing at all to the open source branches of
any of these projects.  So right now it''s fair to say the non-oracle
contributions to the OPEN SOURCE ZFS outweighs the nonexistent oracle
contributions.  However, Oracle is continuing to develop the closed-source
ZFS.  

I don''t know if anyone has real numbers, dollars contributed or number
of
developer hours etc, but I think it''s fair to say that oracle is
probably
contributing more to the closed source ZFS right now, than the rest of the
world is contributing to the open source ZFS right now.  Also, we know that
the closed source ZFS right now is more advanced than the open source ZFS
(zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
developing faster too, than the open source ZFS right now.

If anyone has any good way to draw more contributors into the open source
tree, that would also be useful and appreciated.  Gosh, it would be nice to
get major players like Dell, HP, IBM, Apple contributing into that project.

Deano

2011-Jan-05 14:15 UTC

head link

[zfs-discuss] A few questions

Edward Ned Harvey wrote> I don''t know if anyone has real numbers, dollars contributed or
number of
> developer hours etc, but I think it''s fair to say that oracle is
probably
> contributing more to the closed source ZFS right now, than the rest of the
> world is contributing to the open source ZFS right now.  Also, we know
that> the closed source ZFS right now is more advanced than the open source ZFS
> (zpool 31 vs 28).  Oracle closed source ZFS is ahead, and probably
> developing faster too, than the open source ZFS right now.
> If anyone has any good way to draw more contributors into the open source
> tree, that would also be useful and appreciated.  Gosh, it would be nice
to> get major players like Dell, HP, IBM, Apple contributing into thatproject.

This is something that Illumos/Open source ZFS needs to decide what it
wants, effectively we can''t innovate ZFS without breaking capability...
because our Illumos ZPool version 29 (if we innovate) will not be Oracle
Zpool version 29.

If we want open-source ZFS to we need to make that choice and let everyone
know, apart from submitting bug fixes to zpool v28, are I''m not sure if
other changed would be welcome?

So honestly do we want to innovate ZFS (I do) or do we just want to follow
Oracle?

Bye,
Deano

deano at cloudpixies.com

Edward Ned Harvey

2011-Jan-05 14:34 UTC

head link

[zfs-discuss] A few questions

> From: Deano [mailto:deano at rattie.demon.co.uk]
> Sent: Wednesday, January 05, 2011 9:16 AM
> 
> So honestly do we want to innovate ZFS (I do) or do we just want to follow
> Oracle?
Well, you can''t follow Oracle.  Unless you wait till they release
something,
reverse engineer it, and attempt to reimplement it.  I am quite sure
you''ll
be sued if you do that.

If you want forward development in the open source tree, you basically have
only one option:  Some major contributor must have a financial interest, and
commit to a real concerted development effort, with their own roadmap, which
is intentionally designed NOT to overlap with the Oracle roadmap.
Otherwise, the code will stagnate.

I am rooting for the open source projects, but I''m not optimistic
personally.  I think all major contributors (IBM, Apple, etc) will not
participate for various reasons, and as a result, we''ll experience bit
rot...  As presently evident by lack of zpool advancement beyond 28.

So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...  I
prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
otherwise bought a netapp, I''ll still buy the solaris server instead...
But
it''s no longer a competitor against ubuntu or centos.

Just the way Larry wants it.

Khushil Dep

2011-Jan-05 14:38 UTC

head link

[zfs-discuss] A few questions

We do have a major commercial interest - Nexenta. It''s been quiet but I
do
look forward to seeing something come out of that stable this year? :-)

---
W. A. Khushil Dep - khushil.dep at gmail.com -  07905374843

Visit my blog at http://www.khushil.com/






On 5 January 2011 14:34, Edward Ned Harvey <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > From: Deano [mailto:deano at rattie.demon.co.uk]
> > Sent: Wednesday, January 05, 2011 9:16 AM
> >
> > So honestly do we want to innovate ZFS (I do) or do we just want to
> follow
> > Oracle?
>
> Well, you can''t follow Oracle.  Unless you wait till they release
> something,
> reverse engineer it, and attempt to reimplement it.  I am quite sure
you''ll
> be sued if you do that.
>
> If you want forward development in the open source tree, you basically have
> only one option:  Some major contributor must have a financial interest,
> and
> commit to a real concerted development effort, with their own roadmap,
> which
> is intentionally designed NOT to overlap with the Oracle roadmap.
> Otherwise, the code will stagnate.
>
> I am rooting for the open source projects, but I''m not optimistic
> personally.  I think all major contributors (IBM, Apple, etc) will not
> participate for various reasons, and as a result, we''ll experience
bit
> rot...  As presently evident by lack of zpool advancement beyond 28.
>
> So in my mind, Oracle and ZFS are now just like netapp and wafl.  Well...
>  I
> prefer Solaris and ZFS over netapp and wafl...  So whenever I would have
> otherwise bought a netapp, I''ll still buy the solaris server
instead...
>  But
> it''s no longer a competitor against ubuntu or centos.
>
> Just the way Larry wants it.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110105/a82030a0/attachment.html>

Michael Schuster

2011-Jan-05 14:42 UTC

head link

[zfs-discuss] A few questions

On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>> From: Deano [mailto:deano at rattie.demon.co.uk]
>> Sent: Wednesday, January 05, 2011 9:16 AM
>>
>> So honestly do we want to innovate ZFS (I do) or do we just want to
follow
>> Oracle?
>
> Well, you can''t follow Oracle. ?Unless you wait till they release
something,
> reverse engineer it, and attempt to reimplement it.
that''s not my understanding - while we will have to wait, oracle is
supposed to release *some* source code afterwards to satisfy some
claim or other. I agree, some would argue that that should have
already happened with S11 express... I don''t know it has, but
that''s
not *the* release of S11, is it? And once the code is released, even
if after the fact, it''s not reverse-engineering anymore, is it?

Michael
PS: just in case: even while at Oracle, I had no insight into any of
these plans, much less do I have now.
-- 
regards/mit freundlichen Gr?ssen
Michael Schuster

Saxon, Will

2011-Jan-05 15:08 UTC

head link

[zfs-discuss] A few questions

> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Michael Schuster
> Sent: Wednesday, January 05, 2011 9:42 AM
> To: Edward Ned Harvey
> Cc: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] A few questions
> 
> On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey
> <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> >> From: Deano [mailto:deano at rattie.demon.co.uk]
> >> Sent: Wednesday, January 05, 2011 9:16 AM
> >>
> >> So honestly do we want to innovate ZFS (I do) or do we just want
to
> follow
> >> Oracle?
> >
> > Well, you can''t follow Oracle. ?Unless you wait till they
release something,
> > reverse engineer it, and attempt to reimplement it.
> 
> that''s not my understanding - while we will have to wait, oracle
is
> supposed to release *some* source code afterwards to satisfy some
> claim or other. I agree, some would argue that that should have
> already happened with S11 express... I don''t know it has, but
that''s
> not *the* release of S11, is it? And once the code is released, even
> if after the fact, it''s not reverse-engineering anymore, is it?
Not exactly. Oracle hasn''t publicly committed to anything like that.
There were several news articles last year referencing a leaked internal memo
that I believe was more of a proposal than a plan.

Even if Oracle did ''commit'' to releasing code, they could
easily just decide not to.

-Will

Edward Ned Harvey

2011-Jan-05 15:38 UTC

head link

[zfs-discuss] A few questions

> From: Michael Schuster [mailto:michaelsprivate at gmail.com]
> 
> > Well, you can''t follow Oracle. ?Unless you wait till they
release
something,> > reverse engineer it, and attempt to reimplement it.
> 
> that''s not my understanding - while we will have to wait, oracle
is
> supposed to release *some* source code afterwards to satisfy some
Where do you get that from?  AFAIK, there is no official word about oracle
opening anything moving forward, but there are plenty of unofficial reports
that it will not be opened.  Nobody in the field is holding any hope for
that to change anymore, most importantly illumos and nexenta.  (At least
with regards to ZFS and all the other projects relevant to solaris.)

I know in the case of SGE/OGE, it''s officially closed source now.  As
of Dec
31st, sunsource is being decomissioned, and the announcement of officially
closing the SGE source and decomissioning the open source community went out
on Dec 24th.  So all of this leads me to believe, with very little
reservation, that the new developments beyond zpool 28 are closed source
moving forward.  There''s very little breathing room remaining for hope
of
that being open sourced again.

Edward Ned Harvey

2011-Jan-05 15:44 UTC

head link

[zfs-discuss] A few questions

> From: Khushil Dep [mailto:khushil.dep at gmail.com]
> 
> We do have a major commercial interest - Nexenta. It''s been quiet
but I do
> look forward to seeing something come out of that stable this year? :-)
I''ll agree to call Nexenta "a major commerical interest," in
regards to contribution to the open source ZFS tree, if they become an
officially supported OS on Dell, HP, and/or IBM hardware.  Otherwise,
they''re just simply too small to keep up with the rate of development
of the closed source ZFS tree, and destined to be left in the dust.

And if Nexenta does become a seriously viable competitor against netapp and
oracle...  Watch out for lawsuits...

Garrett D''Amore

2011-Jan-05 16:15 UTC

head link

[zfs-discuss] A few questions

On 01/ 4/11 11:48 PM, Tim Cook wrote:>
>
> On Tue, Jan 4, 2011 at 8:21 PM, Garrett D''Amore <garrett at
nexenta.com
> <mailto:garrett at nexenta.com>> wrote:
>
>     On 01/ 4/11 09:15 PM, Tim Cook wrote:
>>
>>
>>     On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore
>>     <garrett at nexenta.com <mailto:garrett at
nexenta.com>> wrote:
>>
>>         On 01/ 3/11 05:08 AM, Robert Milkowski wrote:
>>>         On 12/26/10 05:40 AM, Tim Cook wrote:
>>>>
>>>>
>>>>         On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling
>>>>         <richard.elling at gmail.com
>>>>         <mailto:richard.elling at gmail.com>> wrote:
>>>>
>>>>
>>>>             There are more people outside of Oracle developing
for
>>>>             ZFS than inside Oracle.
>>>>             This has been true for some time now.
>>>>
>>>>
>>>>
>>>>
>>>>         Pardon my skepticism, but where is the proof of this
claim
>>>>         (I''m quite certain you know I mean no
disrespect)?
>>>>          Solaris11 Express was a massive leap in functionality
and
>>>>         bugfixes to ZFS.  I''ve seen exactly nothing
out of "outside
>>>>         of Oracle" in the time since it went closed.  We
used to
>>>>         see updates bi-weekly out of Sun.  Nexenta spending
>>>>         hundreds of man-hours on a GUI and userland apps
isn''t work
>>>>         on ZFS.
>>>>
>>>>
>>>
>>>         Exactly my observation as well. I haven''t seen any
ZFS
>>>         related development happening at Ilumos or Nexenta, at
least
>>>         not yet.
>>
>>         Just because you''ve not seen it yet doesn''t
imply it isn''t
>>         happening.  Please be patient.
>>
>>            - Garrett
>>
>>
>>
>>     Or, conversely, don''t make claims of all this code
contribution
>>     prior to having anything to show for your claimed efforts.  Duke
>>     Nukem Forever was going to be the greatest video game ever
>>     created... we were told to "be patient"... we''re
still waiting
>>     for that too.
>
>
>     Um, have you not been paying attention?  I''ve delivered quite
a
>     lot of contribution to illumos already, just not in ZFS.   Take a
>     close look -- there almost certainly wouldn''t *be* an open
source
>     version of OS/Net had I not done the work to enable this in libc,
>     kernel crypto, and other bits.  This work is still higher priority
>     than ZFS innovation for a variety of reasons -- mostly because we
>     need a viable and supportable illumos upon which to build those
>     ZFS innovations.
>
>     That said, much of the ZFS work I hope to contribute to illumos
>     needs more baking, but some of it is already open source in
>     NexentaStor.  (You can for a start look at zfs-monitor, the WORM
>     support, and support for hardware GZIP acceleration all as things
>     that Nexenta has innovated in ZFS, and which are open source today
>     if not part of illumos.  Check out http://www.nexenta.org for
>     source code access.)
>
>     So there, money placed where mouth is.  You?
>
>        - Garrett
>
>
>
> The claim was that there are more people contributing code from 
> outside of Oracle than inside to zfs.  Your contributions to Illumos 
> do absolutely nothing to backup that claim.  ZFS-monitor is not ZFS 
> code (it''s an FMA module), WORM also isn''t ZFS code,
it''s an OS level
> operation, and GZIP hardware acceleration is produced by Indra 
> networks, and has absolutely nothing to do with ZFS.  Does it help 
> ZFS?  Sure, but that''s hardly a code contribution to ZFS when
it''s
> simply a hardware acceleration card that accelerates ALL gzip code.
Um... you have obviously not looked at the code.

Our WORM code is not some basic OS guarantees on top of ZFS, but 
modifications to the ZFS code itself so that ZFS *itself* honors the 
WORM property, which is implemented as a property on the ZFS filesystem.

Likewise, the GZIP hardware acceleration support includes specific 
modifications to the ZFS kernel filesystem code.

Of course, we''ve not done anything major to change the fundamental way 
that ZFS stores data... is that what you''re talking about?

I think you must have a very narrow idea of what constitutes an 
"innovation" in ZFS.
>
> So, great job picking three projects that are not proof of developers 
> working on ZFS.  And great job not providing any proof to the claim 
> there are more developers working on ZFS outside of Oracle than within.
Nexenta don''t represent that majority actually.  A large number of ZFS 
folks -- people with names like Leventhal, Ahrens, Wilson, and Gregg, 
are working on ZFS related work at Delphix and Joyent, or so I''ve been 
told.  I don''t have first hand knowledge of *what* the details are, but
I''m looking forward to seeing the results.

This ignores the contributions from people working on ZFS on other 
platforms as well.

Of course, since I know longer work there, I don''t really know how many
people Oracle still has working on ZFS.  They could have tasked 1,000 
people with it.  Or they could have shut the project down entirely.  But 
of the people who had, up until Oracle shut down the open code, made 
non-trivial contributions to ZFS, I think the majority of *those* people 
can be found working outside of Oracle now, and I think most of them are 
still working on ZFS projects.  (There are a few "big names" that I 
don''t know what they are doing precisely -- e.g. Jeff Bonwick.)
>
> You''re going to need a hell of a lot bigger bank account to cash
the
> check than what you''ve got.  As for me, I don''t recall
making any
> claims on this list that I can''t back up, so I''m not
really sure what
> you''re getting at.  I can only assume the defensive tone of your
email
> is because you''ve been called out and can''t backup the
claims either.
>
> So again: if you''ve got code in the works, great.  Talk about it
when
> it''s ready.  Stop throwing out baseless claims that you have no
proof
> of and then fall back on "just be patient, it''s coming".
We''ve heard
> that enough from Oracle and Sun already.
Ok, I''ll shut up now.  But I''m going to completely ignore
anything else
you have to say on this topic, as I have a lot more knowledge of what 
we''re doing at Nexenta than you have.

   - Garrett

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110105/7e52096e/attachment.html>

Deano

2011-Jan-05 17:26 UTC

head link

[zfs-discuss] A few questions

Edward Ned Harvey wrote> From: Deano [mailto:deano at rattie.demon.co.uk]
> Sent: Wednesday, January 05, 2011 9:16 AM
> 
> So honestly do we want to innovate ZFS (I do) or do we just want to follow
> Oracle?
> Well, you can''t follow Oracle.  Unless you wait till they release
something,> reverse engineer it, and attempt to reimplement it.  I am quite sure
you''ll> be sued if you do that.
>
> If you want forward development in the open source tree, you basically
have> only one option:  Some major contributor must have a financial interest,
and> commit to a real concerted development effort, with their own roadmap,
which> is intentionally designed NOT to overlap with the Oracle roadmap.
> Otherwise, the code will stagnate.
Why does it need a big backer? Erm ZFS isn''t that large or amazingly
complex
code. It is *good* code but take 100s of developers and a fortune to
develop? Erm nope (which I''d bet it never had at Sun either).

Why not overlap Oracle? what has it got to do with Oracle if we have split
into ZFS (Oracle) and "OpenZFS" in future. "OpenZFS" will
get whatever
features developers feel that want or they need to develop for it.

This is the fundamental choice of Open source ZFS, illumos and OpenIndiania
(and other distributions) have to decide, what is there purpose? Is it a
free compatible (though trailing) version of Oracle Solaris OR a platform
that shared an ancestor with Oracle Solaris via Sun OpenSolaris but now is
its own evolutionary species, with no more connection than I have with a
15th cousin removed on my great, great, great, grandfathers side.

This isn''t even a theoretical what if situation for me, I have a major
modification to ZFS (still being developed), it has no basis on Oracle or
anybody elses needs just mine. It is what I felt I needed and ZFS was the
right base for it. Now will that go into "OpenZFS"? Honestly I
don''t know
yet, because not sure it would be wanted (it will be incompatible with
Oracle ZFS) and personally, commercially I''m not sure if it''s
the right move
to open source the feature.

I bet I''m not the only small developer out there in a similar
situation, the
landscape is very unclear about what actually the community wants to do
going forward, and whether we will have or even want "OpenZFS" and
Oracle
ZFS or Oracle ZFS and 90% compatibles (always trailing) or Oracle ZFS + DevA
ZFS + DevB ZFS + DevC ZFS.

Bye,
Deano

deano at cloudpixies.com

Richard Elling

2011-Jan-05 18:14 UTC

head link

[zfs-discuss] A few questions

On Jan 5, 2011, at 7:44 AM, Edward Ned Harvey wrote:
>> From: Khushil Dep [mailto:khushil.dep at gmail.com]
>> 
>> We do have a major commercial interest - Nexenta. It''s been
quiet but I do
>> look forward to seeing something come out of that stable this year? :-)
> 
> I''ll agree to call Nexenta "a major commerical
interest," in regards to contribution to the open source ZFS tree, if they
become an officially supported OS on Dell, HP, and/or IBM hardware.
NexentaStor is officially supported on Dell, HP, and IBM hardware.  The only
question is, "what is your definition of ''support''"?
Many NexentaStor customers
today appear to be deploying on SuperMicro and Quanta systems, for obvious
cost reasons. Nexenta has good working relationships with these major vendors
and others.

As for investment, Nexenta has been and continues to hire the best engineers
and professional services people we can find. We see a lot of demand in the 
market and have been growing at an astonishing rate. If you''d like to
contribute
to making software storage solutions rather than whining about what Oracle
won''t
discuss, check us out and send me your resume :-)
 -- richard

Edward Ned Harvey

2011-Jan-06 00:14 UTC

head link

[zfs-discuss] A few questions

> From: Richard Elling [mailto:Richard.Elling at Nexenta.com]
> 
> > I''ll agree to call Nexenta "a major commerical
interest," in regards to
> contribution to the open source ZFS tree, if they become an officially
> supported OS on Dell, HP, and/or IBM hardware.
> 
> NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
only> question is, "what is your definition of
''support''"?  Many NexentaStor
I don''t want to argue about this, but I''ll just try to clarify
what I meant:

Presently, I have a dell server with officially supported solaris, and
it''s
as unreliable as pure junk.  It''s just the backup server, so
I''m free to
frequently create & destroy it... And as such, I frequently do recreate and
destroy it.  It is entirely stable running RHEL (centos) because Dell and
RedHat have a partnership with a serious number of human beings and machines
looking for and fixing any compatibility issues.  For my solaris
instability, I blame the fact that solaris developers don''t do
significant
quality assurance on non-sun hardware.  To become "officially"
compatible,
the whole qualification process is like this:  Somebody installs it,
doesn''t
see any problems, and then calls it "certified."  They reformat with
something else, and move on.  They don''t build their business on that
platform, so they don''t detect stability issues like the ones
reported...
System crashes once per week and so forth.  Solaris therefore passes the
test, and becomes one of the options available on the drop-down menu for
OSes with a new server.  (Of course that''s been discontinued by oracle,
but
that''s how it was in the past.)

Developers need to "eat their own food."  Smoke your own crack. 
Hardware
engineers at Dell need to actually use your OS on their hardware, for their
development efforts.  I would be willing to bet Sun hardware engineers use a
significant percentage of solaris servers for their work...  And guess what
solaris engineers don''t use?  Non-sun hardware.  Pretty safe bet you
won''t
find any Dell servers in the server room where solaris developers do their
thing.

If you want to be taken seriously as an alternative storage option,
you''ve
got to at LEAST be listed as a factory-distributed OS that is an option to
ship with the new server, and THEN, when people such as myself buy those
things, we''ve got to have a good enough experience that we
don''t all bitch
and flame about it afterward.

Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
their developers to run YOUR OS on the servers which they use for
development.  Get them to sell your product bundled with their product.  And
dedicate real and serious engineering into bugfixes working with customers,
to truly identify root causes of instability, with a real OS development and
engineering and support group.  It''s got to be STABLE, that''s
the #1
requirement.

I previously made the comparison...  Even close-source solaris & ZFS is a
better alternative to close-source netapp & wafl.  So for now, those are the
only two enterprise supportable options I''m willing to stake my career
on,
and I''ll buy Sun hardware with Solaris.  But I really wish I could feel
confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
you make yourself look like a serious competitor against solaris, and really
truly form an awesome stable partnership with Dell, I will happily buy your
stuff instead of Oracle.  Even if you are a little behind in feature
offering.  But I will not buy your stuff if I can''t feel perfectly
confident
in its stability.

Ever heard the phrase "Nobody ever got fired for buying IBM." 
You''re the
little guys.  If you want to compete against the big guys, you''ve got
to
kick ass.  And don''t get sued into oblivion.

Even today''s feature set is perfectly adequate for at least a couple of
years to come.  If you put all your effort into stability and bugfixes,
serious partnerships with Dell, HP, IBM, and become extremely professional
looking and stable, with fanatical support...  You don''t have to worry
about
new feature development for some while.  Stability is #1 and not
disappearing is a pretty huge threat right now.

Based on my experience, I would not recommend buying Dell with Solaris, even
if that were still an option.  If you want solaris, buy sun/oracle hardware,
because then you can actually expect it to work reliably.  And if solaris
isn''t stable on dell ... then all the solaris derivatives including
nexenta
can''t be trusted either, no matter how much you claim it''s
"supported."

Show me the HCL, and show me the partnership between your software engineers
and Dell''s hardware engineers.  Make me believe there is a serious and
thorough qualification process.  Do a huge volume.  Your volume must be
large enough to justify dedicating some engineers to serious bugfix efforts
in the field.  Otherwise...  When I need to buy something stable... 
I''m
going to buy solaris on sun hardware, because I know that''s thoroughly
tried, tested, and stable.

Richard Elling

2011-Jan-06 01:47 UTC

head link

[zfs-discuss] A few questions

On Jan 5, 2011, at 4:14 PM, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:Richard.Elling at Nexenta.com]
>> 
>>> I''ll agree to call Nexenta "a major commerical
interest," in regards to
>> contribution to the open source ZFS tree, if they become an officially
>> supported OS on Dell, HP, and/or IBM hardware.
>> 
>> NexentaStor is officially supported on Dell, HP, and IBM hardware.  The
> only
>> question is, "what is your definition of
''support''"?  Many NexentaStor
> 
> I don''t want to argue about this, but I''ll just try to
clarify what I meant:
> 
> Presently, I have a dell server with officially supported solaris, and
it''s
> as unreliable as pure junk.  It''s just the backup server, so
I''m free to
> frequently create & destroy it... And as such, I frequently do recreate
and
> destroy it.  It is entirely stable running RHEL (centos) because Dell and
> RedHat have a partnership with a serious number of human beings and
machines
> looking for and fixing any compatibility issues.  For my solaris
> instability, I blame the fact that solaris developers don''t do
significant
> quality assurance on non-sun hardware.  To become "officially"
compatible,
> the whole qualification process is like this:  Somebody installs it,
doesn''t
> see any problems, and then calls it "certified."  They reformat
with
> something else, and move on.  They don''t build their business on
that
> platform, so they don''t detect stability issues like the ones
reported...
> System crashes once per week and so forth.  Solaris therefore passes the
> test, and becomes one of the options available on the drop-down menu for
> OSes with a new server.  (Of course that''s been discontinued by
oracle, but
> that''s how it was in the past.)
If I understand correctly, you want Dell, HP, and IBM to run OSes other
than Microsoft and RHEL.  For the thousands of other OSes out there,
this is a significant barrier to entry. One can argue that the most significant
innovations in the past 5 years came from none of those companies -- they
came from Google, Apple, Amazon, Facebook, and the other innovators
who did not spend their efforts trying to beat Microsoft and get into the 
manufacturing floor of the big vendors.
> Developers need to "eat their own food."  
I agree, but neither Dell, HP, nor IBM develop Windows...
> Smoke your own crack.  Hardware
> engineers at Dell need to actually use your OS on their hardware, for their
> development efforts.  I would be willing to bet Sun hardware engineers use
a
> significant percentage of solaris servers for their work...  And guess what
> solaris engineers don''t use?  Non-sun hardware.  
I''m not sure of the current state, but many of the Solaris engineers
develop
on laptops and Sun did not offer a laptop product line.
> Pretty safe bet you won''t
> find any Dell servers in the server room where solaris developers do their
> thing.
You will find them where Nexenta developers live :-)
> If you want to be taken seriously as an alternative storage option,
you''ve
> got to at LEAST be listed as a factory-distributed OS that is an option to
> ship with the new server, and THEN, when people such as myself buy those
> things, we''ve got to have a good enough experience that we
don''t all bitch
> and flame about it afterward.
Wait a minute... this is patently false.  The big storage vendors: NetApp,
EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.
> Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
> their developers to run YOUR OS on the servers which they use for
> development.  Get them to sell your product bundled with their product. 
And
> dedicate real and serious engineering into bugfixes working with customers,
> to truly identify root causes of instability, with a real OS development
and
> engineering and support group.  It''s got to be STABLE,
that''s the #1
> requirement.
There are many marketing activities are in progress towards this end.
One of Nexenta''s major OEMs (Compellent) is being purchased by Dell. 
The deal is not done, so there is no public information on future plans,
to my knowledge.
> I previously made the comparison...  Even close-source solaris & ZFS is
a
> better alternative to close-source netapp & wafl.  So for now, those
are the
> only two enterprise supportable options I''m willing to stake my
career on,
> and I''ll buy Sun hardware with Solaris.  But I really wish I could
feel
> confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
> you make yourself look like a serious competitor against solaris, and
really
> truly form an awesome stable partnership with Dell, I will happily buy your
> stuff instead of Oracle.  Even if you are a little behind in feature
> offering.  But I will not buy your stuff if I can''t feel perfectly
confident
> in its stability.
I can assure you that we take stability very seriously.  And since you seem
to think the big box vendors are infallible, a sampling of those things we
(Nexenta) have to live with:
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3690351&swItem=MTX-d56eb5d75f03485dbc32680f62&prodNameId=4094976&swEnvOID=4024&swLang=13&taskId=135&mode=4&idx=2
http://www.intel.com/assets/pdf/specupdate/321324.pdf
http://support.citrix.com/article/CTX127395
http://lists.us.dell.com/pipermail/linux-poweredge/2010-May/042280.html
http://support.dell.com/support/topics/global.aspx/support/kcs/document?c=us&l=en&s=gen&docid=DSN_619147E926299297E040AE0AB8E103AE&isLegacy=true

If you look very far, you will find all vendors have issues and at the end of
the
day, vendors who integrate other people''s products (HP, Dell, IBM) are
subject
to the same issues that the rest of the industry sees. So when you complain
about stability issues, it is incumbent on you to identify the responsible
vendor
or supplier. There is no one-stop-shop in the x86 market and there
hasn''t been
one for the past 3 decades.
> Ever heard the phrase "Nobody ever got fired for buying IBM." 
You''re the
> little guys.  If you want to compete against the big guys, you''ve
got to
> kick ass.  And don''t get sued into oblivion.
Yes, and since you''re finished, you can return that copy of Dummies
Guide
to Business to the library.
> Even today''s feature set is perfectly adequate for at least a
couple of
> years to come.  If you put all your effort into stability and bugfixes,
> serious partnerships with Dell, HP, IBM, and become extremely professional
> looking and stable, with fanatical support...  You don''t have to
worry about
> new feature development for some while.  Stability is #1 and not
> disappearing is a pretty huge threat right now.
I think everyone will agree that stability is important. Since I''ve
been at
Nexenta, I am pleasantly surprised by the lack of panics or data loss.
The current rate at Nexenta is far lower than the rates I saw at Sun (and
yes, I did have access to the data).
> Based on my experience, I would not recommend buying Dell with Solaris,
even
> if that were still an option.  If you want solaris, buy sun/oracle
hardware,
> because then you can actually expect it to work reliably.  And if solaris
> isn''t stable on dell ... then all the solaris derivatives
including nexenta
> can''t be trusted either, no matter how much you claim
it''s "supported."
Oracle "solves" this problem by not making the support details
public... you
have to have an account and service contract to see the dirty laundry details.
> Show me the HCL, and show me the partnership between your software
engineers
> and Dell''s hardware engineers.
Uhm... what makes you think Dell invests in hardware development?
Dell is a manufacturer and spends very little on product development.
http://stocks.investopedia.com/stock-analysis/jimcramer/CramersMadMoneyRecapItsAlmost2011onWallStreetsCalendarUpdate3.aspx

NB. much of Dell''s innovation is in business systems and manufacturing,
a good thing, but they are not known for pure research, software development,
or product development beyond hardware integration.
>  Make me believe there is a serious and
> thorough qualification process.  Do a huge volume.  Your volume must be
> large enough to justify dedicating some engineers to serious bugfix efforts
> in the field.  Otherwise...  When I need to buy something stable... 
I''m
> going to buy solaris on sun hardware, because I know that''s
thoroughly
> tried, tested, and stable.
In a former life I worked in the Quality Office at Sun.  I''m delighted
that you
have such a fondness for the products. They are quite good. Of course, 
NexentaStor works quite nicely on Oracle''s Sun x86 systems :-)
 -- richard

Bob Friesenhahn

2011-Jan-06 02:23 UTC

head link

[zfs-discuss] A few questions

On Wed, 5 Jan 2011, Edward Ned Harvey wrote:> with regards to ZFS and all the other projects relevant to solaris.)
>
> I know in the case of SGE/OGE, it''s officially closed source now. 
As of Dec
> 31st, sunsource is being decomissioned, and the announcement of officially
> closing the SGE source and decomissioning the open source community went
out
> on Dec 24th.  So all of this leads me to believe, with very little
> reservation, that the new developments beyond zpool 28 are closed source
> moving forward.  There''s very little breathing room remaining for
hope of
> that being open sourced again.
I have no idea what you are talking about.  Best I can tell, SGE/OGE 
is a reference to Sun Grid Engine, which has nothing to do with zfs. 
The only annoucement and discussion I can find via Google is written 
by you.  It was pretty clear even a year ago that Sun Grid Engine was 
going away.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Darren J Moffat

2011-Jan-06 09:33 UTC

head link

[zfs-discuss] A few questions

On 06/01/2011 00:14, Edward Ned Harvey wrote:> solaris engineers don''t use?  Non-sun hardware.  Pretty safe bet
you won''t
> find any Dell servers in the server room where solaris developers do their
> thing.
You would lose that bet, not only would you find Dell you would many 
other "big names" as well as white box hand build systems too.

Solaris developers use a lot of different hardware - Sun never made 
laptops so many of us have Apple (running Solaris on the metal and/or 
under virtualisation) or Toshiba or Fujitsu etc laptops.  There are also 
many workstations around the company that aren''t Sun hardware as well
as
servers.

-- 
Darren J Moffat

Khushil Dep

2011-Jan-06 10:12 UTC

head link

[zfs-discuss] A few questions

I''ve deployed large SAN''s on both SuperMicro 825/826/846 and
Dell
R610/R710''s and I''ve not found any issues so far. I always
make a point of
installing Intel chipset NIC''s on the DELL''s and disabling the
Broadcom ones
but other than that it''s always been plain sailing - hardware-wise
anyway.

I''ve always found that the real issue is formulating SOP''s to
match what the
organisation is used to with legacy storage systems, educating the admins
who will manage it going forward and doing the technical hand-over to folks
who may not know or want to know a whole lot of *nix land.

My 2p. YMMV.

---
W. A. Khushil Dep - khushil.dep at gmail.com -  07905374843
Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting &
Contracting
http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord





On 6 January 2011 00:14, Edward Ned Harvey <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > From: Richard Elling [mailto:Richard.Elling at Nexenta.com]
> >
> > > I''ll agree to call Nexenta "a major commerical
interest," in regards to
> > contribution to the open source ZFS tree, if they become an officially
> > supported OS on Dell, HP, and/or IBM hardware.
> >
> > NexentaStor is officially supported on Dell, HP, and IBM hardware. 
The
> only
> > question is, "what is your definition of
''support''"?  Many NexentaStor
>
> I don''t want to argue about this, but I''ll just try to
clarify what I
> meant:
>
> Presently, I have a dell server with officially supported solaris, and
it''s
> as unreliable as pure junk.  It''s just the backup server, so
I''m free to
> frequently create & destroy it... And as such, I frequently do recreate
and
> destroy it.  It is entirely stable running RHEL (centos) because Dell and
> RedHat have a partnership with a serious number of human beings and
> machines
> looking for and fixing any compatibility issues.  For my solaris
> instability, I blame the fact that solaris developers don''t do
significant
> quality assurance on non-sun hardware.  To become "officially"
compatible,
> the whole qualification process is like this:  Somebody installs it,
> doesn''t
> see any problems, and then calls it "certified."  They reformat
with
> something else, and move on.  They don''t build their business on
that
> platform, so they don''t detect stability issues like the ones
reported...
> System crashes once per week and so forth.  Solaris therefore passes the
> test, and becomes one of the options available on the drop-down menu for
> OSes with a new server.  (Of course that''s been discontinued by
oracle, but
> that''s how it was in the past.)
>
> Developers need to "eat their own food."  Smoke your own crack. 
Hardware
> engineers at Dell need to actually use your OS on their hardware, for their
> development efforts.  I would be willing to bet Sun hardware engineers use
> a
> significant percentage of solaris servers for their work...  And guess what
> solaris engineers don''t use?  Non-sun hardware.  Pretty safe bet
you won''t
> find any Dell servers in the server room where solaris developers do their
> thing.
>
> If you want to be taken seriously as an alternative storage option,
you''ve
> got to at LEAST be listed as a factory-distributed OS that is an option to
> ship with the new server, and THEN, when people such as myself buy those
> things, we''ve got to have a good enough experience that we
don''t all bitch
> and flame about it afterward.
>
> Nexenta, you need a real and serious partnership with Dell, HP, IBM.  Get
> their developers to run YOUR OS on the servers which they use for
> development.  Get them to sell your product bundled with their product.
>  And
> dedicate real and serious engineering into bugfixes working with customers,
> to truly identify root causes of instability, with a real OS development
> and
> engineering and support group.  It''s got to be STABLE,
that''s the #1
> requirement.
>
> I previously made the comparison...  Even close-source solaris & ZFS is
a
> better alternative to close-source netapp & wafl.  So for now, those
are
> the
> only two enterprise supportable options I''m willing to stake my
career on,
> and I''ll buy Sun hardware with Solaris.  But I really wish I could
feel
> confident buying a cheaper Dell server and running ZFS on it.  Nexenta, if
> you make yourself look like a serious competitor against solaris, and
> really
> truly form an awesome stable partnership with Dell, I will happily buy your
> stuff instead of Oracle.  Even if you are a little behind in feature
> offering.  But I will not buy your stuff if I can''t feel perfectly
> confident
> in its stability.
>
> Ever heard the phrase "Nobody ever got fired for buying IBM." 
You''re the
> little guys.  If you want to compete against the big guys, you''ve
got to
> kick ass.  And don''t get sued into oblivion.
>
> Even today''s feature set is perfectly adequate for at least a
couple of
> years to come.  If you put all your effort into stability and bugfixes,
> serious partnerships with Dell, HP, IBM, and become extremely professional
> looking and stable, with fanatical support...  You don''t have to
worry
> about
> new feature development for some while.  Stability is #1 and not
> disappearing is a pretty huge threat right now.
>
> Based on my experience, I would not recommend buying Dell with Solaris,
> even
> if that were still an option.  If you want solaris, buy sun/oracle
> hardware,
> because then you can actually expect it to work reliably.  And if solaris
> isn''t stable on dell ... then all the solaris derivatives
including nexenta
> can''t be trusted either, no matter how much you claim
it''s "supported."
>
> Show me the HCL, and show me the partnership between your software
> engineers
> and Dell''s hardware engineers.  Make me believe there is a serious
and
> thorough qualification process.  Do a huge volume.  Your volume must be
> large enough to justify dedicating some engineers to serious bugfix efforts
> in the field.  Otherwise...  When I need to buy something stable... 
I''m
> going to buy solaris on sun hardware, because I know that''s
thoroughly
> tried, tested, and stable.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110106/ff753de8/attachment-0001.html>

Edward Ned Harvey

2011-Jan-06 13:06 UTC

head link

[zfs-discuss] A few questions

> From: Richard Elling [mailto:Richard.Elling at Nexenta.com]
> 
> If I understand correctly, you want Dell, HP, and IBM to run OSes other
> 
> I agree, but neither Dell, HP, nor IBM develop Windows...
> 
> I''m not sure of the current state, but many of the Solaris
engineers
develop> on laptops and Sun did not offer a laptop product line.
> 
> You will find them where Nexenta developers live :-)
> 
> Wait a minute... this is patently false.  The big storage vendors: NetApp,
> EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.
Like I said, not interested in arguing.  This is mostly just a bunch of
contradictions to what I said.

To each his own.  My conclusion is that I am not willing to stake my career
on the underdog alternative, when I know I can safely buy the sun hardware
and solaris.  I experimented once by buying solaris on dell.  It was a
proven failure, but that''s why I did it on a cheap noncritical backup
system
experimentally before expecting it to work in production.  Haven''t seen
any
underdog proven solid enough for me to deploy in enterprise yet.

J.P. King

2011-Jan-06 13:15 UTC

head link

[zfs-discuss] A few questions

This is a silly argument, but...
> Haven''t seen any underdog proven solid enough for me to deploy in 
> enterprise yet.
I haven''t seen any "over"dog proven solid enough for me to be
able to rely
on either.  Certainly not Solaris.  Don''t get me wrong, I like(d)
Solaris.
But every so often you''d find a bug and they''d take an age to
fix it (or
to declare that they wouldn''t fix it).  In one case we had 18 months 
between reporting a problem and Sun fixing it.  In another case it was 
around 3 months and because we happened to have the source code we even 
told them where the bug was and what a fix could be.

Solaris (and the other "over"dogs) are worth it when you want someone
else
to do the grunt work and someone else to point at and blame, but lets not 
romanticize how good it or any of the others are.  What made Solaris (10 
at least) worth deploying were its features (dtrace, zfs, SMF, etc).

Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support

Edward Ned Harvey

2011-Jan-06 13:19 UTC

head link

[zfs-discuss] A few questions

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> 
> On Wed, 5 Jan 2011, Edward Ned Harvey wrote:
> > with regards to ZFS and all the other projects relevant to solaris.)
> >
> > I know in the case of SGE/OGE, it''s officially closed source
now.  As of
Dec> > 31st, sunsource is being decomissioned, and the announcement of
officially> > closing the SGE source and decomissioning the open source community
> went out
> > on Dec 24th.  So all of this leads me to believe, with very little
> > reservation, that the new developments beyond zpool 28 are closed
> source
> > moving forward.  There''s very little breathing room remaining
for hope
of> > that being open sourced again.
> 
> I have no idea what you are talking about.  Best I can tell, SGE/OGE
> is a reference to Sun Grid Engine, which has nothing to do with zfs.
> The only annoucement and discussion I can find via Google is written
> by you.  It was pretty clear even a year ago that Sun Grid Engine was
> going away.
Agreed, SGE/OGE has nothing to do with ZFS, unless you believe there''s
an
oracle culture which might apply to both.

The only thing written by me, as I recall, included links to the original
official announcements.  Following those links now, I see the archives have
been decomissioned.  So there ya go.  Since it''s still in my inbox, I
just
saved a copy for you here...  It is long winded, and the main points are:
SGE (now called OGE) is officially closed-source, and sunsouce.net
decommissioned.  There is an open source fork, which will not share code
development with the closed-source product.
http://dl.dropbox.com/u/543241/SGE_officially_closed/GE%20users%20GE%20annou
nce%20Changes%20for%20a%20Bright%20Future%20at%20Oracle.txt

Edward Ned Harvey

2011-Jan-06 13:28 UTC

head link

[zfs-discuss] A few questions

> From: Khushil Dep [mailto:khushil.dep at gmail.com]
> 
> I''ve deployed large SAN''s on both SuperMicro 825/826/846
and Dell
> R610/R710''s and I''ve not found any issues so far. I
always make a point of
> installing Intel chipset NIC''s on the DELL''s and
disabling the Broadcom ones
> but other than that it''s always been plain sailing - hardware-wise
anyway.
"not found any issues," "except the broadcom one which causes the
system to crash regularly in the default factory configuration."

How did you learn about the broadcom issue for the first time?  I had to learn
the hard way, and with all the involvement of both Dell and Oracle support
teams, nobody could tell me what I needed to change.  We literally replaced
every component of the server twice over a period of 1 year, and I spent mandays
upgrading and downgrading firmwares randomly trying to find a stable
configuration.  I scoured the internet to find this little tidbit about
replacing the broadcom NIC, and randomly guessed, and replaced my nic with an
intel card to make the problem go away.

The same system doesn''t have a problem running RHEL/centos.

What will be the new problem in the next line of servers?  Why, during my
internet scouring, did I find a lot of other reports, of people who needed to
disable c-states (didn''t work for me) and lots of false leads
indicating firmware downgrade would fix my broadcom issue?

See my point?  Next time I buy a server, I do not have confidence to simply
expect solaris on dell to work reliably.  The same goes for solaris derivatives,
and all non-sun hardware.  There simply is not an adequate qualification and/or
support process.

Khushil Dep

2011-Jan-06 13:56 UTC

head link

[zfs-discuss] A few questions

Two fold really - firstly I remember the headaches I used to have
configuring Broadcom cards properly under Debain/Ubuntu but the sweetness
that was using an Intel NIC. Bottom line for me was that I know Intel
drivers have been around longer than Broadcom drivers and thus it would make
sense to ensure that we hand intel NIC''s on the server. Secondly, I
asked
Andy Bennett from Nexenta who told me it would make sense - always good to
get a second opinion :-)

There were/are reports all over Google about Broadcom issues with
Solaris/OpenSolaris so I didn''t want to risk it. For a couple of
hundred for
a quad port gig NIC - it''s worth it when the entire solution is 90K+.

Sometimes (like the issue with bus-resets when some
brands/firmware-rev''s of
SSD''s are used) the knowledge comes from people you work with (Nexenta
rode
to the rescue here again - plug! plug! plug!) :-)

These are deployed in a couple of University and a very large data
capture/marketing company I used to work for and I know it works really well
and (plug! plug! plug) I know the dedicated support I got from the Nexenta
guys.

The difference as I see it is that OpenSolaris/ZFS/Dtrace/FMA allow you to
build your own solution to your own problem. Thinking of storage in a
completely new way instead of "just a block of storage" it becomes an
integrated part of performance engineering - certainly has been for the last
two installs I''ve been involved in.

I know why folks want a "Certified" solution with the likes of Dell/HP
etc
but from my point of view (and all points of view are valid here), I know I
can deliver a cheaper, more focussed (and when I say that I''m not just
doing
some marketing bs) solution for the requirement at hand. It''s sometimes
a
struggle to get customers/end-users to think of storage as more than just
storage. There''s quite a lot of entrenched thinking to get around/over
in
our field (try getting a Java dev to think clearly about thread handling and
massive SMP drawbacks for example).

Anyway - not trying to engage in an argument but it''s always
interesting to
find out why someone went for certain solutions over others.

My 2p. YMMV.

*goes off to collect cheque from Nexenta* ;-)

---
W. A. Khushil Dep - khushil.dep at gmail.com -  07905374843
Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting &
Contracting
http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord

On 6 January 2011 13:28, Edward Ned Harvey <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > From: Khushil Dep [mailto:khushil.dep at gmail.com]
> >
> > I''ve deployed large SAN''s on both SuperMicro
825/826/846 and Dell
> > R610/R710''s and I''ve not found any issues so far. I
always make a point
> of
> > installing Intel chipset NIC''s on the DELL''s and
disabling the Broadcom
> ones
> > but other than that it''s always been plain sailing -
hardware-wise
> anyway.
>
> "not found any issues," "except the broadcom one which
causes the system to
> crash regularly in the default factory configuration."
>
> How did you learn about the broadcom issue for the first time?  I had to
> learn the hard way, and with all the involvement of both Dell and Oracle
> support teams, nobody could tell me what I needed to change.  We literally
> replaced every component of the server twice over a period of 1 year, and I
> spent mandays upgrading and downgrading firmwares randomly trying to find a
> stable configuration.  I scoured the internet to find this little tidbit
> about replacing the broadcom NIC, and randomly guessed, and replaced my nic
> with an intel card to make the problem go away.
>
> The same system doesn''t have a problem running RHEL/centos.
>
> What will be the new problem in the next line of servers?  Why, during my
> internet scouring, did I find a lot of other reports, of people who needed
> to disable c-states (didn''t work for me) and lots of false leads
indicating
> firmware downgrade would fix my broadcom issue?
>
> See my point?  Next time I buy a server, I do not have confidence to simply
> expect solaris on dell to work reliably.  The same goes for solaris
> derivatives, and all non-sun hardware.  There simply is not an adequate
> qualification and/or support process.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110106/a000a6a1/attachment-0001.html>

Garrett D''Amore

2011-Jan-06 16:36 UTC

head link

[zfs-discuss] A few questions

On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote:>> From: Khushil Dep [mailto:khushil.dep at gmail.com]
>>
>> I''ve deployed large SAN''s on both SuperMicro
825/826/846 and Dell
>> R610/R710''s and I''ve not found any issues so far. I
always make a point of
>> installing Intel chipset NIC''s on the DELL''s and
disabling the Broadcom ones
>> but other than that it''s always been plain sailing -
hardware-wise anyway.
>>      
> "not found any issues," "except the broadcom one which
causes the system to crash regularly in the default factory configuration."
>
> How did you learn about the broadcom issue for the first time?  I had to
learn the hard way, and with all the involvement of both Dell and Oracle support
teams, nobody could tell me what I needed to change.  We literally replaced
every component of the server twice over a period of 1 year, and I spent mandays
upgrading and downgrading firmwares randomly trying to find a stable
configuration.  I scoured the internet to find this little tidbit about
replacing the broadcom NIC, and randomly guessed, and replaced my nic with an
intel card to make the problem go away.
>
> The same system doesn''t have a problem running RHEL/centos.
>
> What will be the new problem in the next line of servers?  Why, during my
internet scouring, did I find a lot of other reports, of people who needed to
disable c-states (didn''t work for me) and lots of false leads
indicating firmware downgrade would fix my broadcom issue?
>
> See my point?  Next time I buy a server, I do not have confidence to simply
expect solaris on dell to work reliably.  The same goes for solaris derivatives,
and all non-sun hardware.  There simply is not an adequate qualification and/or
support process.
>    
When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, 
you get a product that has been through a rigorous qualification process 
which includes the hardware and software configuration matched together, 
tested with an extensive battery.  You also can get a higher level of 
support than is offered to people who build their own systems.

Oracle is *not* the only company capable of performing in depth testing 
of Solaris.

I can also know enough about problems that Oracle customers (or rather 
Sun customers) faced with Solaris on Sun hardware -- such as the 
terrible nvidia ethernet problems on first generation U20 and U40 
problems, or the marvell SATA problems on Thumper -- that I know that 
your picture of Oracle isn''t nearly as rosy as you believe.  Of course,
I also lived (as a Sun employee) through the UltraSPARC-II ECC fiasco...

   - Garrett

Jeff Bacon

2011-Jan-07 05:04 UTC

head link

[zfs-discuss] A few questions

> From: Edward Ned Harvey
> 	<opensolarisisdeadlongliveopensolaris at nedharvey.com>
> To: "''Khushil Dep''" <khushil.dep at
gmail.com>
> Cc: Richard Elling <Richard.Elling at nexenta.com>,
> 	zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] A few questions
> Message-ID: <000201cbada5$a3678270$ea368750$@nedharvey.com>
> Content-Type: text/plain; charset="utf-8"
> 
> > From: Khushil Dep [mailto:khushil.dep at gmail.com]
> >
> > I''ve deployed large SAN''s on both SuperMicro
825/826/846 and Dell
> > R610/R710''s and I''ve not found any issues so far. I
always make a
> point of
> > installing Intel chipset NIC''s on the DELL''s and
disabling the
> Broadcom ones
> > but other than that it''s always been plain sailing -
hardware-wise
> anyway.
> 
> "not found any issues," "except the broadcom one which
causes the
> system to crash regularly in the default factory configuration."
> 
> How did you learn about the broadcom issue for the first time?  I had
> to learn the hard way, and with all the involvement of both Dell and
> Oracle support teams, nobody could tell me what I needed to change.
We> literally replaced every component of the server twice over a period
of> 1 year, and I spent mandays upgrading and downgrading firmwares
> randomly trying to find a stable configuration.  I scoured the
internet> to find this little tidbit about replacing the broadcom NIC, and
> randomly guessed, and replaced my nic with an intel card to make the
> problem go away.
20 years of doing this c*(# has taught me that most things only
get learned the hard way. I certainly won''t bet my career solely
on the ability of the vendor to support the product, because they''re
hardly omniscient. Testing, testing, and generous return policies
(and/or R&D budget).... 
> The same system doesn''t have a problem running RHEL/centos.
Then you''re not pushing it hard enough, or your stars are just
aligned nicely.

We have massive piles of Dell hardware, all types. Running CentOS
since at least 4.5. Every single one of those Dells has an Intel
NIC in it, and the Broadcoms disabled in the BIOS. Because every
time we do something stupid like let ourselves think "oh, we could
maybe use those extra Broadcom ports for X", we get burned. 

High-volume financial trading system. Blew up on the bcoms.
Didn''t matter what driver or tweak or fix. Plenty of man-days 
wasted debugging. Went with net.advice, put in Intel NIC.
No more problems. That was 3 years ago.  

Thought we could use the bcoms for our fileservers. Nope.

Thought we could use the bcoms for the dedicated drbd links
for our xen cluster. Nope. 

And we know we''re not alone in this evaluation.

We could have spent forever chasing support to get someone
to "fix" it I suppose... but we have better things to do. 
> See my point?  Next time I buy a server, I do not have confidence to
> simply expect solaris on dell to work reliably.  The same goes for
> solaris derivatives, and all non-sun hardware.  There simply is not an
> adequate qualification and/or support process.
I''m not convinced ANYONE really has such a thing. Or that it''s
even
necessarily possible. 

In fact, I''m sure they don''t. Cuz that''s what it says
in the fine
print on the support contracts and the purchase agreements - "we do
not guarantee..." 

I just prefer not to have any confidence for the most part.
It''s easier and safer.

-bacon

Fajar A. Nugraha

2011-Jan-08 16:08 UTC

head link

[zfs-discuss] A few questions

On Thu, Jan 6, 2011 at 11:36 PM, Garrett D''Amore <garrett at
nexenta.com> wrote:> On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote:
>> See my point? ?Next time I buy a server, I do not have confidence to
>> simply expect solaris on dell to work reliably. ?The same goes for
solaris
>> derivatives, and all non-sun hardware. ?There simply is not an adequate
>> qualification and/or support process.
>>
>
> When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you
Where is the list? Is this the one on
http://www.nexenta.com/corp/technology-partners-overview/certified-technology-partners
?
> get a product that has been through a rigorous qualification process which
> includes the hardware and software configuration matched together, tested
> with an extensive battery. ?You also can get a higher level of support than
> is offered to people who build their own systems.
>
> Oracle is *not* the only company capable of performing in depth testing of
> Solaris.
Does this roughly mean I can expect similar (or even better) hardware
compatibility support and with nexentastor on supermicro as solaris on
oracle/sun hardware?

-- 
Fajar

Edward Ned Harvey

2011-Jan-08 17:33 UTC

head link

[zfs-discuss] A few questions

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Garrett D''Amore
> 
> When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
> you get a product that has been through a rigorous qualification process
How do I do this, exactly?  I am serious.  Before too long, I''m going
to
need another server, and I would very seriously consider reprovisioning my
unstable Dell Solaris server to become a linux or some other stable machine.
The role it''s currently fulfilling is the "backup" server,
which basically
does nothing except "zfs receive" from the primary Sun solaris 10u9
file
server.  Since the role is just for backups, it''s a perfect opportunity
for
experimentation, hence the Dell hardware with solaris.  I''d be happy to
put
some other configuration in there experimentally instead ... say ...
nexenta.  Assuming it will be just as good at "zfs receive" from the
primary
server.

Is there some specific hardware configuration you guys sell?  Or recommend?
How about a Dell R510/R610/R710?  Buy the hardware separately and buy
NexentaStor as just a software product?  Or buy a somehow more certified
hardware & software bundle together?

If I do encounter a bug, where the only known fact is that the system keeps
crashing intermittently on an approximately weekly basis, and there is
absolutely no clue what''s wrong in hardware or software...  How do you
guys
handle it?

If you''d like to follow up offlist, that''s fine.  Then just
email me at the
email address:  nexenta at nedharvey.com
(I use disposable email addresses on mailing lists like this, so at any
random unknown time, I''ll destroy my present alias and start using a
new
one.)

Stephan Budach

2011-Jan-08 18:43 UTC

head link

[zfs-discuss] A few questions

Am 08.01.11 18:33, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Garrett D''Amore
>>
>> When you purchase NexentaStor from a top-tier Nexenta Hardware Partner,
>> you get a product that has been through a rigorous qualification
process
> How do I do this, exactly?  I am serious.  Before too long, I''m
going to
> need another server, and I would very seriously consider reprovisioning my
> unstable Dell Solaris server to become a linux or some other stable
machine.
> The role it''s currently fulfilling is the "backup"
server, which basically
> does nothing except "zfs receive" from the primary Sun solaris
10u9 file
> server.  Since the role is just for backups, it''s a perfect
opportunity for
> experimentation, hence the Dell hardware with solaris.  I''d be
happy to put
> some other configuration in there experimentally instead ... say ...
> nexenta.  Assuming it will be just as good at "zfs receive" from
the primary
> server.
>
> Is there some specific hardware configuration you guys sell?  Or recommend?
> How about a Dell R510/R610/R710?  Buy the hardware separately and buy
> NexentaStor as just a software product?  Or buy a somehow more certified
> hardware&  software bundle together?
>
> If I do encounter a bug, where the only known fact is that the system keeps
> crashing intermittently on an approximately weekly basis, and there is
> absolutely no clue what''s wrong in hardware or software...  How do
you guys
> handle it?
>
> If you''d like to follow up offlist, that''s fine.  Then
just email me at the
> email address:  nexenta at nedharvey.com
> (I use disposable email addresses on mailing lists like this, so at any
> random unknown time, I''ll destroy my present alias and start using
a new
> one.)
>
> _______________________________________________Hmm? that''d interest me as well - I do have 4 Dell PE R610, that are 
running OSol or Sol11Expr. I actually bought a Sun Fire X4170 M2, since 
I couldn''t get my R610 stable, just as Edward points out.

So, if you guys think that NexentaStor avoids these issues, then I''d 
seriously consider to jumpship - so either please don''t continue 
offlist, or please include me in that conversation. ;)

Cheers,
budy

Garrett D''Amore

2011-Jan-08 21:18 UTC

head link

[zfs-discuss] A few questions

On 01/ 8/11 10:43 AM, Stephan Budach wrote:> Am 08.01.11 18:33, schrieb Edward Ned Harvey:
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Garrett D''Amore
>>>
>>> When you purchase NexentaStor from a top-tier Nexenta Hardware
Partner,
>>> you get a product that has been through a rigorous qualification 
>>> process
>> How do I do this, exactly?  I am serious.  Before too long,
I''m going to
>> need another server, and I would very seriously consider 
>> reprovisioning my
>> unstable Dell Solaris server to become a linux or some other stable 
>> machine.
>> The role it''s currently fulfilling is the "backup"
server, which
>> basically
>> does nothing except "zfs receive" from the primary Sun
solaris 10u9 file
>> server.  Since the role is just for backups, it''s a perfect 
>> opportunity for
>> experimentation, hence the Dell hardware with solaris.  I''d be
happy
>> to put
>> some other configuration in there experimentally instead ... say ...
>> nexenta.  Assuming it will be just as good at "zfs receive"
from the
>> primary
>> server.
>>
>> Is there some specific hardware configuration you guys sell?  Or 
>> recommend?
>> How about a Dell R510/R610/R710?  Buy the hardware separately and buy
>> NexentaStor as just a software product?  Or buy a somehow more
certified
>> hardware&  software bundle together?
>>
>> If I do encounter a bug, where the only known fact is that the system 
>> keeps
>> crashing intermittently on an approximately weekly basis, and there is
>> absolutely no clue what''s wrong in hardware or software... 
How do
>> you guys
>> handle it?
Such problems are handled on a case by case basis.  Usually we can do 
some analysis from a crash dump, but not always.   My team includes 
several people who are experienced with such analysis, and when problems 
like this occur, we are called into action.

Ultimately this usually results in a patch, sometimes workaround 
suggestions, and sometimes even binary relief (which happens faster than 
a regular patch, but without the deeper QA.)

   - Garrett

Pasi Kärkkäinen

2011-Jan-09 13:55 UTC

head link

[zfs-discuss] A few questions

On Sat, Jan 08, 2011 at 12:33:50PM -0500, Edward Ned Harvey
wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Garrett D''Amore
> > 
> > When you purchase NexentaStor from a top-tier Nexenta Hardware
Partner,
> > you get a product that has been through a rigorous qualification
process
> 
> How do I do this, exactly?  I am serious.  Before too long, I''m
going to
> need another server, and I would very seriously consider reprovisioning my
> unstable Dell Solaris server to become a linux or some other stable
machine.
> The role it''s currently fulfilling is the "backup"
server, which basically
> does nothing except "zfs receive" from the primary Sun solaris
10u9 file
> server.  Since the role is just for backups, it''s a perfect
opportunity for
> experimentation, hence the Dell hardware with solaris.  I''d be
happy to put
> some other configuration in there experimentally instead ... say ...
> nexenta.  Assuming it will be just as good at "zfs receive" from
the primary
> server.
> 
> Is there some specific hardware configuration you guys sell?  Or recommend?
> How about a Dell R510/R610/R710?  Buy the hardware separately and buy
> NexentaStor as just a software product?  Or buy a somehow more certified
> hardware & software bundle together?
> 
> If I do encounter a bug, where the only known fact is that the system keeps
> crashing intermittently on an approximately weekly basis, and there is
> absolutely no clue what''s wrong in hardware or software...  How do
you guys
> handle it?
> 
> If you''d like to follow up offlist, that''s fine.  Then
just email me at the
> email address:  nexenta at nedharvey.com
> (I use disposable email addresses on mailing lists like this, so at any
> random unknown time, I''ll destroy my present alias and start using
a new
> one.)
> 
Hey,

Other OS''s have had problems with the Broadcom NICs aswell..

See for example this RHEL5 bug:
https://bugzilla.redhat.com/show_bug.cgi?id=520888
Host crashing probably due to MSI-X IRQs with bnx2 NIC..

And VMware vSphere ESX/ESXi 4.1 crashing with bnx2x:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1029368

So I guess there are firmware/driver problems affecting not just Solaris
but also other operating systems..

-- Pasi

Edward Ned Harvey

2011-Jan-10 00:19 UTC

head link

[zfs-discuss] A few questions

> From: Pasi K?rkk?inen [mailto:pasik at iki.fi]
> 
> Other OS''s have had problems with the Broadcom NICs aswell..
Yes.  The difference is, when I go to support.dell.com and punch in my
service tag, I can download updated firmware and drivers for RHEL that (at
least supposedly) solve the problem.  I haven''t tested it, but the dell
support guy told me it has worked for RHEL users.  There is nothing
available to download for solaris.

Also, the bcom is not the only problem on that server.  After I added-on an
intel network card and disabled the bcom, the weekly crashes stopped, but
now it''s ...  I don''t know ... once every 3 weeks with a
slightly different
mode of failure.  This is yet again, rare enough that the system could very
well pass a certification test, but not rare enough for me to feel
comfortable putting into production as a primary mission critical server.

I really think there are only two ways in the world to engineer a good solid
server:
(a) Smoke your own crack.  Systems engineering teams use the same systems
that are sold to customers.
or
(b) Sell millions of ''em.  So despite whether or not the engineering
team
uses them, you''re still going to have sufficient mass to dedicate
engineers
to the purpose of post-sales bug solving.

I suppose a third way, which has certainly happened in history but not very
applicable to me...  Is to simply charge such ridiculously high prices for
your servers that you can dedicate engineers to post-sales bug solving, even
if you only sold a handful of those systems in the whole world.  Things like
munitions-strength cray and alphaservers etc in the past have sometimes fit
into this category.

I do feel confident assuming that solaris kernel engineers use sun servers
primarily for their server infrastructure.  So I feel safe buying this
configuration.  The only thing there is to gain by buying something else is
lower prices... or maybe some obscure fringe detail that I can''t think
of.

Richard Elling

2011-Jan-10 01:33 UTC

head link

[zfs-discuss] A few questions

On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
>> From: Pasi K?rkk?inen [mailto:pasik at iki.fi]
>> 
>> Other OS''s have had problems with the Broadcom NICs aswell..
> 
> Yes.  The difference is, when I go to support.dell.com and punch in my
> service tag, I can download updated firmware and drivers for RHEL that (at
> least supposedly) solve the problem.  I haven''t tested it, but the
dell
> support guy told me it has worked for RHEL users.  There is nothing
> available to download for solaris.
The drivers are written by Broadcom and are, AFAIK, closed source.
By going through Dell, you are going through a middle-man. For example,

http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php

where you see the release of the Solaris drivers was at the same time
as Windows.
> 
> Also, the bcom is not the only problem on that server.  After I added-on an
> intel network card and disabled the bcom, the weekly crashes stopped, but
> now it''s ...  I don''t know ... once every 3 weeks with a
slightly different
> mode of failure.  This is yet again, rare enough that the system could very
> well pass a certification test, but not rare enough for me to feel
> comfortable putting into production as a primary mission critical server.
> 
> I really think there are only two ways in the world to engineer a good
solid
> server:
> (a) Smoke your own crack.  Systems engineering teams use the same systems
> that are sold to customers.
This is rarely practical, not to mention that product development
is often not in the systems engineering organization.
> or
> (b) Sell millions of ''em.  So despite whether or not the
engineering team
> uses them, you''re still going to have sufficient mass to dedicate
engineers
> to the purpose of post-sales bug solving.
yes, indeed :-)
 -- richard>

Michael Sullivan

2011-Jan-10 02:41 UTC

head link

[zfs-discuss] A few questions

Just to add a bit to this, I just love sweeping generalizations...

On 9 Jan 2011, at 19:33 , Richard Elling wrote:
> On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> 
>>> From: Pasi K?rkk?inen [mailto:pasik at iki.fi]
>>> 
>>> Other OS''s have had problems with the Broadcom NICs
aswell..
>> 
>> Yes.  The difference is, when I go to support.dell.com and punch in my
>> service tag, I can download updated firmware and drivers for RHEL that
(at
>> least supposedly) solve the problem.  I haven''t tested it, but
the dell
>> support guy told me it has worked for RHEL users.  There is nothing
>> available to download for solaris.
> 
> The drivers are written by Broadcom and are, AFAIK, closed source.
> By going through Dell, you are going through a middle-man. For example,
> 
> http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php
> 
> where you see the release of the Solaris drivers was at the same time
> as Windows.
> 
What Richard says is true.

Broadcom have been a source of contention in the Linux world as well as the *BSD
world due to the proprietary nature of their firmware.  OpenSolaris/Solaris
users are not the only ones who have complained about this.  There''s
been much uproar in the FOSS community about Broadcom and their drivers.  As a
result, I''ve seen some pretty nasty hacks like people using the Windows
drivers linked into their kernel - *gack*  I forget all the gory details, but it
was rather disgusting as I recall, bubblegum, bailing wire, duct tape and all.

Dell and Red Hat aren''t exactly a marriage made in heaven either. 
I''ve had problems getting support from both Dell and Red Hat, them
pointing fingers at each other rather than solving the problem.  Like most
people, I''ve had to come up with my own work-arounds, like others with
the Broadcom issue, using a "known quantity" NIC.

When dealing with Dell as a corporate buyer, they have always made it quite
clear that they are primarily a Windows platform.  Linux, oh yes, we have that
too...
>> Also, the bcom is not the only problem on that server.  After I
added-on an
>> intel network card and disabled the bcom, the weekly crashes stopped,
but
>> now it''s ...  I don''t know ... once every 3 weeks
with a slightly different
>> mode of failure.  This is yet again, rare enough that the system could
very
>> well pass a certification test, but not rare enough for me to feel
>> comfortable putting into production as a primary mission critical
server.
I''ve never been particularly warm and fuzzy with Dell servers.  They
seem to like to change their chipsets slightly while a model is in production. 
This can cause all sorts of problems which are difficult to diagnose since an
"identical" Dell system will have no problems, and it''s mate
will crash weekly.
>> 
>> I really think there are only two ways in the world to engineer a good
solid
>> server:
>> (a) Smoke your own crack.  Systems engineering teams use the same
systems
>> that are sold to customers.
> 
> This is rarely practical, not to mention that product development
> is often not in the systems engineering organization.
> 
>> or
>> (b) Sell millions of ''em.  So despite whether or not the
engineering team
>> uses them, you''re still going to have sufficient mass to
dedicate engineers
>> to the purpose of post-sales bug solving.
> 
> yes, indeed :-)
> -- richard
As for certified systems, It''s my understanding that Nexenta themselves
don''t "certify" anything.  They have systems which are
recommended and supported by their network of VAR''s.  It just so
happens that SuperMicro is one of the brands of choice, but even then one must
adhere to a fairly tight HCL.  The same holds true for Solaris/OpenSolaris with
third-party hardware.

SATA Controllers and multiplexers are also another example of the drivers being
written by the manufacturer and Solaris/OpenSolaris are not a priority over
Windows and Linux, in that order.

Deviation from items which are not somewhat "plain vanilla" and are
not listed on the HCL is just asking for trouble.

Mike

---
Michael Sullivan                   
michael.p.sullivan at me.com
http://www.kamiogi.net/
Mobile: +1-662-202-7716
US Phone: +1-561-283-2034
JP Phone: +81-50-5806-6242

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3662 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110109/4887419e/attachment.bin>

Brad Stone

2011-Jan-10 03:10 UTC

head link

[zfs-discuss] A few questions

> As for certified systems, It''s my understanding that Nexenta
themselves don''t "certify" anything. ?They have systems which
are recommended and supported by their network of VAR''s.
The certified solutions listed on Nexenta''s website were certified by
Nexenta.

zfs discuss - Dec 2010 - A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] MTBF and why we care [was: A few questions]

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions

[zfs-discuss] A few questions