Hiya, I have been playing with ZFS for a few days now on a test PC, and I plan to use if for my home media server after being very impressed! I''ve got the basics of creating zpools and zfs filesystems with compression and dedup etc, but I''m wondering if there''s a better way to handle security. I''m using Windows 7 clients by the way. I have used this ''guide'' to do the permissions - http://www.slepicka.net/?p=37 Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives; drives 1-5 volume0 zpool drives 6-10 volume1 zpool drives 11-15 volume2 zpool so that I can sustain 3 simultaneous drives failures, as long as it''s one drive from each set. However I think this will mean each zpool will have independant shares which I don''t want. I have used this guide - http://southbrain.com/south/tutorials/zpools.html - which says you can combine zpools into a ''parent'' zpool, but can this be done in my scenario (staggered) as it looks like the child zpools have to be created before the parent is done. So basically I''d need to be able to; Create volume0 zpool now Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent zpool Create volume2 in Feb/March and add to parent zpool I know I could just add each disk to volume0 zpool but I''ve read it''s a bugger to do and that creating seperate zpools with news disks is a much better way to go. I think that''s it for now. Sorry for the mammoth first post! Thanks -- This message posted from opensolaris.org
> Also, at present I have 5x 1TB drives to use in my home server so I > plan to create a RAID-Z1 pool which will have my shares on it (Movies, > Music, Pictures etc). I then plan to increase this in sets of 5 (so > another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can > avoid all disks being from the same batch). I did plan on creating > seperate zpoolz with each set of 5 drives; > > drives 1-5 volume0 zpool > drives 6-10 volume1 zpool > drives 11-15 volume2 zpoolAlthough this seems a good idea to start with, there are issues with it performance-wise. If you fill up VDEV0 (drives 1-5) and then attach VDEV1 (drives 6-10), new writes will still be initially striped across the two VDEVs, leading to a performance impact on writes. There is currently no way of balancing VDEV fills without manualy backup/restore or in-vdev copying the data from one place to another and then removing the original data.> so that I can sustain 3 simultaneous drives failures, as long as it''s > one drive from each set. However I think this will mean each zpool > will have independant shares which I don''t want. I have used this > guide - http://southbrain.com/south/tutorials/zpools.html - which says > you can combine zpools into a ''parent'' zpool, but can this be done in > my scenario (staggered) as it looks like the child zpools have to be > created before the parent is done. So basically I''d need to be able > to;For the scheme to work as above, start with something like # zpool create mypool raidz1 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 Later, you''ll add the new vdev # zpool add mypool raidz1 c0t6d0 c0t7d0 c0t8d0 c2t9d0 c2t10d0 This will work as described above. However, I would do this somehow differently. Start off with, say, 6 1TB drives in RAIDz2 and set autoexpand=on on the pool (remember compression=on on the zfs pool fs too). # zpool create mypool raidz2 c0t1d0 c0t2d0 c0t3d0 c2t4d0 c2t5d0 c2t6d0 # zpool set autoexpand on mypool # zfs set compression=on mypool Compression is lzjb, and it won''t compress much for audio or video, but then, won''t hurt much either. When this starts to get somewhat close to a fill, get new, larger drives and replace the one by one with the older 1TB drives. Once all are replaced by larger, say 1,5TB drives, whops, your array is larger. This will scale better performance-wise and you won''t need that many controllers. Also, with RAIDz2, you can lose any two drives. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Thanks for the reply. In that case, wouldn''t it be better to, as you say, start with a 6 drive Z2, then just keep adding drives until the case is full, for a single Z2 zpool? Or even Z3, if that''s available now? I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 drive bays. Hence the plan to start with 5, then 10, then all the way to 15. This seems a more logical (and cheaper) solution than keep replacing with bigger drives as they come to market. -- This message posted from opensolaris.org
On Thu, Dec 16, 2010 at 12:59 AM, Lanky Doodle <lanky_doodle at hotmail.com> wrote:> I have been playing with ZFS for a few days now on a test PC, and I plan to use if for my home media server after being very impressed!Works great for that. Have a similar setup at home, using FreeBSD.> Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives;No no no. Create 1 pool. Create the pool initially with a single 5-drive raidz vdev. Later, add the next five drives to the system, and create a new raidz vdev *in the same pool*. Voila. You now have the equivalent of a RAID50, as ZFS will stripe writes to both vdevs, increaseing the overall size *and* speed of the pool. Later, add the next five drives to the system, and create a new raidz vdev in the same pool. Voila. You now have a pool with 3 vdevs, with read/writes being striped across all three. You can still lose 3 drives (1 per vdev) before losing the pool. The commands to do this are along the lines of: # zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 # zpool add mypool raidz disk6 disk7 disk8 disk9 disk10 # zpool add mypool raidz disk11 disk12 disk13 disk14 disk15 Creating 1 pool gives you the best performance and the most flexibility. Use separate filesystems on top of that pool if you want to tweak all the different properties. Going with 1 pool also increases your chances for dedupe, as dedupe is done at the pool level. -- Freddie Cash fjwcash at gmail.com
Hi Lanky, Other follow-up posters have given you good advice. I don''t see where you are getting the idea that you can combine pools with pools. You can''t do this and I don''t see that the southbrain tutorial illustrates this either. All of his examples for creating redundant pools are reasonable. As others have said, you can create a RAIDZ pool with one vdev of say 5 disks, and then later add another 5 disks, and so on. Thanks, Cindy On 12/16/10 01:59, Lanky Doodle wrote:> Hiya, > > I have been playing with ZFS for a few days now on a test PC, and I plan to use if for my home media server after being very impressed! > > I''ve got the basics of creating zpools and zfs filesystems with compression and dedup etc, but I''m wondering if there''s a better way to handle security. I''m using Windows 7 clients by the way. > > I have used this ''guide'' to do the permissions - http://www.slepicka.net/?p=37 > > Also, at present I have 5x 1TB drives to use in my home server so I plan to create a RAID-Z1 pool which will have my shares on it (Movies, Music, Pictures etc). I then plan to increase this in sets of 5 (so another 5x 1TB drives in Jan and nother 5 in Feb/March so that I can avoid all disks being from the same batch). I did plan on creating seperate zpoolz with each set of 5 drives; > > drives 1-5 volume0 zpool > drives 6-10 volume1 zpool > drives 11-15 volume2 zpool > > so that I can sustain 3 simultaneous drives failures, as long as it''s one drive from each set. However I think this will mean each zpool will have independant shares which I don''t want. I have used this guide - http://southbrain.com/south/tutorials/zpools.html - which says you can combine zpools into a ''parent'' zpool, but can this be done in my scenario (staggered) as it looks like the child zpools have to be created before the parent is done. So basically I''d need to be able to; > > Create volume0 zpool now > Create volume1 zpool in Jan, then combine volume0 and volume1 into a parent zpool > Create volume2 in Feb/March and add to parent zpool > > I know I could just add each disk to volume0 zpool but I''ve read it''s a bugger to do and that creating seperate zpools with news disks is a much better way to go. > > I think that''s it for now. Sorry for the mammoth first post! > > Thanks
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > In that case, wouldn''t it be better to, as you say, start with a 6 driveZ2, then> just keep adding drives until the case is full, for a single Z2 zpool?Doesn''t work that way. You can create a vdev, and later, you can add more vdev''s. So you can create a raidz now, and later you can add another raidz. But you cannot create a raidz now, and later just add onesy-twosy disks to increase the size incrementally.> Or even Z3, if that''s available now?Raidz3 is available now. There is only one thing to be aware of. ZFS resilvering is very inefficient for typical usage scenarios. The time to resilver divides by the number of vdev''s in the pool (meaning 10 mirrors will resilver 10x faster than an equivalently sized raidzN) and the time to resilver is doubled when you have several disks within the vdev. Due to inefficiency, we''re talking about 12 hours (on my server) to resilver a 1TB disk which is around 70% used. This would have been ~3 weeks if I had one big raidz3. So it matters. Your multiple raidz vdev''s of each 5-6 disks is a reasonable compromise.> I have an 11x 5.1/4 bay case, with 3x 5-in-3 hot swap caddies giving me 15 > drive bays. Hence the plan to start with 5, then 10, then all the way to15.> > This seems a more logical (and cheaper) solution than keep replacing with > bigger drives as they come to market.''Course, you can also replace bigger drives as they come to market, too. ;-) If you''ve got 5 disks in a raidz... First scrub it. Then, replace one disk with a larger disk, and wait for resilver. Replace each disk, one by one, with larger disks. And eventually when you do the last one ... Your pool becomes larger. (Depending on your defaults, manual intervention may be required to make the pool autoexpand when the devices have all been upgraded.)
Thanks for all the replies. The bit about combining zpools came from this command on the southbrain tutorial; zpool create mail \ mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \ mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0 I admit I was getting confused between zpools and vdevs, thinking in the above command that each mirror was a zpool and not a vdev. Just so i''m correct, a normal command would like like zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 which would result in a zpool called my pool, which is made up of a 5 disk raidz vdev? This means that zpools don''t actually ''contain'' physical devices, which is what I originally thought. -- This message posted from opensolaris.org
On 12/17/2010 2:12 AM, Lanky Doodle wrote:> Thanks for all the replies. > > The bit about combining zpools came from this command on the southbrain tutorial; > > zpool create mail \ > mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \ > mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0 > > I admit I was getting confused between zpools and vdevs, thinking in the above command that each mirror was a zpool and not a vdev. > > Just so i''m correct, a normal command would like like > > zpool create mypool raidz disk1 disk2 disk3 disk4 disk5 > > which would result in a zpool called my pool, which is made up of a 5 disk raidz vdev? This means that zpools don''t actually ''contain'' physical devices, which is what I originally thought.You are correct that the above will have a single vdev of 5 disks. Here''s a shorthand note: A zpool is made of 1 or more vdevs. Each vdev can be a raidz, mirror, or single device (either a file or disk). So, you *can* have a zpool which has solely physical drives: e.g. zpool create tank disk1 disk2 disk3 will create a pool with 3 disks, with data being striped across the devices as desired. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
OK cool. One last question. Reading the Admin Guid for ZFS, it says: [i]"A more complex conceptual RAID-Z configuration would look similar to the following: raidz c1t0d0 c2t0d0 c3t0d0 c4t0d0 c5t0d0 c6t0d0 c7t0d0 raidz c8t0d0 c9t0d0 c10t0d0 c11t0d0 c12t0d0 c13t0d0 c14t0d0 If you are creating a RAID-Z configuration with many disks, as in this example, a RAID-Z configuration with 14 disks is better split into a two 7-disk groupings. RAID-Z configurations with single-digit groupings of disks should perform better"[/i] This is relevant as my final setup was planned to be 15 disks, so only one more than the example. So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drive vdevs. Also, does anyone have anything to add re the security of CIFS when used with Windows clients? Thanks again guys, and gals... -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > This is relevant as my final setup was planned to be 15 disks, so only one > more than the example. > > So, do I drop one disk and go with 2 7 drive vdevs, or stick to 3 5 drivevdevs. Both ways are fine. Consider the balance between redundancy and drive space. Also, in the event of a resilver, the 3x5 radiz will be faster. In rough numbers, suppose you have 1TB drives, 70% full. Then your resilver might be 8 days instead of 12 days. That''s important when you consider the fact that during that window, you have degraded redundancy. Another failed disk in the same vdev would destroy the entire pool. Also if a 2nd disk fails during resilver, it''s more likely to be in the same vdev, if you have only 2 vdev''s. Your odds are better with smaller vdev''s, both because the resilver completes faster, and the probability of a 2nd failure in the same vdev is smaller. For both performance and reliability reasons, I recommend nothing except single-drive mirrors, except in extreme data-is-not-important situations. At least, that''s my recommendation until someday, when the resilver efficiency is improved, or "fixed."
Thanks! By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. Or do you mean 1 massive mirror with all 14 disks? This is always a tough one for me. I too prefer RAID1 where redundancy is king, but the trade off for me would be 5GB of ''wasted'' space - total of 7GB in mirror and 12GB in 3x RAIDZ. Decisions, decisions..... -- This message posted from opensolaris.org
You should take a look at the ZFS best practices guide for RAIDZ and mirrored configuration recommendations: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Its easy for me to say because I don''t have to buy storage but mirrored storage pools are currently more flexible, provide good performance, and replacing/resilvering data on disks is faster. Thanks, Cindy On 12/17/10 09:48, Lanky Doodle wrote:> Thanks! > > By single drive mirrors, I assume, in a 14 disk setup, you mean 7 sets of 2 disk mirrors - I am thinking of traditional RAID1 here. > > Or do you mean 1 massive mirror with all 14 disks? > > This is always a tough one for me. I too prefer RAID1 where redundancy is king, but the trade off for me would be 5GB of ''wasted'' space - total of 7GB in mirror and 12GB in 3x RAIDZ. > > Decisions, decisions.....
at Dezember, 17 2010, 17:48 <Lanky Doodle> wrote in [1]:> By single drive mirrors, I assume, in a 14 disk setup, you mean 7 > sets of 2 disk mirrors - I am thinking of traditional RAID1 here.> Or do you mean 1 massive mirror with all 14 disks?Edward means a set of two-way-mirrors. Do you remember what he wrote:>> Also, in the event of a resilver, the 3x5 radiz will be faster. In rough >> numbers, suppose you have 1TB drives, 70% full. Then your resilver might be >> 8 days instead of 12 days. That''s important when you consider the fact that >> during that window, you have degraded redundancy. Another failed disk in >> the same vdev would destroy the entire pool. > >> Also if a 2nd disk fails during resilver, it''s more likely to be in the same >> vdev, if you have only 2 vdev''s. Your odds are better with smaller vdev''s, >> both because the resilver completes faster, and the probability of a 2nd >> failure in the same vdev is smaller.And this scene is a horrible notion. In that time resilvering is running you have to hope that nothing fails. In his example between 192 to 288 hours - thats a long a very long time. And be aware that a disk will broken at some point.> This is always a tough one for me. I too prefer RAID1 where > redundancy is king, but the trade off for me would be 5GB of > ''wasted'' space - total of 7GB in mirror and 12GB in 3x RAIDZ.You lost at most space when you make a pool with mirrors BUT the I/O is much faster and its more secure and you have all the features of zfs too. http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance> Decisions, decisions.....My suggestion is make a two-way-mirror of small disks or ssd for the OS. This is not easy to do after installation, you have to look for a howto. Sorry I dont find the link at the moment. At Sol11 Express Oracle announced that at TestInstall you can set RootPool to mirror during installation. At the moment I try it out in a VM but I didnt find this option. :-( zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4 When you need more space you can expand a bundle of two disks to your lankyserver. Each pair with the same capacity is effective. zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8 ... Consider that its a good decision when you plan one spare disk. You can using the zpool add command when you want to add a spare disk at a later time. http://docs.sun.com/app/docs/doc/819-2240/zpool-1m?a=view When you build a raidz pool every disk in this pool must have the same space as the smallest disk have or bigger. Raidz pool uses only this space that the smallest disk have. The rest of the bigger disk is waste. At a mirrored pool only the pair must have the same space so you can use a pair of 1 TB disks, one pair of 2 TB disks at the same pool. In this scene your spare disk _must have_ the biggest space. Read this for your decision: http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best -- Best Regards Alexander Dezember, 17 2010 ........ [1] mid:382802084.111292604519623.JavaMail.Twebapp at sf-app1 ........
On Fri, 17 Dec 2010, Edward Ned Harvey wrote:> > Also if a 2nd disk fails during resilver, it''s more likely to be in the same > vdev, if you have only 2 vdev''s. Your odds are better with smaller vdev''s, > both because the resilver completes faster, and the probability of a 2nd > failure in the same vdev is smaller.While I agree that smaller vdevs are more reliable, your statement about the failure being more likely be in the same vdev if you have only 2 vdev''s to be a rather useless statement. The probability of vdev failure does not have anything to do with the number of vdevs. However, the probability of vdev failure increases tremendously if there is only one vdev and there is a second disk failure. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Alexander Lesle > > at Dezember, 17 2010, 17:48 <Lanky Doodle> wrote in [1]: > > > By single drive mirrors, I assume, in a 14 disk setup, you mean 7 > > sets of 2 disk mirrors - I am thinking of traditional RAID1 here. > > > Or do you mean 1 massive mirror with all 14 disks? > > Edward means a set of two-way-mirrors.Correct. mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 ... You would normally call this a stripe of mirrors. Even though the ZFS concept of striping is more advanced than traditional raid striping... We still call this a ZFS stripe for lack of any other term. A ZFS stripe has all the good characteristics of raid concatenation and striping, without any of the bad characteristics. It can utilize bandwidth on multiple disks when it wants to, or use a single device when it wants to for small blocks. It can dynamically add randomly sized devices, and it can be done one-at-a-time. Gaining everything good of traditional raid stripe or concatenation, without any of the negatives of traditional raid stripe and concatenation.> At Sol11 Express Oracle announced that at TestInstall you can set > RootPool to mirror during installation. At the moment I try it out > in a VM but I didnt find this option. :-(Actually, even in solaris 10, I habitually install the root filesystem onto a ZFS mirror. You just select 2 disks, and it''s automatically a mirror.> zpool create lankyserver mirror vdev1 vdev2 mirror vdev3 vdev4 > > When you need more space you can expand a bundle of two disks to your > lankyserver. Each pair with the same capacity is effective. > > zpool add lankyserver mirror vdev5 vdev6 mirror vdev7 vdev8 ...Correct.
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > Sent: Friday, December 17, 2010 9:16 PM > > While I agree that smaller vdevs are more reliable, your statement > about the failure being more likely be in the same vdev if you have > only 2 vdev''s to be a rather useless statement. The probability of > vdev failure does not have anything to do with the number of vdevs. > However, the probability of vdev failure increases tremendously if > there is only one vdev and there is a second disk failure.I''m not sure you got what I meant. I''ll rephrase and see if it''s more clear: Correct, the number of vdev''s doesn''t affect the probability of a failure in a specific vdev, but the number of disks in a vdev does. Lanky said he was considering 2x7disk raidz, versus 3x5disk raidz. So when I said he''s more likely to have a 2nd disk fail in the same vdev if he only has 2 vdev''s ... That was meant to be taken in context, not as a generalization about pools in general. Consider a single disk. Let P be the probability of the disk failing, within 1 day. If you have 5 disks in a raidz vdev, and one fails, there are 4 remaining. If resilver will last 8 days, then the probability of a 2nd disk failing is 4*8*P = 32P If you have 7 disks in a raidz vdev, and one fails, there are 6 remaining. If a resilver will last 12 days, then the probability of a 2nd disk failing is 6*12*P = 72P
On the subject of where to install ZFS, I was planning to use either Compact Flash or USB drive (both of which would be mounted internally); using up 2 of the drive bays for a mirrored install is possibly a waste of physical space, considering it''s a) a home media server and b) the config can be backed up to a protected ZFS pool - if the CF or USB drive failed I would just replace and restore the config. Can you have an equivalent of a global hot spare in ZFS. If I did go down the mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5 etc) all the way up to 14 disks that would leave the 15th disk spare. Now this is getting really complex, but can you have server failover in ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace so if a complete server failed nothing is interrupted. I am still undecided as to mirror vs RAID Z. I am going to be ripping uncompressed Blu-Rays so space is vital. I use RAID DP in NetApp kit at work and I''m guessing RAID Z2 is the equivalent? I have 5TB space at the moment so going to the expense of mirroring for only 2TB extra doesn''t seem much of a pay off. Maybe a compromise of 2x 7-disk RAID Z1 with global hotspare is the way to go? Put it this way, I currently use Windows Home Server, which has no true disk failure protection, so any of ZFS''s redundancy schemes is going to be a step up; is there an equivalent system in ZFS where if 1 disk fails you only lose that disks data, like unRAID? Thanks everyone for your input so far :) -- This message posted from opensolaris.org
On Dec 18, 2010, at 12:23 PM, Lanky Doodle wrote:> Now this is getting really complex, but can you have server failover in ZFS, much like DFS-R in Windows - you point clients to a clustered ZFS namespace so if a complete server failed nothing is interrupted.This is the purpose of an Amber Road dual-head cluster (7310C, 7410C, etc.) -- not only the storage pool fails over, but also the server IP address fails over, so that NFS, etc. shares remain active, when one storage head goes down. Amber Road uses ZFS, but the clustering and failover are not related to the filesystem type. Mark
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > On the subject of where to install ZFS, I was planning to use eitherCompact> Flash or USB drive (both of which would be mounted internally); using up 2of> the drive bays for a mirrored install is possibly a waste of physicalspace,> considering it''s a) a home media server and b) the config can be backed upto> a protected ZFS pool - if the CF or USB drive failed I would just replaceand> restore the config.All of the above is correct. One thing you should keep in mind however: If your unmirrored rpool (usb fob) fails... Although yes you can restore assuming you have been sufficiently backing it up ... You will suffer an ungraceful halt. Maybe you can live with that.> Can you have an equivalent of a global hot spare in ZFS. If I did go downthe> mirror route (mirror disk0 disk1 mirror disk2 disk3 mirror disk4 disk5etc) all> the way up to 14 disks that would leave the 15th disk spare.Check the zpool man page for "spare," but I know you can have spares assigned to a vdev, and I''m pretty sure you can assign any given spare to multiples, effectively making it a global hotspare. So yes is the answer.> Now this is getting really complex, but can you have server failover inZFS,> much like DFS-R in Windows - you point clients to a clustered ZFSnamespace> so if a complete server failed nothing is interrupted.If that''s somehow possible, it''s something I don''t know. I don''t believe you can do that with ZFS.> I am still undecided as to mirror vs RAID Z. I am going to be ripping > uncompressed Blu-Rays so space is vital.For both read and write, raidz works extremely well for sequential operations. It sounds like you''re probably going to be doing mostly sequential operations, so raidz should perform very well for you. A lot of people will avoid raidzN because it doesn''t perform very well for random reads, so they opt for mirrors instead. But in your case, no so much. In your case, the only reason I can think to avoid raidz would be if you''re worrying about resilver times. That''s a valid concern, but you can linearly choose any number of disks you want ... You could make raidz using 3-disks each... It''s just a compromise between the mirror and the larger raidz vdev.> I use RAID DP in NetApp kit at work > and I''m guessing RAID Z2 is the equivalent?Yup, raid-dp and raidz2 are conceptually pretty much the same.> Put it this way, I currently use Windows Home Server, which has no truedisk> failure protection, so any of ZFS''s redundancy schemes is going to be astep> up; is there an equivalent system in ZFS where if 1 disk fails you onlylose that> disks data, like unRAID?No. Not unless you make that many separate volumes.
Thanks Edward. I do agree about mirrored rpool (equivalent to Windows OS volume); not doing it goes against one of my principles when building enterprise servers. Is there any argument against using the rpool for all data storage as well as being the install volume? Say for example I chucked 15x 1TB disks in there and created a mirrored rpool during installation, using 2 disks. If I added another 6 mirrors (12 disks) to it that would give me an rpool of 7TB. The 15th disk being a spare. Or, say I selected 3 disks during install, does this create a 3 way mirrored rpool or does it give you the option of creating raidz? If so, I could then create a further 4x 3 drive raidz''s, giving me a 10TB rpool. Or, I could use 2 smaller disks (say 80GB) for the rpool, then create 4x 3 drive raidz''s, giving me an 8TB rpool. Again this gives me a spare disk. Either of these 3 should keep resilvering times to a minimum, against say one big raidz2 of 13 disks. Why does resilvering take so long in raidz anyway? -- This message posted from opensolaris.org
Oh, does anyone know if resilvering efficiency is improved or fixed in Solaris 11 Express, as that is what i''m using. -- This message posted from opensolaris.org
> Why does resilvering take so long in raidz anyway?Because it''s broken. There were some changes a while back that made it more broken. There has been a lot of discussion, anecdotes and some data on this list. The resilver doesn''t do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata. However, the current implentation has difficulty finishing the job if there''s a steady flow of updates to the pool. As far as I''m aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us.
Hi, Which brings up an interesting question... IF it were fixed in for example illumos or freebsd is there a plan for how to handle possible incompatible zfs implementations? Currently the basic version numbering only works as it implies only one stream of development, now with multiple possible stream does ZFS need to move to a feature bit system or are we going to have to have forks or multiple incompatible versions? Thanks, Deano -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Phil Harman Sent: 20 December 2010 10:43 To: Lanky Doodle Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] A few questions> Why does resilvering take so long in raidz anyway?Because it''s broken. There were some changes a while back that made it more broken. There has been a lot of discussion, anecdotes and some data on this list. The resilver doesn''t do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata. However, the current implentation has difficulty finishing the job if there''s a steady flow of updates to the pool. As far as I''m aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> I believe Oracle is aware of the problem, but most of > the core ZFS team has left. And of course, a fix for > Oracle Solaris no longer means a fix for the rest of > us.OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I want to committ to a file system that is ''broken'' and may not be fully fixed, if at all. Hmmmmmnnn... -- This message posted from opensolaris.org
On 20/12/2010 11:03, Deano wrote:> Hi, > Which brings up an interesting question... > > IF it were fixed in for example illumos or freebsd is there a plan for how > to handle possible incompatible zfs implementations? > > Currently the basic version numbering only works as it implies only one > stream of development, now with multiple possible stream does ZFS need to > move to a feature bit system or are we going to have to have forks or > multiple incompatible versions? > > Thanks, > DeanoChanges to the resilvering implementation don''t necessarily require changes to the on disk format (although they could). Of course, there might be an issue moving a pool mid-resilver from one implementation to another. With arguably considerably more ZFS expertise outside Oracle than in, there''s a good chance the community will get to a fix first. It would then be interesting to see whether NIH prevails, or perhaps even a new spirit of "share and share alike". "You may say I''m a dreamer ..."
On 20/12/2010 11:29, Lanky Doodle wrote:>> I believe Oracle is aware of the problem, but most of >> the core ZFS team has left. And of course, a fix for >> Oracle Solaris no longer means a fix for the rest of >> us. > OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I want to committ to a file system that is ''broken'' and may not be fully fixed, if at all. > > Hmmmmmnnn...My home server is still running snv_82, and my iMac is running Apple''s last public beta release for Leopard. The way I see it, the on-disk format is sound, and the basic "always consistent on disk" promise seems to be worth something. My files are read-mostly, and performance isn''t an issue for me. ZFS has protected my data for several years now in the face of various hardware issues. I''ll upgrade my NAS appliance to OpenSolaris snv_134b sometime soon, but as far as I can tell, I can''t use Oracle Solaris 11 Express for licensing reasons (I have backups of business data). I''ll be watching Illumos with interest, but snv_82 has served me well for 3 years, so I figure snv_134b probably has quite a lot of useful life left in it, and maybe then brtfs will be ready for prime time?
Phil Harman <phil.harman at gmail.com> wrote:> Changes to the resilvering implementation don''t necessarily require > changes to the on disk format (although they could). Of course, there > might be an issue moving a pool mid-resilver from one implementation to > another.We seem to come to a similar problem as wuth UFS 20 years ago. At that time, Sun did enhance the UFS on-disk format but the *BSDs did not follow this change even though the format change was "documented" in the related include files. For a future ZFS development, thee may be a need to allow an implementation to implement on-disk version 1..21 + 24 and another implementation to support on-disk version 1..23 + 25. These thoughts of course are void in case that Oracle continues the OSS decisions for Solaris and other Solaris variants can import the code related to recent enhancements. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) joerg.schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com> wrote:>> Why does resilvering take so long in raidz anyway? > > Because it''s broken. There were some changes a while back that made it more broken."broken" is the wrong term here. It functions as designed and correctly resilvers devices. Disagreeing with the design is quite different than proving a defect.> There has been a lot of discussion, anecdotes and some data on this list."slow because I use devices with poor random write(!) performance" is very different than "broken."> The resilver doesn''t do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata.A design that only does a single pass does not handle the temporal changes. Many RAID implementations use a mix of spatial and temporal resilvering and suffer with that design decision.> However, the current implentation has difficulty finishing the job if there''s a steady flow of updates to the pool.Please define current. There are many releases of ZFS, and many improvements have been made over time. What has not improved is the random write performance of consumer-grade HDDs.> As far as I''m aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed.I know of no RAID implementation that bounds resilver times for HDDs. I believe it is not possible. OTOH, whether a resilver takes 10 seconds or 10 hours makes little difference in data availability. Indeed, this is why we often throttle resilvering activity. See previous discussions on this forum regarding the dueling RFEs.> The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler.Resilver time is bounded by the random write performance of the resilvering device. Mirroring or raidz make no difference.> I believe Oracle is aware of the problem, but most of the core ZFS team has left. And of course, a fix for Oracle Solaris no longer means a fix for the rest of us.Some "improvements" were made post-b134 and pre-b148. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/973d9f26/attachment-0001.html>
Thanks relling. I suppose at the end of the day any file system/volume manager has it''s flaws so perhaps it''s better to look at the positives of each and decide based on them. So, back to my question above, is there a deciding argument [i]against[/i] putting data on the install volume (rpool). Forget about mirroring for a sec; 1) Select 3 disks during install creating raidz1. Create a further 4x 3 drive raidz1''s, giving me a 10TB rpool with no spare disks 2) Select 5 disks during install creating raidz1. Create a further 2x 5 drive raidsz1''s giving me a 12TB rpool with no spare disks 3) Select 7 disks during install creating raidz1. Create a further 7 drive raidz1 giving me 12TB rpool with 1 spare disk As there is no space gain between 2) and 3) there is no point going for 3), other than having a spare disk, but resilver times would be slower. So it becomes between 1) and 2). Neither offer spare disks but 1) would offer faster resilver times with upto 5 simultaneous disk failures and 2) would offer 2TB extra space with upto 3 simultaneous disk failures. FYI, I am using Samsung SpinPoint F2''s, which have the variable RPM speeds (http://www.scan.co.uk/products/1tb-samsung-hd103si-ecogreen-f2-sata-3gb-s-32mb-cache-89-ms-ncq) I may wait at least until I get the next 4 drives in (I actually have 6 at the mo, not 5) taking me to 10, before migrating to ZFS so plenty of time to think about it and hopefully time for them to fix resilvering! ;-) Thanks again... -- This message posted from opensolaris.org
On 20/12/2010 13:59, Richard Elling wrote:> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com > <mailto:phil.harman at gmail.com>> wrote: > >>> Why does resilvering take so long in raidz anyway? >> Because it''s broken. There were some changes a while back that made >> it more broken. > > "broken" is the wrong term here. It functions as designed and correctly > resilvers devices. Disagreeing with the design is quite different than > proving a defect.It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. I think we can agree that ZFS currently doesn''t play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions. For a long time at Sun, the rule was "correctness is a constraint, performance is a goal". However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be "wrong"). Then one brave soul at Sun once ventured that "if Linux is faster, it''s a Solaris bug!" and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID "where I = inexpensive", so I''m a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it''s SATA (and I''m not so sure).>> There has been a lot of discussion, anecdotes and some data on this >> list. > > "slow because I use devices with poor random write(!) performance" > is very different than "broken."Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I''d be the first to say "are you nuts?!">> The resilver doesn''t do a single pass of the drives, but uses a >> "smarter" temporal algorithm based on metadata. > > A design that only does a single pass does not handle the temporal > changes. Many RAID implementations use a mix of spatial and temporal > resilvering and suffer with that design decision.Actually, it''s easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs.>> However, the current implentation has difficulty finishing the job if >> there''s a steady flow of updates to the pool. > > Please define current. There are many releases of ZFS, and > many improvements have been made over time. What has not > improved is the random write performance of consumer-grade > HDDs.I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any.>> As far as I''m aware, the only way to get bounded resilver times is to >> stop the workload until resilvering is completed. > > I know of no RAID implementation that bounds resilver times > for HDDs. I believe it is not possible. OTOH, whether a resilver > takes 10 seconds or 10 hours makes little difference in data > availability. Indeed, this is why we often throttle resilvering > activity. See previous discussions on this forum regarding the > dueling RFEs.I don''t share your disbelief or "little difference" analysys. If it is true that no current implementation succeeds, isn''t that a great opportunity to change the rules? Wasn''t resilver time vs availability was a major factor in Adam Leventhal''s paper introducing the need for RAIDZ3? The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss.>> The problem exists for mirrors too, but is not as marked because >> mirror reconstruction is inherently simpler. > > Resilver time is bounded by the random write performance of > the resilvering device. Mirroring or raidz make no difference.This only holds in a quiesced system.>> I believe Oracle is aware of the problem, but most of the core ZFS >> team has left. And of course, a fix for Oracle Solaris no longer >> means a fix for the rest of us. > > Some "improvements" were made post-b134 and pre-b148.That is, indeed, good news.> -- richard-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/40cadc56/attachment-0001.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > > I believe Oracle is aware of the problem, but most of > > the core ZFS team has left. And of course, a fix for > > Oracle Solaris no longer means a fix for the rest of > > us. > > OK, that is a bit concerning then. As good as ZFS may be, i''m not sure Iwant> to committ to a file system that is ''broken'' and may not be fully fixed,if at all. ZFS is not "broken." It is, however, a weak spot, that resilver is very inefficient. For example: On my server, which is made up of 10krpm SATA drives, 1TB each... My drives can each sustain 1Gbit/sec sequential read/write. This means, if I needed to resilver the entire drive (in a mirror) sequentially, it would take ... 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, and disks are around 70% full, and resilver takes 12-14 hours. So although resilver is "broken" by some standards, it is bounded, and you can limit it to something which is survivable, by using mirrors instead of raidz. For most people, even using 5-disk, or 7-disk raidzN will still be fine. But you start getting unsustainable if you get up to 21-disk radiz3 for example.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > Is there any argument against using the rpool for all data storage as wellas> being the install volume?Generally speaking, you can''t do it. The rpool is only supported on mirrors, not raidz. I believe this is because you need rpool in order to load the kernel, and until the kernel is loaded, there''s just no reasonable way to have a fully zfs-aware, supports-every-feature bootloader able to read rpool in order to fetch the kernel. Normally, you''ll dedicate 2 disks to the OS, and then you build additional separate data pools. If you absolutely need all the disk space of the OS disks, then you partition the OS into a smaller section of the OS disks and assign the remaining space to some pool. But doing that partitioning scheme can be complex, and if you''re not careful, risky. I don''t advise it unless you truly have your back against a wall for more disk space.> Why does resilvering take so long in raidz anyway?There are some really long and sometimes complex threads in this mailing list discussing that. Fundamentally ... First of all, it''s not always true. It depends on your usage behavior and the type of disks you''re using. But the "typical" usage includes reading & writing a lot of files, essentially randomly over time, creating and deleting snapshots, using spindle disks, so the "typical" usage behavior does have a resilver performance problem. The root cause of the problem is that ZFS does not resilver the whole disk... It only resilvers the used portions of the disk. Sounds like a performance enhancer, right? It would be, if the disks were mostly empty ... or if ZFS were resilvering a partial disk, in order according to disk layout. Unfortunately, it''s resilvering according to the temporal order blocks were written, and usually a disk is significantly full (say, 50% or more) and as such, the disks have to thrash all around, performing all sorts of random reads, until eventually it can read all the used parts in random order. It''s worse on raidzN than on mirrors, because the number of items which must be read is higher in radizN, assuming you''re using larger vdev''s and therefore more items exist scattered about inside that vdev. You therefore have a higher number of things which must be randomly read before you reach completion.
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > Sent: Monday, December 20, 2010 11:46 AM > To: ''Lanky Doodle''; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] A few questions > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > > > > I believe Oracle is aware of the problem, but most of > > > the core ZFS team has left. And of course, a fix for > > > Oracle Solaris no longer means a fix for the rest of > > > us. > > > > OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I > want > > to committ to a file system that is ''broken'' and may not be fully fixed, > if at all. > > ZFS is not "broken." It is, however, a weak spot, that resilver is very > inefficient. For example: > > On my server, which is made up of 10krpm SATA drives, 1TB each... My > drives > can each sustain 1Gbit/sec sequential read/write. This means, if I needed > to resilver the entire drive (in a mirror) sequentially, it would take ... > 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, > and disks are around 70% full, and resilver takes 12-14 hours. > > So although resilver is "broken" by some standards, it is bounded, and you > can limit it to something which is survivable, by using mirrors instead of > raidz. For most people, even using 5-disk, or 7-disk raidzN will still be > fine. But you start getting unsustainable if you get up to 21-disk radiz3 > for example.This argument keeps coming up on the list, but I don''t see where anyone has made a good suggestion about whether this can even be ''fixed'' or how it would be done. As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that''s easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don''t think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can''t wrap my head around how this could be handled any better than it already is. I''ve read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? Do we know if resilvers on a mirror are actually handled differently from those on a raidz? Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not ''wrong'') is clearly NOT the same as with more conventional arrays. -Will
On 12/20/2010 9:20 AM, Saxon, Will wrote:>> -----Original Message----- >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >> Sent: Monday, December 20, 2010 11:46 AM >> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org >> Subject: Re: [zfs-discuss] A few questions >> >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Lanky Doodle >>> >>>> I believe Oracle is aware of the problem, but most of >>>> the core ZFS team has left. And of course, a fix for >>>> Oracle Solaris no longer means a fix for the rest of >>>> us. >>> OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I >> want >>> to committ to a file system that is ''broken'' and may not be fully fixed, >> if at all. >> >> ZFS is not "broken." It is, however, a weak spot, that resilver is very >> inefficient. For example: >> >> On my server, which is made up of 10krpm SATA drives, 1TB each... My >> drives >> can each sustain 1Gbit/sec sequential read/write. This means, if I needed >> to resilver the entire drive (in a mirror) sequentially, it would take ... >> 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, >> and disks are around 70% full, and resilver takes 12-14 hours. >> >> So although resilver is "broken" by some standards, it is bounded, and you >> can limit it to something which is survivable, by using mirrors instead of >> raidz. For most people, even using 5-disk, or 7-disk raidzN will still be >> fine. But you start getting unsustainable if you get up to 21-disk radiz3 >> for example. > This argument keeps coming up on the list, but I don''t see where anyone has made a good suggestion about whether this can even be ''fixed'' or how it would be done. > > As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that''s easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don''t think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? > > Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can''t wrap my head around how this could be handled any better than it already is. I''ve read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? > > Do we know if resilvers on a mirror are actually handled differently from those on a raidz? > > Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not ''wrong'') is clearly NOT the same as with more conventional arrays. > > -Willthe "problem" is NOT the checksum/error correction overhead. that''s relatively trivial. The problem isn''t really even variable width (i.e. variable number of disks one crosses) slabs. The problem boils down to this: When ZFS does a resilver, it walks the METADATA tree to determine what order to rebuild things from. That means, it resilvers the very first slab ever written, then the next oldest, etc. The problem here is that slab "age" has nothing to do with where that data physically resides on the actual disks. If you''ve used the zpool as a WORM device, then, sure, there should be a strict correlation between increasing slab age and locality on the disk. However, in any reasonable case, files get deleted regularly. This means that the probability that for a slab B, written immediately after slab A, it WON''T be physically near slab A. In the end, the problem is that using metadata order, while reducing the total amount of work to do in the resilver (as you only resilver live data, not every bit on the drive), increases the physical inefficiency for each slab. That is, seek time between cyclinders begins to dominate your slab reconstruction time. In RAIDZ, this problem is magnified by both the much larger average vdev size vs mirrors, and the necessity that all drives containing a slab information return that data before the corrected data can be written to the resilvering drive. Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput limited. This is really the "fault" of the underlying media, not ZFS. For instance, if you have a raidZ of SSDs (where seek time is negligible, but throughput isn''t), they resilver really, really fast. In fact, they resilver at the maximum write throughput rate. However, HDs are severely seek-limited, so that dominates HD resilver time. The "answer" isn''t simple, as the problem is media-specific. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On 12/20/2010 9:20 AM, Saxon, Will wrote:>> -----Original Message----- >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >> Sent: Monday, December 20, 2010 11:46 AM >> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org >> Subject: Re: [zfs-discuss] A few questions >> >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Lanky Doodle >>> >>>> I believe Oracle is aware of the problem, but most of >>>> the core ZFS team has left. And of course, a fix for >>>> Oracle Solaris no longer means a fix for the rest of >>>> us. >>> OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I >> want >>> to committ to a file system that is ''broken'' and may not be fully fixed, >> if at all. >> >> ZFS is not "broken." It is, however, a weak spot, that resilver is very >> inefficient. For example: >> >> On my server, which is made up of 10krpm SATA drives, 1TB each... My >> drives >> can each sustain 1Gbit/sec sequential read/write. This means, if I needed >> to resilver the entire drive (in a mirror) sequentially, it would take ... >> 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, >> and disks are around 70% full, and resilver takes 12-14 hours. >> >> So although resilver is "broken" by some standards, it is bounded, and you >> can limit it to something which is survivable, by using mirrors instead of >> raidz. For most people, even using 5-disk, or 7-disk raidzN will still be >> fine. But you start getting unsustainable if you get up to 21-disk radiz3 >> for example. > This argument keeps coming up on the list, but I don''t see where anyone has made a good suggestion about whether this can even be ''fixed'' or how it would be done. > > As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that''s easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don''t think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? > > Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can''t wrap my head around how this could be handled any better than it already is. I''ve read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? > > Do we know if resilvers on a mirror are actually handled differently from those on a raidz? > > Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not ''wrong'') is clearly NOT the same as with more conventional arrays. > > -Will >As far as a possible fix, here''s what I can see: [Note: I''m not a kernel or FS-level developer. I would love to be able to fix this myself, but I have neither the aptitude nor the [extensive] time to learn such skill] We can either (a) change how ZFS does resilvering or (b) repack the zpool layouts to avoid the problem in the first place. In case (a), my vote would be to seriously increase the number of in-flight resilver slabs, AND allow for out-of-time-order slab resilvering. By that, I mean that ZFS would read several disk-sequential slabs, and then mark them as "done". This would mean a *lot* of scanning the metadata tree (since leaves all over the place could be "done"). Frankly, I can''t say how bad that would be; the problem is that for ANY resilver, ZFS would have to scan the entire metadata tree to see if it had work to do, rather than simply look for the latest completed leave, then assume everything after that needs to be done. There''d also be the matter of determining *if* one should read a disk sector... In case (b), we need the ability to move slabs around on the physical disk (via the mythical "Block Pointer Re-write" method). If there is that underlying mechanism, then a "defrag" utility can be run to repack the zpool to the point where chronological creation time = physical layout. Which then substantially mitigates the seek time problem. I can''t fix (a) - I don''t understand the codebase well enough. Neither can I do the BP-rewrite implementation. However, if I can get BP-rewrite, I''ve got a prototype defragger that seems to work well (under simulation). I''m sure it could use some performance improvement, but it works reasonably well on a simulated fragmented pool. Please, Santa, can a good little boy get a BP-rewrite code commit in his stocking for Christmas? -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik, just a hypothetical what-if ... In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don''t know why this would not be doable. (I''m biased towards mirrors for busy filesystems.) I''m supposing that a block-level snapshot is not doable -- or is it? Mark On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote:> On 12/20/2010 9:20 AM, Saxon, Will wrote: >>> -----Original Message----- >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >>> Sent: Monday, December 20, 2010 11:46 AM >>> To: ''Lanky Doodle''; zfs-discuss at opensolaris.org >>> Subject: Re: [zfs-discuss] A few questions >>> >>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>>> bounces at opensolaris.org] On Behalf Of Lanky Doodle >>>> >>>>> I believe Oracle is aware of the problem, but most of >>>>> the core ZFS team has left. And of course, a fix for >>>>> Oracle Solaris no longer means a fix for the rest of >>>>> us. >>>> OK, that is a bit concerning then. As good as ZFS may be, i''m not sure I >>> want >>>> to committ to a file system that is ''broken'' and may not be fully fixed, >>> if at all. >>> >>> ZFS is not "broken." It is, however, a weak spot, that resilver is very >>> inefficient. For example: >>> >>> On my server, which is made up of 10krpm SATA drives, 1TB each... My >>> drives >>> can each sustain 1Gbit/sec sequential read/write. This means, if I needed >>> to resilver the entire drive (in a mirror) sequentially, it would take ... >>> 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, >>> and disks are around 70% full, and resilver takes 12-14 hours. >>> >>> So although resilver is "broken" by some standards, it is bounded, and you >>> can limit it to something which is survivable, by using mirrors instead of >>> raidz. For most people, even using 5-disk, or 7-disk raidzN will still be >>> fine. But you start getting unsustainable if you get up to 21-disk radiz3 >>> for example. >> This argument keeps coming up on the list, but I don''t see where anyone has made a good suggestion about whether this can even be ''fixed'' or how it would be done. >> >> As I understand it, you have two basic types of array reconstruction: in a mirror you can make a block-by-block copy and that''s easy, but in a parity array you have to perform a calculation on the existing data and/or existing parity to reconstruct the missing piece. This is pretty easy when you can guarantee that all your stripes are the same width, start/end on the same sectors/boundaries/whatever and thus know a piece of them lives on all drives in the set. I don''t think this is possible with ZFS since we have variable stripe width. A failed disk d may or may not contain data from stripe s (or transaction t). This information has to be discovered by looking at the transaction records. Right? >> >> Can someone speculate as to how you could rebuild a variable stripe width array without replaying all the available transactions? I am no filesystem engineer but I can''t wrap my head around how this could be handled any better than it already is. I''ve read that resilvering is throttled - presumably to keep performance degradation to a minimum during the process - maybe this could be a tunable (e.g. priority: low, normal, high)? >> >> Do we know if resilvers on a mirror are actually handled differently from those on a raidz? >> >> Sorry if this has already been explained. I think this is an issue that everyone who uses ZFS should understand completely before jumping in, because the behavior (while not ''wrong'') is clearly NOT the same as with more conventional arrays. >> >> -Will > the "problem" is NOT the checksum/error correction overhead. that''s relatively trivial. The problem isn''t really even variable width (i.e. variable number of disks one crosses) slabs. > > The problem boils down to this: > > When ZFS does a resilver, it walks the METADATA tree to determine what order to rebuild things from. That means, it resilvers the very first slab ever written, then the next oldest, etc. The problem here is that slab "age" has nothing to do with where that data physically resides on the actual disks. If you''ve used the zpool as a WORM device, then, sure, there should be a strict correlation between increasing slab age and locality on the disk. However, in any reasonable case, files get deleted regularly. This means that the probability that for a slab B, written immediately after slab A, it WON''T be physically near slab A. > > In the end, the problem is that using metadata order, while reducing the total amount of work to do in the resilver (as you only resilver live data, not every bit on the drive), increases the physical inefficiency for each slab. That is, seek time between cyclinders begins to dominate your slab reconstruction time. In RAIDZ, this problem is magnified by both the much larger average vdev size vs mirrors, and the necessity that all drives containing a slab information return that data before the corrected data can be written to the resilvering drive. > > Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput limited. This is really the "fault" of the underlying media, not ZFS. For instance, if you have a raidZ of SSDs (where seek time is negligible, but throughput isn''t), they resilver really, really fast. In fact, they resilver at the maximum write throughput rate. However, HDs are severely seek-limited, so that dominates HD resilver time. > > > The "answer" isn''t simple, as the problem is media-specific. > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 12/20/2010 11:56 AM, Mark Sandrock wrote:> Erik, > > just a hypothetical what-if ... > > In the case of resilvering on a mirrored disk, why not take a snapshot, and then > resilver by doing a pure block copy from the snapshot? It would be sequential, > so long as the original data was unmodified; and random access in dealing with > the modified blocks only, right. > > After the original snapshot had been replicated, a second pass would be done, > in order to update the clone to 100% live data. > > Not knowing enough about the inner workings of ZFS snapshots, I don''t know why > this would not be doable. (I''m biased towards mirrors for busy filesystems.) > > I''m supposing that a block-level snapshot is not doable -- or is it? > > MarkSnapshots on ZFS are true snapshots - they take a picture of the current state of the system. They DON''T copy any data around when created. So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time. The problem is this: Let''s say I write block A, B, C, and D on a clean zpool (what kind, it doesn''t matter). I now delete block C. Later on, I write block E. There is a probability (increasing dramatically as times goes on), that the on-disk layout will now look like: A, B, E, D rather than A, B, [space], D, E So, in the first case, I can do a sequential read to get A & B, but then must do a seek to get D, and a seek to get E. The "fragmentation" problem is mainly due to file deletion, NOT to file re-writing. (though, in ZFS, being a C-O-W filesystem, re-writing generally looks like a delete-then-write process, rather than a modify process). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble <erik.trimble at oracle.com> wrote:> > The problem boils down to this: > > When ZFS does a resilver, it walks the METADATA tree to determine what > order to rebuild things from. That means, it resilvers the very first > slab ever written, then the next oldest, etc. The problem here is that > slab "age" has nothing to do with where that data physically resides on > the actual disks. If you''ve used the zpool as a WORM device, then, sure, > there should be a strict correlation between increasing slab age and > locality on the disk. However, in any reasonable case, files get > deleted regularly. This means that the probability that for a slab B, > written immediately after slab A, it WON''T be physically near slab A. > > In the end, the problem is that using metadata order, while reducing the > total amount of work to do in the resilver (as you only resilver live > data, not every bit on the drive), increases the physical inefficiency > for each slab. That is, seek time between cyclinders begins to dominate > your slab reconstruction time. In RAIDZ, this problem is magnified by > both the much larger average vdev size vs mirrors, and the necessity > that all drives containing a slab information return that data before > the corrected data can be written to the resilvering drive. > > Thus, current ZFS resilvering tends to be seek-time limited, NOT > throughput limited. This is really the "fault" of the underlying media, > not ZFS. For instance, if you have a raidZ of SSDs (where seek time is > negligible, but throughput isn''t), they resilver really, really fast. > In fact, they resilver at the maximum write throughput rate. However, > HDs are severely seek-limited, so that dominates HD resilver time.You guys may be interested in a solution I used in a totally different situation. There an identical tree data structure had to be maintained on every node of a distributed system. When a new node was added, it needed to be initialized with an identical copy before it could be put in operation. But this had to be done while the rest of the system was operational and there may even be updates from a central node during the `mirroring'' operation. Some of these updates could completely change the tree! Starting at the root was not going to work since a subtree that was being copied may stop existing in the middle and its space reused! In a way this is a similar problem (but worse!). I needed something foolproof and simple. My algorithm started copying sequentially from the start. If N blocks were already copied when an update comes along, updates of any block with block# > N are ignored (since the sequential copy would get to them eventually). Updates of any block# <= N were queued up (further update of the same block would overwrite the old update, to reduce work). Periodically they would be flushed out to the new node. This was paced so at to not affect the normal operation much. I should think a variation would work for active filesystems. You sequentially read some amount of data from all the disks from which data for the new disk to be prepared and write it out sequentially. Each time read enough data so that reading time dominates any seek time. Handle concurrent updates as above. If you dedicate N% of time to resilvering, the total time to complete resilver will be 100/N times sequential read time of the whole disk. (For example, 1TB disk, 100MBps io speed, 25% for resilver => under 12 hours). How much worse this gets depends on the amount of updates during resilvering. At the time of resilvering your FS is more likely to be near full than near empty so I wouldn''t worry about optimizing the mostly empty FS case. Bakul
On Dec 20, 2010, at 2:05 PM, Erik Trimble wrote:> On 12/20/2010 11:56 AM, Mark Sandrock wrote: >> Erik, >> >> just a hypothetical what-if ... >> >> In the case of resilvering on a mirrored disk, why not take a snapshot, and then >> resilver by doing a pure block copy from the snapshot? It would be sequential, >> so long as the original data was unmodified; and random access in dealing with >> the modified blocks only, right. >> >> After the original snapshot had been replicated, a second pass would be done, >> in order to update the clone to 100% live data. >> >> Not knowing enough about the inner workings of ZFS snapshots, I don''t know why >> this would not be doable. (I''m biased towards mirrors for busy filesystems.) >> >> I''m supposing that a block-level snapshot is not doable -- or is it? >> >> Mark > Snapshots on ZFS are true snapshots - they take a picture of the current state of the system. They DON''T copy any data around when created. So, a ZFS snapshot would be just as fragmented as the ZFS filesystem was at the time.But if one does a raw (block) copy, there isn''t any fragmentation -- except for the COW updates. If there were no updates to the snapshot, then it becomes a 100% sequential block copy operation. But even with COW updates, presumably the large majority of the copy would still be sequential i/o. Maybe for the 2nd pass, the filesystem would have to be locked, so the operation would ever complete, but if this is fairly short in relation to the overall resilvering time, then it could still be a win in many cases. I''m probably not explaining it well, and may be way off, but it seemed an interesting notion. Mark> > > The problem is this: > > Let''s say I write block A, B, C, and D on a clean zpool (what kind, it doesn''t matter). I now delete block C. Later on, I write block E. There is a probability (increasing dramatically as times goes on), that the on-disk layout will now look like: > > A, B, E, D > > rather than > > A, B, [space], D, E > > > So, in the first case, I can do a sequential read to get A & B, but then must do a seek to get D, and a seek to get E. > > The "fragmentation" problem is mainly due to file deletion, NOT to file re-writing. (though, in ZFS, being a C-O-W filesystem, re-writing generally looks like a delete-then-write process, rather than a modify process). > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) >
> From: Erik Trimble [mailto:erik.trimble at oracle.com] > > We can either (a) change how ZFS does resilvering or (b) repack the > zpool layouts to avoid the problem in the first place. > > In case (a), my vote would be to seriously increase the number of > in-flight resilver slabs, AND allow for out-of-time-order slab > resilvering.Question for any clueful person: Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2 failed and is resilvering. If you have an algorithm to create a list of all the used blocks of disk1 in disk order, then you''re able to resilver the mirror extremely fast, because all the reads will be sequential in nature, plus you get to skip past all the unused space. Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3 is resilvering). You find some way of ordering all the used blocks of disk1... Which means disk1 will be able to read in optimal order and speed. Does that necessarily imply disk2 will also work well? Does the on-disk order of blocks of disk1 necessarily match the order of blocks on disk2? If there is no correlation between on-disk order of blocks for different disks within the same vdev, then all hope is lost; it''s essentially impossible to optimize the resilver/scrub order unless the on-disk order of multiple disks is highly correlated or equal by definition.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Erik Trimble > > > In the case of resilvering on a mirrored disk, why not take a snapshot,and> then > > resilver by doing a pure block copy from the snapshot? It would be > sequential, > > So, a > ZFS snapshot would be just as fragmented as the ZFS filesystem was at > the time.I think Mark was suggesting something like "dd" copy device 1 onto device 2, in order to guarantee a first-pass sequential resilver. And my response would be: Creative thinking and suggestions are always a good thing. In fact, the above suggestion is already faster than the present-day solution for what I''m calling "typical" usage, but there are an awful lot of use cases where the "dd" solution would be worse... Such as a pool which is largely sequential already, or largely empty, or made of high IOPS devices such as SSD. However, there is a desire to avoid resilvering unused blocks, so I hope a better solution is possible... The fundamental requirement for a better optimized solution would be a way to resilver according to disk ordering... And it''s just a question for somebody that actually knows the answer ... How terrible is the idea of figuring out the on-disk order?
On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote:>If there is no correlation between on-disk order of blocks for different >disks within the same vdev, then all hope is lost; it''s essentially >impossible to optimize the resilver/scrub order unless the on-disk order of >multiple disks is highly correlated or equal by definition.Very little is impossible. Drives have been optimally ordering seeks for 35+ years. I''m guessing that the trick (difficult, but not impossible) is how to solve a "travelling salesman" route pathing problem where you have billions or trillions of transactions, and do it fast enough that it was worth doing any extra computation besides just giving the device 32+ queued commands at a time that align with the elements of each ordered transaction ID. Add to that all the complexity of unwinding the error recovery in the event that you fail checksum validation on transaction N-1 after moving past transaction N, which would be a required capability if you wanted to queue more than a single transaction for verification at a time. Oh, and do all of the above without noticably affecting the throughput of the applications already running on the system. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
It well may be that different methods are optimal for different use cases. Mechanical disk vs. SSD; mirrored vs. raidz[123]; sparse vs. populated; etc. It would be interesting to read more in this area, if papers are available. I''ll have to take a look. ... Or does someone have pointers? Mark On Dec 20, 2010, at 6:28 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Erik Trimble >> >>> In the case of resilvering on a mirrored disk, why not take a snapshot, > and >> then >>> resilver by doing a pure block copy from the snapshot? It would be >> sequential, >> >> So, a >> ZFS snapshot would be just as fragmented as the ZFS filesystem was at >> the time. > > I think Mark was suggesting something like "dd" copy device 1 onto device 2, > in order to guarantee a first-pass sequential resilver. And my response > would be: Creative thinking and suggestions are always a good thing. In > fact, the above suggestion is already faster than the present-day solution > for what I''m calling "typical" usage, but there are an awful lot of use > cases where the "dd" solution would be worse... Such as a pool which is > largely sequential already, or largely empty, or made of high IOPS devices > such as SSD. However, there is a desire to avoid resilvering unused blocks, > so I hope a better solution is possible... > > The fundamental requirement for a better optimized solution would be a way > to resilver according to disk ordering... And it''s just a question for > somebody that actually knows the answer ... How terrible is the idea of > figuring out the on-disk order? >
On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:> On 20/12/2010 13:59, Richard Elling wrote: >> >> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com> wrote: >> >>> >>>> Why does resilvering take so long in raidz anyway? >>> Because it''s broken. There were some changes a while back that made it more broken. >> >> "broken" is the wrong term here. It functions as designed and correctly >> resilvers devices. Disagreeing with the design is quite different than >> proving a defect. > > It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread.If you only have a few slow drives, you don''t have performance. Like trying to win the Indianapolis 500 with a tricycle...> I think we can agree that ZFS currently doesn''t play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions.... and those conditions are also a strength. For example, most file systems are nowhere near full. With ZFS you only resilver data. For those who recall the resilver throttles in SVM or VXVM, you will appreciate not having to resilver non-data.> For a long time at Sun, the rule was "correctness is a constraint, performance is a goal". However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be "wrong"). > > Then one brave soul at Sun once ventured that "if Linux is faster, it''s a Solaris bug!" and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID "where I = inexpensive", so I''m a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it''s SATA (and I''m not so sure)."slow" doesn''t begin with an "i" :-)> >> >>> There has been a lot of discussion, anecdotes and some data on this list. >> >> "slow because I use devices with poor random write(!) performance" >> is very different than "broken." > > Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I''d be the first to say "are you nuts?!"Unfortunately, the math does not support your position...> >> >>> The resilver doesn''t do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata. >> >> A design that only does a single pass does not handle the temporal >> changes. Many RAID implementations use a mix of spatial and temporal >> resilvering and suffer with that design decision. > > Actually, it''s easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs. > >> >>> However, the current implentation has difficulty finishing the job if there''s a steady flow of updates to the pool. >> >> Please define current. There are many releases of ZFS, and >> many improvements have been made over time. What has not >> improved is the random write performance of consumer-grade >> HDDs. > > I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any. > >> >>> As far as I''m aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. >> >> I know of no RAID implementation that bounds resilver times >> for HDDs. I believe it is not possible. OTOH, whether a resilver >> takes 10 seconds or 10 hours makes little difference in data >> availability. Indeed, this is why we often throttle resilvering >> activity. See previous discussions on this forum regarding the >> dueling RFEs. > > I don''t share your disbelief or "little difference" analysys. If it is true that no current implementation succeeds, isn''t that a great opportunity to change the rules? Wasn''t resilver time vs availability was a major factor in Adam Leventhal''s paper introducing the need for RAIDZ3?No, it wasn''t. There are two failure modes we can model given the data provided by disk vendors: 1. failures by time (MTBF) 2. failures by bits read (UER) Over time, the MTBF has improved, but the failures by bits read has not improved. Just a few years ago enterprise class HDDs had an MTBF of around 1 million hours. Today, they are in the range of 1.6 million hours. Just looking at the size of the numbers, the probability that a drive will fail in one hour is on the order of 10e-6. By contrast, the failure rate by bits read has not improved much. Consumer class HDDs are usually spec''ed at 1 error per 1e14 bits read. To put this in perspective, a 2TB disk has around 1.6e13 bits. Or, the probability of an unrecoverable read if you read every bit on a 2TB is growing well above 10%. Some of the better enterprise class HDDs are rated two orders of magnitude better, but the only way to get much better is to use more bits for ECC... hence the move towards 4KB sectors. In other words, the probability of losing data by reading data can be larger than losing data next year. This is the case for triple parity RAID.> The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss.I agree. Back in the bad old days, we were stuck with silly throttles on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent on the competing, non-scrub I/O. This works because in ZFS all I/O is not created equal, unlike the layered RAID implementations such as SVM or RAID arrays. ZFS schedules the regular workload at a higher priority than scrubs or resilvers. Add the new throttles and the scheduler is even more effective. So you get your interactive performance at the cost of longer resilver times. This is probably a good trade-off for most folks.> >> >>> The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. >> >> Resilver time is bounded by the random write performance of >> the resilvering device. Mirroring or raidz make no difference. > > This only holds in a quiesced system.The effect will be worse for a mirror because you have direct competition for the single, surviving HDD. For raidz*, we clearly see the read workload spread out across the surving disks at approximatey the 1/N ratio. In other words, if you have a 4+1 raidz, then a resilver will keep the resilvering disk 100% busy writing, and the data disks approximately 25% busy reading. Later releases of ZFS will also prefetch the reads and the writes can be coalesced, skewing the ratio a little bit, but the general case seems to be a reasonable starting point. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101220/c24f5516/attachment-0001.html>
On Dec 20, 2010, at 4:19 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: Erik Trimble [mailto:erik.trimble at oracle.com] >> >> We can either (a) change how ZFS does resilvering or (b) repack the >> zpool layouts to avoid the problem in the first place. >> >> In case (a), my vote would be to seriously increase the number of >> in-flight resilver slabs, AND allow for out-of-time-order slab >> resilvering. > > Question for any clueful person: > > Suppose you have a mirror to resilver, made of disk1 and disk2, where disk2 > failed and is resilvering. If you have an algorithm to create a list of all > the used blocks of disk1 in disk order, then you''re able to resilver the > mirror extremely fast, because all the reads will be sequential in nature, > plus you get to skip past all the unused space.Sounds like the definition of random access :-)> > Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, where disk3 > is resilvering). You find some way of ordering all the used blocks of > disk1... Which means disk1 will be able to read in optimal order and speed.Sounds like prefetching :-)> Does that necessarily imply disk2 will also work well? Does the on-disk > order of blocks of disk1 necessarily match the order of blocks on disk2?This is an interesting question, that will become more interesting as the physical sector size gets bigger... -- richard>
> It''s worse on raidzN than on mirrors, because the > number of items which must > be read is higher in radizN, assuming you''re using > larger vdev''s and > therefore more items exist scattered about inside > that vdev. You therefore > have a higher number of things which must be randomly > read before you reach > completion.In that case, isn''t the answer to have a dedicated parity disk (or 2 or 3 depending on what raidz* is used), ala raid-dp. Wouldn''t this effectively be the ''same'' as a mirror when resilvering (the only difference being parity vs actual data), as it''s doing so from a single disk. raid-dp covers the parity disk from failure so raidz1 probably wouldn''t be sensible as if the parity disk fails..... -- This message posted from opensolaris.org
On 21/12/2010 05:44, Richard Elling wrote:> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com > <mailto:phil.harman at gmail.com>> wrote: >> On 20/12/2010 13:59, Richard Elling wrote: >>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com >>> <mailto:phil.harman at gmail.com>> wrote: >>>>> Why does resilvering take so long in raidz anyway? >>>> Because it''s broken. There were some changes a while back that made >>>> it more broken. >>> "broken" is the wrong term here. It functions as designed and correctly >>> resilvers devices. Disagreeing with the design is quite different than >>> proving a defect. >> It might be the wrong term in general, but I think it does apply in >> the budget home media server context of this thread. > If you only have a few slow drives, you don''t have performance. > Like trying to win the Indianapolis 500 with a tricycle...The context of this thread is a budget home media server (certainly not the Indy 500, but perhaps not as humble as tricycle touring either). And whilst it is a habit of the hardware advocate to blame the software ... and vice versa ... it''s not much help to those of us trying to build "good enough" systems across the performance and availability spectrum.>> I think we can agree that ZFS currently doesn''t play well on cheap >> disks. I think we can also agree that the performance of ZFS >> resilvering is known to be suboptimal under certain conditions. > ... and those conditions are also a strength. For example, most file > systems are nowhere near full. With ZFS you only resilver data. For those > who recall the resilver throttles in SVM or VXVM, you will appreciate not > having to resilver non-data.I''d love to see the data and analysis for the assertion that "most files systems are nowhere near full", discounting, of course, any trivial cases. In my experience, in any cost conscious scenario, in the home or the enterprise, the expectation is that I''ll get to use the majority of the space I''ve paid for (generally "through the nose" from the storage silo team in the enterprise scenario). To borrow your illustration, even Indy 500 teams care about fuel consumption. What I don''t appreciate is having to resilver significantly more data than the drive can contain. But when it comes to the crunch, what I''d really appreciate was a bounded resilver time measured in hours not days or weeks.>> For a long time at Sun, the rule was "correctness is a constraint, >> performance is a goal". However, in the real world, performance is >> often also a constraint (just as a quick but erroneous answer is a >> wrong answer, so also, a slow but correct answer can also be "wrong"). >> >> Then one brave soul at Sun once ventured that "if Linux is faster, >> it''s a Solaris bug!" and to his surprise, the idea caught on. I later >> went on to tell people that ZFS delievered RAID "where I = >> inexpensive", so I''m a just a little frustrated when that promise >> becomes less respected over time. First it was USB drives (which I >> agreed with), now it''s SATA (and I''m not so sure). > "slow" doesn''t begin with an "i" :-)Both ZFS and RAID promised to play in the inexpensive space.>>>> There has been a lot of discussion, anecdotes and some data on this >>>> list. >>> "slow because I use devices with poor random write(!) performance" >>> is very different than "broken." >> Again, context is everything. For example, if someone was building a >> business critical NAS appliance from consumer grade parts, I''d be the >> first to say "are you nuts?!" > Unfortunately, the math does not support your position...Actually, the math (e.g. raw drive metrics) doesn''t lead me to expect such a disparity.>>>> The resilver doesn''t do a single pass of the drives, but uses a >>>> "smarter" temporal algorithm based on metadata. >>> A design that only does a single pass does not handle the temporal >>> changes. Many RAID implementations use a mix of spatial and temporal >>> resilvering and suffer with that design decision. >> Actually, it''s easy to see how a combined spatial and temporal >> approach could be implemented to an advantage for mirrored vdevs. >>>> However, the current implentation has difficulty finishing the job >>>> if there''s a steady flow of updates to the pool. >>> Please define current. There are many releases of ZFS, and >>> many improvements have been made over time. What has not >>> improved is the random write performance of consumer-grade >>> HDDs. >> I was led to believe this was not yet fixed in Solaris 11, and that >> there are therefore doubts about what Solaris 10 update may see the >> fix, if any. >>>> As far as I''m aware, the only way to get bounded resilver times is >>>> to stop the workload until resilvering is completed. >>> I know of no RAID implementation that bounds resilver times >>> for HDDs. I believe it is not possible. OTOH, whether a resilver >>> takes 10 seconds or 10 hours makes little difference in data >>> availability. Indeed, this is why we often throttle resilvering >>> activity. See previous discussions on this forum regarding the >>> dueling RFEs. >> I don''t share your disbelief or "little difference" analysys. If it >> is true that no current implementation succeeds, isn''t that a great >> opportunity to change the rules? Wasn''t resilver time vs availability >> was a major factor in Adam Leventhal''s paper introducing the need for >> RAIDZ3? > > No, it wasn''t.Maybe we weren''t reading the same paper? From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a pointer to Adam''s ACM article)> The need for triple-parity RAID > ... > The time to populate a drive is directly relevant for RAID rebuild. As > disks in RAID systems take longer to reconstruct, the reliability of > the total system decreases due to increased periods running in a > degraded state. Today that can be four hours or longer; that could > easily grow to days or weeks.From http://queue.acm.org/detail.cfm?id=1670144 (Adam''s ACM article)> While bit error rates have nearly kept pace with the growth in disk > capacity, throughput has not been given its due consideration when > determining RAID reliability.Whilst Adam does discuss the lack of progress in bit error rates, his focus (in the article, and in his pointer to it) seems to be on drive capacity vs data rates, how this impact recovery times, and the consequential need to protect against multiple overlapping failures.> There are two failure modes we can model given the data > provided by disk vendors: > 1. failures by time (MTBF) > 2. failures by bits read (UER) > > Over time, the MTBF has improved, but the failures by bits read has not > improved. Just a few years ago enterprise class HDDs had an MTBF > of around 1 million hours. Today, they are in the range of 1.6 million > hours. Just looking at the size of the numbers, the probability that a > drive will fail in one hour is on the order of 10e-6. > > By contrast, the failure rate by bits read has not improved much. > Consumer class HDDs are usually spec''ed at 1 error per 1e14 > bits read. To put this in perspective, a 2TB disk has around 1.6e13 > bits. Or, the probability of an unrecoverable read if you read every bit > on a 2TB is growing well above 10%. Some of the better enterprise class > HDDs are rated two orders of magnitude better, but the only way to get > much better is to use more bits for ECC... hence the move towards > 4KB sectors. > > In other words, the probability of losing data by reading data can be > larger than losing data next year. This is the case for triple parity > RAID.MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when a disk fails, it is not considered "repairable", so a better metric is MTTF (because there are no repairable failures)] 1.6 million hours equates to about 180 years, so why do HDD vendors guarantee their drives for considerably less (typically 3-5 years)? It''s because they base the figure on a constant failure rate expected during the normal useful life of the drive (typically 5 years). However, quoting from http://www.asknumbers.com/WhatisReliability.aspx> Field failures do not generally occur at a uniform rate, but follow a > distribution in time commonly described as a "bathtub curve." The life > of a device can be divided into three regions: Infant Mortality > Period, where the failure rate progressively improves; Useful Life > Period, where the failure rate remains constant; and Wearout Period, > where failure rates begin to increase.Crucially, the vendor''s quoted MTBF figures do not take into account "infant mortality" or early "wearout". Until every HDD is fitted with an environmental tell-tale device for shock, vibration, temperature, pressure, humidity, etc we can''t even come close to predicting either factor. And this is just the HDD itself. In a system there are many ways to lose access to an HDD. So I''m exposed when I lose the first drive in a RAIDZ1 (second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer the resilver takes, the longer I''m exposed. Add to the mix that Indy 500 drives can degrade to tricyle performance before they fail utterly, and yes, low performing drives can still be an issue, even for the elite.>> The appropriateness or otherwise of resilver throttling depends on >> the context. If I can tolerate further failures without data loss >> (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed >> devices), or if I can recover business critical data in a timely >> manner, then great. But there may come a point where I would rather >> take a short term performance hit to close the window on total data loss. > > I agree. Back in the bad old days, we were stuck with silly throttles > on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent > on the competing, non-scrub I/O. This works because in ZFS all I/O is not > created equal, unlike the layered RAID implementations such as SVM or > RAID arrays. ZFS schedules the regular workload at a higher priority than > scrubs or resilvers. Add the new throttles and the scheduler is even more > effective. So you get your interactive performance at the cost of longer > resilver times. This is probably a good trade-off for most folks. > >>>> The problem exists for mirrors too, but is not as marked because >>>> mirror reconstruction is inherently simpler. >>> >>> Resilver time is bounded by the random write performance of >>> the resilvering device. Mirroring or raidz make no difference. >> >> This only holds in a quiesced system. > > The effect will be worse for a mirror because you have direct > competition for the single, surviving HDD. For raidz*, we clearly > see the read workload spread out across the surving disks at > approximatey the 1/N ratio. In other words, if you have a 4+1 raidz, > then a resilver will keep the resilvering disk 100% busy writing, and > the data disks approximately 25% busy reading. Later releases of > ZFS will also prefetch the reads and the writes can be coalesced, > skewing the ratio a little bit, but the general case seems to be a > reasonable starting point.Mirrored systems need more drives to achieve the same capacity, so mirrored volumes are generally striped by some means, so the equivalent of your 4+1 RAIDZ1 is actually a 4+4. In such a configuration resilvering one drive at 100% would also result in a mean hit of 25%. Obviously, a drive running at 100% has nothing more to give, so for fun let''s throttle the resilver to 25x1MB sequential reads per second (which is about 25% of a good drive''s capacity). At this rate, a 2TB drive will resilver in under 24 hours, so let''s make that the upper bound. It is highly desirable to throttle the resilver and regular I/O rates according to required performance and availability metrics, so something better than 24 hours should be the norm. It should also be possible for the system to report an ETA based on current and historic workload statistics. "You may say I''m a dreamer..." For mirrored vdevs, ZFS could resilver using an efficient block level copy, whilst keeping a record of progress, and considering copied blocks as already mirrored and ready to be read and updated by normal activity. Obviously, it''s much harder to apply this approach for RAIDZ. Since slabs are allocated sequentially, it should also be possible to set a high water mark for the bulk copy, so that fresh pools with little or no data could also be resilvered in minutes or seconds. I believe such an approach would benefit all ZFS users, not just the elite.> -- richardPhil p.s. just for the record, Nexenta''s Hardware Supported List (HSL) is an excellent resource for those wanting to build NAS appliances that actually work... http://www.nexenta.com/corp/supported-hardware/hardware-supported-list ... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs (enterprise class drives at near consumer prices) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/e5b57b3d/attachment-0001.html>
On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:> If you only have a few slow drives, you don''t have performance.> Like trying to win the Indianapolis 500 with a tricycle...Well you can put a jet engine on a tricycle and perhaps win it? Or you can change the race course to only allow a tricycle space to move. In the context of storage we have 2 factors hardware and software, having faster and more reliable spindles is no reason to suggest that better software can?t be used to beat it. The simple example is ZIL SSD, where using some software and even a cheap commodity SSD will outperform sync writes than any amount of expensive spindle drives. Before ZIL software is was easy to argue that the only way of speeding up writes was more faster spindles. The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can?t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn?t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking ?what can we do with zfs code as a base?? Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software? Bye, Deano -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/f9c6fd16/attachment.html>
On 21/12/2010 13:05, Deano wrote:> > On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com > <mailto:phil.harman at gmail.com>> wrote: > > > If you only have a few slow drives, you don''t have performance. > > > Like trying to win the Indianapolis 500 with a tricycle... >Actually, I didn''t say that, Richard did :)> Well you can put a jet engine on a tricycle and perhaps win it? Or you > can change the race course to only allow a tricycle space to move. In > the context of storage we have 2 factors hardware and software, having > faster and more reliable spindles is no reason to suggest that better > software can?t be used to beat it. The simple example is ZIL SSD, > where using some software and even a cheap commodity SSD will > outperform sync writes than any amount of expensive spindle drives. > Before ZIL software is was easy to argue that the only way of speeding > up writes was more faster spindles. > > The question therefore is, is there room in the software > implementation to achieve performance and reliability numbers similar > to expensive drives whilst using relative cheap drives? > > ZFS is good but IMHO easy to see how it can be improved to better meet > this situation, I can?t currently say when this line of thinking and > code will move from research to production level use (tho I have a > pretty good idea ;) ) but I wouldn?t bet on the status quo lasting > much longer. In some ways the removal of OpenSolaris may actually be a > good thing, as its catalyized a number of developers from the view > that zfs is Oracle led, to thinking ?what can we do with zfs code as a > base?? > > Ffor example how about sticking a cheap 80GiB commodity SSD in the > storage case. When a resilver or defrag is required, use it as a > scratch space to give you a block of fast IOPs storage space to > accelerate the slow parts. When its done secure erase and power it > down, ready for the next time a resilver needs to happen. The hardware > is available, just needs someone to write the software? > > Bye, > > Deano > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/48a22208/attachment.html>
Doh sorry about that, the threading got very confused on my mail reader! Bye, Deano From: Phil Harman [mailto:phil.harman at gmail.com] Sent: 21 December 2010 13:12 To: Deano Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] A few questions On 21/12/2010 13:05, Deano wrote: On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote:> If you only have a few slow drives, you don''t have performance.> Like trying to win the Indianapolis 500 with a tricycle...Actually, I didn''t say that, Richard did :) Well you can put a jet engine on a tricycle and perhaps win it? Or you can change the race course to only allow a tricycle space to move. In the context of storage we have 2 factors hardware and software, having faster and more reliable spindles is no reason to suggest that better software can?t be used to beat it. The simple example is ZIL SSD, where using some software and even a cheap commodity SSD will outperform sync writes than any amount of expensive spindle drives. Before ZIL software is was easy to argue that the only way of speeding up writes was more faster spindles. The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives? ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can?t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn?t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking ?what can we do with zfs code as a base?? Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software? Bye, Deano _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101221/d0bde944/attachment-0001.html>
> From: edmudama at mail.bounceswoosh.org > [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama > > On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote: > >If there is no correlation between on-disk order of blocks for different > >disks within the same vdev, then all hope is lost; it''s essentially > >impossible to optimize the resilver/scrub order unless the on-disk orderof> >multiple disks is highly correlated or equal by definition. > > Very little is impossible. > > Drives have been optimally ordering seeks for 35+ years. I''m guessingUnless your drive is able to queue up a request to read every single used part of the drive... Which is larger than the command queue for any reasonable drive in the world... The point is, in order to be "optimal" you have to eliminate all those seeks, and perform sequential reads only. The only seeks you should do are to skip over unused space. If you''re able to sequentially read the whole drive, skipping all the unused space, then you''re guaranteed to complete faster (or equal) than either (a) sequentially reading the whole drive, or (b) seeking all over the drive to read the used parts in random order.
> From: Richard Elling [mailto:richard.elling at gmail.com] > > > Now suppose you have a raidz with 3 disks (disk1, disk2, disk3, wheredisk3> > is resilvering). You find some way of ordering all the used blocks of > > disk1... Which means disk1 will be able to read in optimal order andspeed.> > Sounds like prefetching :-)Ok. Prefetch every used sector in the pool. Problem solved. Let the disks sort all the requests into on-disk order. Unless perhaps the number of requests would exceed the limits of what the drive is able to sort ... Which seems ... more than likely.
On Tue, Dec 21 at 8:24, Edward Ned Harvey wrote:>> From: edmudama at mail.bounceswoosh.org >> [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama >> >> On Mon, Dec 20 at 19:19, Edward Ned Harvey wrote: >> >If there is no correlation between on-disk order of blocks for different >> >disks within the same vdev, then all hope is lost; it''s essentially >> >impossible to optimize the resilver/scrub order unless the on-disk order >of >> >multiple disks is highly correlated or equal by definition. >> >> Very little is impossible. >> >> Drives have been optimally ordering seeks for 35+ years. I''m guessing > >Unless your drive is able to queue up a request to read every single used >part of the drive... Which is larger than the command queue for any >reasonable drive in the world... The point is, in order to be "optimal" you >have to eliminate all those seeks, and perform sequential reads only. The >only seeks you should do are to skip over unused space.I don''t think you read my whole post. I was saying this seek calculation pre-processing would have to be done by the host server, and while not impossible, is not trivial. Present the next 32 seeks to each device while the pre-processor works on the complete list of future seeks, and the drive will do as well as possible.>If you''re able to sequentially read the whole drive, skipping all the unused >space, then you''re guaranteed to complete faster (or equal) than either (a) >sequentially reading the whole drive, or (b) seeking all over the drive to >read the used parts in random order.Yes, I understand how that works. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
> From: edmudama at mail.bounceswoosh.org > [mailto:edmudama at mail.bounceswoosh.org] On Behalf Of Eric D. Mudama > > >Unless your drive is able to queue up a request to read every single used > >part of the drive... Which is larger than the command queue for any > >reasonable drive in the world... The point is, in order to be "optimal"you> >have to eliminate all those seeks, and perform sequential reads only.The> >only seeks you should do are to skip over unused space. > > I don''t think you read my whole post. I was saying this seek > calculation pre-processing would have to be done by the host server, > and while not impossible, is not trivial. Present the next 32 seeks > to each device while the pre-processor works on the complete list of > future seeks, and the drive will do as well as possible.I did read that, but now I think, perhaps I misunderstand it, or you misunderstood me? I am thinking... If you''re just queueing up a few reads at a time (less than infinity, or less than 99% of the pool) ... I would not assume that these 32 seeks are even remotely sequential.... I mean ... 32 blocks in a pool of presumably millions of blocks... I would assume they are essentially random, are they not? In my mind, which is likely wrong or at least oversimplified, I think if you want to order the list of blocks to read according to disk order (which should at least be theoretically possible on mirrors, but perhaps not even physically possible on raidz)... You would have to first generate a list of all the blocks to be read, and then sort it. Rough estimate, for any pool of a reasonable size, that sounds like some GB of ram to me. Maybe there''s a less-than-perfect sort algorithm which has a much lower memory footprint? Like a simple hashing algorithm that will guarantee the next few thousand seeks are in disk order... Although they will skip or jump over many blocks that will have to be done later ... An algorithm which is not a perfect sort, but given some repetition and multiple passes over the disk, might achieve an acceptable level of performance versus memory footprint...
On Dec 21, 2010, at 5:05 AM, Deano wrote:> > The question therefore is, is there room in the software implementation to achieve performance and reliability numbers similar to expensive drives whilst using relative cheap drives?For some definition of "similar," yes. But using relatively cheap drives does not mean the overall system cost will be cheap. For example, $250 will buy 8.6K random IOPS @ 4KB in an SSD[1], but to do that with "cheap disks" might require eighty 7,200 rpm SATA disks.> ZFS is good but IMHO easy to see how it can be improved to better meet this situation, I can?t currently say when this line of thinking and code will move from research to production level use (tho I have a pretty good idea ;) ) but I wouldn?t bet on the status quo lasting much longer. In some ways the removal of OpenSolaris may actually be a good thing, as its catalyized a number of developers from the view that zfs is Oracle led, to thinking ?what can we do with zfs code as a base??There are more people outside of Oracle developing for ZFS than inside Oracle. This has been true for some time now.> Ffor example how about sticking a cheap 80GiB commodity SSD in the storage case. When a resilver or defrag is required, use it as a scratch space to give you a block of fast IOPs storage space to accelerate the slow parts. When its done secure erase and power it down, ready for the next time a resilver needs to happen. The hardware is available, just needs someone to write the software?In general, SSDs will not speed resilver unless the resilvering disk is an SSD. [1] http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/feature/index.htm -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/0f136004/attachment.html>
On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Dec 21, 2010, at 5:05 AM, Deano wrote: > > > The question therefore is, is there room in the software implementation to > achieve performance and reliability numbers similar to expensive drives > whilst using relative cheap drives? > > > For some definition of "similar," yes. But using relatively cheap drives > does > not mean the overall system cost will be cheap. For example, $250 will buy > 8.6K random IOPS @ 4KB in an SSD[1], but to do that with "cheap disks" > might > require eighty 7,200 rpm SATA disks. > > ZFS is good but IMHO easy to see how it can be improved to better meet this > situation, I can?t currently say when this line of thinking and code will > move from research to production level use (tho I have a pretty good idea ;) > ) but I wouldn?t bet on the status quo lasting much longer. In some ways the > removal of OpenSolaris may actually be a good thing, as its catalyized a > number of developers from the view that zfs is Oracle led, to thinking ?what > can we do with zfs code as a base?? > > > There are more people outside of Oracle developing for ZFS than inside > Oracle. > This has been true for some time now. > > >Pardon my skepticism, but where is the proof of this claim (I''m quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I''ve seen exactly nothing out of "outside of Oracle" in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn''t work on ZFS. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/f33f15b3/attachment-0001.html>
Richard Elling
2010-Dec-26 06:21 UTC
[zfs-discuss] MTBF and why we care [was: A few questions]
On Dec 21, 2010, at 3:48 AM, Phil Harman wrote:> On 21/12/2010 05:44, Richard Elling wrote: >> >> On Dec 20, 2010, at 7:31 AM, Phil Harman <phil.harman at gmail.com> wrote: >>> On 20/12/2010 13:59, Richard Elling wrote: >>>> >>>> On Dec 20, 2010, at 2:42 AM, Phil Harman <phil.harman at gmail.com> wrote: >>>>>> Why does resilvering take so long in raidz anyway? >>>>> Because it''s broken. There were some changes a while back that made it more broken. >>>> "broken" is the wrong term here. It functions as designed and correctly >>>> resilvers devices. Disagreeing with the design is quite different than >>>> proving a defect. >>> It might be the wrong term in general, but I think it does apply in the budget home media server context of this thread. >> If you only have a few slow drives, you don''t have performance. >> Like trying to win the Indianapolis 500 with a tricycle... > > The context of this thread is a budget home media server (certainly not the Indy 500, but perhaps not as humble as tricycle touring either). And whilst it is a habit of the hardware advocate to blame the software ... and vice versa ... it''s not much help to those of us trying to build "good enough" systems across the performance and availability spectrum.it is all in how the expectations are set. For the home user, waiting overnight for a resilver might not impact their daily lives (switch night/day around for developers :-)>>> I think we can agree that ZFS currently doesn''t play well on cheap disks. I think we can also agree that the performance of ZFS resilvering is known to be suboptimal under certain conditions. >> ... and those conditions are also a strength. For example, most file >> systems are nowhere near full. With ZFS you only resilver data. For those >> who recall the resilver throttles in SVM or VXVM, you will appreciate not >> having to resilver non-data. > > I''d love to see the data and analysis for the assertion that "most files systems are nowhere near full", discounting, of course, any trivial cases.I wish I still had access to that data, since I left Sun, I''d be pleasantly surprised if anyone keeps up with it any more. But yes, we did track file system utilization on around 300,000 systems, clearly a statistically significant sample, for Sun''s market anyway. Average space utilization is well below 50%.> In my experience, in any cost conscious scenario, in the home or the enterprise, the expectation is that I''ll get to use the majority of the space I''ve paid for (generally "through the nose" from the storage silo team in the enterprise scenario). To borrow your illustration, even Indy 500 teams care about fuel consumption. > > What I don''t appreciate is having to resilver significantly more data than the drive can contain. But when it comes to the crunch, what I''d really appreciate was a bounded resilver time measured in hours not days or weeks.For those following along, changeset 12296:7cf402a7f374 on May 3, 2010 brought a number of changes to scrubs and resilvers.>>> For a long time at Sun, the rule was "correctness is a constraint, performance is a goal". However, in the real world, performance is often also a constraint (just as a quick but erroneous answer is a wrong answer, so also, a slow but correct answer can also be "wrong"). >>> >>> Then one brave soul at Sun once ventured that "if Linux is faster, it''s a Solaris bug!" and to his surprise, the idea caught on. I later went on to tell people that ZFS delievered RAID "where I = inexpensive", so I''m a just a little frustrated when that promise becomes less respected over time. First it was USB drives (which I agreed with), now it''s SATA (and I''m not so sure). >> "slow" doesn''t begin with an "i" :-) > > Both ZFS and RAID promised to play in the inexpensive space.And tricycles are less expensive than Indy cars...>>>>> There has been a lot of discussion, anecdotes and some data on this list. >>>> "slow because I use devices with poor random write(!) performance" >>>> is very different than "broken." >>> Again, context is everything. For example, if someone was building a business critical NAS appliance from consumer grade parts, I''d be the first to say "are you nuts?!" >> Unfortunately, the math does not support your position... > > Actually, the math (e.g. raw drive metrics) doesn''t lead me to expect such a disparity. > >>>>> The resilver doesn''t do a single pass of the drives, but uses a "smarter" temporal algorithm based on metadata. >>>> A design that only does a single pass does not handle the temporal >>>> changes. Many RAID implementations use a mix of spatial and temporal >>>> resilvering and suffer with that design decision. >>> Actually, it''s easy to see how a combined spatial and temporal approach could be implemented to an advantage for mirrored vdevs. >>>>> However, the current implentation has difficulty finishing the job if there''s a steady flow of updates to the pool. >>>> Please define current. There are many releases of ZFS, and >>>> many improvements have been made over time. What has not >>>> improved is the random write performance of consumer-grade >>>> HDDs. >>> I was led to believe this was not yet fixed in Solaris 11, and that there are therefore doubts about what Solaris 10 update may see the fix, if any. >>>>> As far as I''m aware, the only way to get bounded resilver times is to stop the workload until resilvering is completed. >>>> I know of no RAID implementation that bounds resilver times >>>> for HDDs. I believe it is not possible. OTOH, whether a resilver >>>> takes 10 seconds or 10 hours makes little difference in data >>>> availability. Indeed, this is why we often throttle resilvering >>>> activity. See previous discussions on this forum regarding the >>>> dueling RFEs. >>> I don''t share your disbelief or "little difference" analysys. If it is true that no current implementation succeeds, isn''t that a great opportunity to change the rules? Wasn''t resilver time vs availability was a major factor in Adam Leventhal''s paper introducing the need for RAIDZ3? >> >> No, it wasn''t. > > Maybe we weren''t reading the same paper? > > From http://dtrace.org/blogs/ahl/2009/12/21/acm_triple_parity_raid (a pointer to Adam''s ACM article) >> The need for triple-parity RAID >> ... >> The time to populate a drive is directly relevant for RAID rebuild. As disks in RAID systems take longer to reconstruct, the reliability of the total system decreases due to increased periods running in a degraded state. Today that can be four hours or longer; that could easily grow to days or weeks. > > From http://queue.acm.org/detail.cfm?id=1670144 (Adam''s ACM article) >> While bit error rates have nearly kept pace with the growth in disk capacity, throughput has not been given its due consideration when determining RAID reliability. > > Whilst Adam does discuss the lack of progress in bit error rates, his focus (in the article, and in his pointer to it) seems to be on drive capacity vs data rates, how this impact recovery times, and the consequential need to protect against multiple overlapping failures. > >> There are two failure modes we can model given the data >> provided by disk vendors: >> 1. failures by time (MTBF) >> 2. failures by bits read (UER) >> >> Over time, the MTBF has improved, but the failures by bits read has not >> improved. Just a few years ago enterprise class HDDs had an MTBF >> of around 1 million hours. Today, they are in the range of 1.6 million >> hours. Just looking at the size of the numbers, the probability that a >> drive will fail in one hour is on the order of 10e-6. >> >> By contrast, the failure rate by bits read has not improved much. >> Consumer class HDDs are usually spec''ed at 1 error per 1e14 >> bits read. To put this in perspective, a 2TB disk has around 1.6e13 >> bits. Or, the probability of an unrecoverable read if you read every bit >> on a 2TB is growing well above 10%. Some of the better enterprise class >> HDDs are rated two orders of magnitude better, but the only way to get >> much better is to use more bits for ECC... hence the move towards >> 4KB sectors. >> >> In other words, the probability of losing data by reading data can be >> larger than losing data next year. This is the case for triple parity RAID. > > MTBF as quoted by HDD vendors has become pretty meaningless. [nit: when a disk fails, it is not considered "repairable", so a better metric is MTTF (because there are no repairable failures)]They are the same in this context.> 1.6 million hours equates to about 180 years, so why do HDD vendors guarantee their drives for considerably less (typically 3-5 years)? It''s because they base the figure on a constant failure rate expected during the normal useful life of the drive (typically 5 years).MTBF has units of "hours between failures," but is often shortened to "hours." It is often easier to do the math with Failures in Time (FITs) where Time is a billion hours. There is a direct correlation: FITs = 1,000,000,000 / MTBF To put this in perspective, a modern CPU has an MTBF of around 4 million hours or 250 FITs. A simple PCI card can easily get to 10 million hours, or less than 100 FITs. Or, if you prefer, the annualized failure rate (AFR) gives a more intuitive response. AFR = 8760 hours per year / MTBF AFR is often represented as a percentage, and ranges of 0.6% to 4% are useful for disks. Remember, all failures due to wear out and described by MTBF in disks are mechanical failures.> However, quoting from http://www.asknumbers.com/WhatisReliability.aspx >> Field failures do not generally occur at a uniform rate, but follow a distribution in time commonly described as a "bathtub curve." The life of a device can be divided into three regions: Infant Mortality Period, where the failure rate progressively improves; Useful Life Period, where the failure rate remains constant; and Wearout Period, where failure rates begin to increase. > > Crucially, the vendor''s quoted MTBF figures do not take into account "infant mortality" or early "wearout". Until every HDD is fitted with an environmental tell-tale device for shock, vibration, temperature, pressure, humidity, etc we can''t even come close to predicting either factor.Yes we can, and yes we do. All you need is a large enough sample size. In many cases, the changes in failure rates occur because of events not considered in MTBF calculations: factory defects, contamination, environmental conditions, physical damage, firmware bugs, etc.> And this is just the HDD itself. In a system there are many ways to lose access to an HDD. So I''m exposed when I lose the first drive in a RAIDZ1 (second drive in a RAIDZ2, or third drive in a RAIDZ3). And the longer the resilver takes, the longer I''m exposed.Indeed. Let''s look at the math. For the simple MTTDL[1] model, that does not consider UER, we calculate the probability that we we have a second failure during the repair time: single parity : MTTDL[1] = MTBF^2 / (N*(N-1) * MTTR) double parity: MTTDL[1] = MTBF^3 / (N * (N-1) * (N-2) * MTTR^2) Mean Time To Repair (MTTR) includes logistical replacement and resilvering time, so this model can show the advantage of hot spares (by reducing logistical replacement time). The practical use of this model makes sense where MTTR is on the order of 10s or 100s of hours while the MTBF is on the order of 1 million hours. But the more difficult problem arises with the UER spec. A consumer-grade disk typically has a UER rating of error per 10^14 bits read. 10^14 bits is around 8 2TB drives. In other words, the probability of having an UER during reconstruction of an 8+1 raidz using 2TB consumer-grade drives is more like 63%, much higher than the MTTDL[1] model implies. We are just now seeing enterprise-class drives with a UER rating of 1 error per 10^16 bits read. http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/NS/Cheetah%20NS%2010K.2/100516228d.pdf> Add to the mix that Indy 500 drives can degrade to tricyle performance before they fail utterly, and yes, low performing drives can still be an issue, even for the elite.Yes. I feel this will become the dominant issue with HDDs and one where there is plenty of room for improvement in ZFS.>>> The appropriateness or otherwise of resilver throttling depends on the context. If I can tolerate further failures without data loss (e.g. RAIDZ2 with one failed device, or RAIDZ3 with two failed devices), or if I can recover business critical data in a timely manner, then great. But there may come a point where I would rather take a short term performance hit to close the window on total data loss. >> >> I agree. Back in the bad old days, we were stuck with silly throttles >> on SVM (10 IOPs, IIRC). The current ZFS throttle (b142, IIRC) is dependent >> on the competing, non-scrub I/O. This works because in ZFS all I/O is not >> created equal, unlike the layered RAID implementations such as SVM or >> RAID arrays. ZFS schedules the regular workload at a higher priority than >> scrubs or resilvers. Add the new throttles and the scheduler is even more >> effective. So you get your interactive performance at the cost of longer >> resilver times. This is probably a good trade-off for most folks. >> >>>>> The problem exists for mirrors too, but is not as marked because mirror reconstruction is inherently simpler. >>>> >>>> Resilver time is bounded by the random write performance of >>>> the resilvering device. Mirroring or raidz make no difference. >>> >>> This only holds in a quiesced system. >> >> The effect will be worse for a mirror because you have direct >> competition for the single, surviving HDD. For raidz*, we clearly >> see the read workload spread out across the surving disks at >> approximatey the 1/N ratio. In other words, if you have a 4+1 raidz, >> then a resilver will keep the resilvering disk 100% busy writing, and >> the data disks approximately 25% busy reading. Later releases of >> ZFS will also prefetch the reads and the writes can be coalesced, >> skewing the ratio a little bit, but the general case seems to be a >> reasonable starting point. > > Mirrored systems need more drives to achieve the same capacity, so mirrored volumes are generally striped by some means, so the equivalent of your 4+1 RAIDZ1 is actually a 4+4. In such a configuration resilvering one drive at 100% would also result in a mean hit of 25%.For HDDs, writes take longer than reads, so reality is much more difficult to model. This is further complicated by ZFS''s I/O scheduler, track read buffers, ZFS prefetching, and the async nature of resilvering writes.> Obviously, a drive running at 100% has nothing more to give, so for fun let''s throttle the resilver to 25x1MB sequential reads per second (which is about 25% of a good drive''s capacity). At this rate, a 2TB drive will resilver in under 24 hours, so let''s make that the upper bound.OK. I think this is a fair goal. It is certainly easier to achieve than the 4.5 hours you can expect for sustained writes to the media.> It is highly desirable to throttle the resilver and regular I/O rates according to required performance and availability metrics, so something better than 24 hours should be the norm. > > It should also be possible for the system to report an ETA based on current and historic workload statistics. "You may say I''m a dreamer..."That is what happens today, but the algorithm doesn''t work well for devices with widely varying random performance profiles (eg HDDs). As the resilver throttle kicks in, due to other I/O taking priority, the resilver time is even more unpredictable. An amusing CR is 6973953, where the "solution" is "do not print estimated time if hours_left is more than 30 days" http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6973953> For mirrored vdevs, ZFS could resilver using an efficient block level copy, whilst keeping a record of progress, and considering copied blocks as already mirrored and ready to be read and updated by normal activity. Obviously, it''s much harder to apply this approach for RAIDZ. > > Since slabs are allocated sequentially, it should also be possible to set a high water mark for the bulk copy, so that fresh pools with little or no data could also be resilvered in minutes or seconds.That is the case today. Try it :-) -- richard> I believe such an approach would benefit all ZFS users, not just the elite. > >> -- richard > > Phil > > p.s. just for the record, Nexenta''s Hardware Supported List (HSL) is an excellent resource for those wanting to build NAS appliances that actually work... > > http://www.nexenta.com/corp/supported-hardware/hardware-supported-list > > ... which includes Hitachi Ultrastar A7K2000 SATA 7200rpm HDDs (enterprise class drives at near consumer prices)-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101225/9aec1a6b/attachment-0001.html>
On 12/26/10 05:40 AM, Tim Cook wrote:> > > On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling > <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: > > > There are more people outside of Oracle developing for ZFS than > inside Oracle. > This has been true for some time now. > > > > > Pardon my skepticism, but where is the proof of this claim (I''m quite > certain you know I mean no disrespect)? Solaris11 Express was a > massive leap in functionality and bugfixes to ZFS. I''ve seen exactly > nothing out of "outside of Oracle" in the time since it went closed. > We used to see updates bi-weekly out of Sun. Nexenta spending > hundreds of man-hours on a GUI and userland apps isn''t work on ZFS. > >Exactly my observation as well. I haven''t seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. -- Robert Milkowski http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/808050da/attachment.html>
On Mon, 3 Jan 2011, Robert Milkowski wrote:> > Exactly my observation as well. I haven''t seen any ZFS related > development happening at Ilumos or Nexenta, at least not yet.There seems to be plenty of zfs work on the FreeBSD project, but primarily with porting the latest available sources to FreeBSD (going very well!) rather than with developing zfs itself. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 01/ 3/11 05:08 AM, Robert Milkowski wrote:> On 12/26/10 05:40 AM, Tim Cook wrote: >> >> >> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling >> <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: >> >> >> There are more people outside of Oracle developing for ZFS than >> inside Oracle. >> This has been true for some time now. >> >> >> >> >> Pardon my skepticism, but where is the proof of this claim (I''m quite >> certain you know I mean no disrespect)? Solaris11 Express was a >> massive leap in functionality and bugfixes to ZFS. I''ve seen exactly >> nothing out of "outside of Oracle" in the time since it went closed. >> We used to see updates bi-weekly out of Sun. Nexenta spending >> hundreds of man-hours on a GUI and userland apps isn''t work on ZFS. >> >> > > Exactly my observation as well. I haven''t seen any ZFS related > development happening at Ilumos or Nexenta, at least not yet.Just because you''ve not seen it yet doesn''t imply it isn''t happening. Please be patient. - Garrett> > -- > Robert Milkowski > http://milek.blogspot.com > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/fecd0af1/attachment.html>
On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:> On 12/26/10 05:40 AM, Tim Cook wrote: >> >> >> >> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at gmail.com> wrote: >> >> There are more people outside of Oracle developing for ZFS than inside Oracle. >> This has been true for some time now. >> >>> >> >> >> >> Pardon my skepticism, but where is the proof of this claim (I''m quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I''ve seen exactly nothing out of "outside of Oracle" in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn''t work on ZFS. >> >> > > Exactly my observation as well. I haven''t seen any ZFS related development happening at Ilumos or Nexenta, at least not yet.I am quite sure you understand how pipelines work :-) -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/3281ed97/attachment.html>
On 1/3/2011 8:28 AM, Richard Elling wrote:> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: >> On 12/26/10 05:40 AM, Tim Cook wrote: >>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling >>> <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: >>> >>> >>> There are more people outside of Oracle developing for ZFS than >>> inside Oracle. >>> This has been true for some time now. >>> >>> >>> Pardon my skepticism, but where is the proof of this claim (I''m >>> quite certain you know I mean no disrespect)? Solaris11 Express was >>> a massive leap in functionality and bugfixes to ZFS. I''ve seen >>> exactly nothing out of "outside of Oracle" in the time since it went >>> closed. We used to see updates bi-weekly out of Sun. Nexenta >>> spending hundreds of man-hours on a GUI and userland apps isn''t work >>> on ZFS. >>> >>> >> >> Exactly my observation as well. I haven''t seen any ZFS related >> development happening at Ilumos or Nexenta, at least not yet. > > I am quite sure you understand how pipelines work :-) > -- richard >I''m getting pretty close to my pain threshold on the BP_rewrite stuff, since not having that feature''s holding up a big chunk of work I''d like to push. If anyone outside of Oracle is working on some sort of change to ZFS that will allow arbitrary movement/placement of pre-written slabs, can they please contact me? I''m pretty much at the point where I''m going to start diving into that chunk of the source to see if there''s something little old me can do, and I''d far rather help on someone else''s implementation than have to do it myself from scratch. I''d prefer a private contact, as I realize that such work may not be ready for public discussion yet. Thanks, folks! Oh, and this is completely just me, not Oracle talking in any way. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/d587ef77/attachment-0001.html>
On Jan 3, 2011, at 2:10 PM, Erik Trimble wrote> On 1/3/2011 8:28 AM, Richard Elling wrote: >> >> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: >>> On 12/26/10 05:40 AM, Tim Cook wrote: >>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at gmail.com> wrote: >>>> >>>> There are more people outside of Oracle developing for ZFS than inside Oracle. >>>> This has been true for some time now. >>>> >>>> >>>> Pardon my skepticism, but where is the proof of this claim (I''m quite certain you know I mean no disrespect)? Solaris11 Express was a massive leap in functionality and bugfixes to ZFS. I''ve seen exactly nothing out of "outside of Oracle" in the time since it went closed. We used to see updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a GUI and userland apps isn''t work on ZFS. >>>> >>>> >>> >>> Exactly my observation as well. I haven''t seen any ZFS related development happening at Ilumos or Nexenta, at least not yet. >> >> I am quite sure you understand how pipelines work :-) >> -- richard > > I''m getting pretty close to my pain threshold on the BP_rewrite stuff, since not having that feature''s holding up a big chunk of work I''d like to push. > > If anyone outside of Oracle is working on some sort of change to ZFS that will allow arbitrary movement/placement of pre-written slabs, can they please contact me? I''m pretty much at the point where I''m going to start diving into that chunk of the source to see if there''s something little old me can do, and I''d far rather help on someone else''s implementation than have to do it myself from scratch. > > I''d prefer a private contact, as I realize that such work may not be ready for public discussion yet. > > Thanks, folks! > > Oh, and this is completely just me, not Oracle talking in any way.Oracle doesn''t seem to say much at all :-( But for those interested, Nexenta is actively hiring people to work in this area. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110103/d84cc3fa/attachment.html>
On 01/ 3/11 04:28 PM, Richard Elling wrote:> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: > >> On 12/26/10 05:40 AM, Tim Cook wrote: >>> >>> >>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling >>> <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: >>> >>> >>> There are more people outside of Oracle developing for ZFS than >>> inside Oracle. >>> This has been true for some time now. >>> >>> >>> >>> >>> Pardon my skepticism, but where is the proof of this claim (I''m >>> quite certain you know I mean no disrespect)? Solaris11 Express was >>> a massive leap in functionality and bugfixes to ZFS. I''ve seen >>> exactly nothing out of "outside of Oracle" in the time since it went >>> closed. We used to see updates bi-weekly out of Sun. Nexenta >>> spending hundreds of man-hours on a GUI and userland apps isn''t work >>> on ZFS. >>> >>> >> >> Exactly my observation as well. I haven''t seen any ZFS related >> development happening at Ilumos or Nexenta, at least not yet. > > I am quite sure you understand how pipelines work :-) >Are you suggesting that Nexenta is developing new ZFS features behind closed doors (like Oracle...) and then will share code later-on? Somehow I don''t think so... but I would love to be proved wrong :) -- Robert Milkowski http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/e4112981/attachment-0001.html>
On 01/ 4/11 11:35 PM, Robert Milkowski wrote:> On 01/ 3/11 04:28 PM, Richard Elling wrote: >> On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote: >> >>> On 12/26/10 05:40 AM, Tim Cook wrote: >>>> >>>> >>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling >>>> <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: >>>> >>>> >>>> There are more people outside of Oracle developing for ZFS than >>>> inside Oracle. >>>> This has been true for some time now. >>>> >>>> >>>> >>>> >>>> Pardon my skepticism, but where is the proof of this claim (I''m >>>> quite certain you know I mean no disrespect)? Solaris11 Express >>>> was a massive leap in functionality and bugfixes to ZFS. I''ve seen >>>> exactly nothing out of "outside of Oracle" in the time since it >>>> went closed. We used to see updates bi-weekly out of Sun. Nexenta >>>> spending hundreds of man-hours on a GUI and userland apps isn''t >>>> work on ZFS. >>>> >>>> >>> >>> Exactly my observation as well. I haven''t seen any ZFS related >>> development happening at Ilumos or Nexenta, at least not yet. >> >> I am quite sure you understand how pipelines work :-) >> > > Are you suggesting that Nexenta is developing new ZFS features behind > closed doors (like Oracle...) and then will share code later-on? > Somehow I don''t think so... but I would love to be proved wrong :)I mean I would love to see Nexenta start delivering real innovation in Solaris/Illumos kernel (zfs, networking, ...), not that I would love to see it happening behind a closed doors :) -- Robert Milkowski http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/c8dea63e/attachment.html>
On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore <garrett at nexenta.com> wrote:> On 01/ 3/11 05:08 AM, Robert Milkowski wrote: > > On 12/26/10 05:40 AM, Tim Cook wrote: > > > > On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling <richard.elling at gmail.com > > wrote: > >> >> There are more people outside of Oracle developing for ZFS than inside >> Oracle. >> This has been true for some time now. >> >> >> > > Pardon my skepticism, but where is the proof of this claim (I''m quite > certain you know I mean no disrespect)? Solaris11 Express was a massive > leap in functionality and bugfixes to ZFS. I''ve seen exactly nothing out of > "outside of Oracle" in the time since it went closed. We used to see > updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a > GUI and userland apps isn''t work on ZFS. > > > > Exactly my observation as well. I haven''t seen any ZFS related development > happening at Ilumos or Nexenta, at least not yet. > > > Just because you''ve not seen it yet doesn''t imply it isn''t happening. > Please be patient. > > - Garrett >Or, conversely, don''t make claims of all this code contribution prior to having anything to show for your claimed efforts. Duke Nukem Forever was going to be the greatest video game ever created... we were told to "be patient"... we''re still waiting for that too. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/bcab0250/attachment-0001.html>
On Tue, Jan 4, 2011 at 8:21 PM, Garrett D''Amore <garrett at nexenta.com> wrote:> On 01/ 4/11 09:15 PM, Tim Cook wrote: > > > > On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore <garrett at nexenta.com>wrote: > >> On 01/ 3/11 05:08 AM, Robert Milkowski wrote: >> >> On 12/26/10 05:40 AM, Tim Cook wrote: >> >> >> >> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling < >> richard.elling at gmail.com> wrote: >> >>> >>> There are more people outside of Oracle developing for ZFS than inside >>> Oracle. >>> This has been true for some time now. >>> >>> >>> >> >> Pardon my skepticism, but where is the proof of this claim (I''m quite >> certain you know I mean no disrespect)? Solaris11 Express was a massive >> leap in functionality and bugfixes to ZFS. I''ve seen exactly nothing out of >> "outside of Oracle" in the time since it went closed. We used to see >> updates bi-weekly out of Sun. Nexenta spending hundreds of man-hours on a >> GUI and userland apps isn''t work on ZFS. >> >> >> >> Exactly my observation as well. I haven''t seen any ZFS related development >> happening at Ilumos or Nexenta, at least not yet. >> >> >> Just because you''ve not seen it yet doesn''t imply it isn''t happening. >> Please be patient. >> >> - Garrett >> > > > Or, conversely, don''t make claims of all this code contribution prior to > having anything to show for your claimed efforts. Duke Nukem Forever was > going to be the greatest video game ever created... we were told to "be > patient"... we''re still waiting for that too. > > > > Um, have you not been paying attention? I''ve delivered quite a lot of > contribution to illumos already, just not in ZFS. Take a close look -- > there almost certainly wouldn''t *be* an open source version of OS/Net had I > not done the work to enable this in libc, kernel crypto, and other bits. > This work is still higher priority than ZFS innovation for a variety of > reasons -- mostly because we need a viable and supportable illumos upon > which to build those ZFS innovations. > > That said, much of the ZFS work I hope to contribute to illumos needs more > baking, but some of it is already open source in NexentaStor. (You can for > a start look at zfs-monitor, the WORM support, and support for hardware GZIP > acceleration all as things that Nexenta has innovated in ZFS, and which are > open source today if not part of illumos. Check out > http://www.nexenta.org for source code access.) > > So there, money placed where mouth is. You? > > - Garrett > > >The claim was that there are more people contributing code from outside of Oracle than inside to zfs. Your contributions to Illumos do absolutely nothing to backup that claim. ZFS-monitor is not ZFS code (it''s an FMA module), WORM also isn''t ZFS code, it''s an OS level operation, and GZIP hardware acceleration is produced by Indra networks, and has absolutely nothing to do with ZFS. Does it help ZFS? Sure, but that''s hardly a code contribution to ZFS when it''s simply a hardware acceleration card that accelerates ALL gzip code. So, great job picking three projects that are not proof of developers working on ZFS. And great job not providing any proof to the claim there are more developers working on ZFS outside of Oracle than within. You''re going to need a hell of a lot bigger bank account to cash the check than what you''ve got. As for me, I don''t recall making any claims on this list that I can''t back up, so I''m not really sure what you''re getting at. I can only assume the defensive tone of your email is because you''ve been called out and can''t backup the claims either. So again: if you''ve got code in the works, great. Talk about it when it''s ready. Stop throwing out baseless claims that you have no proof of and then fall back on "just be patient, it''s coming". We''ve heard that enough from Oracle and Sun already. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110104/c2f67bca/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Tim Cook > > > The claim was that there are more people contributing code from outside of > Oracle than inside to zfs. ?Your contributions to Illumos do absolutelynothing Guys, please let''s just say this much: To all those who are contributing to the open-source ZFS code, freebsd, illumos project, and others, thank you very much. :-) We know certain things are stable and production ready, but there has not yet been much forward development after zpool 28, but the effort is well appreciated, and for whatever comes next, yes we can all be patient. Right now, Oracle is not contributing at all to the open source branches of any of these projects. So right now it''s fair to say the non-oracle contributions to the OPEN SOURCE ZFS outweighs the nonexistent oracle contributions. However, Oracle is continuing to develop the closed-source ZFS. I don''t know if anyone has real numbers, dollars contributed or number of developer hours etc, but I think it''s fair to say that oracle is probably contributing more to the closed source ZFS right now, than the rest of the world is contributing to the open source ZFS right now. Also, we know that the closed source ZFS right now is more advanced than the open source ZFS (zpool 31 vs 28). Oracle closed source ZFS is ahead, and probably developing faster too, than the open source ZFS right now. If anyone has any good way to draw more contributors into the open source tree, that would also be useful and appreciated. Gosh, it would be nice to get major players like Dell, HP, IBM, Apple contributing into that project.
Edward Ned Harvey wrote> I don''t know if anyone has real numbers, dollars contributed or number of > developer hours etc, but I think it''s fair to say that oracle is probably > contributing more to the closed source ZFS right now, than the rest of the > world is contributing to the open source ZFS right now. Also, we knowthat> the closed source ZFS right now is more advanced than the open source ZFS > (zpool 31 vs 28). Oracle closed source ZFS is ahead, and probably > developing faster too, than the open source ZFS right now.> If anyone has any good way to draw more contributors into the open source > tree, that would also be useful and appreciated. Gosh, it would be niceto> get major players like Dell, HP, IBM, Apple contributing into thatproject. This is something that Illumos/Open source ZFS needs to decide what it wants, effectively we can''t innovate ZFS without breaking capability... because our Illumos ZPool version 29 (if we innovate) will not be Oracle Zpool version 29. If we want open-source ZFS to we need to make that choice and let everyone know, apart from submitting bug fixes to zpool v28, are I''m not sure if other changed would be welcome? So honestly do we want to innovate ZFS (I do) or do we just want to follow Oracle? Bye, Deano deano at cloudpixies.com
> From: Deano [mailto:deano at rattie.demon.co.uk] > Sent: Wednesday, January 05, 2011 9:16 AM > > So honestly do we want to innovate ZFS (I do) or do we just want to follow > Oracle?Well, you can''t follow Oracle. Unless you wait till they release something, reverse engineer it, and attempt to reimplement it. I am quite sure you''ll be sued if you do that. If you want forward development in the open source tree, you basically have only one option: Some major contributor must have a financial interest, and commit to a real concerted development effort, with their own roadmap, which is intentionally designed NOT to overlap with the Oracle roadmap. Otherwise, the code will stagnate. I am rooting for the open source projects, but I''m not optimistic personally. I think all major contributors (IBM, Apple, etc) will not participate for various reasons, and as a result, we''ll experience bit rot... As presently evident by lack of zpool advancement beyond 28. So in my mind, Oracle and ZFS are now just like netapp and wafl. Well... I prefer Solaris and ZFS over netapp and wafl... So whenever I would have otherwise bought a netapp, I''ll still buy the solaris server instead... But it''s no longer a competitor against ubuntu or centos. Just the way Larry wants it.
We do have a major commercial interest - Nexenta. It''s been quiet but I do look forward to seeing something come out of that stable this year? :-) --- W. A. Khushil Dep - khushil.dep at gmail.com - 07905374843 Visit my blog at http://www.khushil.com/ On 5 January 2011 14:34, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Deano [mailto:deano at rattie.demon.co.uk] > > Sent: Wednesday, January 05, 2011 9:16 AM > > > > So honestly do we want to innovate ZFS (I do) or do we just want to > follow > > Oracle? > > Well, you can''t follow Oracle. Unless you wait till they release > something, > reverse engineer it, and attempt to reimplement it. I am quite sure you''ll > be sued if you do that. > > If you want forward development in the open source tree, you basically have > only one option: Some major contributor must have a financial interest, > and > commit to a real concerted development effort, with their own roadmap, > which > is intentionally designed NOT to overlap with the Oracle roadmap. > Otherwise, the code will stagnate. > > I am rooting for the open source projects, but I''m not optimistic > personally. I think all major contributors (IBM, Apple, etc) will not > participate for various reasons, and as a result, we''ll experience bit > rot... As presently evident by lack of zpool advancement beyond 28. > > So in my mind, Oracle and ZFS are now just like netapp and wafl. Well... > I > prefer Solaris and ZFS over netapp and wafl... So whenever I would have > otherwise bought a netapp, I''ll still buy the solaris server instead... > But > it''s no longer a competitor against ubuntu or centos. > > Just the way Larry wants it. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110105/a82030a0/attachment.html>
On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: Deano [mailto:deano at rattie.demon.co.uk] >> Sent: Wednesday, January 05, 2011 9:16 AM >> >> So honestly do we want to innovate ZFS (I do) or do we just want to follow >> Oracle? > > Well, you can''t follow Oracle. ?Unless you wait till they release something, > reverse engineer it, and attempt to reimplement it.that''s not my understanding - while we will have to wait, oracle is supposed to release *some* source code afterwards to satisfy some claim or other. I agree, some would argue that that should have already happened with S11 express... I don''t know it has, but that''s not *the* release of S11, is it? And once the code is released, even if after the fact, it''s not reverse-engineering anymore, is it? Michael PS: just in case: even while at Oracle, I had no insight into any of these plans, much less do I have now. -- regards/mit freundlichen Gr?ssen Michael Schuster
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Michael Schuster > Sent: Wednesday, January 05, 2011 9:42 AM > To: Edward Ned Harvey > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] A few questions > > On Wed, Jan 5, 2011 at 15:34, Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > >> From: Deano [mailto:deano at rattie.demon.co.uk] > >> Sent: Wednesday, January 05, 2011 9:16 AM > >> > >> So honestly do we want to innovate ZFS (I do) or do we just want to > follow > >> Oracle? > > > > Well, you can''t follow Oracle. ?Unless you wait till they release something, > > reverse engineer it, and attempt to reimplement it. > > that''s not my understanding - while we will have to wait, oracle is > supposed to release *some* source code afterwards to satisfy some > claim or other. I agree, some would argue that that should have > already happened with S11 express... I don''t know it has, but that''s > not *the* release of S11, is it? And once the code is released, even > if after the fact, it''s not reverse-engineering anymore, is it?Not exactly. Oracle hasn''t publicly committed to anything like that. There were several news articles last year referencing a leaked internal memo that I believe was more of a proposal than a plan. Even if Oracle did ''commit'' to releasing code, they could easily just decide not to. -Will
> From: Michael Schuster [mailto:michaelsprivate at gmail.com] > > > Well, you can''t follow Oracle. ?Unless you wait till they releasesomething,> > reverse engineer it, and attempt to reimplement it. > > that''s not my understanding - while we will have to wait, oracle is > supposed to release *some* source code afterwards to satisfy someWhere do you get that from? AFAIK, there is no official word about oracle opening anything moving forward, but there are plenty of unofficial reports that it will not be opened. Nobody in the field is holding any hope for that to change anymore, most importantly illumos and nexenta. (At least with regards to ZFS and all the other projects relevant to solaris.) I know in the case of SGE/OGE, it''s officially closed source now. As of Dec 31st, sunsource is being decomissioned, and the announcement of officially closing the SGE source and decomissioning the open source community went out on Dec 24th. So all of this leads me to believe, with very little reservation, that the new developments beyond zpool 28 are closed source moving forward. There''s very little breathing room remaining for hope of that being open sourced again.
> From: Khushil Dep [mailto:khushil.dep at gmail.com] > > We do have a major commercial interest - Nexenta. It''s been quiet but I do > look forward to seeing something come out of that stable this year? :-)I''ll agree to call Nexenta "a major commerical interest," in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware. Otherwise, they''re just simply too small to keep up with the rate of development of the closed source ZFS tree, and destined to be left in the dust. And if Nexenta does become a seriously viable competitor against netapp and oracle... Watch out for lawsuits...
On 01/ 4/11 11:48 PM, Tim Cook wrote:> > > On Tue, Jan 4, 2011 at 8:21 PM, Garrett D''Amore <garrett at nexenta.com > <mailto:garrett at nexenta.com>> wrote: > > On 01/ 4/11 09:15 PM, Tim Cook wrote: >> >> >> On Mon, Jan 3, 2011 at 5:56 AM, Garrett D''Amore >> <garrett at nexenta.com <mailto:garrett at nexenta.com>> wrote: >> >> On 01/ 3/11 05:08 AM, Robert Milkowski wrote: >>> On 12/26/10 05:40 AM, Tim Cook wrote: >>>> >>>> >>>> On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling >>>> <richard.elling at gmail.com >>>> <mailto:richard.elling at gmail.com>> wrote: >>>> >>>> >>>> There are more people outside of Oracle developing for >>>> ZFS than inside Oracle. >>>> This has been true for some time now. >>>> >>>> >>>> >>>> >>>> Pardon my skepticism, but where is the proof of this claim >>>> (I''m quite certain you know I mean no disrespect)? >>>> Solaris11 Express was a massive leap in functionality and >>>> bugfixes to ZFS. I''ve seen exactly nothing out of "outside >>>> of Oracle" in the time since it went closed. We used to >>>> see updates bi-weekly out of Sun. Nexenta spending >>>> hundreds of man-hours on a GUI and userland apps isn''t work >>>> on ZFS. >>>> >>>> >>> >>> Exactly my observation as well. I haven''t seen any ZFS >>> related development happening at Ilumos or Nexenta, at least >>> not yet. >> >> Just because you''ve not seen it yet doesn''t imply it isn''t >> happening. Please be patient. >> >> - Garrett >> >> >> >> Or, conversely, don''t make claims of all this code contribution >> prior to having anything to show for your claimed efforts. Duke >> Nukem Forever was going to be the greatest video game ever >> created... we were told to "be patient"... we''re still waiting >> for that too. > > > Um, have you not been paying attention? I''ve delivered quite a > lot of contribution to illumos already, just not in ZFS. Take a > close look -- there almost certainly wouldn''t *be* an open source > version of OS/Net had I not done the work to enable this in libc, > kernel crypto, and other bits. This work is still higher priority > than ZFS innovation for a variety of reasons -- mostly because we > need a viable and supportable illumos upon which to build those > ZFS innovations. > > That said, much of the ZFS work I hope to contribute to illumos > needs more baking, but some of it is already open source in > NexentaStor. (You can for a start look at zfs-monitor, the WORM > support, and support for hardware GZIP acceleration all as things > that Nexenta has innovated in ZFS, and which are open source today > if not part of illumos. Check out http://www.nexenta.org for > source code access.) > > So there, money placed where mouth is. You? > > - Garrett > > > > The claim was that there are more people contributing code from > outside of Oracle than inside to zfs. Your contributions to Illumos > do absolutely nothing to backup that claim. ZFS-monitor is not ZFS > code (it''s an FMA module), WORM also isn''t ZFS code, it''s an OS level > operation, and GZIP hardware acceleration is produced by Indra > networks, and has absolutely nothing to do with ZFS. Does it help > ZFS? Sure, but that''s hardly a code contribution to ZFS when it''s > simply a hardware acceleration card that accelerates ALL gzip code.Um... you have obviously not looked at the code. Our WORM code is not some basic OS guarantees on top of ZFS, but modifications to the ZFS code itself so that ZFS *itself* honors the WORM property, which is implemented as a property on the ZFS filesystem. Likewise, the GZIP hardware acceleration support includes specific modifications to the ZFS kernel filesystem code. Of course, we''ve not done anything major to change the fundamental way that ZFS stores data... is that what you''re talking about? I think you must have a very narrow idea of what constitutes an "innovation" in ZFS.> > So, great job picking three projects that are not proof of developers > working on ZFS. And great job not providing any proof to the claim > there are more developers working on ZFS outside of Oracle than within.Nexenta don''t represent that majority actually. A large number of ZFS folks -- people with names like Leventhal, Ahrens, Wilson, and Gregg, are working on ZFS related work at Delphix and Joyent, or so I''ve been told. I don''t have first hand knowledge of *what* the details are, but I''m looking forward to seeing the results. This ignores the contributions from people working on ZFS on other platforms as well. Of course, since I know longer work there, I don''t really know how many people Oracle still has working on ZFS. They could have tasked 1,000 people with it. Or they could have shut the project down entirely. But of the people who had, up until Oracle shut down the open code, made non-trivial contributions to ZFS, I think the majority of *those* people can be found working outside of Oracle now, and I think most of them are still working on ZFS projects. (There are a few "big names" that I don''t know what they are doing precisely -- e.g. Jeff Bonwick.)> > You''re going to need a hell of a lot bigger bank account to cash the > check than what you''ve got. As for me, I don''t recall making any > claims on this list that I can''t back up, so I''m not really sure what > you''re getting at. I can only assume the defensive tone of your email > is because you''ve been called out and can''t backup the claims either. > > So again: if you''ve got code in the works, great. Talk about it when > it''s ready. Stop throwing out baseless claims that you have no proof > of and then fall back on "just be patient, it''s coming". We''ve heard > that enough from Oracle and Sun already.Ok, I''ll shut up now. But I''m going to completely ignore anything else you have to say on this topic, as I have a lot more knowledge of what we''re doing at Nexenta than you have. - Garrett -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110105/7e52096e/attachment.html>
Edward Ned Harvey wrote> From: Deano [mailto:deano at rattie.demon.co.uk] > Sent: Wednesday, January 05, 2011 9:16 AM > > So honestly do we want to innovate ZFS (I do) or do we just want to follow > Oracle?> Well, you can''t follow Oracle. Unless you wait till they releasesomething,> reverse engineer it, and attempt to reimplement it. I am quite sureyou''ll> be sued if you do that. > > If you want forward development in the open source tree, you basicallyhave> only one option: Some major contributor must have a financial interest,and> commit to a real concerted development effort, with their own roadmap,which> is intentionally designed NOT to overlap with the Oracle roadmap. > Otherwise, the code will stagnate.Why does it need a big backer? Erm ZFS isn''t that large or amazingly complex code. It is *good* code but take 100s of developers and a fortune to develop? Erm nope (which I''d bet it never had at Sun either). Why not overlap Oracle? what has it got to do with Oracle if we have split into ZFS (Oracle) and "OpenZFS" in future. "OpenZFS" will get whatever features developers feel that want or they need to develop for it. This is the fundamental choice of Open source ZFS, illumos and OpenIndiania (and other distributions) have to decide, what is there purpose? Is it a free compatible (though trailing) version of Oracle Solaris OR a platform that shared an ancestor with Oracle Solaris via Sun OpenSolaris but now is its own evolutionary species, with no more connection than I have with a 15th cousin removed on my great, great, great, grandfathers side. This isn''t even a theoretical what if situation for me, I have a major modification to ZFS (still being developed), it has no basis on Oracle or anybody elses needs just mine. It is what I felt I needed and ZFS was the right base for it. Now will that go into "OpenZFS"? Honestly I don''t know yet, because not sure it would be wanted (it will be incompatible with Oracle ZFS) and personally, commercially I''m not sure if it''s the right move to open source the feature. I bet I''m not the only small developer out there in a similar situation, the landscape is very unclear about what actually the community wants to do going forward, and whether we will have or even want "OpenZFS" and Oracle ZFS or Oracle ZFS and 90% compatibles (always trailing) or Oracle ZFS + DevA ZFS + DevB ZFS + DevC ZFS. Bye, Deano deano at cloudpixies.com
On Jan 5, 2011, at 7:44 AM, Edward Ned Harvey wrote:>> From: Khushil Dep [mailto:khushil.dep at gmail.com] >> >> We do have a major commercial interest - Nexenta. It''s been quiet but I do >> look forward to seeing something come out of that stable this year? :-) > > I''ll agree to call Nexenta "a major commerical interest," in regards to contribution to the open source ZFS tree, if they become an officially supported OS on Dell, HP, and/or IBM hardware.NexentaStor is officially supported on Dell, HP, and IBM hardware. The only question is, "what is your definition of ''support''"? Many NexentaStor customers today appear to be deploying on SuperMicro and Quanta systems, for obvious cost reasons. Nexenta has good working relationships with these major vendors and others. As for investment, Nexenta has been and continues to hire the best engineers and professional services people we can find. We see a lot of demand in the market and have been growing at an astonishing rate. If you''d like to contribute to making software storage solutions rather than whining about what Oracle won''t discuss, check us out and send me your resume :-) -- richard
> From: Richard Elling [mailto:Richard.Elling at Nexenta.com] > > > I''ll agree to call Nexenta "a major commerical interest," in regards to > contribution to the open source ZFS tree, if they become an officially > supported OS on Dell, HP, and/or IBM hardware. > > NexentaStor is officially supported on Dell, HP, and IBM hardware. Theonly> question is, "what is your definition of ''support''"? Many NexentaStorI don''t want to argue about this, but I''ll just try to clarify what I meant: Presently, I have a dell server with officially supported solaris, and it''s as unreliable as pure junk. It''s just the backup server, so I''m free to frequently create & destroy it... And as such, I frequently do recreate and destroy it. It is entirely stable running RHEL (centos) because Dell and RedHat have a partnership with a serious number of human beings and machines looking for and fixing any compatibility issues. For my solaris instability, I blame the fact that solaris developers don''t do significant quality assurance on non-sun hardware. To become "officially" compatible, the whole qualification process is like this: Somebody installs it, doesn''t see any problems, and then calls it "certified." They reformat with something else, and move on. They don''t build their business on that platform, so they don''t detect stability issues like the ones reported... System crashes once per week and so forth. Solaris therefore passes the test, and becomes one of the options available on the drop-down menu for OSes with a new server. (Of course that''s been discontinued by oracle, but that''s how it was in the past.) Developers need to "eat their own food." Smoke your own crack. Hardware engineers at Dell need to actually use your OS on their hardware, for their development efforts. I would be willing to bet Sun hardware engineers use a significant percentage of solaris servers for their work... And guess what solaris engineers don''t use? Non-sun hardware. Pretty safe bet you won''t find any Dell servers in the server room where solaris developers do their thing. If you want to be taken seriously as an alternative storage option, you''ve got to at LEAST be listed as a factory-distributed OS that is an option to ship with the new server, and THEN, when people such as myself buy those things, we''ve got to have a good enough experience that we don''t all bitch and flame about it afterward. Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get their developers to run YOUR OS on the servers which they use for development. Get them to sell your product bundled with their product. And dedicate real and serious engineering into bugfixes working with customers, to truly identify root causes of instability, with a real OS development and engineering and support group. It''s got to be STABLE, that''s the #1 requirement. I previously made the comparison... Even close-source solaris & ZFS is a better alternative to close-source netapp & wafl. So for now, those are the only two enterprise supportable options I''m willing to stake my career on, and I''ll buy Sun hardware with Solaris. But I really wish I could feel confident buying a cheaper Dell server and running ZFS on it. Nexenta, if you make yourself look like a serious competitor against solaris, and really truly form an awesome stable partnership with Dell, I will happily buy your stuff instead of Oracle. Even if you are a little behind in feature offering. But I will not buy your stuff if I can''t feel perfectly confident in its stability. Ever heard the phrase "Nobody ever got fired for buying IBM." You''re the little guys. If you want to compete against the big guys, you''ve got to kick ass. And don''t get sued into oblivion. Even today''s feature set is perfectly adequate for at least a couple of years to come. If you put all your effort into stability and bugfixes, serious partnerships with Dell, HP, IBM, and become extremely professional looking and stable, with fanatical support... You don''t have to worry about new feature development for some while. Stability is #1 and not disappearing is a pretty huge threat right now. Based on my experience, I would not recommend buying Dell with Solaris, even if that were still an option. If you want solaris, buy sun/oracle hardware, because then you can actually expect it to work reliably. And if solaris isn''t stable on dell ... then all the solaris derivatives including nexenta can''t be trusted either, no matter how much you claim it''s "supported." Show me the HCL, and show me the partnership between your software engineers and Dell''s hardware engineers. Make me believe there is a serious and thorough qualification process. Do a huge volume. Your volume must be large enough to justify dedicating some engineers to serious bugfix efforts in the field. Otherwise... When I need to buy something stable... I''m going to buy solaris on sun hardware, because I know that''s thoroughly tried, tested, and stable.
On Jan 5, 2011, at 4:14 PM, Edward Ned Harvey wrote:>> From: Richard Elling [mailto:Richard.Elling at Nexenta.com] >> >>> I''ll agree to call Nexenta "a major commerical interest," in regards to >> contribution to the open source ZFS tree, if they become an officially >> supported OS on Dell, HP, and/or IBM hardware. >> >> NexentaStor is officially supported on Dell, HP, and IBM hardware. The > only >> question is, "what is your definition of ''support''"? Many NexentaStor > > I don''t want to argue about this, but I''ll just try to clarify what I meant: > > Presently, I have a dell server with officially supported solaris, and it''s > as unreliable as pure junk. It''s just the backup server, so I''m free to > frequently create & destroy it... And as such, I frequently do recreate and > destroy it. It is entirely stable running RHEL (centos) because Dell and > RedHat have a partnership with a serious number of human beings and machines > looking for and fixing any compatibility issues. For my solaris > instability, I blame the fact that solaris developers don''t do significant > quality assurance on non-sun hardware. To become "officially" compatible, > the whole qualification process is like this: Somebody installs it, doesn''t > see any problems, and then calls it "certified." They reformat with > something else, and move on. They don''t build their business on that > platform, so they don''t detect stability issues like the ones reported... > System crashes once per week and so forth. Solaris therefore passes the > test, and becomes one of the options available on the drop-down menu for > OSes with a new server. (Of course that''s been discontinued by oracle, but > that''s how it was in the past.)If I understand correctly, you want Dell, HP, and IBM to run OSes other than Microsoft and RHEL. For the thousands of other OSes out there, this is a significant barrier to entry. One can argue that the most significant innovations in the past 5 years came from none of those companies -- they came from Google, Apple, Amazon, Facebook, and the other innovators who did not spend their efforts trying to beat Microsoft and get into the manufacturing floor of the big vendors.> Developers need to "eat their own food."I agree, but neither Dell, HP, nor IBM develop Windows...> Smoke your own crack. Hardware > engineers at Dell need to actually use your OS on their hardware, for their > development efforts. I would be willing to bet Sun hardware engineers use a > significant percentage of solaris servers for their work... And guess what > solaris engineers don''t use? Non-sun hardware.I''m not sure of the current state, but many of the Solaris engineers develop on laptops and Sun did not offer a laptop product line.> Pretty safe bet you won''t > find any Dell servers in the server room where solaris developers do their > thing.You will find them where Nexenta developers live :-)> If you want to be taken seriously as an alternative storage option, you''ve > got to at LEAST be listed as a factory-distributed OS that is an option to > ship with the new server, and THEN, when people such as myself buy those > things, we''ve got to have a good enough experience that we don''t all bitch > and flame about it afterward.Wait a minute... this is patently false. The big storage vendors: NetApp, EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.> Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get > their developers to run YOUR OS on the servers which they use for > development. Get them to sell your product bundled with their product. And > dedicate real and serious engineering into bugfixes working with customers, > to truly identify root causes of instability, with a real OS development and > engineering and support group. It''s got to be STABLE, that''s the #1 > requirement.There are many marketing activities are in progress towards this end. One of Nexenta''s major OEMs (Compellent) is being purchased by Dell. The deal is not done, so there is no public information on future plans, to my knowledge.> I previously made the comparison... Even close-source solaris & ZFS is a > better alternative to close-source netapp & wafl. So for now, those are the > only two enterprise supportable options I''m willing to stake my career on, > and I''ll buy Sun hardware with Solaris. But I really wish I could feel > confident buying a cheaper Dell server and running ZFS on it. Nexenta, if > you make yourself look like a serious competitor against solaris, and really > truly form an awesome stable partnership with Dell, I will happily buy your > stuff instead of Oracle. Even if you are a little behind in feature > offering. But I will not buy your stuff if I can''t feel perfectly confident > in its stability.I can assure you that we take stability very seriously. And since you seem to think the big box vendors are infallible, a sampling of those things we (Nexenta) have to live with: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3690351&swItem=MTX-d56eb5d75f03485dbc32680f62&prodNameId=4094976&swEnvOID=4024&swLang=13&taskId=135&mode=4&idx=2 http://www.intel.com/assets/pdf/specupdate/321324.pdf http://support.citrix.com/article/CTX127395 http://lists.us.dell.com/pipermail/linux-poweredge/2010-May/042280.html http://support.dell.com/support/topics/global.aspx/support/kcs/document?c=us&l=en&s=gen&docid=DSN_619147E926299297E040AE0AB8E103AE&isLegacy=true If you look very far, you will find all vendors have issues and at the end of the day, vendors who integrate other people''s products (HP, Dell, IBM) are subject to the same issues that the rest of the industry sees. So when you complain about stability issues, it is incumbent on you to identify the responsible vendor or supplier. There is no one-stop-shop in the x86 market and there hasn''t been one for the past 3 decades.> Ever heard the phrase "Nobody ever got fired for buying IBM." You''re the > little guys. If you want to compete against the big guys, you''ve got to > kick ass. And don''t get sued into oblivion.Yes, and since you''re finished, you can return that copy of Dummies Guide to Business to the library.> Even today''s feature set is perfectly adequate for at least a couple of > years to come. If you put all your effort into stability and bugfixes, > serious partnerships with Dell, HP, IBM, and become extremely professional > looking and stable, with fanatical support... You don''t have to worry about > new feature development for some while. Stability is #1 and not > disappearing is a pretty huge threat right now.I think everyone will agree that stability is important. Since I''ve been at Nexenta, I am pleasantly surprised by the lack of panics or data loss. The current rate at Nexenta is far lower than the rates I saw at Sun (and yes, I did have access to the data).> Based on my experience, I would not recommend buying Dell with Solaris, even > if that were still an option. If you want solaris, buy sun/oracle hardware, > because then you can actually expect it to work reliably. And if solaris > isn''t stable on dell ... then all the solaris derivatives including nexenta > can''t be trusted either, no matter how much you claim it''s "supported."Oracle "solves" this problem by not making the support details public... you have to have an account and service contract to see the dirty laundry details.> Show me the HCL, and show me the partnership between your software engineers > and Dell''s hardware engineers.Uhm... what makes you think Dell invests in hardware development? Dell is a manufacturer and spends very little on product development. http://stocks.investopedia.com/stock-analysis/jimcramer/CramersMadMoneyRecapItsAlmost2011onWallStreetsCalendarUpdate3.aspx NB. much of Dell''s innovation is in business systems and manufacturing, a good thing, but they are not known for pure research, software development, or product development beyond hardware integration.> Make me believe there is a serious and > thorough qualification process. Do a huge volume. Your volume must be > large enough to justify dedicating some engineers to serious bugfix efforts > in the field. Otherwise... When I need to buy something stable... I''m > going to buy solaris on sun hardware, because I know that''s thoroughly > tried, tested, and stable.In a former life I worked in the Quality Office at Sun. I''m delighted that you have such a fondness for the products. They are quite good. Of course, NexentaStor works quite nicely on Oracle''s Sun x86 systems :-) -- richard
On Wed, 5 Jan 2011, Edward Ned Harvey wrote:> with regards to ZFS and all the other projects relevant to solaris.) > > I know in the case of SGE/OGE, it''s officially closed source now. As of Dec > 31st, sunsource is being decomissioned, and the announcement of officially > closing the SGE source and decomissioning the open source community went out > on Dec 24th. So all of this leads me to believe, with very little > reservation, that the new developments beyond zpool 28 are closed source > moving forward. There''s very little breathing room remaining for hope of > that being open sourced again.I have no idea what you are talking about. Best I can tell, SGE/OGE is a reference to Sun Grid Engine, which has nothing to do with zfs. The only annoucement and discussion I can find via Google is written by you. It was pretty clear even a year ago that Sun Grid Engine was going away. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 06/01/2011 00:14, Edward Ned Harvey wrote:> solaris engineers don''t use? Non-sun hardware. Pretty safe bet you won''t > find any Dell servers in the server room where solaris developers do their > thing.You would lose that bet, not only would you find Dell you would many other "big names" as well as white box hand build systems too. Solaris developers use a lot of different hardware - Sun never made laptops so many of us have Apple (running Solaris on the metal and/or under virtualisation) or Toshiba or Fujitsu etc laptops. There are also many workstations around the company that aren''t Sun hardware as well as servers. -- Darren J Moffat
I''ve deployed large SAN''s on both SuperMicro 825/826/846 and Dell R610/R710''s and I''ve not found any issues so far. I always make a point of installing Intel chipset NIC''s on the DELL''s and disabling the Broadcom ones but other than that it''s always been plain sailing - hardware-wise anyway. I''ve always found that the real issue is formulating SOP''s to match what the organisation is used to with legacy storage systems, educating the admins who will manage it going forward and doing the technical hand-over to folks who may not know or want to know a whole lot of *nix land. My 2p. YMMV. --- W. A. Khushil Dep - khushil.dep at gmail.com - 07905374843 Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting & Contracting http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord On 6 January 2011 00:14, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Richard Elling [mailto:Richard.Elling at Nexenta.com] > > > > > I''ll agree to call Nexenta "a major commerical interest," in regards to > > contribution to the open source ZFS tree, if they become an officially > > supported OS on Dell, HP, and/or IBM hardware. > > > > NexentaStor is officially supported on Dell, HP, and IBM hardware. The > only > > question is, "what is your definition of ''support''"? Many NexentaStor > > I don''t want to argue about this, but I''ll just try to clarify what I > meant: > > Presently, I have a dell server with officially supported solaris, and it''s > as unreliable as pure junk. It''s just the backup server, so I''m free to > frequently create & destroy it... And as such, I frequently do recreate and > destroy it. It is entirely stable running RHEL (centos) because Dell and > RedHat have a partnership with a serious number of human beings and > machines > looking for and fixing any compatibility issues. For my solaris > instability, I blame the fact that solaris developers don''t do significant > quality assurance on non-sun hardware. To become "officially" compatible, > the whole qualification process is like this: Somebody installs it, > doesn''t > see any problems, and then calls it "certified." They reformat with > something else, and move on. They don''t build their business on that > platform, so they don''t detect stability issues like the ones reported... > System crashes once per week and so forth. Solaris therefore passes the > test, and becomes one of the options available on the drop-down menu for > OSes with a new server. (Of course that''s been discontinued by oracle, but > that''s how it was in the past.) > > Developers need to "eat their own food." Smoke your own crack. Hardware > engineers at Dell need to actually use your OS on their hardware, for their > development efforts. I would be willing to bet Sun hardware engineers use > a > significant percentage of solaris servers for their work... And guess what > solaris engineers don''t use? Non-sun hardware. Pretty safe bet you won''t > find any Dell servers in the server room where solaris developers do their > thing. > > If you want to be taken seriously as an alternative storage option, you''ve > got to at LEAST be listed as a factory-distributed OS that is an option to > ship with the new server, and THEN, when people such as myself buy those > things, we''ve got to have a good enough experience that we don''t all bitch > and flame about it afterward. > > Nexenta, you need a real and serious partnership with Dell, HP, IBM. Get > their developers to run YOUR OS on the servers which they use for > development. Get them to sell your product bundled with their product. > And > dedicate real and serious engineering into bugfixes working with customers, > to truly identify root causes of instability, with a real OS development > and > engineering and support group. It''s got to be STABLE, that''s the #1 > requirement. > > I previously made the comparison... Even close-source solaris & ZFS is a > better alternative to close-source netapp & wafl. So for now, those are > the > only two enterprise supportable options I''m willing to stake my career on, > and I''ll buy Sun hardware with Solaris. But I really wish I could feel > confident buying a cheaper Dell server and running ZFS on it. Nexenta, if > you make yourself look like a serious competitor against solaris, and > really > truly form an awesome stable partnership with Dell, I will happily buy your > stuff instead of Oracle. Even if you are a little behind in feature > offering. But I will not buy your stuff if I can''t feel perfectly > confident > in its stability. > > Ever heard the phrase "Nobody ever got fired for buying IBM." You''re the > little guys. If you want to compete against the big guys, you''ve got to > kick ass. And don''t get sued into oblivion. > > Even today''s feature set is perfectly adequate for at least a couple of > years to come. If you put all your effort into stability and bugfixes, > serious partnerships with Dell, HP, IBM, and become extremely professional > looking and stable, with fanatical support... You don''t have to worry > about > new feature development for some while. Stability is #1 and not > disappearing is a pretty huge threat right now. > > Based on my experience, I would not recommend buying Dell with Solaris, > even > if that were still an option. If you want solaris, buy sun/oracle > hardware, > because then you can actually expect it to work reliably. And if solaris > isn''t stable on dell ... then all the solaris derivatives including nexenta > can''t be trusted either, no matter how much you claim it''s "supported." > > Show me the HCL, and show me the partnership between your software > engineers > and Dell''s hardware engineers. Make me believe there is a serious and > thorough qualification process. Do a huge volume. Your volume must be > large enough to justify dedicating some engineers to serious bugfix efforts > in the field. Otherwise... When I need to buy something stable... I''m > going to buy solaris on sun hardware, because I know that''s thoroughly > tried, tested, and stable. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110106/ff753de8/attachment-0001.html>
> From: Richard Elling [mailto:Richard.Elling at Nexenta.com] > > If I understand correctly, you want Dell, HP, and IBM to run OSes other > > I agree, but neither Dell, HP, nor IBM develop Windows... > > I''m not sure of the current state, but many of the Solaris engineersdevelop> on laptops and Sun did not offer a laptop product line. > > You will find them where Nexenta developers live :-) > > Wait a minute... this is patently false. The big storage vendors: NetApp, > EMC, Hitachi, Fujitsu, LSI... none run on HP, IBM, or Dell servers.Like I said, not interested in arguing. This is mostly just a bunch of contradictions to what I said. To each his own. My conclusion is that I am not willing to stake my career on the underdog alternative, when I know I can safely buy the sun hardware and solaris. I experimented once by buying solaris on dell. It was a proven failure, but that''s why I did it on a cheap noncritical backup system experimentally before expecting it to work in production. Haven''t seen any underdog proven solid enough for me to deploy in enterprise yet.
This is a silly argument, but...> Haven''t seen any underdog proven solid enough for me to deploy in > enterprise yet.I haven''t seen any "over"dog proven solid enough for me to be able to rely on either. Certainly not Solaris. Don''t get me wrong, I like(d) Solaris. But every so often you''d find a bug and they''d take an age to fix it (or to declare that they wouldn''t fix it). In one case we had 18 months between reporting a problem and Sun fixing it. In another case it was around 3 months and because we happened to have the source code we even told them where the bug was and what a fix could be. Solaris (and the other "over"dogs) are worth it when you want someone else to do the grunt work and someone else to point at and blame, but lets not romanticize how good it or any of the others are. What made Solaris (10 at least) worth deploying were its features (dtrace, zfs, SMF, etc). Julian -- Julian King Computer Officer, University of Cambridge, Unix Support
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > > On Wed, 5 Jan 2011, Edward Ned Harvey wrote: > > with regards to ZFS and all the other projects relevant to solaris.) > > > > I know in the case of SGE/OGE, it''s officially closed source now. As ofDec> > 31st, sunsource is being decomissioned, and the announcement ofofficially> > closing the SGE source and decomissioning the open source community > went out > > on Dec 24th. So all of this leads me to believe, with very little > > reservation, that the new developments beyond zpool 28 are closed > source > > moving forward. There''s very little breathing room remaining for hopeof> > that being open sourced again. > > I have no idea what you are talking about. Best I can tell, SGE/OGE > is a reference to Sun Grid Engine, which has nothing to do with zfs. > The only annoucement and discussion I can find via Google is written > by you. It was pretty clear even a year ago that Sun Grid Engine was > going away.Agreed, SGE/OGE has nothing to do with ZFS, unless you believe there''s an oracle culture which might apply to both. The only thing written by me, as I recall, included links to the original official announcements. Following those links now, I see the archives have been decomissioned. So there ya go. Since it''s still in my inbox, I just saved a copy for you here... It is long winded, and the main points are: SGE (now called OGE) is officially closed-source, and sunsouce.net decommissioned. There is an open source fork, which will not share code development with the closed-source product. http://dl.dropbox.com/u/543241/SGE_officially_closed/GE%20users%20GE%20annou nce%20Changes%20for%20a%20Bright%20Future%20at%20Oracle.txt
> From: Khushil Dep [mailto:khushil.dep at gmail.com] > > I''ve deployed large SAN''s on both SuperMicro 825/826/846 and Dell > R610/R710''s and I''ve not found any issues so far. I always make a point of > installing Intel chipset NIC''s on the DELL''s and disabling the Broadcom ones > but other than that it''s always been plain sailing - hardware-wise anyway."not found any issues," "except the broadcom one which causes the system to crash regularly in the default factory configuration." How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. The same system doesn''t have a problem running RHEL/centos. What will be the new problem in the next line of servers? Why, during my internet scouring, did I find a lot of other reports, of people who needed to disable c-states (didn''t work for me) and lots of false leads indicating firmware downgrade would fix my broadcom issue? See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process.
Two fold really - firstly I remember the headaches I used to have configuring Broadcom cards properly under Debain/Ubuntu but the sweetness that was using an Intel NIC. Bottom line for me was that I know Intel drivers have been around longer than Broadcom drivers and thus it would make sense to ensure that we hand intel NIC''s on the server. Secondly, I asked Andy Bennett from Nexenta who told me it would make sense - always good to get a second opinion :-) There were/are reports all over Google about Broadcom issues with Solaris/OpenSolaris so I didn''t want to risk it. For a couple of hundred for a quad port gig NIC - it''s worth it when the entire solution is 90K+. Sometimes (like the issue with bus-resets when some brands/firmware-rev''s of SSD''s are used) the knowledge comes from people you work with (Nexenta rode to the rescue here again - plug! plug! plug!) :-) These are deployed in a couple of University and a very large data capture/marketing company I used to work for and I know it works really well and (plug! plug! plug) I know the dedicated support I got from the Nexenta guys. The difference as I see it is that OpenSolaris/ZFS/Dtrace/FMA allow you to build your own solution to your own problem. Thinking of storage in a completely new way instead of "just a block of storage" it becomes an integrated part of performance engineering - certainly has been for the last two installs I''ve been involved in. I know why folks want a "Certified" solution with the likes of Dell/HP etc but from my point of view (and all points of view are valid here), I know I can deliver a cheaper, more focussed (and when I say that I''m not just doing some marketing bs) solution for the requirement at hand. It''s sometimes a struggle to get customers/end-users to think of storage as more than just storage. There''s quite a lot of entrenched thinking to get around/over in our field (try getting a Java dev to think clearly about thread handling and massive SMP drawbacks for example). Anyway - not trying to engage in an argument but it''s always interesting to find out why someone went for certain solutions over others. My 2p. YMMV. *goes off to collect cheque from Nexenta* ;-) --- W. A. Khushil Dep - khushil.dep at gmail.com - 07905374843 Windows - Linux - Solaris - ZFS - Nexenta - Development - Consulting & Contracting http://www.khushil.com/ - http://www.facebook.com/GlobalOverlord On 6 January 2011 13:28, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Khushil Dep [mailto:khushil.dep at gmail.com] > > > > I''ve deployed large SAN''s on both SuperMicro 825/826/846 and Dell > > R610/R710''s and I''ve not found any issues so far. I always make a point > of > > installing Intel chipset NIC''s on the DELL''s and disabling the Broadcom > ones > > but other than that it''s always been plain sailing - hardware-wise > anyway. > > "not found any issues," "except the broadcom one which causes the system to > crash regularly in the default factory configuration." > > How did you learn about the broadcom issue for the first time? I had to > learn the hard way, and with all the involvement of both Dell and Oracle > support teams, nobody could tell me what I needed to change. We literally > replaced every component of the server twice over a period of 1 year, and I > spent mandays upgrading and downgrading firmwares randomly trying to find a > stable configuration. I scoured the internet to find this little tidbit > about replacing the broadcom NIC, and randomly guessed, and replaced my nic > with an intel card to make the problem go away. > > The same system doesn''t have a problem running RHEL/centos. > > What will be the new problem in the next line of servers? Why, during my > internet scouring, did I find a lot of other reports, of people who needed > to disable c-states (didn''t work for me) and lots of false leads indicating > firmware downgrade would fix my broadcom issue? > > See my point? Next time I buy a server, I do not have confidence to simply > expect solaris on dell to work reliably. The same goes for solaris > derivatives, and all non-sun hardware. There simply is not an adequate > qualification and/or support process. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110106/a000a6a1/attachment-0001.html>
On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote:>> From: Khushil Dep [mailto:khushil.dep at gmail.com] >> >> I''ve deployed large SAN''s on both SuperMicro 825/826/846 and Dell >> R610/R710''s and I''ve not found any issues so far. I always make a point of >> installing Intel chipset NIC''s on the DELL''s and disabling the Broadcom ones >> but other than that it''s always been plain sailing - hardware-wise anyway. >> > "not found any issues," "except the broadcom one which causes the system to crash regularly in the default factory configuration." > > How did you learn about the broadcom issue for the first time? I had to learn the hard way, and with all the involvement of both Dell and Oracle support teams, nobody could tell me what I needed to change. We literally replaced every component of the server twice over a period of 1 year, and I spent mandays upgrading and downgrading firmwares randomly trying to find a stable configuration. I scoured the internet to find this little tidbit about replacing the broadcom NIC, and randomly guessed, and replaced my nic with an intel card to make the problem go away. > > The same system doesn''t have a problem running RHEL/centos. > > What will be the new problem in the next line of servers? Why, during my internet scouring, did I find a lot of other reports, of people who needed to disable c-states (didn''t work for me) and lots of false leads indicating firmware downgrade would fix my broadcom issue? > > See my point? Next time I buy a server, I do not have confidence to simply expect solaris on dell to work reliably. The same goes for solaris derivatives, and all non-sun hardware. There simply is not an adequate qualification and/or support process. >When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, you get a product that has been through a rigorous qualification process which includes the hardware and software configuration matched together, tested with an extensive battery. You also can get a higher level of support than is offered to people who build their own systems. Oracle is *not* the only company capable of performing in depth testing of Solaris. I can also know enough about problems that Oracle customers (or rather Sun customers) faced with Solaris on Sun hardware -- such as the terrible nvidia ethernet problems on first generation U20 and U40 problems, or the marvell SATA problems on Thumper -- that I know that your picture of Oracle isn''t nearly as rosy as you believe. Of course, I also lived (as a Sun employee) through the UltraSPARC-II ECC fiasco... - Garrett
> From: Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com> > To: "''Khushil Dep''" <khushil.dep at gmail.com> > Cc: Richard Elling <Richard.Elling at nexenta.com>, > zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] A few questions > Message-ID: <000201cbada5$a3678270$ea368750$@nedharvey.com> > Content-Type: text/plain; charset="utf-8" > > > From: Khushil Dep [mailto:khushil.dep at gmail.com] > > > > I''ve deployed large SAN''s on both SuperMicro 825/826/846 and Dell > > R610/R710''s and I''ve not found any issues so far. I always make a > point of > > installing Intel chipset NIC''s on the DELL''s and disabling the > Broadcom ones > > but other than that it''s always been plain sailing - hardware-wise > anyway. > > "not found any issues," "except the broadcom one which causes the > system to crash regularly in the default factory configuration." > > How did you learn about the broadcom issue for the first time? I had > to learn the hard way, and with all the involvement of both Dell and > Oracle support teams, nobody could tell me what I needed to change.We> literally replaced every component of the server twice over a periodof> 1 year, and I spent mandays upgrading and downgrading firmwares > randomly trying to find a stable configuration. I scoured theinternet> to find this little tidbit about replacing the broadcom NIC, and > randomly guessed, and replaced my nic with an intel card to make the > problem go away.20 years of doing this c*(# has taught me that most things only get learned the hard way. I certainly won''t bet my career solely on the ability of the vendor to support the product, because they''re hardly omniscient. Testing, testing, and generous return policies (and/or R&D budget)....> The same system doesn''t have a problem running RHEL/centos.Then you''re not pushing it hard enough, or your stars are just aligned nicely. We have massive piles of Dell hardware, all types. Running CentOS since at least 4.5. Every single one of those Dells has an Intel NIC in it, and the Broadcoms disabled in the BIOS. Because every time we do something stupid like let ourselves think "oh, we could maybe use those extra Broadcom ports for X", we get burned. High-volume financial trading system. Blew up on the bcoms. Didn''t matter what driver or tweak or fix. Plenty of man-days wasted debugging. Went with net.advice, put in Intel NIC. No more problems. That was 3 years ago. Thought we could use the bcoms for our fileservers. Nope. Thought we could use the bcoms for the dedicated drbd links for our xen cluster. Nope. And we know we''re not alone in this evaluation. We could have spent forever chasing support to get someone to "fix" it I suppose... but we have better things to do.> See my point? Next time I buy a server, I do not have confidence to > simply expect solaris on dell to work reliably. The same goes for > solaris derivatives, and all non-sun hardware. There simply is not an > adequate qualification and/or support process.I''m not convinced ANYONE really has such a thing. Or that it''s even necessarily possible. In fact, I''m sure they don''t. Cuz that''s what it says in the fine print on the support contracts and the purchase agreements - "we do not guarantee..." I just prefer not to have any confidence for the most part. It''s easier and safer. -bacon
On Thu, Jan 6, 2011 at 11:36 PM, Garrett D''Amore <garrett at nexenta.com> wrote:> On 01/ 6/11 05:28 AM, Edward Ned Harvey wrote: >> See my point? ?Next time I buy a server, I do not have confidence to >> simply expect solaris on dell to work reliably. ?The same goes for solaris >> derivatives, and all non-sun hardware. ?There simply is not an adequate >> qualification and/or support process. >> > > When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, youWhere is the list? Is this the one on http://www.nexenta.com/corp/technology-partners-overview/certified-technology-partners ?> get a product that has been through a rigorous qualification process which > includes the hardware and software configuration matched together, tested > with an extensive battery. ?You also can get a higher level of support than > is offered to people who build their own systems. > > Oracle is *not* the only company capable of performing in depth testing of > Solaris.Does this roughly mean I can expect similar (or even better) hardware compatibility support and with nexentastor on supermicro as solaris on oracle/sun hardware? -- Fajar
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Garrett D''Amore > > When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, > you get a product that has been through a rigorous qualification processHow do I do this, exactly? I am serious. Before too long, I''m going to need another server, and I would very seriously consider reprovisioning my unstable Dell Solaris server to become a linux or some other stable machine. The role it''s currently fulfilling is the "backup" server, which basically does nothing except "zfs receive" from the primary Sun solaris 10u9 file server. Since the role is just for backups, it''s a perfect opportunity for experimentation, hence the Dell hardware with solaris. I''d be happy to put some other configuration in there experimentally instead ... say ... nexenta. Assuming it will be just as good at "zfs receive" from the primary server. Is there some specific hardware configuration you guys sell? Or recommend? How about a Dell R510/R610/R710? Buy the hardware separately and buy NexentaStor as just a software product? Or buy a somehow more certified hardware & software bundle together? If I do encounter a bug, where the only known fact is that the system keeps crashing intermittently on an approximately weekly basis, and there is absolutely no clue what''s wrong in hardware or software... How do you guys handle it? If you''d like to follow up offlist, that''s fine. Then just email me at the email address: nexenta at nedharvey.com (I use disposable email addresses on mailing lists like this, so at any random unknown time, I''ll destroy my present alias and start using a new one.)
Am 08.01.11 18:33, schrieb Edward Ned Harvey:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Garrett D''Amore >> >> When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, >> you get a product that has been through a rigorous qualification process > How do I do this, exactly? I am serious. Before too long, I''m going to > need another server, and I would very seriously consider reprovisioning my > unstable Dell Solaris server to become a linux or some other stable machine. > The role it''s currently fulfilling is the "backup" server, which basically > does nothing except "zfs receive" from the primary Sun solaris 10u9 file > server. Since the role is just for backups, it''s a perfect opportunity for > experimentation, hence the Dell hardware with solaris. I''d be happy to put > some other configuration in there experimentally instead ... say ... > nexenta. Assuming it will be just as good at "zfs receive" from the primary > server. > > Is there some specific hardware configuration you guys sell? Or recommend? > How about a Dell R510/R610/R710? Buy the hardware separately and buy > NexentaStor as just a software product? Or buy a somehow more certified > hardware& software bundle together? > > If I do encounter a bug, where the only known fact is that the system keeps > crashing intermittently on an approximately weekly basis, and there is > absolutely no clue what''s wrong in hardware or software... How do you guys > handle it? > > If you''d like to follow up offlist, that''s fine. Then just email me at the > email address: nexenta at nedharvey.com > (I use disposable email addresses on mailing lists like this, so at any > random unknown time, I''ll destroy my present alias and start using a new > one.) > > _______________________________________________Hmm? that''d interest me as well - I do have 4 Dell PE R610, that are running OSol or Sol11Expr. I actually bought a Sun Fire X4170 M2, since I couldn''t get my R610 stable, just as Edward points out. So, if you guys think that NexentaStor avoids these issues, then I''d seriously consider to jumpship - so either please don''t continue offlist, or please include me in that conversation. ;) Cheers, budy
On 01/ 8/11 10:43 AM, Stephan Budach wrote:> Am 08.01.11 18:33, schrieb Edward Ned Harvey: >>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Garrett D''Amore >>> >>> When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, >>> you get a product that has been through a rigorous qualification >>> process >> How do I do this, exactly? I am serious. Before too long, I''m going to >> need another server, and I would very seriously consider >> reprovisioning my >> unstable Dell Solaris server to become a linux or some other stable >> machine. >> The role it''s currently fulfilling is the "backup" server, which >> basically >> does nothing except "zfs receive" from the primary Sun solaris 10u9 file >> server. Since the role is just for backups, it''s a perfect >> opportunity for >> experimentation, hence the Dell hardware with solaris. I''d be happy >> to put >> some other configuration in there experimentally instead ... say ... >> nexenta. Assuming it will be just as good at "zfs receive" from the >> primary >> server. >> >> Is there some specific hardware configuration you guys sell? Or >> recommend? >> How about a Dell R510/R610/R710? Buy the hardware separately and buy >> NexentaStor as just a software product? Or buy a somehow more certified >> hardware& software bundle together? >> >> If I do encounter a bug, where the only known fact is that the system >> keeps >> crashing intermittently on an approximately weekly basis, and there is >> absolutely no clue what''s wrong in hardware or software... How do >> you guys >> handle it?Such problems are handled on a case by case basis. Usually we can do some analysis from a crash dump, but not always. My team includes several people who are experienced with such analysis, and when problems like this occur, we are called into action. Ultimately this usually results in a patch, sometimes workaround suggestions, and sometimes even binary relief (which happens faster than a regular patch, but without the deeper QA.) - Garrett
On Sat, Jan 08, 2011 at 12:33:50PM -0500, Edward Ned Harvey wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Garrett D''Amore > > > > When you purchase NexentaStor from a top-tier Nexenta Hardware Partner, > > you get a product that has been through a rigorous qualification process > > How do I do this, exactly? I am serious. Before too long, I''m going to > need another server, and I would very seriously consider reprovisioning my > unstable Dell Solaris server to become a linux or some other stable machine. > The role it''s currently fulfilling is the "backup" server, which basically > does nothing except "zfs receive" from the primary Sun solaris 10u9 file > server. Since the role is just for backups, it''s a perfect opportunity for > experimentation, hence the Dell hardware with solaris. I''d be happy to put > some other configuration in there experimentally instead ... say ... > nexenta. Assuming it will be just as good at "zfs receive" from the primary > server. > > Is there some specific hardware configuration you guys sell? Or recommend? > How about a Dell R510/R610/R710? Buy the hardware separately and buy > NexentaStor as just a software product? Or buy a somehow more certified > hardware & software bundle together? > > If I do encounter a bug, where the only known fact is that the system keeps > crashing intermittently on an approximately weekly basis, and there is > absolutely no clue what''s wrong in hardware or software... How do you guys > handle it? > > If you''d like to follow up offlist, that''s fine. Then just email me at the > email address: nexenta at nedharvey.com > (I use disposable email addresses on mailing lists like this, so at any > random unknown time, I''ll destroy my present alias and start using a new > one.) >Hey, Other OS''s have had problems with the Broadcom NICs aswell.. See for example this RHEL5 bug: https://bugzilla.redhat.com/show_bug.cgi?id=520888 Host crashing probably due to MSI-X IRQs with bnx2 NIC.. And VMware vSphere ESX/ESXi 4.1 crashing with bnx2x: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1029368 So I guess there are firmware/driver problems affecting not just Solaris but also other operating systems.. -- Pasi
> From: Pasi K?rkk?inen [mailto:pasik at iki.fi] > > Other OS''s have had problems with the Broadcom NICs aswell..Yes. The difference is, when I go to support.dell.com and punch in my service tag, I can download updated firmware and drivers for RHEL that (at least supposedly) solve the problem. I haven''t tested it, but the dell support guy told me it has worked for RHEL users. There is nothing available to download for solaris. Also, the bcom is not the only problem on that server. After I added-on an intel network card and disabled the bcom, the weekly crashes stopped, but now it''s ... I don''t know ... once every 3 weeks with a slightly different mode of failure. This is yet again, rare enough that the system could very well pass a certification test, but not rare enough for me to feel comfortable putting into production as a primary mission critical server. I really think there are only two ways in the world to engineer a good solid server: (a) Smoke your own crack. Systems engineering teams use the same systems that are sold to customers. or (b) Sell millions of ''em. So despite whether or not the engineering team uses them, you''re still going to have sufficient mass to dedicate engineers to the purpose of post-sales bug solving. I suppose a third way, which has certainly happened in history but not very applicable to me... Is to simply charge such ridiculously high prices for your servers that you can dedicate engineers to post-sales bug solving, even if you only sold a handful of those systems in the whole world. Things like munitions-strength cray and alphaservers etc in the past have sometimes fit into this category. I do feel confident assuming that solaris kernel engineers use sun servers primarily for their server infrastructure. So I feel safe buying this configuration. The only thing there is to gain by buying something else is lower prices... or maybe some obscure fringe detail that I can''t think of.
On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: Pasi K?rkk?inen [mailto:pasik at iki.fi] >> >> Other OS''s have had problems with the Broadcom NICs aswell.. > > Yes. The difference is, when I go to support.dell.com and punch in my > service tag, I can download updated firmware and drivers for RHEL that (at > least supposedly) solve the problem. I haven''t tested it, but the dell > support guy told me it has worked for RHEL users. There is nothing > available to download for solaris.The drivers are written by Broadcom and are, AFAIK, closed source. By going through Dell, you are going through a middle-man. For example, http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php where you see the release of the Solaris drivers was at the same time as Windows.> > Also, the bcom is not the only problem on that server. After I added-on an > intel network card and disabled the bcom, the weekly crashes stopped, but > now it''s ... I don''t know ... once every 3 weeks with a slightly different > mode of failure. This is yet again, rare enough that the system could very > well pass a certification test, but not rare enough for me to feel > comfortable putting into production as a primary mission critical server. > > I really think there are only two ways in the world to engineer a good solid > server: > (a) Smoke your own crack. Systems engineering teams use the same systems > that are sold to customers.This is rarely practical, not to mention that product development is often not in the systems engineering organization.> or > (b) Sell millions of ''em. So despite whether or not the engineering team > uses them, you''re still going to have sufficient mass to dedicate engineers > to the purpose of post-sales bug solving.yes, indeed :-) -- richard>
Just to add a bit to this, I just love sweeping generalizations... On 9 Jan 2011, at 19:33 , Richard Elling wrote:> On Jan 9, 2011, at 4:19 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > >>> From: Pasi K?rkk?inen [mailto:pasik at iki.fi] >>> >>> Other OS''s have had problems with the Broadcom NICs aswell.. >> >> Yes. The difference is, when I go to support.dell.com and punch in my >> service tag, I can download updated firmware and drivers for RHEL that (at >> least supposedly) solve the problem. I haven''t tested it, but the dell >> support guy told me it has worked for RHEL users. There is nothing >> available to download for solaris. > > The drivers are written by Broadcom and are, AFAIK, closed source. > By going through Dell, you are going through a middle-man. For example, > > http://www.broadcom.com/support/ethernet_nic/netxtremeii10.php > > where you see the release of the Solaris drivers was at the same time > as Windows. >What Richard says is true. Broadcom have been a source of contention in the Linux world as well as the *BSD world due to the proprietary nature of their firmware. OpenSolaris/Solaris users are not the only ones who have complained about this. There''s been much uproar in the FOSS community about Broadcom and their drivers. As a result, I''ve seen some pretty nasty hacks like people using the Windows drivers linked into their kernel - *gack* I forget all the gory details, but it was rather disgusting as I recall, bubblegum, bailing wire, duct tape and all. Dell and Red Hat aren''t exactly a marriage made in heaven either. I''ve had problems getting support from both Dell and Red Hat, them pointing fingers at each other rather than solving the problem. Like most people, I''ve had to come up with my own work-arounds, like others with the Broadcom issue, using a "known quantity" NIC. When dealing with Dell as a corporate buyer, they have always made it quite clear that they are primarily a Windows platform. Linux, oh yes, we have that too...>> Also, the bcom is not the only problem on that server. After I added-on an >> intel network card and disabled the bcom, the weekly crashes stopped, but >> now it''s ... I don''t know ... once every 3 weeks with a slightly different >> mode of failure. This is yet again, rare enough that the system could very >> well pass a certification test, but not rare enough for me to feel >> comfortable putting into production as a primary mission critical server.I''ve never been particularly warm and fuzzy with Dell servers. They seem to like to change their chipsets slightly while a model is in production. This can cause all sorts of problems which are difficult to diagnose since an "identical" Dell system will have no problems, and it''s mate will crash weekly.>> >> I really think there are only two ways in the world to engineer a good solid >> server: >> (a) Smoke your own crack. Systems engineering teams use the same systems >> that are sold to customers. > > This is rarely practical, not to mention that product development > is often not in the systems engineering organization. > >> or >> (b) Sell millions of ''em. So despite whether or not the engineering team >> uses them, you''re still going to have sufficient mass to dedicate engineers >> to the purpose of post-sales bug solving. > > yes, indeed :-) > -- richardAs for certified systems, It''s my understanding that Nexenta themselves don''t "certify" anything. They have systems which are recommended and supported by their network of VAR''s. It just so happens that SuperMicro is one of the brands of choice, but even then one must adhere to a fairly tight HCL. The same holds true for Solaris/OpenSolaris with third-party hardware. SATA Controllers and multiplexers are also another example of the drivers being written by the manufacturer and Solaris/OpenSolaris are not a priority over Windows and Linux, in that order. Deviation from items which are not somewhat "plain vanilla" and are not listed on the HCL is just asking for trouble. Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Mobile: +1-662-202-7716 US Phone: +1-561-283-2034 JP Phone: +81-50-5806-6242 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3662 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110109/4887419e/attachment.bin>
> As for certified systems, It''s my understanding that Nexenta themselves don''t "certify" anything. ?They have systems which are recommended and supported by their network of VAR''s.The certified solutions listed on Nexenta''s website were certified by Nexenta.