Hi everyone, We''re a small Linux shop (20 users). I am currently using a Linux server to host our 2TBs of data. I am considering better options for our data storage needs. I mostly need instant snapshots and better data protection. I have been considering EMC NS20 filers and Zfs based solutions. For the Zfs solutions, I am considering NexentaStor product installed on a pogoLinux StorageDirector box. The box will be mostly sharing 2TB over NFS, nothing fancy. Now, my question is I need to assess the zfs reliability today Q4-2008 in comparison to an EMC solution. Something like EMC is pretty mature and used at the most demanding sites. Zfs is fairly new, and from time to time I have heard it had some pretty bad bugs. However, the EMC solution is like 4X more expensive. I need to somehow "quantify" the relative quality level, in order to judge whether or not I should be paying all that much to EMC. The only really important reliability measure to me, is not having data loss! Is there any real measure like "percentage of total corruption of a pool" that can assess such a quality, so you''d tell me zfs has pool failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If not, would you guys rate such a zfs solution as ??% the reliability of an EMC solution ? I know it''s a pretty difficult question to answer, but it''s the one I need to answer and weigh against the cost. Thanks a million, I really appreciate your help -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/d6b9db50/attachment.html>
Ahmed Kamal wrote:> Hi everyone, > > We''re a small Linux shop (20 users). I am currently using a Linux > server to host our 2TBs of data. I am considering better options for > our data storage needs. I mostly need instant snapshots and better > data protection. I have been considering EMC NS20 filers and Zfs based > solutions. For the Zfs solutions, I am considering NexentaStor product > installed on a pogoLinux StorageDirector box. The box will be mostly > sharing 2TB over NFS, nothing fancy. > > Now, my question is I need to assess the zfs reliability today Q4-2008 > in comparison to an EMC solution. Something like EMC is pretty mature > and used at the most demanding sites. Zfs is fairly new, and from time > to time I have heard it had some pretty bad bugs. However, the EMC > solution is like 4X more expensive. I need to somehow "quantify" the > relative quality level, in order to judge whether or not I should be > paying all that much to EMC. The only really important reliability > measure to me, is not having data loss! > Is there any real measure like "percentage of total corruption of a > pool" that can assess such a quality, so you''d tell me zfs has pool > failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If > not, would you guys rate such a zfs solution as ??% the reliability of > an EMC solution ?EMC does not, and cannot, provide end-to-end data validation. So how would measure its data reliability? If you search the ZFS-discuss archives, you will find instances where people using high-end storage also had data errors detected by ZFS. So, you should consider them complementary rather than adversaries. -- richard
0n Mon, Sep 29, 2008 at 09:28:53PM -0700, Richard Elling wrote: >EMC does not, and cannot, provide end-to-end data validation. So how >would measure its data reliability? If you search the ZFS-discuss archives, >you will find instances where people using high-end storage also had data >errors detected by ZFS. So, you should consider them complementary rather >than adversaries. Mmm ... got any keywords to search for ? -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
On Mon, Sep 29, 2008 at 9:28 PM, Richard Elling <Richard.Elling at sun.com> wrote:> Ahmed Kamal wrote: >> Hi everyone, >> >> We''re a small Linux shop (20 users). I am currently using a Linux >> server to host our 2TBs of data. I am considering better options for >> our data storage needs. I mostly need instant snapshots and better >> data protection. I have been considering EMC NS20 filers and Zfs based >> solutions. For the Zfs solutions, I am considering NexentaStor product >> installed on a pogoLinux StorageDirector box. The box will be mostly >> sharing 2TB over NFS, nothing fancy. >> >> Now, my question is I need to assess the zfs reliability today Q4-2008 >> in comparison to an EMC solution. Something like EMC is pretty mature >> and used at the most demanding sites. Zfs is fairly new, and from time >> to time I have heard it had some pretty bad bugs. However, the EMC >> solution is like 4X more expensive. I need to somehow "quantify" the >> relative quality level, in order to judge whether or not I should be >> paying all that much to EMC. The only really important reliability >> measure to me, is not having data loss! >> Is there any real measure like "percentage of total corruption of a >> pool" that can assess such a quality, so you''d tell me zfs has pool >> failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If >> not, would you guys rate such a zfs solution as ??% the reliability of >> an EMC solution ? > > EMC does not, and cannot, provide end-to-end data validation. So how > would measure its data reliability? If you search the ZFS-discuss archives, > you will find instances where people using high-end storage also had data > errors detected by ZFS. So, you should consider them complementary rather > than adversaries. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Key word, detected :) I recall a few of those, I''ll cite later when I''m not tired, where people used <insert SAN vendor here> as the iSCSI target, with ZFS on top of. ZFS was quite capable of detecting errors, but since they did not let ZFS handle the RAID, and instead relied on another level, ZFS was not able to correct the errors. -- Brent Jones brent at servuhome.net
The good news is that even though the answer to your question is "no", it doesn''t matter because it sounds like what you are doing is a piece of cake :) Given how cheap hardware is, and how modest your requirements sound, I expect you could build multiple custom systems for the cost of an EMC system. Even that pogolinux stuff is overshooting the mark compared to what a custom system might be. Price is typical too, considering they''re trying to sell 1TB drives for $260 when similar drives are less than $150 for regular folks. The manageability of nexentastor software might be worth it to you over a solaris terminal, but for a small shop with one machine and one guy who knows it well, you might just do the hardware from scratch :) Especially given what there is to know about ZFS and your use case, such as being able to use slower disks with more RAM and a SSD ZIL cache to produce deceptively fast results. If cost continues to be a concern over performance, also consider that these pre-made systems are not designed for power conservation at all. They''re still shipping old inefficient processors and other such parts in these things, hoping to take advantage of IT people who don''t care or know any better. A custom system could potentially cut the total power cost in half...> <div id="jive-html-wrapper-div"> > <div dir="ltr">Hi everyone,<br><br>We're a small > Linux shop (20 users). I am currently using a Linux > server to host our 2TBs of data. I am considering > better options for our data storage needs. I mostly > need instant snapshots and better data protection. I > have been considering EMC NS20 filers and Zfs based > solutions. For the Zfs solutions, I am considering > NexentaStor product installed on a pogoLinux > StorageDirector box. The box will be mostly sharing > 2TB over NFS, nothing fancy.<br> > <br>Now, my question is I need to assess the zfs > reliability today Q4-2008 in comparison to an EMC > solution. Something like EMC is pretty mature and > used at the most demanding sites. Zfs is fairly new, > and from time to time I have heard it had some pretty > bad bugs. However, the EMC solution is like 4X more > expensive. I need to somehow "quantify" the > relative quality level, in order to judge whether or > not I should be paying all that much to EMC. The only > really important reliability measure to me, is not > having data loss!<br> > Is there any real measure like "percentage of > total corruption of a pool" that can assess such > a quality, so you'd tell me zfs has pool failure > rate of 1 in a 10^6, while EMC has a rate of 1 in a > 10^7. If not, would you guys rate such a zfs solution > as ??% the reliability of an EMC solution ?<br> > <br>I know it's a pretty difficult question to > answer, but it's the one I need to answer and > weigh against the cost. <br>Thanks a million, I > really appreciate your help<br></div> > > </div>_______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss-- This message posted from opensolaris.org
Thanks for all the answers .. Please find more questions below :) - Good to know EMC filers do not have end2end checksums! What about netapp ? - Any other limitations of the big two NAS vendors as compared to zfs ? - I still don''t have my original question answered, I want to somehow assess the reliability of that zfs storage stack. If there''s no hard data on that, then if any storage expert who works with lots of systems can give his "impression" of the reliability compared to the big two, that would be great ! - Regarding building my own hardware, I don''t really want to do that (I am scared enough to put our small but very important data on zfs). If you know of any Dell box (we usually deal with dell) that can host say 10 drives minimum (for expandability) and that is *known* to work very well with nexentaStor. Then please please let me know about it. I am unconfident about the hardware quality of the pogoLinux solution, but forced to go with it for nexenta. The Sun thumper solution is too expensive for me, I am looking for a solution around 10k$. I don''t need all those disks or RAM in thumper! - Assuming I plan to host a maximum of 8TB uesable data on the pogo box as seen in: http://www.pogolinux.com/quotes/editsys?sys_id=8498 * Would I need one or two of those Quad core xeon CPUs ? * How much RAM is needed ? * I''m planning on using Segate 1TB sata 7200 disks. Is that crazy ? The EMC guy insisted we use 10k Fibre/SAS drives at least. We''re currently on 3 1TB sata disks on my current linux box, and it''s fine for me! At least when it''s not rsnapshotting. The workload is 20 user NFS for homes and some software shares * Assuming the pogo sata controller dies, do you suppose I could plug the disks into any other machine and work with them ? I wonder why the pogo box does not come with two controllers, doesn''t solaris support that ! Thanks a lot for your replies On Tue, Sep 30, 2008 at 10:31 AM, MC <rac at eastlink.ca> wrote:> The good news is that even though the answer to your question is "no", it > doesn''t matter because it sounds like what you are doing is a piece of cake > :) > > Given how cheap hardware is, and how modest your requirements sound, I > expect you could build multiple custom systems for the cost of an EMC > system. Even that pogolinux stuff is overshooting the mark compared to what > a custom system might be. Price is typical too, considering they''re trying > to sell 1TB drives for $260 when similar drives are less than $150 for > regular folks. > > The manageability of nexentastor software might be worth it to you over a > solaris terminal, but for a small shop with one machine and one guy who > knows it well, you might just do the hardware from scratch :) Especially > given what there is to know about ZFS and your use case, such as being able > to use slower disks with more RAM and a SSD ZIL cache to produce deceptively > fast results. > > If cost continues to be a concern over performance, also consider that > these pre-made systems are not designed for power conservation at all. > They''re still shipping old inefficient processors and other such parts in > these things, hoping to take advantage of IT people who don''t care or know > any better. A custom system could potentially cut the total power cost in > half... > > > <div id="jive-html-wrapper-div"> > > <div dir="ltr">Hi everyone,<br><br>We're a small > > Linux shop (20 users). I am currently using a Linux > > server to host our 2TBs of data. I am considering > > better options for our data storage needs. I mostly > > need instant snapshots and better data protection. I > > have been considering EMC NS20 filers and Zfs based > > solutions. For the Zfs solutions, I am considering > > NexentaStor product installed on a pogoLinux > > StorageDirector box. The box will be mostly sharing > > 2TB over NFS, nothing fancy.<br> > > <br>Now, my question is I need to assess the zfs > > reliability today Q4-2008 in comparison to an EMC > > solution. Something like EMC is pretty mature and > > used at the most demanding sites. Zfs is fairly new, > > and from time to time I have heard it had some pretty > > bad bugs. However, the EMC solution is like 4X more > > expensive. I need to somehow "quantify" the > > relative quality level, in order to judge whether or > > not I should be paying all that much to EMC. The only > > really important reliability measure to me, is not > > having data loss!<br> > > Is there any real measure like "percentage of > > total corruption of a pool" that can assess such > > a quality, so you'd tell me zfs has pool failure > > rate of 1 in a 10^6, while EMC has a rate of 1 in a > > 10^7. If not, would you guys rate such a zfs solution > > as ??% the reliability of an EMC solution ?<br> > > <br>I know it's a pretty difficult question to > > answer, but it's the one I need to answer and > > weigh against the cost. <br>Thanks a million, I > > really appreciate your help<br></div> > > > > </div>_______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > > ss > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/65db2002/attachment.html>
On Sep 30, 2008, at 06:58, Ahmed Kamal wrote:> - I still don''t have my original question answered, I want to > somehow assess > the reliability of that zfs storage stack. If there''s no hard data > on that, > then if any storage expert who works with lots of systems can give his > "impression" of the reliability compared to the big two, that would > be great!What would you consider "hard data"? Can you give examples of "hard data" for EMC and NetApp (or anyone else)? Then perhaps similar things can be found for ZFS.
Ahmed Kamal wrote:> Thanks for all the answers .. Please find more questions below :) > > - Good to know EMC filers do not have end2end checksums! What about > netapp ?If they are not at the end, they can''t do end-to-end data validation. Ideally, application writers would do this, but it is a lot of work. ZFS does this on behalf of applications which use ZFS. Hence my comment about ZFS being complementary to your storage device decision. -- richard
On Tue, 30 Sep 2008, Ahmed Kamal wrote:> > - I still don''t have my original question answered, I want to somehow assess > the reliability of that zfs storage stack. If there''s no hard data on that, > then if any storage expert who works with lots of systems can give his > "impression" of the reliability compared to the big two, that would be greatThe reliability of that zfs storage stack primarily depends on the reliability of the hardware it runs on. Note that there is a huge difference between ''reliability'' and ''mean time to data loss'' (MTDL). There is also the concern about ''availability'' which is a function of how often the system fails, and the time to correct a failure. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I guess I am mostly interested in MTDL for a zfs system on whitebox hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ? On Tue, Sep 30, 2008 at 4:36 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Tue, 30 Sep 2008, Ahmed Kamal wrote: > >> >> - I still don''t have my original question answered, I want to somehow >> assess >> the reliability of that zfs storage stack. If there''s no hard data on >> that, >> then if any storage expert who works with lots of systems can give his >> "impression" of the reliability compared to the big two, that would be >> great >> > > The reliability of that zfs storage stack primarily depends on the > reliability of the hardware it runs on. Note that there is a huge > difference between ''reliability'' and ''mean time to data loss'' (MTDL). There > is also the concern about ''availability'' which is a function of how often > the system fails, and the time to correct a failure. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/6ef6a4b6/attachment.html>
On Tue, 30 Sep 2008, Ahmed Kamal wrote:> I guess I am mostly interested in MTDL for a zfs system on whitebox hardware > (like pogo), vs dataonTap on netapp hardware. Any numbers ?Barring kernel bugs or memory errors, Richard Elling''s blog entry seems to be the best place use as a guide: http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl It is pretty easy to build a ZFS pool with data loss probabilities (on paper) which are about as low as you winning the state jumbo lottery jackpot based on a ticket you found on the ground. However, if you want to compete with an EMC system, then you will want to purchase hardware of similar grade. If you purchase a cheapo system from Dell without ECC memory then the actual data reliability will suffer. ZFS protects you against corruption in the data storage path. It does not protect you against main memory errors or random memory overwrites due to a horrific kernel bug. ZFS also does not protect against data loss due to user error, which remains the primary factor in data loss. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ahmed Kamal wrote:> I guess I am mostly interested in MTDL for a zfs system on whitebox > hardware (like pogo), vs dataonTap on netapp hardware. Any numbers ?It depends to a large degree on the disks chosen. NetApp uses enterprise class disks and you can expect better reliability from such disks. I''ve blogged about a few different MTTDL models and posted some model results. http://blogs.sun.com/relling/tags/mttdl -- richard
>>>>> "ak" == Ahmed Kamal <email.ahmedkamal at googlemail.com> writes:ak> I need to answer and weigh against the cost. I suggest translating the reliability problems into a cost for mitigating them: price the ZFS alternative as two systems, and keep the second system offline except for nightly backup. Since you care mostly about data loss, not availability, this should work okay. You can lose 1 day of data, right? I think you need two zpools, or zpool + LVM2/XFS, some kind of two-filesystem setup, because of the ZFS corruption and panic/freeze-on-import problems. Having two zpools helps with other things, too, like if you need to destroy and recreate the pool to remove a slog or a vdev, or change from mirroring to raidz2, or something like that. I don''t think it''s realistic to give a quantitative MTDL for loss caused by software bugs, from netapp or from ZFS. ak> The EMC guy insisted we use 10k Fibre/SAS drives at least. I''m still not experienced at dealing with these guys without wasting huge amounts of time. I guess one strategy is to call a bunch of them, so they are all wasting your time in parallel. Last time I tried, the EMC guy wanted to meet _in person_ in the financial district, and then he just stopped calling so I had to guesstimate his quote from some low-end iSCSI/FC box that Dell was reselling. Have you called netapp, hitachi, storagetek? The IBM NAS is netapp so you could call IBM if netapp ignores you, but you probably want the storevault which is sold differently. The HP NAS looks weird because it runs your choice of Linux or Windows instead of WeirdNASplatform---maybe read some more about that one. Of course you don''t get source, but it surprised me these guys are MUCH worse than ordinary proprietary software. At least netapp stuff, you may as well consider it leased. They leverage the ``appliance'''' aspect, and then have sneaky licenses, that attempt to obliterate any potential market for used filers. When you''re cut off from support you can''t even download manuals. If you''re accustomed to the ``first sale doctrine'''' then ZFS with source has a huge advantage over netapp, beyond even ZFS''s advantage over proprietary software. The idea of dumping all my data into some opaque DRM canister lorded over by asshole CEO''s who threaten to sick their corporate lawyers on users on the mailing list offends me just a bit, but I guess we have to follow the ``market forces.'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/104f9801/attachment.bin>
Thanks guys, it seems the problem is even more difficult than I thought, and it seems there is no real measure for the software quality of the zfs stack vs others, neutralizing the hardware used under both. I will be using ECC RAM, since you mentioned it, and I will shift to using "enterprise" disks (I had initially thought zfs will always recovers from cheapo sata disks, making other disks only faster but not also safer), but now I am shifting to 10krpm SAS disks So, I am changing my question into "Do you see any obvious problems with the following setup I am considering" - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHz - 16GB ECC FB-DIMM 667MHz (8 x 2GB) - 10 Seagate 400GB 10K 16MB SAS HDD The 10 disks will be: 2 spare + 2 parity for raidz2 + 6 data => 2.4TB useable space * Do I need more CPU power ? How do I measure that ? What about RAM ?! * Now that I''m using ECC RAM, and enterprisey disks, Does this put this solution in par with low end netapp 2020 for example ? I will be replicating the important data daily to a Linux box, just in case I hit a wonderful zpool bug. Any final advice before I take the blue bill ;) Thanks a lot On Tue, Sep 30, 2008 at 8:40 PM, Miles Nordin <carton at ivy.net> wrote:> >>>>> "ak" == Ahmed Kamal <email.ahmedkamal at googlemail.com> writes: > > ak> I need to answer and weigh against the cost. > > I suggest translating the reliability problems into a cost for > mitigating them: price the ZFS alternative as two systems, and keep > the second system offline except for nightly backup. Since you care > mostly about data loss, not availability, this should work okay. You > can lose 1 day of data, right? > > I think you need two zpools, or zpool + LVM2/XFS, some kind of > two-filesystem setup, because of the ZFS corruption and > panic/freeze-on-import problems. Having two zpools helps with other > things, too, like if you need to destroy and recreate the pool to > remove a slog or a vdev, or change from mirroring to raidz2, or > something like that. > > I don''t think it''s realistic to give a quantitative MTDL for loss > caused by software bugs, from netapp or from ZFS. > > ak> The EMC guy insisted we use 10k Fibre/SAS drives at least. > > I''m still not experienced at dealing with these guys without wasting > huge amounts of time. I guess one strategy is to call a bunch of > them, so they are all wasting your time in parallel. Last time I > tried, the EMC guy wanted to meet _in person_ in the financial > district, and then he just stopped calling so I had to guesstimate his > quote from some low-end iSCSI/FC box that Dell was reselling. Have > you called netapp, hitachi, storagetek? The IBM NAS is netapp so you > could call IBM if netapp ignores you, but you probably want the > storevault which is sold differently. The HP NAS looks weird because > it runs your choice of Linux or Windows instead of > WeirdNASplatform---maybe read some more about that one. > > Of course you don''t get source, but it surprised me these guys are > MUCH worse than ordinary proprietary software. At least netapp stuff, > you may as well consider it leased. They leverage the ``appliance'''' > aspect, and then have sneaky licenses, that attempt to obliterate any > potential market for used filers. When you''re cut off from support > you can''t even download manuals. If you''re accustomed to the ``first > sale doctrine'''' then ZFS with source has a huge advantage over netapp, > beyond even ZFS''s advantage over proprietary software. The idea of > dumping all my data into some opaque DRM canister lorded over by > asshole CEO''s who threaten to sick their corporate lawyers on users on > the mailing list offends me just a bit, but I guess we have to follow > the ``market forces.'''' > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/2ae1406d/attachment.html>
On Tue, Sep 30, 2008 at 2:10 PM, Ahmed Kamal < email.ahmedkamal at googlemail.com> wrote:> > * Now that I''m using ECC RAM, and enterprisey disks, Does this put this > solution in par with low end netapp 2020 for example ? > >*sort of*. What are you going to be using it for? Half the beauty of NetApp are all the add-on applications you run server side. The snapmanager products. If you''re just using it for basic single head file serving, I''d say you''re pretty much on par. IMO, NetApp''s clustering is still far superior (yes folks, from a fileserver perspecctive, not an application clustering perspective) to anything Solaris has to offer right now, and also much, much, MUCH easier to configure/manage. Let me know when I can plug an infiniband cable between two Solaris boxes and type "cf enable" and we''ll talk :) --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/624c3bd5/attachment.html>
Just to confuse you more, I mean, give you another point of view:> - CPU: 1 Xeon Quad Core E5410 2.33GHz 12MB Cache 1333MHzThe reason the Xeon line is good is because it allows you to squeeze maximum performance out of a given processor technology from Intel, possibly getting the highest performance density. The reason it is bad is because it isn''t that much better for a lot more money. A mainstream processor is 80% of the performance for 20% of the price, so unless you need the highest possible performance density, you can save money going mainstream. Not that you should. Intel mainstream (and indeed many tech companies'') stuff is purposely stratified from the enterprise stuff by cutting out features like ECC and higher memory capacity and using different interface form factors.> - 10 Seagate 400GB 10K 16MB SAS HDDThere is nothing magical about SAS drives. Hard drives are for the most part all built with the same technology. The MTBF on that is 1.4M hours vs 1.2M hours for the enterprise 1TB SATA disk, which isn''t a big difference. And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours MTBF. And we know that enterprise SATA drives are the same as the consumer drives, just with different firmware optimized for server workloads and longer testing designed to detect infant mortality, which affects MTBF just as much as old-age failure. The MTBF difference from this extra testing at the start is huge. So you can tell right there that the perceived extra reliability scam they''re running is bunk. The SAS interface is a psychological tool to help disguise the fact that we''re all using roughly the same stuff :) Do your own 24 hour or 7-day stress-testing before deployment to weed out bad drives. Apparently old humans don''t live that much longer than they did in years gone by, instead much fewer of our babies die, which makes the average lifespan of everyone go up :) You know that 1TB SATA works for you now. Don''t let some big greedy company convince you otherwise. That extra money should be spent on your payroll, not on filling EMC''s coffers. ZFS provides a new landscape for storage. It is entirely possible that a server built with mainstream hardware can be cheaper, faster, and at least as reliable as an EMC system. Manageability and interoperability and all those things are another issue however. -- This message posted from opensolaris.org
On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote:> Thanks for all the answers .. Please find more questions below :) > > - Good to know EMC filers do not have end2end checksums! What about > netapp ?Blunty - no remote storage can have it by definition. The checksum needs to be computed as close as possible to the application. What''s why ZFS can do this and hardware solutions can''t (being several unreliable subsystems away from the data). --Toby> ...
Toby Thain wrote:> On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote: > > >> Thanks for all the answers .. Please find more questions below :) >> >> - Good to know EMC filers do not have end2end checksums! What about >> netapp ? >> > > Blunty - no remote storage can have it by definition. The checksum > needs to be computed as close as possible to the application. What''s > why ZFS can do this and hardware solutions can''t (being several > unreliable subsystems away from the data). > > --Toby > >Well.... That''s not _strictly_ true. ZFS can still munge things up as a result of faulty memory. And, it''s entirely possible to build a hardware end2end system which is at least as reliable as ZFS (e.g. is only faultable due to host memory failures). It''s just neither easy, nor currently available from anyone I know. Doing such checking is far easier at the filesystem level than any other place, which is a big strength of ZFS over other hardware solutions. Several of the storage vendors (EMC and NetApp included) I do believe support hardware checksumming over on the SAN/NAS device, but that still leaves them vulnerable to HBA and transport medium (e.g. FibreChannel/SCSI/Ethernet) errors, which they don''t currently have a solution for. I''d be interested in seeing if anyone has statistics about where errors occur in the data stream. My gut tells me that (from most common to least): (1) hard drives (2) transport medium (particularly if it''s Ethernet) (3) SAN/NAS controller cache (4) Host HBA (5) SAN/NAS controller (6) Host RAM (7) Host bus issues -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Tue, Sep 30, 2008 at 4:26 PM, Toby Thain <toby at telegraphics.com.au>wrote:> > On 30-Sep-08, at 6:58 AM, Ahmed Kamal wrote: > > > Thanks for all the answers .. Please find more questions below :) > > > > - Good to know EMC filers do not have end2end checksums! What about > > netapp ? > > Blunty - no remote storage can have it by definition. The checksum > needs to be computed as close as possible to the application. What''s > why ZFS can do this and hardware solutions can''t (being several > unreliable subsystems away from the data). > > --Toby > > > ... >So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD any different than a NetApp filer, running ONTAP with a QLogic HBA directly connected to an FC JBOD? How is it "several unreliable subsystems away from the data"? That''s a great talking point but it''s far from accurate. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/f2e63bb6/attachment.html>
On Tue, Sep 30, 2008 at 21:48, Tim <tim at tcsac.net> wrote:>> why ZFS can do this and hardware solutions can''t (being several >> unreliable subsystems away from the data). > So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD > any different than a NetApp filer, running ONTAP with a QLogic HBA directly > connected to an FC JBOD? How is it "several unreliable subsystems away from > the data"? > > That''s a great talking point but it''s far from accurate.Do your applications run on the NetApp filer? The idea of ZFS as I see it is to checksum the data from when the application puts the data into memory until it reads it out of memory again. Separate filers can checksum from when data is written into their buffers until they receive the request for that data, but to get from the filer to the machine running the application the data must be sent across an unreliable medium. If data is corrupted between the filer and the host, the corruption cannot be detected. Perhaps the filer could use a special protocol and include the checksum for each block, but then the host must verify the checksum for it to be useful. Contrast this with ZFS. It takes the application data, checksums it, and writes the data and the checksum out across the (unreliable) wire to the (unreliable) disk. Then when a read request comes, it reads the data and checksum across the (unreliable) wire, and verifies the checksum on the *host* side of the wire. If the data is corrupted any time between the checksum being calculated on the host and checked on the host, it can be detected. This adds a couple more layers of verifiability than filer-based checksums. Will
Will Murnane wrote:> On Tue, Sep 30, 2008 at 21:48, Tim <tim at tcsac.net> wrote: > >>> why ZFS can do this and hardware solutions can''t (being several >>> unreliable subsystems away from the data). >>> >> So how is a Server running Solaris with a QLogic HBA connected to an FC JBOD >> any different than a NetApp filer, running ONTAP with a QLogic HBA directly >> connected to an FC JBOD? How is it "several unreliable subsystems away from >> the data"? >> >> That''s a great talking point but it''s far from accurate. >> > Do your applications run on the NetApp filer? The idea of ZFS as I > see it is to checksum the data from when the application puts the data > into memory until it reads it out of memory again. Separate filers > can checksum from when data is written into their buffers until they > receive the request for that data, but to get from the filer to the > machine running the application the data must be sent across an > unreliable medium. If data is corrupted between the filer and the > host, the corruption cannot be detected. Perhaps the filer could use > a special protocol and include the checksum for each block, but then > the host must verify the checksum for it to be useful. > > Contrast this with ZFS. It takes the application data, checksums it, > and writes the data and the checksum out across the (unreliable) wire > to the (unreliable) disk. Then when a read request comes, it reads > the data and checksum across the (unreliable) wire, and verifies the > checksum on the *host* side of the wire. If the data is corrupted any > time between the checksum being calculated on the host and checked on > the host, it can be detected. This adds a couple more layers of > verifiability than filer-based checksums. > > Will >To make Will''s argument more succinct (<wink>), with a NetApp, undetectable (by the NetApp) errors can be introduced at the HBA and transport layer (FC Switch, slightly damage cable) levels. ZFS will detect such errors, and fix them (if properly configured). NetApp has no such ability. Also, I''m not sure that a NetApp (or EMC) has the ability to find bit-rot. That is, they can determine if a block is written correctly, but I don''t know if they keep the block checksum around permanently, and, how redundant that stored block checksum is. If they don''t permanently write the block checksum somewhere, then the NetApp has no way to determine if a READ block is OK, and hasn''t suffered from bit-rot (aka disk block failure). And, if it''s not either multiply stored, then they have the potential to lose the ability to do READ verification. Neither are problems of ZFS. In many of my production environments, I''ve got at least 2 different FC switches between my hosts and disks. And, with longer cables, comes more of the chance that something gets bent a bit too much. Finally, HBAs are not the most reliable things I''ve seen (sadly). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Tue, Sep 30, 2008 at 03:19:40PM -0700, Erik Trimble wrote:> To make Will''s argument more succinct (<wink>), with a NetApp, > undetectable (by the NetApp) errors can be introduced at the HBA and > transport layer (FC Switch, slightly damage cable) levels. ZFS will > detect such errors, and fix them (if properly configured). NetApp has no > such ability.It sounds like you mean the Netapp can''t detect silent errors in it''s own storage. It can (in a manner similar, but not identical to ZFS). The difference is that the Netapp is always remote from the application, and cannot detect corruption introduced before it arrives at the filer.> Also, I''m not sure that a NetApp (or EMC) has the ability to find > bit-rot. That is, they can determine if a block is written correctly, > but I don''t know if they keep the block checksum around permanently, > and, how redundant that stored block checksum is. If they don''t > permanently write the block checksum somewhere, then the NetApp has no > way to determine if a READ block is OK, and hasn''t suffered from bit-rot > (aka disk block failure). And, if it''s not either multiply stored, then > they have the potential to lose the ability to do READ verification. > Neither are problems of ZFS.A netapp filer does have a permanent block checksum that can verify reads. To my knowledge, it is not redundant. But then if it fails, you can just declare that block bad and fall back on the RAID/mirror redundancy to supply the data. -- Darren
On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble <Erik.Trimble at sun.com> wrote:> > To make Will''s argument more succinct (<wink>), with a NetApp, undetectable > (by the NetApp) errors can be introduced at the HBA and transport layer (FC > Switch, slightly damage cable) levels. ZFS will detect such errors, and > fix them (if properly configured). NetApp has no such ability. > > Also, I''m not sure that a NetApp (or EMC) has the ability to find bit-rot. > That is, they can determine if a block is written correctly, but I don''t > know if they keep the block checksum around permanently, and, how redundant > that stored block checksum is. If they don''t permanently write the block > checksum somewhere, then the NetApp has no way to determine if a READ block > is OK, and hasn''t suffered from bit-rot (aka disk block failure). And, if > it''s not either multiply stored, then they have the potential to lose the > ability to do READ verification. Neither are problems of ZFS. > > > In many of my production environments, I''ve got at least 2 different FC > switches between my hosts and disks. And, with longer cables, comes more of > the chance that something gets bent a bit too much. Finally, HBAs are not > the most reliable things I''ve seen (sadly). > > >* NetApp''s block-appended checksum approach appears similar but is in fact much stronger. Like many arrays, NetApp formats its drives with 520-byte sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it compares the checksum to the data just like an array would, but there''s a key difference: it does this comparison after the data has made it through the I/O path, so it validates that the block made the journey from platter to memory without damage in transit. * ** -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/ec1c5482/attachment.html>
> > Intel mainstream (and indeed many tech companies'') stuff is purposely > stratified from the enterprise stuff by cutting out features like ECC and > higher memory capacity and using different interface form factors.Well I guess I am getting a Xeon anyway> There is nothing magical about SAS drives. Hard drives are for the most > part all built with the same technology. The MTBF on that is 1.4M hours vs > 1.2M hours for the enterprise 1TB SATA disk, which isn''t a big difference. > And for comparison, the WD3000BLFS is a consumer drive with 1.4M hours > MTBF. > >Hmm ... well, there is a considerable price difference, so unless someone says I''m horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 drives. By the way, how many of those would saturate a single (non trunked) Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 disks should be about enough to saturate it ? BTW, for everyone saying zfs is more reliable because it''s closer to the application than a netapp, well at least in my case it isn''t. The solaris box will be NFS sharing and the apps will be running on remote Linux boxes. So, I guess this makes them equal. How about a new "reliable NFS" protocol, that computes the hashes on the client side, sends it over the wire to be written remotely on the zfs storage node ?! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/0df7ccb3/attachment.html>
On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal < email.ahmedkamal at googlemail.com> wrote:> > >> > Hmm ... well, there is a considerable price difference, so unless someone > says I''m horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 > drives. By the way, how many of those would saturate a single (non trunked) > Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 > disks should be about enough to saturate it ? >SAS has far greater performance, and if your workload is extremely random, will have a longer MTBF. SATA drives suffer badly on random workloads.> > > BTW, for everyone saying zfs is more reliable because it''s closer to the > application than a netapp, well at least in my case it isn''t. The solaris > box will be NFS sharing and the apps will be running on remote Linux boxes. > So, I guess this makes them equal. How about a new "reliable NFS" protocol, > that computes the hashes on the client side, sends it over the wire to be > written remotely on the zfs storage node ?! >Won''t be happening anytime soon. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/8ab4c0c4/attachment.html>
On Tue, Sep 30, 2008 at 06:09:30PM -0500, Tim wrote:> On Tue, Sep 30, 2008 at 6:03 PM, Ahmed Kamal < > email.ahmedkamal at googlemail.com> wrote: > > BTW, for everyone saying zfs is more reliable because it''s closer to the > > application than a netapp, well at least in my case it isn''t. The solaris > > box will be NFS sharing and the apps will be running on remote Linux boxes. > > So, I guess this makes them equal. How about a new "reliable NFS" protocol, > > that computes the hashes on the client side, sends it over the wire to be > > written remotely on the zfs storage node ?! > > Won''t be happening anytime soon.If you use RPCSEC_GSS with integrity protection then you''ve got it already.
Ahmed Kamal wrote:> BTW, for everyone saying zfs is more reliable because it''s closer to the > application than a netapp, well at least in my case it isn''t. The > solaris box will be NFS sharing and the apps will be running on remote > Linux boxes. So, I guess this makes them equal. How about a new > "reliable NFS" protocol, that computes the hashes on the client side, > sends it over the wire to be written remotely on the zfs storage node ?!We''ve actually prototyped an NFS protocol extension that does this, but the challenges are integrating it with ZFS to form a single protection domain, and getting the protocol to be a standard. For now, an option you have is Kerberos with data integrity; the sender computes a CRC of the data and the receiver can verify it to rule out OTW corruption. This is, of course, not end-to-end from platter to memory, but introduces a separate protection domain for the NFS link. Rob T
On Tue, 30 Sep 2008, Miles Nordin wrote:> > I think you need two zpools, or zpool + LVM2/XFS, some kind of > two-filesystem setup, because of the ZFS corruption and > panic/freeze-on-import problems. Having two zpools helps with otherIf ZFS provides such a terrible experience for you can I be brave enough to suggest that perhaps you are on the wrong mailing list and perhaps you should be watching the pinwheels with HFS+? ;-) While we surely do hear all the horror stories on this list, I don''t think that ZFS is as wildly unstable as you make it out to be. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 1 Oct 2008, Ahmed Kamal wrote:> So, I guess this makes them equal. How about a new "reliable NFS" protocol, > that computes the hashes on the client side, sends it over the wire to be > written remotely on the zfs storage node ?!Modern NFS runs over a TCP connection, which includes its own data validation. This surely helps. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "rt" == Robert Thurlow <robert.thurlow at sun.com> writes:rt> introduces a separate protection domain for the NFS link. There are checksums in the ethernet FCS, checksums in IP headers, checksums in UDP headers (which are sometimes ignored), and checksums in TCP (which are not ignored). There might be an RPC layer checksum, too, not sure. Different arguments can be made against each, I suppose, but did you have a particular argument in mind? Have you experienced corruption with NFS that you can blame on the network, not the CPU/memory/busses of the server and client? I''ve experienced enough to make me buy stories of corruption in disks, disk interfaces, and memory. but not yet with TCP so I''d like to hear the story as well as the hypothetical argument, if there is one. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/b535d685/attachment.bin>
On Sep 30, 2008, at 19:44, Miles Nordin wrote:> There are checksums in the ethernet FCS, checksums in IP headers, > checksums in UDP headers (which are sometimes ignored), and checksums > in TCP (which are not ignored). There might be an RPC layer checksum, > too, not sure.Not of which helped Amazon when their S3 service went down due to a flipped bit:> More specifically, we found that there were a handful of messages on > Sunday morning that had a single bit corrupted such that the message > was still intelligible, but the system state information was > incorrect. We use MD5 checksums throughout the system, for example, > to prevent, detect, and recover from corruption that can occur > during receipt, storage, and retrieval of customers'' objects. > However, we didn''t have the same protection in place to detect > whether this particular internal state information had been corrupted.http://status.aws.amazon.com/s3-20080720.html
On Sep 30, 2008, at 19:09, Tim wrote:> SAS has far greater performance, and if your workload is extremely > random, > will have a longer MTBF. SATA drives suffer badly on random > workloads.Well, if you can probably afford more SATA drives for the purchase price, you can put them in a striped-mirror set up, and that may help things. If your disks are cheap you can afford to buy more of them (space, heat, and power not withstanding).
On Tue, Sep 30, 2008 at 7:15 PM, David Magda <dmagda at ee.ryerson.ca> wrote:> On Sep 30, 2008, at 19:09, Tim wrote: > > SAS has far greater performance, and if your workload is extremely random, >> will have a longer MTBF. SATA drives suffer badly on random workloads. >> > > Well, if you can probably afford more SATA drives for the purchase price, > you can put them in a striped-mirror set up, and that may help things. If > your disks are cheap you can afford to buy more of them (space, heat, and > power not withstanding). > >More disks will not solve SATA''s problem. I run into this on a daily basis working on enterprise storage. If it''s for just archive/storage, or even sequential streaming, it shouldn''t be a big deal. If it''s random workload, there''s pretty much nothing you can do to get around it short of more front-end cache and intelligence which is simply a band-aid, not a fix. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/d7ba149a/attachment.html>
> > Well, if you can probably afford more SATA drives for the purchase > price, you can put them in a striped-mirror set up, and that may help > things. If your disks are cheap you can afford to buy more of them > (space, heat, and power not withstanding). >Hmm, that''s actually cool ! If I configure the system with 10 x 400G 10k rpm disk == cost ==> 13k$ 10 x 1TB SATA 7200 == cost ==> 9k$ Always assuming 2 spare disks, and Using the sata disks, I would configure them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), and better data reliability ?? (don''t really know about that one) ? Is this a recommended setup ? It looks too good to be true ? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/7e14b8fa/attachment.html>
Bob Friesenhahn wrote:> On Wed, 1 Oct 2008, Ahmed Kamal wrote: >> So, I guess this makes them equal. How about a new "reliable NFS" protocol, >> that computes the hashes on the client side, sends it over the wire to be >> written remotely on the zfs storage node ?! > > Modern NFS runs over a TCP connection, which includes its own data > validation. This surely helps.Less than we''d sometimes like :-) The TCP checksum isn''t very strong, and we''ve seen corruption tied to a broken router, where the Ethernet checksum was recomputed on bad data, and the TCP checksum didn''t help. It sucked. Rob T
On Tue, Sep 30, 2008 at 7:30 PM, Ahmed Kamal < email.ahmedkamal at googlemail.com> wrote:> > >> Well, if you can probably afford more SATA drives for the purchase >> price, you can put them in a striped-mirror set up, and that may help >> things. If your disks are cheap you can afford to buy more of them >> (space, heat, and power not withstanding). >> > > Hmm, that''s actually cool ! > If I configure the system with > > 10 x 400G 10k rpm disk == cost ==> 13k$ > 10 x 1TB SATA 7200 == cost ==> 9k$ > > Always assuming 2 spare disks, and Using the sata disks, I would configure > them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would > get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), > and better data reliability ?? (don''t really know about that one) ? > > Is this a recommended setup ? It looks too good to be true ? >I *HIGHLY* doubt you''ll see better performance out of the SATA, but it is possible. You don''t need 2 spares with SAS, 1 is more than enough with that few disks. I''d suggest doing RAID-Z (raid-5) as well if you''ve only got 9 data disks. 8+1 is more than acceptable with SAS drives. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/17a568df/attachment.html>
Miles Nordin wrote:> There are checksums in the ethernet FCS, checksums in IP headers, > checksums in UDP headers (which are sometimes ignored), and checksums > in TCP (which are not ignored). There might be an RPC layer checksum, > too, not sure. > > Different arguments can be made against each, I suppose, but did you > have a particular argument in mind? > > Have you experienced corruption with NFS that you can blame on the > network, not the CPU/memory/busses of the server and client?Absolutely. See my recent post in this thread. The TCP checksum is not that strong, and a router broken the right way can regenerate a correct-looking Ethernet checksum on bad data. krb5i fixed it nicely. Rob T
Tim wrote:> > > On Tue, Sep 30, 2008 at 7:15 PM, David Magda <dmagda at ee.ryerson.ca > <mailto:dmagda at ee.ryerson.ca>> wrote: > > On Sep 30, 2008, at 19:09, Tim wrote: > > SAS has far greater performance, and if your workload is > extremely random, > will have a longer MTBF. SATA drives suffer badly on random > workloads. > > > Well, if you can probably afford more SATA drives for the purchase > price, you can put them in a striped-mirror set up, and that may > help things. If your disks are cheap you can afford to buy more of > them (space, heat, and power not withstanding). > > > More disks will not solve SATA''s problem. I run into this on a daily > basis working on enterprise storage. If it''s for just > archive/storage, or even sequential streaming, it shouldn''t be a big > deal. If it''s random workload, there''s pretty much nothing you can do > to get around it short of more front-end cache and intelligence which > is simply a band-aid, not a fix.I observe that there are no disk vendors supplying SATA disks with speed > 7,200 rpm. It is no wonder that a 10k rpm disk outperforms a 7,200 rpm disk for random workloads. I''ll attribute this to intentional market segmentation by the industry rather than a deficiency in the transfer protocol (SATA). -- richard
> > I observe that there are no disk vendors supplying SATA disks > with speed > 7,200 rpm. It is no wonder that a 10k rpm disk > outperforms a 7,200 rpm disk for random workloads. I''ll attribute > this to intentional market segmentation by the industry rather than > a deficiency in the transfer protocol (SATA). >I don''t really need more performance that what''s needed to saturate a gig link (4 sata disks?) So, performance aside, does SAS have other benefits ? Data integrity ? How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/5ef8d695/attachment.html>
Ahmed Kamal wrote:> > I observe that there are no disk vendors supplying SATA disks > with speed > 7,200 rpm. It is no wonder that a 10k rpm disk > outperforms a 7,200 rpm disk for random workloads. I''ll attribute > this to intentional market segmentation by the industry rather than > a deficiency in the transfer protocol (SATA). > > > I don''t really need more performance that what''s needed to saturate a > gig link (4 sata disks?)It depends on the disk. A Seagate Barracuda 500 GByte SATA disk is rated at a media speed of 105 MBytes/s which is near the limit of a GbE link. In theory, one disk would be close, two should do it.> So, performance aside, does SAS have other benefits ? Data integrity ? > How would a 8 raid1 sata compare vs another 8 smaller SAS disks in > raidz(2) ?Like apples and pomegranates. Both should be able to saturate a GbE link. -- richard
> > > So, performance aside, does SAS have other benefits ? Data integrity ? How > would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? > Like apples and pomegranates. Both should be able to saturate a GbE link. >You''re the expert, but isn''t the 100M/s for streaming not random read/write. For that, I suppose the disk drops to around 25M/s which is why I was mentioning 4 sata disks. When I was asking for comparing the 2 raids, It''s was aside from performance, basically sata is obviously cheaper, it will saturate the gig link, so performance yes too, so the question becomes which has better data protection ( 8 sata raid1 or 8 sas raidz2) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/16f59a65/attachment.html>
On Wed, 1 Oct 2008, Ahmed Kamal wrote:> Always assuming 2 spare disks, and Using the sata disks, I would configure > them in raid1 mirror (raid6 for the 400G), Besides being cheaper, I would > get more useable space (4TB vs 2.4TB), Better performance of raid1 (right?), > and better data reliability ?? (don''t really know about that one) ? > > Is this a recommended setup ? It looks too good to be true ?Using mirrors will surely make up quite a lot for disks with slow seek times. Reliability is acceptable for most purposes. Resilver should be pretty fast. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal < email.ahmedkamal at googlemail.com> wrote:> >> So, performance aside, does SAS have other benefits ? Data integrity ? How >> would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) ? >> Like apples and pomegranates. Both should be able to saturate a GbE link. >> > > You''re the expert, but isn''t the 100M/s for streaming not random > read/write. For that, I suppose the disk drops to around 25M/s which is why > I was mentioning 4 sata disks. > > When I was asking for comparing the 2 raids, It''s was aside from > performance, basically sata is obviously cheaper, it will saturate the gig > link, so performance yes too, so the question becomes which has better data > protection ( 8 sata raid1 or 8 sas raidz2) >SAS''s main benefits are seek time and max IOPS. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/9f549528/attachment.html>
On Tue, 30 Sep 2008, Robert Thurlow wrote:>> Modern NFS runs over a TCP connection, which includes its own data >> validation. This surely helps. > > Less than we''d sometimes like :-) The TCP checksum isn''t > very strong, and we''ve seen corruption tied to a broken > router, where the Ethernet checksum was recomputed on > bad data, and the TCP checksum didn''t help. It sucked.TCP does not see the router. The TCP and ethernet checksums are at completely different levels. Routers do not pass ethernet packets. They pass IP packets. Your statement does not make technical sense. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hm, richard''s excellent Graphs here http://blogs.sun.com/relling/tags/mttdl as well as his words say he prefers mirroring over raidz/raidz2 almost always. It''s better for performance and MTTDL. Since 8 sata raid1 is cheaper and probably more reliable than 8 raidz2 sas (and I dont need extra sas performance), and offers better performance and MTTDL than 8 sata raidz2, I guess I will go with 8-sata-raid1 then! Hope I''m not horribly mistaken :) On Wed, Oct 1, 2008 at 3:18 AM, Tim <tim at tcsac.net> wrote:> > > On Tue, Sep 30, 2008 at 8:13 PM, Ahmed Kamal < > email.ahmedkamal at googlemail.com> wrote: > >> >>> So, performance aside, does SAS have other benefits ? Data integrity ? >>> How would a 8 raid1 sata compare vs another 8 smaller SAS disks in raidz(2) >>> ? >>> Like apples and pomegranates. Both should be able to saturate a GbE >>> link. >>> >> >> You''re the expert, but isn''t the 100M/s for streaming not random >> read/write. For that, I suppose the disk drops to around 25M/s which is why >> I was mentioning 4 sata disks. >> >> When I was asking for comparing the 2 raids, It''s was aside from >> performance, basically sata is obviously cheaper, it will saturate the gig >> link, so performance yes too, so the question becomes which has better data >> protection ( 8 sata raid1 or 8 sas raidz2) >> > > SAS''s main benefits are seek time and max IOPS. > > --Tim >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/c5ac51ab/attachment.html>
On 30-Sep-08, at 6:31 PM, Tim wrote:> > > On Tue, Sep 30, 2008 at 5:19 PM, Erik Trimble > <Erik.Trimble at sun.com> wrote: > > To make Will''s argument more succinct (<wink>), with a NetApp, > undetectable (by the NetApp) errors can be introduced at the HBA > and transport layer (FC Switch, slightly damage cable) levels. > ZFS will detect such errors, and fix them (if properly configured). > NetApp has no such ability. > > Also, I''m not sure that a NetApp (or EMC) has the ability to find > bit-rot. ... > > > > NetApp''s block-appended checksum approach appears similar but is in > fact much stronger. Like many arrays, NetApp formats its drives > with 520-byte sectors. It then groups them into 8-sector blocks: 4K > of data (the WAFL filesystem blocksize) and 64 bytes of checksum. > When WAFL reads a block it compares the checksum to the data just > like an array would, but there''s a key difference: it does this > comparison after the data has made it through the I/O path, so it > validates that the block made the journey from platter to memory > without damage in transit. >This is not end to end protection; they are merely saying the data arrived in the storage subsystem''s memory verifiably intact. The data still has a long way to go before it reaches the application. --Toby> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/9171fd7b/attachment.html>
>>>>> "rt" == Robert Thurlow <robert.thurlow at sun.com> writes: >>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes:dm> Not of which helped Amazon when their S3 service went down due dm> to a flipped bit: ok, I get that S3 went down due to corruption, and that the network checksums I mentioned failed to prevent the corruption. The missing piece is: belief that the corruption occurred on the network rather than somewhere else. Their post-mortem sounds to me as though a bit flipped inside the memory of one server could be spread via this ``gossip'''' protocol to infect the entire cluster. The replication and spreadability of the data makes their cluster into a many-terabyte gamma ray detector. I wonder if they even use a meaningful VPN. > Modern NFS runs over a TCP connection, which includes its own > data validation. This surely helps. Yeah fine, but IP and UDP and Ethernet also have checksums. The one in TCP isn''t much fancier. rt> The TCP checksum isn''t very strong, and we''ve seen corruption rt> tied to a broken router, where the Ethernet checksum was rt> recomputed on bad data, and the TCP checksum didn''t help. It rt> sucked. That''s more like what I was looking for. The other concept from your first post of ``protection domains'''' is interesting, too (of one domain including ZFS and NFS). Of course, what do you do when you get an error on an NFS client, throw ``stale NFS file handle?'''' Even speaking hypothetically, it depends on good exception handling for its value, which has been a big trouble spot for ZFS so far. This ``protection domain'''' concept is already enshrined in IEEE 802.1d---bridges are not supposed to recalculate the FCS, and if they need to mangle the packet they''re supposed to update the FCS algorithmically based on fancy math and only the bits they changed, not just recalculate it over the whole packet. They state this is to protect against bad RAM inside the bridge. I don''t know if anyone DOES that, but it''s written into the spec. But if the network is L3, then FCS and IP checksums (ttl decrement) will have to be recalculated, so the ``protection domain'''' is partly split leaving only the UDP/TCP checksum contiguous. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/b4399e2b/attachment.bin>
On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain <toby at telegraphics.com.au>wrote:> > > * NetApp''s block-appended checksum approach appears similar but is in fact > much stronger. Like many arrays, NetApp formats its drives with 520-byte > sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL > filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block it > compares the checksum to the data just like an array would, but there''s a > key difference: it does this comparison after the data has made it through > the I/O path, so it validates that the block made the journey from platter > to memory without damage in transit.* > > > This is not end to end protection; they are merely saying the data arrived > in the storage subsystem''s memory verifiably intact. The data still has a > long way to go before it reaches the application. > > --Toby > >As it does in ANY fileserver scenario, INCLUDING zfs. He is building a FILESERVER. This is not an APPLICATION server. You seem to be stuck on this idea that everyone is using ZFS on the server they''re running the application. That does a GREAT job of creating disparate storage islands, something EVERY enterprise is trying to get rid of. Not create more of. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/7becfec0/attachment.html>
Ahmed Kamal wrote:> > > So, performance aside, does SAS have other benefits ? Data > integrity ? How would a 8 raid1 sata compare vs another 8 smaller > SAS disks in raidz(2) ? > Like apples and pomegranates. Both should be able to saturate a > GbE link. > > > You''re the expert, but isn''t the 100M/s for streaming not random > read/write. For that, I suppose the disk drops to around 25M/s which > is why I was mentioning 4 sata disks. > > When I was asking for comparing the 2 raids, It''s was aside from > performance, basically sata is obviously cheaper, it will saturate the > gig link, so performance yes too, so the question becomes which has > better data protection ( 8 sata raid1 or 8 sas raidz2)Good question. Since you are talking about different disks, the vendor specs are different. The 500 GByte Seagate Barracuda 7200.11 I described above is rated with an MTBF of 750,000 hours, even though it comes in either a SATA or SAS interface -- but that isn''t so interesting. A 450 GByte Seagate Cheetah 15k.6 (SAS) has a rated MTBF of 1.6M hours. Putting that into RAIDoptimizer we see: Disk RAID MTTDL[1](yrs) MTTDL[2](yrs) ---------------------------------------------------- Barracuda 1+0 284,966 5,351 z2 180,663,117 6,784,904 Cheetah 1+0 1,316,385 126,839 z2 1,807,134,968 348,249,968 For ZFS, 50% space used, logistical MTTR=24 hours, mirror resync time = 60 GBytes/hr In general, (2-way) mirrors are single parity, raidz2 is double parity. If you use a triple mirror, then the numbers will be closer to the raidz2 numbers. For explanations of these models, see my blog, http://blogs.sun.com/relling/entry/a_story_of_two_mttdl -- richard
On 30-Sep-08, at 9:54 PM, Tim wrote:> > > On Tue, Sep 30, 2008 at 8:50 PM, Toby Thain > <toby at telegraphics.com.au> wrote: > >> >> NetApp''s block-appended checksum approach appears similar but is >> in fact much stronger. Like many arrays, NetApp formats its drives >> with 520-byte sectors. It then groups them into 8-sector blocks: >> 4K of data (the WAFL filesystem blocksize) and 64 bytes of >> checksum. When WAFL reads a block it compares the checksum to the >> data just like an array would, but there''s a key difference: it >> does this comparison after the data has made it through the I/O >> path, so it validates that the block made the journey from platter >> to memory without damage in transit. >> > > This is not end to end protection; they are merely saying the data > arrived in the storage subsystem''s memory verifiably intact. The > data still has a long way to go before it reaches the application. > > --Toby > > > As it does in ANY fileserver scenario, INCLUDING zfs. He is > building a FILESERVER. This is not an APPLICATION server. You > seem to be stuck on this idea that everyone is using ZFS on the > server they''re running the application.ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? --Toby> That does a GREAT job of creating disparate storage islands, > something EVERY enterprise is trying to get rid of. Not create > more of. > > --Tim >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080930/a1f31b43/attachment.html>
On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote:> As it does in ANY fileserver scenario, INCLUDING zfs. He is building a > FILESERVER. This is not an APPLICATION server. You seem to be stuck on > this idea that everyone is using ZFS on the server they''re running the > application. That does a GREAT job of creating disparate storage islands, > something EVERY enterprise is trying to get rid of. Not create more of.First off there''s an issue of design. Wherever possible end-to-end protection is better (and easier to implement and deploy) than hop-by-hop protection. Hop-by-hop protection implies a lot of trust. Yes, in a NAS you''re going to have at least one hop: from the client to the server. But how does the necessity of one hop mean that N hops is fine? One hop is manageable. N hops is a disaster waiting to happen. Second, NAS is not the only way to access remote storage. There''s also SAN (e.g., iSCSI). So you might host a DB on a ZFS pool backed by iSCSI targets. If you do that with a random iSCSI target implementation then you get end-to-end integrity protection regardless of what else the vendor does for you in terms of hop-by-hop integrity protection. And you can even host the target on a ZFS pool, in which case there''s two layers of integrity protection, and so some waste of disk space, but you get the benefit of very flexible volume management on both, the initiator and the target. Third, who''s to say that end-to-end integrity protection can''t possibly be had in a NAS environment? Sure, with today''s protocols you can''t have it -- you can get hop-by-hop protection with at least one hop (see above) -- but having end-to-end integrity protection built-in to the filesystem may enable new NAS protocols that do provide end-to-end protection. (This is a variant of the first point above: good design decisions pay off.) Nico --
Tim wrote:> > As it does in ANY fileserver scenario, INCLUDING zfs. He is building > a FILESERVER. This is not an APPLICATION server. You seem to be > stuck on this idea that everyone is using ZFS on the server they''re > running the application. That does a GREAT job of creating disparate > storage islands, something EVERY enterprise is trying to get rid of. > Not create more of.I think you''d be surprised how large an organisation can migrate most, if not all of their application servers to zones one or two Thumpers. Isn''t that the reason for buying in "server appliances"? Ian
On Tue, Sep 30, 2008 at 10:44 PM, Toby Thain <toby at telegraphics.com.au>wrote:> > > ZFS allows the architectural option of separate storage without losing end > to end protection, so the distinction is still important. Of course this > means ZFS itself runs on the application server, but so what? > > --Toby > >So what would be that the application has to run on Solaris. And requires a LUN to function. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/03977195/attachment.html>
On Wed, Oct 1, 2008 at 12:24 AM, Ian Collins <ian at ianshome.com> wrote:> Tim wrote: > > > > As it does in ANY fileserver scenario, INCLUDING zfs. He is building > > a FILESERVER. This is not an APPLICATION server. You seem to be > > stuck on this idea that everyone is using ZFS on the server they''re > > running the application. That does a GREAT job of creating disparate > > storage islands, something EVERY enterprise is trying to get rid of. > > Not create more of. > > I think you''d be surprised how large an organisation can migrate most, > if not all of their application servers to zones one or two Thumpers. > > Isn''t that the reason for buying in "server appliances"? > > Ian > >I think you''d be surprised how quickly they''d be fired for putting that much risk into their enterprise. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/bfc43241/attachment.html>
On Tue, Sep 30, 2008 at 11:58 PM, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> On Tue, Sep 30, 2008 at 08:54:50PM -0500, Tim wrote: > > As it does in ANY fileserver scenario, INCLUDING zfs. He is building a > > FILESERVER. This is not an APPLICATION server. You seem to be stuck on > > this idea that everyone is using ZFS on the server they''re running the > > application. That does a GREAT job of creating disparate storage > islands, > > something EVERY enterprise is trying to get rid of. Not create more of. > > First off there''s an issue of design. Wherever possible end-to-end > protection is better (and easier to implement and deploy) than > hop-by-hop protection. > > Hop-by-hop protection implies a lot of trust. Yes, in a NAS you''re > going to have at least one hop: from the client to the server. But how > does the necessity of one hop mean that N hops is fine? One hop is > manageable. N hops is a disaster waiting to happen.Who''s talking about N hops? WAFL gives you the exact same amount of hops as ZFS.> > > Second, NAS is not the only way to access remote storage. There''s also > SAN (e.g., iSCSI). So you might host a DB on a ZFS pool backed by iSCSI > targets. If you do that with a random iSCSI target implementation then > you get end-to-end integrity protection regardless of what else the > vendor does for you in terms of hop-by-hop integrity protection. And > you can even host the target on a ZFS pool, in which case there''s two > layers of integrity protection, and so some waste of disk space, but you > get the benefit of very flexible volume management on both, the > initiator and the target. >I don''t recall saying it was. The original poster is talking about a FILESERVER, not iSCSI targets. As off topic as it is, the current iSCSI target is hardly fully baked or production ready.> > Third, who''s to say that end-to-end integrity protection can''t possibly > be had in a NAS environment? Sure, with today''s protocols you can''t > have it -- you can get hop-by-hop protection with at least one hop (see > above) -- but having end-to-end integrity protection built-in to the > filesystem may enable new NAS protocols that do provide end-to-end > protection. (This is a variant of the first point above: good design > decisions pay off.) > >Which would apply to WAFL as well as ZFS. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/f3c04e4b/attachment.html>
>On Tue, 30 Sep 2008, Robert Thurlow wrote: > >>> Modern NFS runs over a TCP connection, which includes its own data >>> validation. This surely helps. >> >> Less than we''d sometimes like :-) The TCP checksum isn''t >> very strong, and we''ve seen corruption tied to a broken >> router, where the Ethernet checksum was recomputed on >> bad data, and the TCP checksum didn''t help. It sucked. > >TCP does not see the router. The TCP and ethernet checksums are at >completely different levels. Routers do not pass ethernet packets. >They pass IP packets. Your statement does not make technical sense.I think he was referring to a broken VLAN switch. But even then, any active component will take bist from the wire, check the MAC, changes what needed and redo the MAC and other checksums which needed changes. The whole packet lives in the memory of the switch/router and if that memory is broken the packet will be send damaged. Casper
Casper.Dik at Sun.COM wrote:>> On Tue, 30 Sep 2008, Robert Thurlow wrote: >> >>>> Modern NFS runs over a TCP connection, which includes its own data >>>> validation. This surely helps. >>> Less than we''d sometimes like :-) The TCP checksum isn''t >>> very strong, and we''ve seen corruption tied to a broken >>> router, where the Ethernet checksum was recomputed on >>> bad data, and the TCP checksum didn''t help. It sucked. >> TCP does not see the router. The TCP and ethernet checksums are at >> completely different levels. Routers do not pass ethernet packets. >> They pass IP packets. Your statement does not make technical sense. > > I think he was referring to a broken VLAN switch. > > But even then, any active component will take bist from the > wire, check the MAC, changes what needed and redo the MAC and > other checksums which needed changes. The whole packet lives > in the memory of the switch/router and if that memory is broken > the packet will be send damaged.Which is why you need a network end-to-end strong checksum for iSCSI. I recommend that IPsec AH (at least but in many cases ESP) be deployed. If you care enough about your data to set checksum=sha256 for the ZFS datasets then make sure you care enough to setup IPsec and use HMAC-SHA256 for on the wire integrity protection too. -- Darren J Moffat
Tim <tim at tcsac.net> wrote:> > Hmm ... well, there is a considerable price difference, so unless someone > > says I''m horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 > > drives. By the way, how many of those would saturate a single (non trunked) > > Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 > > disks should be about enough to saturate it ? > > > > SAS has far greater performance, and if your workload is extremely random, > will have a longer MTBF. SATA drives suffer badly on random workloads.The SATA Barracuda ST310003 I recently bought has a MTBF of 136 years. If you believe that you may compare MTBF values in the range > 100 years, you may do something wrong. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote:> > Hmm ... well, there is a considerable price difference, so unless someone > says I''m horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 > drives. By the way, how many of those would saturate a single (non trunked) > Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 > disks should be about enough to saturate it ?You keep mentioning that you plan on using NFS, and everyone seems to keep ignoring the fact that in order to make NFS performance reasonable you''re really going to want a couple very fast slog devices. Since I don''t have the correct amount of money to afford a very fast slog device, I can''t speak to which one is the best price/performance ratio, but there are tons of options out there. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
David Magda <dmagda at ee.ryerson.ca> wrote:> On Sep 30, 2008, at 19:09, Tim wrote: > > > SAS has far greater performance, and if your workload is extremely > > random, > > will have a longer MTBF. SATA drives suffer badly on random > > workloads. > > Well, if you can probably afford more SATA drives for the purchase > price, you can put them in a striped-mirror set up, and that may help > things. If your disks are cheap you can afford to buy more of them > (space, heat, and power not withstanding).SATA and SAS disks usually base on the same drive mechanism. The seek times are most likely identical. Some SATA disks support tagged command queueing and others do not. I would asume that there is no speed difference between SATA with command queueing and SAS. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Toby Thain Wrote:> ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what?The OP in question is not running his network clients on Solaris or OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux workstations. Unless there''s been a recent port of ZFS to Linux, that makes a big What. Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process. --Joe
Ian Collins wrote:> I think you''d be surprised how large an organisation can migrate most, > if not all of their application servers to zones one or two Thumpers. > > Isn''t that the reason for buying in "server appliances"? >Assuming that the application servers can coexist in the "only" 16GB available on a thumper, and the "only" 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. --Joe
On Wed, Oct 1, 2008 at 8:52 AM, Brian Hechinger <wonko at 4amlunch.net> wrote:> On Wed, Oct 01, 2008 at 01:03:28AM +0200, Ahmed Kamal wrote: >> >> Hmm ... well, there is a considerable price difference, so unless someone >> says I''m horribly mistaken, I now want to go back to Barracuda ES 1TB 7200 >> drives. By the way, how many of those would saturate a single (non trunked) >> Gig ethernet link ? Workload NFS sharing of software and homes. I think 4 >> disks should be about enough to saturate it ? > > You keep mentioning that you plan on using NFS, and everyone seems to keep > ignoring the fact that in order to make NFS performance reasonable you''re > really going to want a couple very fast slog devices. Since I don''t have > the correct amount of money to afford a very fast slog device, I can''t > speak to which one is the best price/performance ratio, but there are tons > of options out there.+1 for the slog devices - make them 15k RPM SAS Also the OP has not stated how his Linux clients intend to use this fileserver. In particular, we need to understand how many IOPS (I/O Ops/Sec) are required and whether the typical workload is sequencial (large or small file) or random and the percentage or read to write operations. Often a mix of different ZFS configs are required to provide a complete and flexible solution. Here is a rough generalization: - for large file sequential I/O with high reliability go raidz2 with 6 disks minimun and use SATA disks. - for workloads with random I/O patterns and you need lots of IOPS - use a ZFS multi-way mirror and 15k RPM SAS disks. For example, a 3-way mirror will distribute the reads across 3 drives - so you''ll see 3 * (single disk) IOPS for reads and 1* IOPS for writes. Consider 4 or more way mirrors for heavy (random) read workloads. Usually it makes sense to configure more that one ZFS pool config and then use the zpool that is appropriate for each specific workload. Also this config (diversity) future-proofs your fileserver - because its very difficult to predict how your usage patterns will change a year down the road[1]. Also, bear in mind that, in the future, you may wish to replace disks with SSDs (or add SSDs) to this fileserver - when the pricing is more reasonable. So only spend what you absolutely need to spend to meet todays requirements. You can always "push in" newer/bigger/better/faster *devices* down the road and this will provide you with a more flexible fileserver as your needs evolve. This is a huge strength for ZFS. Feel free to email me off list if you want more specific recommendations. [1] on a 10 disk system we have: a) a 5 disk RAIDZ pool b) a 3-way mirror (pool) c) a 2-way mirror (pool) If I was to do it again, I''d make a) a 6-disk RAIDZ2 config to take advantage of the higher reliability provided by this config. Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Moore, Joe wrote:> Toby Thain Wrote: >> ZFS allows the architectural option of separate storage without losing end to end protection, so the distinction is still important. Of course this means ZFS itself runs on the application server, but so what? > > The OP in question is not running his network clients on Solaris or OpenSolaris or FreeBSD or MacOSX, but rather a collection of Linux workstations. Unless there''s been a recent port of ZFS to Linux, that makes a big What. > > Given the fact that NFS, as implemented in his client systems, provides no end-to-end reliability, the only data protection that ZFS has any control over is after the write() is issued by the NFS server process.NFS can provided on the wire protection if you enable Kerberos support (there are usually 3 options for Kerberos: krb5 (or sometimes called krb5a) which is Auth only, krb5i which is Auth plus integrity provided by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data. I have personally seen krb5i NFS mounts catch problems when there was a router causing failures that the TCP checksum don''t catch. -- Darren J Moffat
On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe <joe.moore at siemens.com> wrote:> > > Ian Collins wrote: >> I think you''d be surprised how large an organisation can migrate most, >> if not all of their application servers to zones one or two Thumpers. >> >> Isn''t that the reason for buying in "server appliances"? >> > > Assuming that the application servers can coexist in the "only" 16GB available on a thumper, and the "only" 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. > > You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. >Agreed. My biggest issue with the Thumper is that all the disks are 7,200RPM SATA and have limited IOPS. I''d like to see the Thumper configurations offered allowing a user chosen mixture of SAS and SATA drives with 7,200 and 15K RPM spindle speeds. And yes - I agree - you need as much RAM in the box as you can afford; ZFS loves lots and lots of RAM and your users will love the performance that large memory ZFS boxes provide. Did''nt they just offer a thumper with more RAM recently??? -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Darren J Moffat wrote:> Moore, Joe wrote: > > Given the fact that NFS, as implemented in his client > systems, provides no end-to-end reliability, the only data > protection that ZFS has any control over is after the write() > is issued by the NFS server process. > > NFS can provided on the wire protection if you enable Kerberos support > (there are usually 3 options for Kerberos: krb5 (or sometimes called > krb5a) which is Auth only, krb5i which is Auth plus integrity provided > by the RPCSEC_GSS layer, krb5p Auth+Integrity+Encrypted data. > > I have personally seen krb5i NFS mounts catch problems when > there was a > router causing failures that the TCP checksum don''t catch.No doubt, additional layers of data protection are available. I don''t know the state of RPCSEC on Linux, so I can''t comment on this, certainly your experience brings valuable insight into this discussion. It is also recommended (when iSCSI is an appropriate transport) to run over IPSEC in ESP mode to also ensure data-packet-content consistancy. Certainly NFS over IPSEC/ESP would be more resistant to on-the-wire corruption. Either of these would give better data reliability than pure NFS, just like ZFS on the backend gives better data reliability than for example, UFS or EXT3. --Joe
On 10/01/08 10:46, Al Hopper wrote:> On Wed, Oct 1, 2008 at 9:34 AM, Moore, Joe <joe.moore at siemens.com> wrote: > >> Ian Collins wrote: >> >>> I think you''d be surprised how large an organisation can migrate most, >>> if not all of their application servers to zones one or two Thumpers. >>> >>> Isn''t that the reason for buying in "server appliances"? >>> >>> >> Assuming that the application servers can coexist in the "only" 16GB available on a thumper, and the "only" 8GHz of CPU core speed, and the fact that the System controller is a massive single point of failure for both the applications and the storage. >> >> You may have a difference of opinion as to what a large organization is, but the reality is that the thumper series is good for some things in a large enterprise, and not good for some things. >> >> > > Agreed. My biggest issue with the Thumper is that all the disks are > 7,200RPM SATA and have limited IOPS. I''d like to see the Thumper > configurations offered allowing a user chosen mixture > of SAS and SATA drives with 7,200 and 15K RPM spindle speeds. And > yes - I agree - you need as much RAM in the box as you can afford; ZFS > loves lots and lots of RAM and your users will love the performance > that large memory ZFS boxes provide. > > Did''nt they just offer a thumper with more RAM recently??? > >The x4540 has twice the DIMM slots and # of cores. It also uses an LSI disk controller. Still 48 sata disks @ 7200 rpm. You can "build a thumper" using any rack mount server you like and the J4200/J4400 JBOD arrays. Then you can mix and match drives types (SATA and SAS). The server portion could have as many as 16/32 cores and 32/64 DIMM slots (the x4450/X4640). You''ll use up a little more rack space but the drives will be serviceable without shutting down the system. I think Thumper/Thor fills a specific role (maximum disk density in a minimum chassis). I''d doubt that it will change much. -- Matt Sweeney Systems Engineer Sun Microsystems 585-368-5930/x29097 desk 585-727-0573 cell -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/8ace83fb/attachment.html>
On Wed, 1 Oct 2008, Tim wrote:>> >> I think you''d be surprised how large an organisation can migrate most, >> if not all of their application servers to zones one or two Thumpers. >> >> Isn''t that the reason for buying in "server appliances"? >> > I think you''d be surprised how quickly they''d be fired for putting that much > risk into their enterprise.There is the old saying that "No one gets fired for buying IBM". If one buys an IBM system which runs 30 isolated instances of Linux, all of which are used for mission critical applications, is this a similar risk to consolidating storage on a Thumper since we are really talking about just one big system? In what way is consolidating on Sun/Thumper more or less risky to an enterprise than consolidating on a big IBM server with many subordinate OS instances? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Oct 1, 2008 at 9:18 AM, Joerg Schilling < Joerg.Schilling at fokus.fraunhofer.de> wrote:> David Magda <dmagda at ee.ryerson.ca> wrote: > > > On Sep 30, 2008, at 19:09, Tim wrote: > > > > > SAS has far greater performance, and if your workload is extremely > > > random, > > > will have a longer MTBF. SATA drives suffer badly on random > > > workloads. > > > > Well, if you can probably afford more SATA drives for the purchase > > price, you can put them in a striped-mirror set up, and that may help > > things. If your disks are cheap you can afford to buy more of them > > (space, heat, and power not withstanding). > > SATA and SAS disks usually base on the same drive mechanism. The seek times > are most likely identical. > > Some SATA disks support tagged command queueing and others do not. > I would asume that there is no speed difference between SATA with command > queueing and SAS. > J?rg > >Ummm, no. SATA and SAS seek times are not even in the same universe. They most definitely do not use the same mechanics inside. Whoever told you that rubbish is an outright liar. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/48a2ab72/attachment.html>
On Wed, Oct 1, 2008 at 10:28 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Wed, 1 Oct 2008, Tim wrote: > >> >>> I think you''d be surprised how large an organisation can migrate most, >>> if not all of their application servers to zones one or two Thumpers. >>> >>> Isn''t that the reason for buying in "server appliances"? >>> >>> I think you''d be surprised how quickly they''d be fired for putting that >> much >> risk into their enterprise. >> > > There is the old saying that "No one gets fired for buying IBM". If one > buys an IBM system which runs 30 isolated instances of Linux, all of which > are used for mission critical applications, is this a similar risk to > consolidating storage on a Thumper since we are really talking about just > one big system? > > In what way is consolidating on Sun/Thumper more or less risky to an > enterprise than consolidating on a big IBM server with many subordinate OS > instances? > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >Are you honestly trying to compare a Thumper''s reliability to an IBM mainframe? Please tell me that''s a joke... We can start at redundant, hot-swappable components and go from there. The thumper can''t even hold a candle to Sun''s own older sparc platforms. It''s not even in the same game as the IBM mainframes. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/9cbc9055/attachment.html>
>Ummm, no. SATA and SAS seek times are not even in the same universe.> They >most definitely do not use the same mechanics inside. Whoever told y>ou that >rubbish is an outright liar.Which particular disks are you guys talking about? I;m thinking you guys are talking about the same 3.5" w/ the same RPM, right? We''re not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA devices, are we? Casper
On Wed, Oct 1, 2008 at 11:20 AM, <Casper.Dik at sun.com> wrote:> > > >Ummm, no. SATA and SAS seek times are not even in the same universe.> > They > >most definitely do not use the same mechanics inside. Whoever told y> >ou that > >rubbish is an outright liar. > > > Which particular disks are you guys talking about? > > I;m thinking you guys are talking about the same 3.5" w/ the same RPM, > right? We''re not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA > devices, are we? > > Casper > >I''m talking about 10k and 15k SAS drives, which is what the OP was talking about from the get-go. Apparently this is yet another case of subsequent posters completely ignoring the topic and taking us off on tangents that have nothing to do with the OP''s problem. --Tm -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/23603508/attachment.html>
On Wed, October 1, 2008 10:18, Joerg Schilling wrote:> SATA and SAS disks usually base on the same drive mechanism. The seek > times are most likely identical. > > Some SATA disks support tagged command queueing and others do not. > I would asume that there is no speed difference between SATA with command > queueing and SAS.I guess the meaning in my e-mail wasn''t clear: because SAS drives are generally more expensive on a per unit basis, for a given budget, you can buy fewer of them than SATA drives. To get the same storage between capacity with SAS drives and SATA drives, you''d probably have to put the SAS drives in a RAID-5/6/Z configuration to be more space efficient. However by doing this you''d be losing spindles, and therefore IOPS. With SATA drives, since you can buy more for the same budget, you could put them in a RAID-10 configuration. While the individual disk many be slower, you''d have more spindles in the zpool, so that should help with the IOPS.
On Wed, 1 Oct 2008, Joerg Schilling wrote:> > SATA and SAS disks usually base on the same drive mechanism. The seek times > are most likely identical.This must be some sort of "urban legend". While the media composition and drive chassis is similar, the rest of the product clearly differs. The seek times for typical SAS drives are clearly much better, and the typical drive rotates much faster. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks for all the opinions everyone, my current impression is: - I do need as much RAM as I can afford (16GB look good enough for me) - SAS disks offers better iops & better MTBF than SATA. But Sata offers enough performance for me (to saturate a gig link), and its MTBF is around 100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA disks in a raidz2 that should give me "enough" protection and performance. It seems I will go with sata then for now. I hope for all practical purposes the raidz2 array of say 6 sata drives are "very well protected" for say the next 10 years! (If not please tell me) - This will mainly be used for NFS sharing. Everyone is saying it will have "bad" performance. My question is, how "bad" is bad ? Is it worse than a plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid card with caching disabled ? coz that''s what I currently have. Is it say worse that a Linux box sharing over soft raid ? - If I will be using 6 sata disks in raidz2, I understand to improve performance I can add a 15k SAS drive as a Zil device, is this correct ? Is the zil device per pool. Do I loose any flexibility by using it ? Does it become a SPOF say ? Typically how much percentage improvement should I expect to get from such a zil device ? Thanks On Wed, Oct 1, 2008 at 6:22 PM, Tim <tim at tcsac.net> wrote:> > > On Wed, Oct 1, 2008 at 11:20 AM, <Casper.Dik at sun.com> wrote: > >> >> >> >Ummm, no. SATA and SAS seek times are not even in the same universe.>> > They >> >most definitely do not use the same mechanics inside. Whoever told y>> >ou that >> >rubbish is an outright liar. >> >> >> Which particular disks are you guys talking about? >> >> I;m thinking you guys are talking about the same 3.5" w/ the same RPM, >> right? We''re not comparing 10K/2.5 SAS drives agains 7.2K/3.5 SATA >> devices, are we? >> >> Casper >> >> > I''m talking about 10k and 15k SAS drives, which is what the OP was talking > about from the get-go. Apparently this is yet another case of subsequent > posters completely ignoring the topic and taking us off on tangents that > have nothing to do with the OP''s problem. > > --Tm > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/baf62838/attachment.html>
On Tue, Sep 30, 2008 at 09:54:04PM -0400, Miles Nordin wrote:> ok, I get that S3 went down due to corruption, and that the network > checksums I mentioned failed to prevent the corruption. The missing > piece is: belief that the corruption occurred on the network rather > than somewhere else. > > Their post-mortem sounds to me as though a bit flipped inside the > memory of one server could be spread via this ``gossip'''' protocol to > infect the entire cluster. The replication and spreadability of the > data makes their cluster into a many-terabyte gamma ray detector.A bit flipped inside an end of an end-to-end system will not be detected by that system. So the CPU, memory and memory bus of an end have to be trusted and so require their own corruption detection mechanisms (e.g., ECC memory). In the S3 case it sounds like there''s a lot of networking involved, and that they weren''t providing integrity protection for the gossip protocol. Given a two-bit-flip-that-passed-all-Ethernet-and-TCP-CRCs event that we had within Sun a few years ago (much alluded to elsewhere in this thread), and which happened in one faulty switch, I would suspect the switch. Also, years ago when 100Mbps Ethernet first came on the market I saw lots of bad cat-5 wiring issues, where a wire would go bad and start introducing errors just a few months into its useful life. I don''t trust the networking equipment -- I prefer end-to-end protection. Just because you have to trust that the ends behave correctly doesn''t mean that you should have to trust everything in the middle too. Nico --
>>>>> "t" == Tim <tim at tcsac.net> writes:t> So what would be that the application has to run on Solaris. t> And requires a LUN to function. ITYM requires two LUN''s, or else when your filesystem becomes corrupt after a crash the sysadmin will get blamed for it. Maybe you can deduplicate the ZFS mirror LUNs on the storage back-end or something. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/21695811/attachment.bin>
On Wed, Oct 1, 2008 at 11:53 AM, Ahmed Kamal < email.ahmedkamal at googlemail.com> wrote:> Thanks for all the opinions everyone, my current impression is: > - I do need as much RAM as I can afford (16GB look good enough for me) >Depends on both the workload, and the amount of storage behind it. From your descriptions though, I think you''ll be ok.> - SAS disks offers better iops & better MTBF than SATA. But Sata offers > enough performance for me (to saturate a gig link), and its MTBF is around > 100 years, which is I guess good enough for me too. If I wrap 5 or 6 SATA > disks in a raidz2 that should give me "enough" protection and performance. > It seems I will go with sata then for now. I hope for all practical purposes > the raidz2 array of say 6 sata drives are "very well protected" for say the > next 10 years! (If not please tell me) >***If you have a sequential workload. It''s not a blanket "SATA is fast enough".> - This will mainly be used for NFS sharing. Everyone is saying it will have > "bad" performance. My question is, how "bad" is bad ? Is it worse than a > plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid > card with caching disabled ? coz that''s what I currently have. Is it say > worse that a Linux box sharing over soft raid ? >Whoever is saying that is being dishonest. NFS is plenty fast for most workloads. There are very, VERY few workloads in the enterprise that are I/O bound, they are almost all IOPS bound.> - If I will be using 6 sata disks in raidz2, I understand to improve > performance I can add a 15k SAS drive as a Zil device, is this correct ? Is > the zil device per pool. Do I loose any flexibility by using it ? Does it > become a SPOF say ? Typically how much percentage improvement should I > expect to get from such a zil device ? >ZIL''s come with their own fun. Isn''t there still the issue of losing the entire pool if you lose the ZIL? And you can''t get it back without extensive, ugly work? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/e208581f/attachment.html>
On Wed, 1 Oct 2008, dmagda at ee.ryerson.ca wrote:> > To get the same storage between capacity with SAS drives and SATA drives, > you''d probably have to put the SAS drives in a RAID-5/6/Z configuration to > be more space efficient. However by doing this you''d be losing spindles, > and therefore IOPS. With SATA drives, since you can buy more for the same > budget, you could put them in a RAID-10 configuration. While the > individual disk many be slower, you''d have more spindles in the zpool, so > that should help with the IOPS.I will agree with that except to point out that there are many applications which require performance but not a huge amount of storage. For many critical applications, even 10s of gigabytes is a lot of storage. Based on this, I would say that most applications where SAS is desireable are the ones which desire the most reliability and performance whereas the applications where SATA is desireable are the ones which place a priority on bulk storage capacity. If you are concerned about total storage capacity and you are also specifying SAS for performance/reliability for critical data then it is likely that there is something wrong with your plan for storage and how the data is distributed. There is a reason why when you go to the store you see tack hammers, construction hammers, and sledge hammers. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Oct 01, 2008 at 12:22:56PM -0500, Tim wrote:> > - This will mainly be used for NFS sharing. Everyone is saying it will have > > "bad" performance. My question is, how "bad" is bad ? Is it worse than a > > plain Linux server sharing NFS over 4 sata disks, using a crappy 3ware raid > > card with caching disabled ? coz that''s what I currently have. Is it say > > worse that a Linux box sharing over soft raid ? > > Whoever is saying that is being dishonest. NFS is plenty fast for most > workloads. There are very, VERY few workloads in the enterprise that are > I/O bound, they are almost all IOPS bound.NFS is bad for workloads that involve lots of operations that NFS requires to be synchronous and which the application doesn''t parallelize. Things like open(2) and close(2), for example, which means applications like tar(1). The solution is to get a fast slog device. (Or to use an NFS server that violates the synchrony requirement.) Nico --
Tim <tim at tcsac.net> wrote:> > Ummm, no. SATA and SAS seek times are not even in the same universe. They > most definitely do not use the same mechanics inside. Whoever told you that > rubbish is an outright liar.It is extremely unlikely that two drives from the same manufacturer and with the same RPM differ in seek times if you compare a SAS variant with a SATA variant. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
>>>>> "cd" == Casper Dik <Casper.Dik at Sun.COM> writes:cd> The whole packet lives in the memory of the switch/router and cd> if that memory is broken the packet will be send damaged. that''s true, but by algorithmically modifying the checksum to match your ttl decrementing and MAC address label-swapping rather than recomputing it from scratch, it''s possible for an L2 or even L3 switch to avoid ``splitting the protection domain''''. It''ll still send the damaged packet, but with a wrong FCS, so it''ll just get dropped by the next input port and eventually retransmitted. This is what 802.1d suggests. I suspect one reason the IP/UDP/TCP checksums were specified as simple checksums rather than CRC''s like the Ethernet L2 FCS, is that it''s really easy and obvious how to algorithmically modify them. sounds like they are not good enough though, because unless this broken router that Robert and Darren saw was doing NAT, yeah, it should not have touch the TCP/UDP checksum. BTW which router was it, or you can''t say because you''re in the US? :) I would expect any cost-conscious router or switch manufacturer to use the same Ethernet MAC ASIC''s as desktops, so the checksums would likely be computed right before transmission using the ``offload'''' feature of the Ethernet chip, but of course we can''t tell because they''re all proprietary. Eventually I bet it will become commonplace for Ethernet MAC''s to do IPsec offload, so we''ll have to remember the ``avoid splitting the protection domain'''' idea when that starts happening. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/05e69fd7/attachment.bin>
On Wed, 1 Oct 2008, Joerg Schilling wrote:>> Ummm, no. SATA and SAS seek times are not even in the same universe. They >> most definitely do not use the same mechanics inside. Whoever told you that >> rubbish is an outright liar. > > It is extremely unlikely that two drives from the same manufacturer and with the > same RPM differ in seek times if you compare a SAS variant with a SATA variant.I did find a manufacturer (Seagate) which does offer a SAS variant of what is normally a SATA drive. Is this the specific product you are talking about? The interface itself is perhaps not all that important but drive vendors have traditionally selected that "SCSI" based products are based on high performance hardware with a focus on reliability while "ATA" based products are based on low or medium performance hardware with a focus on cost. There is very little overlap between these distinct product lines. It is rare to find similarity between the specification sheets. It is quite rare to find similar rotation rates or seek times. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ahmed Kamal wrote:> Thanks for all the opinions everyone, my current impression is: > > - I do need as much RAM as I can afford (16GB look good enough for me) > - SAS disks offers better iops & better MTBF than SATA. But Sata > offers enough performance for me (to saturate a gig link), and its > MTBF is around 100 years, which is I guess good enough for me too. If > I wrap 5 or 6 SATA disks in a raidz2 that should give me "enough" > protection and performance. It seems I will go with sata then for now. > I hope for all practical purposes the raidz2 array of say 6 sata > drives are "very well protected" for say the next 10 years! (If not > please tell me)OK, so what the specs don''t tell you is how MTBF changes over time. It is very common to see an MTBF quoted, but you will almost never see it described as a function of age. Rather, you will see something in the specs about expected service lifetime, and how the environment can decrease the service lifetime (read: decrease the MTBF over time more rapidly). I''ve not seen a consumer grade disk spec with 10 years of expected service life -- some are 5 years. In other words, as time goes by, you should plan to replace them. A more lengthy discussion of this, and why we measure field reliability in other ways, see: http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent> - This will mainly be used for NFS sharing. Everyone is saying it will > have "bad" performance. My question is, how "bad" is bad ? Is it worse > than a plain Linux server sharing NFS over 4 sata disks, using a > crappy 3ware raid card with caching disabled ? coz that''s what I > currently have. Is it say worse that a Linux box sharing over soft raid ? > - If I will be using 6 sata disks in raidz2, I understand to improve > performance I can add a 15k SAS drive as a Zil device, is this correct > ? Is the zil device per pool. Do I loose any flexibility by using it ? > Does it become a SPOF say ? Typically how much percentage improvement > should I expect to get from such a zil device ?See the best practices guide: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide -- richard
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Wed, 1 Oct 2008, Joerg Schilling wrote: > > > > SATA and SAS disks usually base on the same drive mechanism. The seek times > > are most likely identical. > > This must be some sort of "urban legend". While the media composition > and drive chassis is similar, the rest of the product clearly differs. > The seek times for typical SAS drives are clearly much better, and the > typical drive rotates much faster.Did you recently look at spec files from drive manufacturers? If you look at drives in the same category, the difference between a SATA and a SAS disk is only the firmware and the way the drive mechanism has been selected. Another difference is that SAS drives may have two SAS interfaces instead of the single SATA interface found in the SATA drives. IOPS/s depend on seek times, latency times and probably on disk cache size. If you have a drive with 1 ms seek time, the seek time is not really important. What''s important is the latency time which is 4ms for a 7200 rpm drive and only 2 ms for 15000 rpm drive. People who talk about SAS usually forget that they try to compare 15000 rpm SAS drives with 7200 rpm SATA drives. There are faster SATA drives but these drives consume more power. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Miles Nordin wrote:> sounds > like they are not good enough though, because unless this broken > router that Robert and Darren saw was doing NAT, yeah, it should not > have touch the TCP/UDP checksum.I believe we proved that the problem bit flips were such that the TCP checksum was the same, so the original checksum still appeared correct. > BTW which router was it, or you> can''t say because you''re in the US? :)I can''t remember; it was aging at the time. Rob T
On Wed, Oct 1, 2008 at 12:51 PM, Joerg Schilling < Joerg.Schilling at fokus.fraunhofer.de> wrote:> > Did you recently look at spec files from drive manufacturers? > > If you look at drives in the same category, the difference between a SATA > and a > SAS disk is only the firmware and the way the drive mechanism has been > selected. > Another difference is that SAS drives may have two SAS interfaces instead > of the > single SATA interface found in the SATA drives. > > IOPS/s depend on seek times, latency times and probably on disk cache size. > > If you have a drive with 1 ms seek time, the seek time is not really > important. > What''s important is the latency time which is 4ms for a 7200 rpm drive and > only > 2 ms for 15000 rpm drive. > > People who talk about SAS usually forget that they try to compare 15000 rpm > SAS drives with 7200 rpm SATA drives. There are faster SATA drives but > these > drives consume more power. >That''s because the faster SATA drives cost just as much money as their SAS counterparts for less performance and none of the advantages SAS brings such as dual ports. Not to mention none of them can be dual sourced making it a non-starter in the enterprise. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081001/64e7a4bf/attachment.html>
On Wed, Oct 01, 2008 at 11:54:55AM -0600, Robert Thurlow wrote:> Miles Nordin wrote: > > > sounds > > like they are not good enough though, because unless this broken > > router that Robert and Darren saw was doing NAT, yeah, it should not > > have touch the TCP/UDP checksum. > > I believe we proved that the problem bit flips were such > that the TCP checksum was the same, so the original checksum > still appeared correct.The bit flips came in pairs, IIRC. I forget the details, but it''s probably buried somewhere in my (and many others'') e-mail.> > BTW which router was it, or you > > can''t say because you''re in the US? :) > > I can''t remember; it was aging at the time.I can''t remember either -- it was a few years ago.
On Wed, 1 Oct 2008, Joerg Schilling wrote:> > Did you recently look at spec files from drive manufacturers?Yes.> If you look at drives in the same category, the difference between a > SATA and aThe problem is that these drives (SAS / SATA) are generally not in the same category so your comparison does not make sense. There is very little overlap between the "exotic sports car" class and the "family mini van" class. In some very few cases we see some transition vehicles such as station wagons in a sport form factor. Most drive vendors try to make sure that the drives are in truely distinct classes in order to preserve the profit margins on the more expensive drives. In some cases we see SAS interfaces fitted to drives which are fundamentally SATA-class drives but such products are rare. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote:> > like they are not good enough though, because unless this broken > > router that Robert and Darren saw was doing NAT, yeah, it should not > > have touch the TCP/UDP checksum.NAT was not involved.> I believe we proved that the problem bit flips were such > that the TCP checksum was the same, so the original checksum > still appeared correct.That''s correct. The pattern we found in corrupted data was that there would be two offsetting bit-flips. A 0->1 was followed 256 or 512 or 1024 bytes later by a 1->0 Or vice-versa. (It was always the same bit; in the cases I analyzed, the corrupted files contained C source code and the bit-flips were obvious). Under the 16-bit one''s-complement checksum used by TCP, these two changes cancel each other out and the resulting packet has the same checksum.> > BTW which router was it, or you > > can''t say because you''re in the US? :) > > I can''t remember; it was aging at the time.to use excruciatingly precise terminology, I believe the switch in question is marketed as a combo L2 bridge/L3 router but in this case may have been acting as a bridge rather than a router. After we noticed the data corruption we looked at TCP counters on hosts on that subnet and noticed a high rate of failed checksums, so clearly the TCP checksum was catching *most* of the corrupted packets; we just didn''t look at the counters until after we saw data corruption. - Bill
Tim <tim <at> tcsac.net> writes:> > That''s because the faster SATA drives cost just as much money as > their SAS counterparts for less performance and none of the > advantages SAS brings such as dual ports.SAS drives are far from always being the best choice, because absolute IOPS or throughput numbers do not matter. What only matters in the end is (TB, throughput, or IOPS) per (dollar, Watt, or Rack Unit). 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. You can''t argue against that. To paraphrase what was said earlier in this thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best IOPS/RU, 15000rpm drives have the advantage. Etc. -marc
Marc Bevand wrote:> Tim <tim <at> tcsac.net> writes: > >> That''s because the faster SATA drives cost just as much money as >> their SAS counterparts for less performance and none of the >> advantages SAS brings such as dual ports. >> > > SAS drives are far from always being the best choice, because absolute IOPS or > throughput numbers do not matter. What only matters in the end is (TB, > throughput, or IOPS) per (dollar, Watt, or Rack Unit). > > 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, and IOPS/$. > You can''t argue against that. To paraphrase what was said earlier in this > thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm > (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the best > IOPS/RU, 15000rpm drives have the advantage. Etc. > > -marc >Be very careful about that. 73GB SAS drives aren''t that expensive, so you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB SATA drives (per Sun list pricing for J4200 drives). SATA doesn''t always win the IOPS/$. Remember, a SAS drive can provide more than 2x the number of IOPs a SATA drive can. Likewise, throughput on a 15k drive can be roughly 2x a 7.2k drive, depending on I/O load. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble <Erik.Trimble <at> Sun.COM> writes:> Marc Bevand wrote: > > 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, IOPS/$. > > You can''t argue against that. To paraphrase what was said earlier in this > > thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm > > (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the > > best IOPS/RU, 15000rpm drives have the advantage. Etc. > > Be very careful about that. 73GB SAS drives aren''t that expensive, so > you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB > SATA drives (per Sun list pricing for J4200 drives). SATA doesn''t > always win the IOPS/$. Remember, a SAS drive can provide more than 2x > the number of IOPs a SATA drive can.Well let''s look at a concrete example: - cheapest 15k SAS drive (73GB): $180 [1] - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37) The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x [1] http://www.newegg.com/Product/Product.aspx?Item=N82E16822116057 [2] http://www.newegg.com/Product/Product.aspx?Item=N82E16822136075 -marc
Marc Bevand <m.bevand at gmail.com> wrote:> > Be very careful about that. 73GB SAS drives aren''t that expensive, so > > you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB > > SATA drives (per Sun list pricing for J4200 drives). SATA doesn''t > > always win the IOPS/$. Remember, a SAS drive can provide more than 2x > > the number of IOPs a SATA drive can. > > Well let''s look at a concrete example: > - cheapest 15k SAS drive (73GB): $180 [1] > - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37) > The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5xI am not going to accept blanket numbers... If you claim that a 15k drive offers more than 2x the IOPS/s of a 7.2k drive you would need to show your computation. SAS and SATA use the same cable and in case you buy server grade SATA disks, you also get tagged command queuing. Let''s use a concrete example: For 10 TB you need ~ 12 1TB SATA drives (server grade) -> 2160 Euro 96 Watt stand by ~ 1 GByte/s sustained read Max IOPS -> 1200 For 10 TB you need ~ 165 73 GB SAS drives -> ~ 22000 Euro 2475 Watt stand by ~ 4 GBytes/s sustained read if you are lucky and have a good controller Max IOPS -> 34000 (not realistic) More important in this case: the MTFB os the SATA based system is much bigger than the MTBF of the SAS based system. In theory, you really get 2x-3x the IOPS/$, but do you like to pay as many $ and do you like to like to pay for the space and the energy? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Marc Bevand <m.bevand <at> gmail.com> writes:> > Well let''s look at a concrete example: > - cheapest 15k SAS drive (73GB): $180 [1] > - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37) > The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5xDoh! I said the opposite of what I meant. Let me rephrase: "The SAS drive offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price. Therefore the SATA drive has better IOPS/$." (Joerg: I am on your side of the debate !) -marc
On Thu, 2 Oct 2008, Joerg Schilling wrote:> > I am not going to accept blanket numbers... If you claim that a 15k drive > offers more than 2x the IOPS/s of a 7.2k drive you would need to show your > computation. SAS and SATA use the same cable and in case you buy server grade > SATA disks, you also get tagged command queuing.I sense some confusion here on your part related to basic physics. Even with identical seek times, if the drive spins 2X as fast then it is able to return the data in 1/2 the time already. But the best SAS drives have seek times 2-3X faster than SATA drives. In order to return data, the drive has to seek the head, and then wait for the data to start to arrive, which is entirely dependent on rotation rate. These are reasons why enterprise SAS drives offer far more IOPS than SATA drives.> For 10 TB you need ~ 165 73 GB SAS drives -> ~ 22000 Euro > 2475 Watt stand by > ~ 4 GBytes/s sustained > read if you are lucky > and have a good > controller > Max IOPS -> 34000 (not realistic)I am not sure where you get the idea that 34000 is not realistic. It is perfectly in line with the performance I have measured with my own 12 drive array. As I mentioned yesterday, proper design isolates storage which requires maximum IOPS from storage which requires maximum storage capacity or lowest cost. Therefore, it is entirely pointless to discuss how many 73GB SAS drives are required to achieve 10TB of storage since this thinking almost always represents poor design. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 2 Oct 2008, Marc Bevand wrote:> > Doh! I said the opposite of what I meant. Let me rephrase: "The SAS drive > offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price. > Therefore the SATA drive has better IOPS/$."I doubt that anyone will successfully argue that SAS drives offer the best IOPS/$ value as long as space, power, and reliability factors may be ignored. However, these sort of enterprise devices exist in order to allow enterprises to meet critical business demands which are otherwise not possible to be met. There are situations where space, power, and cooling are limited and the cost of the equipment is less of an issue since the space, power, and cooling cost more than the equipment. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks for the info. I am not really after big performance, I am already on SATA and it''s good enough for me. What I really really can''t afford is data loss. The CAD designs our engineers are working on can sometimes be really worth a lot. But still we''re a small company and would rather save and buy SATA drives if it is "Safe" I now understand MTBF is next to useless (at least directly), the RAID optimizer tables don''t take how failure rates go up with years, so it''s not really accurate. My question now is if I will use high quality Barracuda nearline 1TB sata 7200 disks, and configure them as 8 disks in a raidz2 configuration. What is the "real/practical" possibility that I will face data loss during the next 5 years for example ? As storage experts please help me interpret whatever numbers you''re going to throw, so is it a "really really small chance", or would you be worried about it ? Thanks On Thu, Oct 2, 2008 at 12:24 PM, Marc Bevand <m.bevand at gmail.com> wrote:> Marc Bevand <m.bevand <at> gmail.com> writes: > > > > Well let''s look at a concrete example: > > - cheapest 15k SAS drive (73GB): $180 [1] > > - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37) > > The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not > 180/40=4.5x > > Doh! I said the opposite of what I meant. Let me rephrase: "The SAS drive > offers at most 2x-3x the IOPS (optimistic), but at 180/40=4.5x the price. > Therefore the SATA drive has better IOPS/$." > > (Joerg: I am on your side of the debate !) > > -marc > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081002/2b799442/attachment.html>
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Thu, 2 Oct 2008, Joerg Schilling wrote: > > > > I am not going to accept blanket numbers... If you claim that a 15k drive > > offers more than 2x the IOPS/s of a 7.2k drive you would need to show your > > computation. SAS and SATA use the same cable and in case you buy server grade > > SATA disks, you also get tagged command queuing. > > I sense some confusion here on your part related to basic physics. > Even with identical seek times, if the drive spins 2X as fast then it > is able to return the data in 1/2 the time already. But the best SAS > drives have seek times 2-3X faster than SATA drives. In order to > return data, the drive has to seek the head, and then wait for the > data to start to arrive, which is entirely dependent on rotation rate. > These are reasons why enterprise SAS drives offer far more IOPS than > SATA drives.You seem to missunderstand drive physics. With modern drives, seek times are not a dominating factor. It is the latency time that is rather important and this is indeed 1/rotanilnal-speed. On the other side you did missunderstand another important fact in drive physics: The sustained transfer speed of a drive is proportional to the linear data density on the medium. The third mistake you make is to see that confuse the effects from the drive interface type with the effects from different drive geometry. The only coincidence here is that the drive geometry is typically updated more frequently for SATA drives than it is for SAS drives. This way, you benefit from the higher data density of a recent SATA drive and get a higher sustained data rate. BTW: I am not saying it makes no sense to buy SAS drives but it makes sense to look at _all_ important parameters. Power consumption is a really important issue here and the reduced MTBF from using more disks is another one. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Thu, 2 Oct 2008, Ahmed Kamal wrote:> > What is the "real/practical" possibility that I will face data loss during > the next 5 years for example ? As storage experts please help me > interpret whatever numbers you''re going to throw, so is it a "really really > small chance", or would you be worried about it ?No one can tell you the answer to this. Certainly math can be formulated which shows that the pyramids in Egypt will be completely flattened before you experience data loss. Obviously that math has little value once it exceeds the lifetime of the computer by many orders of magnitude. Raidz2 is about as good as it gets on paper. Bad things can still happen which are unrelated to what raidz2 protects against or are a calamity which impacts all the hardware in the system. If you care about your data, make sure that you have a working backup system. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, Oct 02, 2008 at 12:46:59PM -0500, Bob Friesenhahn wrote:> On Thu, 2 Oct 2008, Ahmed Kamal wrote: > > What is the "real/practical" possibility that I will face data loss during > > the next 5 years for example ? As storage experts please help me > > interpret whatever numbers you''re going to throw, so is it a "really really > > small chance", or would you be worried about it ? > > No one can tell you the answer to this. Certainly math can be > formulated which shows that the pyramids in Egypt will be completely > flattened before you experience data loss. Obviously that math has > little value once it exceeds the lifetime of the computer by many > orders of magnitude. > > Raidz2 is about as good as it gets on paper. Bad things can still > happen which are unrelated to what raidz2 protects against or are a > calamity which impacts all the hardware in the system. If you care > about your data, make sure that you have a working backup system.Bad things like getting all your drives from a bad drive batch, so they all fail near simultaneously and much sooner than you''d expect. As always, do backup your data.
On Thu, Oct 2, 2008 at 11:43 AM, Joerg Schilling < Joerg.Schilling at fokus.fraunhofer.de> wrote:> > > You seem to missunderstand drive physics. > > With modern drives, seek times are not a dominating factor. It is the > latency > time that is rather important and this is indeed 1/rotanilnal-speed. > > On the other side you did missunderstand another important fact in drive > physics: > > The sustained transfer speed of a drive is proportional to the linear data > density on the medium. > > The third mistake you make is to see that confuse the effects from the > drive interface type with the effects from different drive geometry. The > only > coincidence here is that the drive geometry is typically updated more > frequently for SATA drives than it is for SAS drives. This way, you benefit > from the higher data density of a recent SATA drive and get a higher > sustained > data rate. > > BTW: I am not saying it makes no sense to buy SAS drives but it makes sense > to > look at _all_ important parameters. Power consumption is a really important > issue here and the reduced MTBF from using more disks is another one. > > J?rg > > -- >Please, give me a list of enterprises currently using SATA drives for their database workloads, vmware workloads... hell any workload besides email archiving, lightly used cifs shares, or streaming sequential transfers of large files. I''m glad you can sit there with a spec sheet and tell me how you think things are going to work. I can tell you from real-life experience you''re not even remotely correct in your assumptions. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081002/15027071/attachment.html>
On Thu, Oct 2, 2008 at 10:56 AM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> > > I doubt that anyone will successfully argue that SAS drives offer the > best IOPS/$ value as long as space, power, and reliability factors may > be ignored. However, these sort of enterprise devices exist in order > to allow enterprises to meet critical business demands which are > otherwise not possible to be met. > > There are situations where space, power, and cooling are limited and > the cost of the equipment is less of an issue since the space, power, > and cooling cost more than the equipment. > > Bob >It''s called USABLE IOPS/$. You can throw 500 drives at a workload, if you''re attempting to access lots of small files in random ways, it won''t make a lick of difference. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081002/4904e543/attachment.html>
On Thu, Oct 2, 2008 at 3:09 AM, Marc Bevand <m.bevand at gmail.com> wrote:> Erik Trimble <Erik.Trimble <at> Sun.COM> writes: >> Marc Bevand wrote: >> > 7500rpm (SATA) drives clearly provide the best TB/$, throughput/$, IOPS/$. >> > You can''t argue against that. To paraphrase what was said earlier in this >> > thread, to get the best IOPS out of $1000, spend your money on 10 7500rpm >> > (SATA) drives instead of 3 or 4 15000rpm (SAS) drives. Similarly, for the >> > best IOPS/RU, 15000rpm drives have the advantage. Etc. >> >> Be very careful about that. 73GB SAS drives aren''t that expensive, so >> you can get 6 x 73GB 15k SAS drives for the same amount as 11 x 250GB >> SATA drives (per Sun list pricing for J4200 drives). SATA doesn''t >> always win the IOPS/$. Remember, a SAS drive can provide more than 2x >> the number of IOPs a SATA drive can. > > Well let''s look at a concrete example: > - cheapest 15k SAS drive (73GB): $180 [1] > - cheapest 7.2k SATA drive (160GB): $40 [2] (not counting a 80GB at $37) > The SAS drive most likely offers 2x-3x the IOPS/$. Certainly not 180/40=4.5x > > [1] http://www.newegg.com/Product/Product.aspx?Item=N82E16822116057 > [2] http://www.newegg.com/Product/Product.aspx?Item=N82E16822136075 >But Marc - that scenario is not realistic. Your budget drives won''t come with a 5 year warranty and the MTBF to go along with this warranty. To level the playing field lets narrow it down to: a) drives with a 5 year warranty b) drives that are designed for continuous 7x24 operation In the case of SATA drives, the Western Digital "RE" series and Seagate "ES" immediately spring to mind. Another very interesting SATA product is the WD Velociraptor - which blurs the line between 7k2 SATA and 15k RPM SAS drives. For the application the OP talked about, I doubt that he is going to run out and purchase low-cost drives that are more at home is a low-end desktop than they are in a ZFS based filesystem. Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On Thu, Oct 2, 2008 at 12:46 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Thu, 2 Oct 2008, Ahmed Kamal wrote: >> >> What is the "real/practical" possibility that I will face data loss during >> the next 5 years for example ? As storage experts please help me >> interpret whatever numbers you''re going to throw, so is it a "really really >> small chance", or would you be worried about it ? > > No one can tell you the answer to this. Certainly math can be > formulated which shows that the pyramids in Egypt will be completely > flattened before you experience data loss. Obviously that math has > little value once it exceeds the lifetime of the computer by many > orders of magnitude. > > Raidz2 is about as good as it gets on paper. Bad things can still > happen which are unrelated to what raidz2 protects against or are a > calamity which impacts all the hardware in the system. If you care > about your data, make sure that you have a working backup system. >Agreed. But make that an offsite backup (Amazon S3 comes to mind). We are already at the point in the evolution of computer technology where errors caused by people are far more likely to cause data loss. Or a water leak over the weekend or.... etc. Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
OK, we cut off this thread now. Bottom line here is that when it comes to making statements about SATA vs SAS, there are ONLY two statements which are currently absolute: (1) a SATA drive has better GB/$ than a SAS drive (2) a SAS drive has better throughput and IOPs than a SATA drive This is comparing apples to apples (that is, drives in the same generation, available at the same time): ANY OTHER RANKING depends on your prioritization of cost, serviceability (i.e. vendor support), throughput, IOPS, space, power, and redundancy. When we have this discussion, PLEASE, folks, ASK SPECIFICALLY ABOUT YOUR CONFIGURATION - that is, to those asking the initial question, SPECIFY your ranking of the above criteria. Otherwise, to all those on this list, DON''T ANSWER - we constantly devolve into tit-for-tat otherwise. That''s it, folks. -Erik On Thu, 2008-10-02 at 13:39 -0500, Tim wrote:> > > On Thu, Oct 2, 2008 at 11:43 AM, Joerg Schilling > <Joerg.Schilling at fokus.fraunhofer.de> wrote: > > > > You seem to missunderstand drive physics. > > With modern drives, seek times are not a dominating factor. It > is the latency > time that is rather important and this is indeed > 1/rotanilnal-speed. > > On the other side you did missunderstand another important > fact in drive > physics: > > The sustained transfer speed of a drive is proportional to the > linear data > density on the medium. > > The third mistake you make is to see that confuse the effects > from the > drive interface type with the effects from different drive > geometry. The only > coincidence here is that the drive geometry is typically > updated more > frequently for SATA drives than it is for SAS drives. This > way, you benefit > from the higher data density of a recent SATA drive and get a > higher sustained > data rate. > > BTW: I am not saying it makes no sense to buy SAS drives but > it makes sense to > look at _all_ important parameters. Power consumption is a > really important > issue here and the reduced MTBF from using more disks is > another one. > > J?rg > > -- > > Please, give me a list of enterprises currently using SATA drives for > their database workloads, vmware workloads... hell any workload > besides email archiving, lightly used cifs shares, or streaming > sequential transfers of large files. I''m glad you can sit there with > a spec sheet and tell me how you think things are going to work. I > can tell you from real-life experience you''re not even remotely > correct in your assumptions. > > --Tim > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik Trimble <Erik.Trimble <at> Sun.COM> writes:> > Bottom line here is that when it comes to making statements about SATA > vs SAS, there are ONLY two statements which are currently absolute: > > (1) a SATA drive has better GB/$ than a SAS drive > (2) a SAS drive has better throughput and IOPs than a SATA driveYes, and to represent statements (1) and (2) in a more exhaustive table: Best X per Y | Dollar Watt Rack Unit (or "per drive") ---------------+------------------------------------------- Capacity | SATA(1) SATA SATA Throughput | SATA SAS SAS(2) IOPS | SATA SAS SAS(2) If (a) people understood that each of these 9 performance numbers can be measured independently from each other, and (b) knew which of these numbers matter for a given workload (very often multiple of them do, so a compromise has to be made), then there would be no more circular SATA vs. SAS debates. -marc
I hate to drag this thread on, but... Erik Trimble wrote:> OK, we cut off this thread now. > > > Bottom line here is that when it comes to making statements about SATA > vs SAS, there are ONLY two statements which are currently absolute: > > (1) a SATA drive has better GB/$ than a SAS drive >In general, yes.> (2) a SAS drive has better throughput and IOPs than a SATA drive >Disagree. We proved that the transport layer protocol has no bearing on throughput or iops. Several vendors offer drives which are identical in all respects except for transport layer protocol: SAS or SATA. You can choose either transport layer protocol and the performance remains the same.> This is comparing apples to apples (that is, drives in the same > generation, available at the same time): >Add same size. One common mis-correlation in this thread is comparing 3.5" SATA disks against 2.5" SAS disks -- there is a big difference in the capacity, power consumption, and seek time as the disk diameter changes. -- richard
Anton B. Rang
2008-Oct-05 19:31 UTC
[zfs-discuss] SATA/SAS (Re: Quantifying ZFS reliability)
Erik:> > (2) a SAS drive has better throughput and IOPs than a SATA driveRichard:> Disagree. We proved that the transport layer protocol has no bearing > on throughput or iops. Several vendors offer drives which are > identical in all respects except for transport layer protocol: SAS or > SATA. You can choose either transport layer protocol and the > performance remains the same.Reference, please? I draw your attention to Seagate''s SPC-2 benchmark of the Barracuda ES.2 with SATA & SAS. http://www.seagate.com/docs/pdf/whitepaper/tp_sas_benefits_to_tier_2_storage.pdf Certainly there are a wide variety of workloads and you won''t see a benefit everywhere, but there are cases where the SAS protocol provides a significant improvement over SATA. Anton -- This message posted from opensolaris.org
Richard Elling
2008-Oct-06 14:52 UTC
[zfs-discuss] SATA/SAS (Re: Quantifying ZFS reliability)
Anton B. Rang wrote:> Erik: > >>> (2) a SAS drive has better throughput and IOPs than a SATA drive >>> > > Richard: > >> Disagree. We proved that the transport layer protocol has no bearing >> on throughput or iops. Several vendors offer drives which are >> identical in all respects except for transport layer protocol: SAS or >> SATA. You can choose either transport layer protocol and the >> performance remains the same. >> > > Reference, please? > > I draw your attention to Seagate''s SPC-2 benchmark of the Barracuda ES.2 with SATA & SAS. > > http://www.seagate.com/docs/pdf/whitepaper/tp_sas_benefits_to_tier_2_storage.pdf > > Certainly there are a wide variety of workloads and you won''t see a benefit everywhere, but there are cases where the SAS protocol provides a significant improvement over SATA. >This really is a moot point. Comparing a single channel SATA disk to a dual channel SAS disk is disingenuous -- perhaps one could argue that Seagate should sell a dual-port SATA disk. Also, SATA drives which are faster than 15k rpm SAS disks are available or imminent from Intel, Samsung, and others (Super Talent, STEC, Crucial, et.al.) I think we can all agree that most of the vendors, to date, have been trying to differentiate their high-margin products to maintain the high margins (FC is a better example of this than SAS). But it is not a good habit to claim one transport is always superior to another when the real comparison must occur at the device. -- richard