I''m surprised no-one else has posted about this - part of the Sun Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor for cache protection. http://www.sun.com/storage/disk_systems/sss/f20/specs.xml There''s no pricing on the webpage though - does anyone know how it compares in price to a logzilla? -- James Andrewartha
On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote:> I''m surprised no-one else has posted about this - part of the Sun > Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with > 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor > for cache protection. http://www.sun.com/storage/disk_systems/sss/f20/specs.xmlAt the Exadata-2 announcement, Larry kept saying that it wasn''t a disk. But there was little else of a technical nature said, though John did have one to show. RAC doesn''t work with ZFS directly, so the details of the configuration should prove interesting. -- richard
On Thu, Sep 24, 2009 at 12:10 PM, Richard Elling <richard.elling at gmail.com>wrote:> On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: > > I''m surprised no-one else has posted about this - part of the Sun Oracle >> Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 or 96 GB of >> SLC, a built-in SAS controller and a super-capacitor for cache protection. >> http://www.sun.com/storage/disk_systems/sss/f20/specs.xml >> > > At the Exadata-2 announcement, Larry kept saying that it wasn''t a disk. > But there > was little else of a technical nature said, though John did have one to > show. > > RAC doesn''t work with ZFS directly, so the details of the configuration > should prove > interesting. > -- richard >Exadata 2 is built on Linux from what I read, so I''m not entirely sure how it would leverage ZFS, period. I hope I heard wrong or the whole announcement feels like a bit of a joke to me. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090924/cbd8b197/attachment.html>
Richard Elling wrote:> On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: > >> I''m surprised no-one else has posted about this - part of the Sun >> Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 >> or 96 GB of SLC, a built-in SAS controller and a super-capacitor for >> cache protection. >> http://www.sun.com/storage/disk_systems/sss/f20/specs.xml > > At the Exadata-2 announcement, Larry kept saying that it wasn''t a disk. > But there > was little else of a technical nature said, though John did have one to > show. > > RAC doesn''t work with ZFS directly, so the details of the configuration > should prove > interesting.isn''t exadata based on linux, so not clear where zfs comes into play, but I didn''t see any of this oracle preso, so could be confused by all this. Enda> -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Sep 24, 2009, at 10:17 AM, Tim Cook wrote:> > > On Thu, Sep 24, 2009 at 12:10 PM, Richard Elling <richard.elling at gmail.com > > wrote: > On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: > > I''m surprised no-one else has posted about this - part of the Sun > Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with > 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor > for cache protection. http://www.sun.com/storage/disk_systems/sss/f20/specs.xml > > At the Exadata-2 announcement, Larry kept saying that it wasn''t a > disk. But there > was little else of a technical nature said, though John did have one > to show. > > RAC doesn''t work with ZFS directly, so the details of the > configuration should prove > interesting. > -- richard > > Exadata 2 is built on Linux from what I read, so I''m not entirely > sure how it would leverage ZFS, period. I hope I heard wrong or the > whole announcement feels like a bit of a joke to me.It is not clear to me. They speak of "storage servers" which would be needed to implement the shared storage. These are described as Sun Fire X4275 loaded with the FlashFire cards. I am not aware of a production-ready Linux file system which implements a hybrid storage pool. I could easily envision these as being OpenStorage appliances. -- richard
Richard, Tim, yes, one might envision the X4275 as OpenStorage appliances, but they are not. Exadata 2 is - *all* Sun hardware - *all* Oracle software (*) and that combination is now an Oracle product: a database appliance. All nodes run Oracles Linux; as far as I understand - and that is not sooo much - Oracle has offloaded certain database functionality into the storage nodes. I would not assume that there is a hybrid storage pool with a file system - it is a distributed data base that knows to utilize flash storage. I see it as a first quick step. hth -- Roland PS: (*) disregarding firmware-like software components like Service Processor code or IB subnet managers in the IB switches, which are provided by Sun Richard Elling schrieb:> > On Sep 24, 2009, at 10:17 AM, Tim Cook wrote: > >> >> >> On Thu, Sep 24, 2009 at 12:10 PM, Richard Elling >> <richard.elling at gmail.com> wrote: >> On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: >> >> I''m surprised no-one else has posted about this - part of the Sun >> Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 >> or 96 GB of SLC, a built-in SAS controller and a super-capacitor for >> cache protection. >> http://www.sun.com/storage/disk_systems/sss/f20/specs.xml >> >> At the Exadata-2 announcement, Larry kept saying that it wasn''t a >> disk. But there >> was little else of a technical nature said, though John did have one >> to show. >> >> RAC doesn''t work with ZFS directly, so the details of the >> configuration should prove >> interesting. >> -- richard >> >> Exadata 2 is built on Linux from what I read, so I''m not entirely sure >> how it would leverage ZFS, period. I hope I heard wrong or the whole >> announcement feels like a bit of a joke to me. > > It is not clear to me. They speak of "storage servers" which would be > needed to > implement the shared storage. These are described as Sun Fire X4275 loaded > with the FlashFire cards. I am not aware of a production-ready Linux > file system > which implements a hybrid storage pool. I could easily envision these as > being > OpenStorage appliances. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ********************************************************** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008-2222 mailto:Roland.Rambau at sun.com ********************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht M?nchen: HRB 161028; Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin H?ring ******* UNIX ********* /bin/sh ******** FORTRAN **********
Richard Elling wrote:> > On Sep 24, 2009, at 10:17 AM, Tim Cook wrote: > >> >> >> On Thu, Sep 24, 2009 at 12:10 PM, Richard Elling >> <richard.elling at gmail.com> wrote: >> On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: >> >> I''m surprised no-one else has posted about this - part of the Sun >> Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 >> or 96 GB of SLC, a built-in SAS controller and a super-capacitor for >> cache protection. >> http://www.sun.com/storage/disk_systems/sss/f20/specs.xml >> >> At the Exadata-2 announcement, Larry kept saying that it wasn''t a >> disk. But there >> was little else of a technical nature said, though John did have one >> to show. >> >> RAC doesn''t work with ZFS directly, so the details of the >> configuration should prove >> interesting. >> -- richard >> >> Exadata 2 is built on Linux from what I read, so I''m not entirely >> sure how it would leverage ZFS, period. I hope I heard wrong or the >> whole announcement feels like a bit of a joke to me. > > It is not clear to me. They speak of "storage servers" which would be > needed to > implement the shared storage. These are described as Sun Fire X4275 > loaded > with the FlashFire cards. I am not aware of a production-ready Linux > file system > which implements a hybrid storage pool. I could easily envision these > as being > OpenStorage appliances. > -- richardWell, I''m not an expert on this at all, but what was said IIRC is that it is using ASM with the whole lot running on OEL. These aren''t just plain storage servers either. The storage servers are provided with enough details of the DB search being performed to do an initial filtering of the data so the data returned to the DB servers for them to work on is only typically 10% of the raw data they would conventionally have to process (and that''s before taking compression into account). I haven''t seen anything which says exactly how the flash cache is used (as in, is it ASM or the database which decides what goes in flash?). ASM certainly has the smarts to do this level of tuning for conventional disk layout, and just like ZFS, it puts hot data on the outer edge of a disk and uses slower parts of disks for less performant data (things like backups), so it certainly could decide what goes into flash. -- Andrew
Roland Rambau wrote:> Richard, Tim, > > yes, one might envision the X4275 as OpenStorage appliances, but > they are not. Exadata 2 is > - *all* Sun hardware > - *all* Oracle software (*) > and that combination is now an Oracle product: a database appliance.Is there any reason the X4275 couldn''t be an OpenStorage appliance? It seems like it would be a good fit. It doesn''t seem specific to Exadata2. The F20 accelerator card isn''t something specific to Exadata2 either is it? It looks like something that would benefit any kind of storage server. When I saw the F20 on the Sun site the other day, my first thought was "Oh cool, they reinvented Prestoserve!" -Brian
Oracle use Linux :-( But on the positive note have a look at this:- http://www.youtube.com/watch?v=rmrxN3GWHpM It''s Ed Zander talking to Larry and asking some great questions. 29:45 Ed asks what parts of Sun are you going to keep ----- all of it! 45:00 Larry''s rant on Cloud Computing ---- "the cloud is water vapour!" 20:00 Talks about Russell Coutts (a good kiwi bloke) and the America''s cup if you don''t care about anything else. Although they seem confused about who should own it, Team New Zealand are only letting the Swiss borrow it for a while until they loose all our top sailors, like Russell and we win it back, once the trimaran side show is over :-) Oh and back on topic..... Anybody found any info on the F20. I''ve a customer who wants to buy one and on the partner portal I can''t find any real details (Just the Facts, or SunIntro, onestop for partner page would be nice) Trevor Enda O''Connor wrote: Richard Elling wrote: On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote: I''m surprised no-one else has posted about this - part of the Sun Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor for cache protection. http://www.sun.com/storage/disk_systems/sss/f20/specs.xml At the Exadata-2 announcement, Larry kept saying that it wasn''t a disk. But there was little else of a technical nature said, though John did have one to show. RAC doesn''t work with ZFS directly, so the details of the configuration should prove interesting. isn''t exadata based on linux, so not clear where zfs comes into play, but I didn''t see any of this oracle preso, so could be confused by all this. Enda -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi James; Product will be lounched in a very short time. You can learn pricing from sun. Please keep in mind that Logzilla and F20 is desigined for slightly different tasks in mind. Logzilla is an extremely fast and reliable write device while F20 can be used for many different loads (read or write cache r both at the same time) Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of James Andrewartha Sent: Thursday, September 24, 2009 10:21 AM To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Sun Flash Accelerator F20 I''m surprised no-one else has posted about this - part of the Sun Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor for cache protection. http://www.sun.com/storage/disk_systems/sss/f20/specs.xml There''s no pricing on the webpage though - does anyone know how it compares in price to a logzilla? -- James Andrewartha _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Richard; You are right ZFS is not a shared FS so it can not be used for RAC unless you have 7000 series disk system. In Exadata ASM is used for storage Management where F20 can perform as a cache. Best regards Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Richard Elling Sent: Thursday, September 24, 2009 8:10 PM To: James Andrewartha Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Sun Flash Accelerator F20 On Sep 24, 2009, at 12:20 AM, James Andrewartha wrote:> I''m surprised no-one else has posted about this - part of the Sun > Oracle Exadata v2 is the Sun Flash Accelerator F20 PCIe card, with > 48 or 96 GB of SLC, a built-in SAS controller and a super-capacitor > for cache protection.http://www.sun.com/storage/disk_systems/sss/f20/specs.xml At the Exadata-2 announcement, Larry kept saying that it wasn''t a disk. But there was little else of a technical nature said, though John did have one to show. RAC doesn''t work with ZFS directly, so the details of the configuration should prove interesting. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
A word of caution, be sure not to read a lot into the fact that the F20 is included in the Exadata Machine.>From what I''ve heard the flash_cache feature of 11.2.0 Oracle that was enabled in beta, is not working in the production release, for anyone except the Exadata 2.The question is, why did they need to give this machine an unfair software advantage? Is it because of the poor performance they found with the F20? Oracle bought Sun, they have reason to make such moves. I have been talking to a Sun rep for weeks now, trying to get the latency specs on this F20 card, with no luck in getting that revealed so far. However, you can look at Sun''s other products like the F5100, which are very unimpressive and high latency. I would not assume this Sun tech is in the same league as a Fusion-io ioDrive, or a Ramsan-10. They would not confirm whether its a native PCIe solution, or if the reason it comes on a SAS card, is because it requires SAS. So, test, test, test, and don''t assume this card is competitive because it came out this year, I am not sure its even competitive with last years ioDrive. I told my sun reseller that I merely needed it to be faster than the Intel X25-E in terms of latency, and they weren''t able to demonstrate that, at least so far...lots of feet dragging, and I can only assume they want to sell as much as they can, before the cards metrics become widely known. -- This message posted from opensolaris.org
On Tue, Oct 20, 2009 at 10:23 AM, Robert Dupuy <rdupuy at umpublishing.org>wrote:> A word of caution, be sure not to read a lot into the fact that the F20 is > included in the Exadata Machine. > > >From what I''ve heard the flash_cache feature of 11.2.0 Oracle that was > enabled in beta, is not working in the production release, for anyone except > the Exadata 2. > > The question is, why did they need to give this machine an unfair software > advantage? Is it because of the poor performance they found with the F20? > > Oracle bought Sun, they have reason to make such moves. > > I have been talking to a Sun rep for weeks now, trying to get the latency > specs on this F20 card, with no luck in getting that revealed so far. > > However, you can look at Sun''s other products like the F5100, which are > very unimpressive and high latency. > > I would not assume this Sun tech is in the same league as a Fusion-io > ioDrive, or a Ramsan-10. They would not confirm whether its a native PCIe > solution, or if the reason it comes on a SAS card, is because it requires > SAS. > > So, test, test, test, and don''t assume this card is competitive because it > came out this year, I am not sure its even competitive with last years > ioDrive. > > I told my sun reseller that I merely needed it to be faster than the Intel > X25-E in terms of latency, and they weren''t able to demonstrate that, at > least so far...lots of feet dragging, and I can only assume they want to > sell as much as they can, before the cards metrics become widely known. > -- >That''s an awful lot of assumptions with no factual basis for any of your claims. As for your bagging on the F5100... what exactly is your problem with its latency? Assuming you aren''t using absurdly large block sizes, it would appear to fly. 0.15ms is bad? http://blogs.sun.com/BestPerf/entry/1_6_million_4k_iops --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091020/1e61e1b9/attachment.html>
My post is a caution to test the performance, and get your own results. http://www.storagesearch.com/ssd.html Please see the entry for October 12th. The result page you linked too, shows that you can use an arbitrarily high number of threads, spread evenly across a large number of SAS channels, and get the results to scale. This is Sun''s ideal conditions designed to sell the F5100. Now, real world performance is unimpressive. The results need to be compared to other Flash systems, not to traditional hard drives. Most flash, including Sun''s trumps traditional hard drives. I''m issuing a caution because I think its a benefit. Look at Sun''s numbers for latency http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml .41ms Fast compared to hard drives, but quite slow compared to competing SSD. I''ve done testing with the X25-E (Intel 32GB 2.5" SATA form factor drives). I''m cautious about the F20, precisely because I would think Sun would be anxious to prove its faster than this competitor. I have not said its slower, only that its unconfirmed, and so my recommendation is to confirm the performance of this card, do not assume. Good advice. -- This message posted from opensolaris.org
On Tue, 20 Oct 2009, Robert Dupuy wrote:> My post is a caution to test the performance, and get your own results. > > http://www.storagesearch.com/ssd.html > > Please see the entry for October 12th.I see an editorial based on no experience and little data.> The result page you linked too, shows that you can use an > arbitrarily high number of threads, spread evenly across a large > number of SAS channels, and get the results to scale. > > This is Sun''s ideal conditions designed to sell the F5100. Now, > real world performance is unimpressive.It is not clear to me that "real world performance is unimpressive" since then it is necessary to define what is meant by "real world". Many real world environments are naturally heavily threaded.> I''m issuing a caution because I think its a benefit. Look at Sun''s > numbers for latency > > http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml > > .41ms > > Fast compared to hard drives, but quite slow compared to competing SSD.You are assuming that the competing SSD vendors measured latency the same way that Sun did. Sun has always been conservative with their benchmarks and their specifications.> I''ve done testing with the X25-E (Intel 32GB 2.5" SATA form factor > drives). > > I''m cautious about the F20, precisely because I would think Sun > would be anxious to prove its faster than this competitor.Is the X25-E a competitor? I never even crossed my mind that a device like the X25-E would be a competitor. They don''t satisfy the same requirements. 1K non-volatile write IOPS vs 84k non-volatile write IOPS. Seems like night and day to me (and I am sure that Sun prices accordingly). The only thing I agree with is the need to perform real world testing for the intended application. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I agree, that assuming that the F20 works well for your application, because its included in the Exadata 2, probably isn''t logical. Equally, assuming it doesn''t work, isn''t logical. Yes, the X-25E is clearly a competitor. It was once part of the Pillar Data Systems setup, and was disqualified based on reliability issues...in that sense, doesn''t seem like a good competitor, but it is a competitior. I''m not here to promote the X-25E, however Sun does sell a rebadged X-25E in their own servers, and my particular salesman, spec''d both an X-25E based system, and an F20 based system....so they were clearly pitched against each other. As far as I''m assuming about Sun''s .41ms benchmark methodology, really? am I sir? hardly. I start that as the basis of a discussion, because Sun published that number. Seemed logical. But I think we mostly agree, good idea to test. My intention was just to save anyone from disappointment, should they purchase without testing. I admit, I haven''t posted here before, I registered precisely because google was showing this page as being a forum to discuss this card, and...just wanted to discuss it some. My apologies if I seemed too enthusiastic in my points, from the get go. -- This message posted from opensolaris.org
On Tue, 20 Oct 2009, Robert Dupuy wrote:> I''m not here to promote the X-25E, however Sun does sell a rebadged > X-25E in their own servers, and my particular salesman, spec''d both > an X-25E based system, and an F20 based system....so they were > clearly pitched against each other.Sun salesmen don''t always know what they are doing. The F20 likely cost more than the rest of the system. To date, no one here has posted any experience with the F20 so it must be assumed that to date it has only been used in the lab or under NDA. People here dream of using it for the ZFS intent log but it is clear that this was not Sun''s initial focus for the product.> As far as I''m assuming about Sun''s .41ms benchmark methodology, > really? am I sir? hardly.Measuring and specifying access times for disk drives is much more straight-forward than for solid state devices.> I admit, I haven''t posted here before, I registered precisely > because google was showing this page as being a forum to discuss > this card, and...just wanted to discuss it some.This is a good place. We just have to wait until some "real world" users get their hands on some F20s without any NDA in place so that we can hear about their experiences. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> People here dream of using it for the ZFS intent log but it is clear > that this was not Sun''s initial focus for the product.At the moment I''m considering using a Gigabyte iRAM as ZIL device. (see http://cgi.ebay.com/Gigabyte-IRAM-I-Ram-GC-RAMDISK-SSD-4GB-PCI-card-SATA_W0Q QitemZ120481489540QQcmdZViewItemQQptZPCC_Drives_Storage_Internal?hash=item1c 0d41a284) As I am using 2x Gbit Ethernet an 4 Gig of RAM, 4 Gig of RAM for the iRAM should be more than sufficient (0.5 times RAM and 10s worth of IO) I am aware that this RAM is non-ECC so I plan to mirror the ZIL device. Any considerations for this setup....Will it work as I expect it (speed up sync. IO especially for NFS)?
On Oct 20, 2009, at 8:23 AM, Robert Dupuy wrote:> A word of caution, be sure not to read a lot into the fact that the > F20 is included in the Exadata Machine. > >> From what I''ve heard the flash_cache feature of 11.2.0 Oracle that >> was enabled in beta, is not working in the production release, for >> anyone except the Exadata 2. > > The question is, why did they need to give this machine an unfair > software advantage? Is it because of the poor performance they > found with the F20? > > Oracle bought Sun, they have reason to make such moves. > > I have been talking to a Sun rep for weeks now, trying to get the > latency specs on this F20 card, with no luck in getting that > revealed so far.AFAICT, there is no consistent latency measurement in the industry, yet. With magnetic disks, you can usually get some sort of average values, which can be useful to the first order. We do know that for most flash devices read latency is relatively easy to measure, but write latency can vary by an order of magnitude, depending on the SSD design and IOP size. Ok, this is a fancy way of saying YMMV, but in real life, YMMV.> However, you can look at Sun''s other products like the F5100, which > are very unimpressive and high latency. > > I would not assume this Sun tech is in the same league as a Fusion- > io ioDrive, or a Ramsan-10. They would not confirm whether its a > native PCIe solution, or if the reason it comes on a SAS card, is > because it requires SAS. > > So, test, test, test, and don''t assume this card is competitive > because it came out this year, I am not sure its even competitive > with last years ioDrive.+1> I told my sun reseller that I merely needed it to be faster than the > Intel X25-E in terms of latency, and they weren''t able to > demonstrate that, at least so far...lots of feet dragging, and I can > only assume they want to sell as much as they can, before the cards > metrics become widely known.I''d be surprised if anyone could answer such a question while simultaneously being credible. How many angels can dance on the tip of a pin? Square dance or ballet? :-) FWIW, Brendan recently blogged about measuring this at the NFS layer. http://blogs.sun.com/brendan/entry/hybrid_storage_pool_top_speeds I think where we stand today, the higher-level systems questions of redundancy tend to work against builtin cards like the F20. These sorts of cards have been available in one form or another for more than 20 years, and yet they still have limited market share -- not because they are fast, but because the other limitations carry more weight. If the stars align and redundancy above the block layer gets more popular, then we might see this sort of functionality implemented directly on the mobo... at which point we can revisit the notion of file system. Previous efforts to do this (eg Virident) haven''t demonstrated stellar market movement. -- richard
Richard Elling wrote: I think where we stand today, the higher-level systems questions of redundancy tend to work against builtin cards like the F20. These sorts of cards have been available in one form or another for more than 20 years, and yet they still have limited market share -- not because they are fast, but because the other limitations carry more weight. If the stars align and redundancy above the block layer gets more popular, then we might see this sort of functionality implemented directly on the mobo... at which point we can revisit the notion of file system. Previous efforts to do this (eg Virident) haven''t demonstrated stellar market movement. -- richard Richard You mean presto-serve :-) Putting data on a local NVRAM in the sever layer, was a bad idea 20 years ago for a lot of applications. The reasons haven''t changed in all those years! For those who may not have been around in the "good old days" when 1 to 16 MB of NVRAM on an s-bus card was a good idea - or not http://docs.sun.com/app/docs/doc/801-7289/6i1jv4t2s?a=view Trevor _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
"there is no consistent latency measurement in the industry" You bring up an important point, as did another poster earlier in the thread, and certainly its an issue that needs to be addressed. "I''d be surprised if anyone could answer such a question while simultaneously being credible." http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf Intel: X-25E read latency 75 microseconds http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml Sun: F5100 read latency 410 microseconds http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf Fusion-IO: read latency less than 50 microseconds Fusion-IO lists theirs as .05ms I find the latency measures to be useful. I know it isn''t perfect, and I agree benchmarks can be deceiving, heck I criticized one vendors benchmarks in this thread already :) But, I did find, that for me, I just take a very simple, single thread, read as fast you can approach, and get the # of random access per second, as one type of measurement, that gives you some data, on the raw access ability of the drive. No doubt in some cases, you want to test multithreaded IO too, but my application is very latency sensitive, so this initial test was telling. As I got into the actual performance of my app, the lower latency drives, performed better than the higher latency drives...all of this was on SSD. (I did not test the F5100 personally, I''m talking about the SSD drives that I did test). So, yes, SSD and HDD are different, but latency is still important. -- This message posted from opensolaris.org
> So, yes, SSD and HDD are different, but latency is still important.But on SSD, write performance is much more unpredictable than on HDD. If you want to write to SSD you will have to erase the used blocks (assuming this is not a brand-new SSD) before you are able to write to them. This takes much time, assuming the drive''s firmeware doesn''t do this by itself...but who can tell. I replaced my notebooks internal HDD with an "cheap" SSD. At first I was impressed but in the meantime writes are unpredictable (copy times differ from 1h to 60 seconds on a Single file). If there is a difference to "enterprise-grade" SSDs, please let me know!
Matthias Appel wrote:> But on SSD, write performance is much more unpredictable than on HDD. > > If you want to write to SSD you will have to erase the used blocks (assuming > this is not a brand-new SSD) before you are able to write to them. > > This takes much time, assuming the drive''s firmeware doesn''t do this by > itself...but who can tell. > > I replaced my notebooks internal HDD with an "cheap" SSD. > > At first I was impressed but in the meantime writes are unpredictable (copy > times differ from 1h to 60 seconds on a > Single file). > > If there is a difference to "enterprise-grade" SSDs, please let me know!I haven''t seen that on the X25-E disks I hammer as part of the demos on the "Turbocharge Your Apps" discovery days I run. -- Andrew
On Oct 20, 2009, at 1:58 PM, Robert Dupuy wrote:> "there is no consistent latency measurement in the industry" > > You bring up an important point, as did another poster earlier in > the thread, and certainly its an issue that needs to be addressed. > > "I''d be surprised if anyone could answer such a question while > simultaneously being credible." > > http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf > > Intel: X-25E read latency 75 microseconds... but they don''t say where it was measured or how big it was...> http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml > > Sun: F5100 read latency 410 microseconds... for 1M transfers... I have no idea what the units are, though... bytes?> http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf > > Fusion-IO: read latency less than 50 microseconds > > Fusion-IO lists theirs as .05ms...at the same time they quote 119,790 IOPS @ 4KB. By my calculator, that is 8.3 microseconds per IOP, so clearly the latency itself doesn''t have a direct impact on IOPs.> I find the latency measures to be useful.Yes, but since we are seeing benchmarks showing 1.6 MIOPS (mega-IOPS :-) on a system which claims 410 microseconds of latency, it really isn''t clear to me how to apply the numbers to capacity planning. To wit, there is some limit to the number of concurrent IOPS that can be processed per device, so do I need more devices, faster devices, or devices which can handle more concurrent IOPS?> I know it isn''t perfect, and I agree benchmarks can be deceiving, > heck I criticized one vendors benchmarks in this thread already :) > > But, I did find, that for me, I just take a very simple, single > thread, read as fast you can approach, and get the # of random > access per second, as one type of measurement, that gives you some > data, on the raw access ability of the drive.> No doubt in some cases, you want to test multithreaded IO too, but > my application is very latency sensitive, so this initial test was > telling.cool.> As I got into the actual performance of my app, the lower latency > drives, performed better than the higher latency drives...all of > this was on SSD.Note: the F5100 has SAS expanders which add latency. -- richard> (I did not test the F5100 personally, I''m talking about the SSD > drives that I did test). > > So, yes, SSD and HDD are different, but latency is still important. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Oct 20, 2009 at 3:58 PM, Robert Dupuy <rdupuy at umpublishing.org>wrote:> "there is no consistent latency measurement in the industry" > > You bring up an important point, as did another poster earlier in the > thread, and certainly its an issue that needs to be addressed. > > "I''d be surprised if anyone could answer such a question while > simultaneously being credible." > > > http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-product-brief.pdf > > Intel: X-25E read latency 75 microseconds > > http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml > > Sun: F5100 read latency 410 microseconds > > http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf > > Fusion-IO: read latency less than 50 microseconds > > Fusion-IO lists theirs as .05ms > > > I find the latency measures to be useful. > > I know it isn''t perfect, and I agree benchmarks can be deceiving, heck I > criticized one vendors benchmarks in this thread already :) > > But, I did find, that for me, I just take a very simple, single thread, > read as fast you can approach, and get the # of random access per second, as > one type of measurement, that gives you some data, on the raw access ability > of the drive. > > No doubt in some cases, you want to test multithreaded IO too, but my > application is very latency sensitive, so this initial test was telling. > > As I got into the actual performance of my app, the lower latency drives, > performed better than the higher latency drives...all of this was on SSD. > > (I did not test the F5100 personally, I''m talking about the SSD drives that > I did test). > > So, yes, SSD and HDD are different, but latency is still important. >Timeout, rewind, etc. What workload do you have that 410microsecond latency is detrimental? More to the point, what workload do you have that you''d rather have 5microsecond latency with 1/100000th the IOPS? Whatever it is, I''ve never run across such a workload in the real world. It sounds like you''re comparing paper numbers for the sake of comparison, rather than to solve a real-world problem... BTW, latency does not give you "# of random access per second". 5microsecond latency for one access != # of random access per second, sorry. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091020/feaf2112/attachment.html>
On Tue, 20 Oct 2009, Richard Elling wrote:>> >> Intel: X-25E read latency 75 microseconds > > ... but they don''t say where it was measured or how big it was...Probably measured using a logic analyzer and measuring the time from the last bit of the request going in, to the first bit of the response coming out. It is not clear if this latency is a minimum, maximum, median, or average. It is not clear if this latency is while the device is under some level of load, or if it is in a quiescent state. This is one of the skimpiest specification sheets that I have ever seen for an enterprise product.>> Sun: F5100 read latency 410 microseconds > > ... for 1M transfers... I have no idea what the units are, though... bytes?Sun''s testing is likely done while attached to a system and done with some standard loading factor rather than while in a quiescent state.> ...at the same time they quote 119,790 IOPS @ 4KB. By my calculator, > that is 8.3 microseconds per IOP, so clearly the latency itself doesn''t > have a direct impact on IOPs.I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
My take on the responses I''ve received the last days, is that it isn''t genuine. ________________________________ From: Tim Cook [mailto:tim at cook.ms] Sent: 2009-10-20 20:57 To: Dupuy, Robert Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Sun Flash Accelerator F20 On Tue, Oct 20, 2009 at 3:58 PM, Robert Dupuy <rdupuy at umpublishing.org> wrote: "there is no consistent latency measurement in the industry" You bring up an important point, as did another poster earlier in the thread, and certainly its an issue that needs to be addressed. "I''d be surprised if anyone could answer such a question while simultaneously being credible." http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-pro duct-brief.pdf Intel: X-25E read latency 75 microseconds http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml Sun: F5100 read latency 410 microseconds http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf Fusion-IO: read latency less than 50 microseconds Fusion-IO lists theirs as .05ms I find the latency measures to be useful. I know it isn''t perfect, and I agree benchmarks can be deceiving, heck I criticized one vendors benchmarks in this thread already :) But, I did find, that for me, I just take a very simple, single thread, read as fast you can approach, and get the # of random access per second, as one type of measurement, that gives you some data, on the raw access ability of the drive. No doubt in some cases, you want to test multithreaded IO too, but my application is very latency sensitive, so this initial test was telling. As I got into the actual performance of my app, the lower latency drives, performed better than the higher latency drives...all of this was on SSD. (I did not test the F5100 personally, I''m talking about the SSD drives that I did test). So, yes, SSD and HDD are different, but latency is still important. Timeout, rewind, etc. What workload do you have that 410microsecond latency is detrimental? More to the point, what workload do you have that you''d rather have 5microsecond latency with 1/100000th the IOPS? Whatever it is, I''ve never run across such a workload in the real world. It sounds like you''re comparing paper numbers for the sake of comparison, rather than to solve a real-world problem... BTW, latency does not give you "# of random access per second". 5microsecond latency for one access != # of random access per second, sorry. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091021/e6de6eb7/attachment.html>
I''ve already explained how you can scale up IOP #''s and unless that is your real workload, you won''t see that in practice. See, running a high # of parallel jobs spread evenly across. I don''t find the conversation genuine, so I''m not going to continue it. -----Original Message----- From: Richard Elling [mailto:richard.elling at gmail.com] Sent: 2009-10-20 16:39 To: Dupuy, Robert Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Sun Flash Accelerator F20 On Oct 20, 2009, at 1:58 PM, Robert Dupuy wrote:> "there is no consistent latency measurement in the industry" > > You bring up an important point, as did another poster earlier in > the thread, and certainly its an issue that needs to be addressed. > > "I''d be surprised if anyone could answer such a question while > simultaneously being credible." > >http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-pro duct-brief.pdf> > Intel: X-25E read latency 75 microseconds... but they don''t say where it was measured or how big it was...> http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml > > Sun: F5100 read latency 410 microseconds... for 1M transfers... I have no idea what the units are, though... bytes?> http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf > > Fusion-IO: read latency less than 50 microseconds > > Fusion-IO lists theirs as .05ms...at the same time they quote 119,790 IOPS @ 4KB. By my calculator, that is 8.3 microseconds per IOP, so clearly the latency itself doesn''t have a direct impact on IOPs.> I find the latency measures to be useful.Yes, but since we are seeing benchmarks showing 1.6 MIOPS (mega-IOPS :-) on a system which claims 410 microseconds of latency, it really isn''t clear to me how to apply the numbers to capacity planning. To wit, there is some limit to the number of concurrent IOPS that can be processed per device, so do I need more devices, faster devices, or devices which can handle more concurrent IOPS?> I know it isn''t perfect, and I agree benchmarks can be deceiving, > heck I criticized one vendors benchmarks in this thread already :) > > But, I did find, that for me, I just take a very simple, single > thread, read as fast you can approach, and get the # of random > access per second, as one type of measurement, that gives you some > data, on the raw access ability of the drive.> No doubt in some cases, you want to test multithreaded IO too, but > my application is very latency sensitive, so this initial test was > telling.cool.> As I got into the actual performance of my app, the lower latency > drives, performed better than the higher latency drives...all of > this was on SSD.Note: the F5100 has SAS expanders which add latency. -- richard> (I did not test the F5100 personally, I''m talking about the SSD > drives that I did test). > > So, yes, SSD and HDD are different, but latency is still important. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
" This is one of the skimpiest specification sheets that I have ever seen for an enterprise product." At least it shows the latency. This is some kind of technology cult, I''ve wondered into. I won''t respond further. -----Original Message----- From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] Sent: 2009-10-20 21:54 To: Richard Elling Cc: Dupuy, Robert; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Sun Flash Accelerator F20 On Tue, 20 Oct 2009, Richard Elling wrote:>> >> Intel: X-25E read latency 75 microseconds > > ... but they don''t say where it was measured or how big it was...Probably measured using a logic analyzer and measuring the time from the last bit of the request going in, to the first bit of the response coming out. It is not clear if this latency is a minimum, maximum, median, or average. It is not clear if this latency is while the device is under some level of load, or if it is in a quiescent state. This is one of the skimpiest specification sheets that I have ever seen for an enterprise product.>> Sun: F5100 read latency 410 microseconds > > ... for 1M transfers... I have no idea what the units are, though...bytes? Sun''s testing is likely done while attached to a system and done with some standard loading factor rather than while in a quiescent state.> ...at the same time they quote 119,790 IOPS @ 4KB. By my calculator, > that is 8.3 microseconds per IOP, so clearly the latency itselfdoesn''t> have a direct impact on IOPs.I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
There is a debate tactic known as complex argument, where so many false and misleading statements are made at once, that it overwhelms the respondent. I''m just going to respond this way. I am very disappointed in this discussion group. The response is not genuine. The idea that latency is not important, patently absurd. I am not going into the details of my private application, so you can pick at it. If you want to say latency has no relevance, you can defend that absurdity at the risk of your own reputation. The responses are not, in my opinion, genuine. Attacking Intel''s spec sheet, when simultanously defending no latency #''s being release by another vendor? well I sent 3 emails in response earlier this morning, but I wasn''t logged in, so I don''t know if the mod will post them or not. Moderator, you don''t have to, this will suffice as my last email. Guys, I don''t have time to waste with you, and I feel that it is very wasteful to sit here and argue with people who either a) don''t understand technology or more likely b) simply are being argumentative because they have a vested interest. Either way, I don''t see genuine help. I am going to remove my account, and good bye! best of luck to everyone. -- This message posted from opensolaris.org
Please don''t feed the troll. :) -brian On Wed, Oct 21, 2009 at 06:32:42AM -0700, Robert Dupuy wrote:> There is a debate tactic known as complex argument, where so many false and misleading statements are made at once, that it overwhelms the respondent. > > I''m just going to respond this way. > > I am very disappointed in this discussion group. The response is not genuine. > > The idea that latency is not important, patently absurd. I am not going into the details of my private application, so you can pick at it. > > If you want to say latency has no relevance, you can defend that absurdity at the risk of your own reputation. > > The responses are not, in my opinion, genuine. > > Attacking Intel''s spec sheet, when simultanously defending no latency #''s being release by another vendor? > > well I sent 3 emails in response earlier this morning, but I wasn''t logged in, so I don''t know if the mod will post them or not. > > Moderator, you don''t have to, this will suffice as my last email. > > Guys, I don''t have time to waste with you, and I feel that it is very wasteful to sit here and argue with people who either a) don''t understand technology or more likely b) simply are being argumentative because they have a vested interest. > > Either way, I don''t see genuine help. > > I am going to remove my account, and good bye! best of luck to everyone. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
On Oct 21, 2009, at 6:14 AM, Dupuy, Robert wrote:> " This is one of the skimpiest specification sheets that I have ever > seen for an enterprise product." > > At least it shows the latency.STORAGEsearch has been trying to wade through the spec muck for years. http://www.storagesearch.com/ssd-fastest.html -- richard
Clearly a lot of people don''t understand latency, so I''ll talk about latency, breaking it down in simpler components. Sometimes it helps to use made up numbers, to simplify a point. Imagine a non-real system that had these ''ridiculous'' performance characteristics: The system has a 60 second (1 minute) read latency. The system can scale dramatically, it can do 60 billion IO''s per minute. Now some here are arguing about the term latency, but its rather a simple term. It simply means the amount of time it takes, for data to move from one point to another. And some here have argued there is no good measurement of latency, but also it very simple. It is measured in time units. OK, so we have a latency of 1 minute, in this ''explanatory'' system. That means, I issued a read request, the Flash takes 1 minute to return the data requested to the program. But remember, this example system, has massive parallel scalability. I issue 2 read requests, both read requests return after 1 minute. I issue 3 read requests, all 3 return after 1 minute. I defined this made up system, as one, such that if you issue 60 billion read requests, they all return, simultaneously, after 1 minute. Let''s do some math. 60,000,000,000 divided by 60 seconds, well this system does 1 billion IOPS! Wow, what wouldn''t run fast with 1 billion IOPS? The answer, is, most programs would not, not with such a high latency as waiting 1 minute for data to return. Most apps wouldn''t run acceptably, no not at all. Imagine you are in Windows, or Solaris, or Linux, and every time you needed to go to disk, a 1 minute wait. Wow, it would be totally unacceptable, despite the IOPS, latency matters. Certain types of apps wouldn''t be latency sensitive, some people would love to have this 1 billion IOPs system :) The good news is, the F20 latency, even if we don''t know what it is, is certainly not 1 minute, and we can speculate, it is much better than traditional rotating disks. But lets blue sky this, and make up a number, say .41ms (410 microseconds). And lets say you have a competitor at .041ms (41 microseconds). When would the competitor have a real advantage? Well, if it was an app that issued a read, waited for the results, issued a read, waited for the results, and say, did this 100 million times or so, then, yes, that low latency card is going to help accelerate that app. Computers are fast, they deal with a lot of data, real world -and a surprising lot, doesn''t scale. I''ve seen sales and financial apps do 100 million io''s and more. Even a Sun blogger, I read recently, did an article about the F20 in terms of how, compared to traditional disks, it speeds up Peoplesoft jobs. flash has lower latency than traditional disks, that''s part of what makes it competitive...and by the same token, flash with lower latency than other flash, has a competitive advantage. Some here say latency (that wait times) doesn''t matter with flash. That latency (waiting) only matters with traditional hard drives. Uhm, who told you that? I''ve never heard someone make that case before, anywhere, ever. And lets give you credit and say you had some minor point to make about hdd and flash differences...still you are using it in such a way, that someone could draw the wrong conclusion, so..... clarify this point, you are certainly not suggesting that higher wait times speeds up an application, correct? Or that the F20''s latency cannot impact performace, right? C''mon, some common sense? anyone? -- This message posted from opensolaris.org
Now lets talk about the ''latency deny''ers'' First of all, the say, there is no standard measurement of latency. That isn''t complicated. Sun includes the transfer time in latency figures, other companies do not. THen latency deny''ers say, there is no way to compare the numbers. Thats what I''m getting from reading the thread. Well, take a 4k block size, the transfer time isn''t significant. When someone publishes their ''best'' latency times...transfer time is not significant. You CAN compare latency specs. If a latency deny''er says you can''t; ask yourself why. Is it because they wouldn''t be favorable in a latency comparison? Some say, latency is not important in flash. Well why would someone say that? Is it because they don''t have good latency numbers? The F20 card, has some features to its merit, but its not a high performance card...its not in the class of a Fusion-IO or TMS RamSan-20 in terms of IOPS. However, it has some features you may like. I understand its bootable. I''m told it has a supercapacitor. Its not rotating hard drive slow. Furthermore, if you need high sequential transfers, it seems to have that. High IOP card? Well, I do say, watch out for scaled results that look good on benchmarks. Your real world application better run like a benchmark. -- This message posted from opensolaris.org
On Wed, Oct 21, 2009 at 9:15 PM, Jake Caferilla <jake at tanooshka.com> wrote:> Clearly a lot of people don''t understand latency, so I''ll talk about > latency, breaking it down in simpler components. > > Sometimes it helps to use made up numbers, to simplify a point. > > Imagine a non-real system that had these ''ridiculous'' performance > characteristics: > > The system has a 60 second (1 minute) read latency. > The system can scale dramatically, it can do 60 billion IO''s per minute. > > Now some here are arguing about the term latency, but its rather a simple > term. > It simply means the amount of time it takes, for data to move from one > point to another. > > And some here have argued there is no good measurement of latency, but also > it very simple. > It is measured in time units. > > OK, so we have a latency of 1 minute, in this ''explanatory'' system. > > That means, I issued a read request, the Flash takes 1 minute to return the > data requested to the program. > > But remember, this example system, has massive parallel scalability. > > I issue 2 read requests, both read requests return after 1 minute. > I issue 3 read requests, all 3 return after 1 minute. > > I defined this made up system, as one, such that if you issue 60 billion > read requests, they all return, simultaneously, after 1 minute. > > Let''s do some math. > > 60,000,000,000 divided by 60 seconds, well this system does 1 billion IOPS! > > Wow, what wouldn''t run fast with 1 billion IOPS? > > The answer, is, most programs would not, not with such a high latency as > waiting 1 minute for data to return. Most apps wouldn''t run acceptably, no > not at all. > > Imagine you are in Windows, or Solaris, or Linux, and every time you needed > to go to disk, a 1 minute wait. Wow, it would be totally unacceptable, > despite the IOPS, latency matters. > > Certain types of apps wouldn''t be latency sensitive, some people would love > to have this 1 billion IOPs system :) > > The good news is, the F20 latency, even if we donflash has lower latency than traditional disks, that''s part of what makes it> competitive...and by the same token, flash with lower latency than other > flash, has a competitive advantage. > > Some here say latency (that wait times) doesn''t matter with flash. That > latency (waiting) only matters with traditional hard drives. > > Uhm, who told you that? I''ve never heard someone make that case before, > anywhere, ever. > > And lets give you credit and say you had some minor point to make about hdd > and flash differences...still you are using it in such a way, that someone > could draw the wrong conclusion, so..... clarify this point, you are > certainly not suggesting that higher wait times speeds up an application, > correct? > > Or that the F20''s latency cannot impact performace, right? C''mon, some > common sense? anyone? > >Yet again, you''re making up situations on paper. We''re dealing with the real world, not theory. So please, describe the electronics that have been invented that can somehow take in 1billion IO requests, process them, have a memory back end that can return them, but does absolutely nothing with them for a full minute. Even if you scale those numbers down, your theory is absolutely ridiculous. Of course, you also failed to address the other issue. How exactly does a drive have .05ms response time, yet only provide 500 IOPS. It''s IMPOSSIBLE for those numbers to work out. But hey, lets ignore reality and just go with vendor numbers. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091021/6ad240b0/attachment.html>
On Tue, Oct 20 at 21:54, Bob Friesenhahn wrote:>On Tue, 20 Oct 2009, Richard Elling wrote: >>> >>>Intel: X-25E read latency 75 microseconds >> >>... but they don''t say where it was measured or how big it was... > >Probably measured using a logic analyzer and measuring the time from >the last bit of the request going in, to the first bit of the >response coming out. It is not clear if this latency is a minimum, >maximum, median, or average. It is not clear if this latency is >while the device is under some level of load, or if it is in a >quiescent state. > >This is one of the skimpiest specification sheets that I have ever >seen for an enterprise product.It may be "skimpy" compared to what you''re used to, but it seems to answer most of your questions, looking at the public X25-E data sheet on intel.com: The latency numbers clearly indicate "typical" which I equate to average (perhaps incorrectly) and regarding the system load, they''re measured doing 4KB reads or writes with a queue depth of 1 which is traditionally considered very light loading. Correct, it doesn''t explicitly state whether the data transfer phase via 3Gbit/s SATA is included or not. At 300MB/s the bus transfer is relatively small, even compared to the quoted numbers. The read and write performance IOPS numbers clearly indicate a SATA queue depth of 32, with write cacheing enabled, and that every LBA on the device has been written to (device 100% full) prior to measurement, which answers some (granted not all) of the questions about device preparations before measurement. The full pack IO cases should be worst case, since the device has the minimum available spare area for managing wear leveling, garbage collection, and other features at that point. The inverse of latency under load in the IOPS testing should give you a number that you can multiply by queue depth to get typical individual command latency under load. (Assuming commands are done in order or mostly in order) It also gives you the typical bandwidth under load as well, which is about 13MB/s in full-pack random 4KB writes and about 140MB/s in full-pack random 4KB reads. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
>>>>> "ma" == Matthias Appel <matthias.appel at lanlabor.com> writes:ma> At the moment I''m considering using a Gigabyte iRAM as ZIL acard ans-9010 backs up to CF on power loss. also has a sneaky ``ECC emulation'''' mode that rearranges non-ECC memory into blocks and then dedicates some bits to ECC, so you don''t have to pay unfair prices for the 9th bit out of every eight. See the ``quick install'''' pdf on acard''s site which is written as a FAQ. it is a little goofy because it goes into this sort of ``crisis mode'''' when you yank the power, where it starts using the battery and writing to the CF, which it doesn''t do under normal operation. This exposes it to UPS-like silly problems if the battery is bad, or the CF is bad, or the CF isn''t shoved in all the way, whatever, while something like stec that uses the dram and commits to flash continuously can report its failure before you have to count on it, which is particularly well-adapted to something like a slog that''s not evenv used unless there''s a power failure. but...the acard thing gives stec performance for intel prices! let us know how your .tw ramdisk scheme goes. ma> (speed up sync. IO especially for NFS)? IIUC it will never give you more improvement than disabling the ZIL, so you should try disabling the ZIL and running tests to get an upper bound on the improvement you can expect. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091027/83ffbe90/attachment.bin>
>>>>> "jc" == Jake Caferilla <jake at tanooshka.com> writes:jc> But remember, this example system, has massive parallel jc> scalability. jc> I issue 2 read requests, both read requests return after 1 jc> minute. yeah, it''s interesting to consider overall software stacks that might have unnecessary serialization points. For example, Postfix will do one or two fsync()''s per mail it receives from the Interent into its internal queue, but I think it can be working on receiving many mails at once, all waiting on fsync''s together. If the filesystem stack pushes these fsync''s all the way down to the disk and serializes them, you''ll only end up with a few parallel I/O transactions open, because almost everything postfix wants to do is ``write, sync, let me know when done.'''' OTOH, if you ran three instances of Postfix on the same single box with three separate queue diectories on three separate filesystems/zpools, inside zones for example, and split the mail load among them using MX records or a load balancer, you might through this ridiculous arrangement increase the parallelism of I/O reaching the disk even on a system that serializes fsync''s. but if ZFS is smart enough to block several threads on fsync at once, batch up their work to a single ZIL write-and-sync, then the three-instance scheme will have no benefit. anyway, though, I do agree it makes sense to quote ``latency'''' as the write-and-sync latency, equivalent to a hard disk with the write cache off, and if you want to tell a story about how much further you can stretch performance through parallelism, quote ``sustained write io/s''''. I don''t think it''s as difficult to quantify write performance as you make out, and obviuosly hard disk vendors are already specifying in this honest way otherwise their write latency numbers would be stupidly low, but it sounds like SSD vendors are sometimes waving their hands around ``oh it''s all chips anyway we didn''t know what you MEANT it''s AMBIGUOUS is that a rabbit over there?'''' giving dishonest latency numbers. Latency numbers for writes that will not survive power loss are *MEANINGLESS*. period. And that is worth complaining about. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091027/df826e47/attachment.bin>
Miles Nordin wrote:> For example, Postfix will do one or two fsync()''s per mail it receives > from the Interent into its internal queue, but I think it can be > working on receiving many mails at once, all waiting on fsync''s > together. If the filesystem stack pushes these fsync''s all the way > down to the disk and serializes them, you''ll only end up with a few > parallel I/O transactions open, because almost everything postfix > wants to do is ``write, sync, let me know when done.'''' > > OTOH, if you ran three instances of Postfix on the same single box > with three separate queue diectories on three separate > filesystems/zpools, inside zones for example, and split the mail load > among them using MX records or a load balancer, you might through this > ridiculous arrangement increase the parallelism of I/O reaching the > disk even on a system that serializes fsync''s. > > but if ZFS is smart enough to block several threads on fsync at once, > batch up their work to a single ZIL write-and-sync, then the > three-instance scheme will have no benefit. >ZFS does exactly this. I demonstrate it on the SSD Discovery Days I run periodically in the UK. -- Andrew Gabriel
Matthias Appel wrote:> I am using 2x Gbit Ethernet an 4 Gig of RAM, > 4 Gig of RAM for the iRAM should be more than sufficient (0.5 times RAM and > 10s worth of IO) > > I am aware that this RAM is non-ECC so I plan to mirror the ZIL device. > > Any considerations for this setup....Will it work as I expect it (speed up > sync. IO especially for NFS)?We looked at the iRAM a while back and it had decent performance, but I was wary of deploying it in remote datacenters since there was no way to monitor the battery status. IIRC the only indicator of a bad battery is an LED on the board face, not even on the bracket, so I''d have to go crack open the machine to make sure the battery was holding a charge. That didn''t fit our model of maintainability, so we didn''t deploy it. Regards, Eric
On 21/10/2009 03:54, Bob Friesenhahn wrote:> > I would be interested to know how many IOPS an OS like Solaris is able > to push through a single device interface. The normal driver stack is > likely limited as to how many IOPS it can sustain for a given LUN > since the driver stack is optimized for high latency devices like disk > drives. If you are creating a driver stack, the design decisions you > make when requests will be satisfied in about 12ms would be much > different than if requests are satisfied in 50us. Limitations of > existing software stacks are likely reasons why Sun is designing > hardware with more device interfaces and more independent devices.Open Solaris 2009.06, 1KB READ I/O: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& # iostat -xnzCM 1|egrep "device|c[0123]$" [...] r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17497.3 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17498.8 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17277.6 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17441.3 0.0 17.0 0.0 0.0 0.8 0.0 0.0 0 82 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 17333.9 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 Now lets see how it looks like for a single SAS connection but dd to 11x SSDs: # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0& # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0& # iostat -xnzCM 1|egrep "device|c[0123]$" [...] r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104243.3 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104249.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104208.1 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104245.8 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 966 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104221.9 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 104212.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 It looks like a single CPU core still hasn''t been saturated and the bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port. It also scales well - I did run above dd''s over 4x SAS ports at the same time and it scaled linearly by achieving well over 400k IOPS. hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 1.27.3.0), connected to F5100. -- Robert Milkowski http://milek.blogspot.com
On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl> wrote:> On 21/10/2009 03:54, Bob Friesenhahn wrote: > >> >> I would be interested to know how many IOPS an OS like Solaris is able to >> push through a single device interface. The normal driver stack is likely >> limited as to how many IOPS it can sustain for a given LUN since the driver >> stack is optimized for high latency devices like disk drives. If you are >> creating a driver stack, the design decisions you make when requests will be >> satisfied in about 12ms would be much different than if requests are >> satisfied in 50us. Limitations of existing software stacks are likely >> reasons why Sun is designing hardware with more device interfaces and more >> independent devices. >> > > > Open Solaris 2009.06, 1KB READ I/O: > > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& >/dev/null is usually a poor choice for a test lie this. Just to be on the safe side, I''d rerun it with /dev/random. Regards, Andrey> # iostat -xnzCM 1|egrep "device|c[0123]$" > [...] > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17497.3 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17498.8 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17277.6 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17441.3 0.0 17.0 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17333.9 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 > > > Now lets see how it looks like for a single SAS connection but dd to 11x > SSDs: > > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0& > > # iostat -xnzCM 1|egrep "device|c[0123]$" > [...] > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104243.3 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104249.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104208.1 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104245.8 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 966 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104221.9 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104212.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 > > > It looks like a single CPU core still hasn''t been saturated and the > bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris > 2009.06 can do at least 100,000 IOPS to a single SAS port. > > It also scales well - I did run above dd''s over 4x SAS ports at the same > time and it scaled linearly by achieving well over 400k IOPS. > > > hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. > 1.27.3.0), connected to F5100. > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/c68dc0e2/attachment.html>
On 10/06/2010 15:39, Andrey Kuzmin wrote:> On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl > <mailto:milek at task.gda.pl>> wrote: > > On 21/10/2009 03:54, Bob Friesenhahn wrote: > > > I would be interested to know how many IOPS an OS like Solaris > is able to push through a single device interface. The normal > driver stack is likely limited as to how many IOPS it can > sustain for a given LUN since the driver stack is optimized > for high latency devices like disk drives. If you are > creating a driver stack, the design decisions you make when > requests will be satisfied in about 12ms would be much > different than if requests are satisfied in 50us. Limitations > of existing software stacks are likely reasons why Sun is > designing hardware with more device interfaces and more > independent devices. > > > > Open Solaris 2009.06, 1KB READ I/O: > > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > > > /dev/null is usually a poor choice for a test lie this. Just to be on > the safe side, I''d rerun it with /dev/random. >That wouldn''t work, would it? Please notice that I''m reading *from* an ssd and writing *to* /dev/null -- Robert Milkowski http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/e218edff/attachment.html>
Sorry, my bad. _Reading_ from /dev/null may be an issue, but not writing to it, of course. Regards, Andrey On Thu, Jun 10, 2010 at 6:46 PM, Robert Milkowski <milek at task.gda.pl> wrote:> On 10/06/2010 15:39, Andrey Kuzmin wrote: > > On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl>wrote: > >> On 21/10/2009 03:54, Bob Friesenhahn wrote: >> >>> >>> I would be interested to know how many IOPS an OS like Solaris is able to >>> push through a single device interface. The normal driver stack is likely >>> limited as to how many IOPS it can sustain for a given LUN since the driver >>> stack is optimized for high latency devices like disk drives. If you are >>> creating a driver stack, the design decisions you make when requests will be >>> satisfied in about 12ms would be much different than if requests are >>> satisfied in 50us. Limitations of existing software stacks are likely >>> reasons why Sun is designing hardware with more device interfaces and more >>> independent devices. >>> >> >> >> Open Solaris 2009.06, 1KB READ I/O: >> >> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& >> > > /dev/null is usually a poor choice for a test lie this. Just to be on the > safe side, I''d rerun it with /dev/random. > > > That wouldn''t work, would it? > Please notice that I''m reading *from* an ssd and writing *to* /dev/null > > > -- > Robert Milkowski > http://milek.blogspot.com > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/41b0ad7d/attachment.html>
On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin <andrey.v.kuzmin at gmail.com> wrote:> On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl> wrote: >> >> On 21/10/2009 03:54, Bob Friesenhahn wrote: >>> >>> I would be interested to know how many IOPS an OS like Solaris is able to >>> push through a single device interface. ?The normal driver stack is likely >>> limited as to how many IOPS it can sustain for a given LUN since the driver >>> stack is optimized for high latency devices like disk drives. ?If you are >>> creating a driver stack, the design decisions you make when requests will be >>> satisfied in about 12ms would be much different than if requests are >>> satisfied in 50us. ?Limitations of existing software stacks are likely >>> reasons why Sun is designing hardware with more device interfaces and more >>> independent devices. >> >> >> Open Solaris 2009.06, 1KB READ I/O: >> >> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > > /dev/null is usually a poor choice for a test lie this. Just to be on the > safe side, I''d rerun it with /dev/random. > Regards, > Andrey(aside from other replies about read vs. write and /dev/random...) Testing performance of disk by reading from /dev/random and writing to disk is misguided. From random(7d): Applications retrieve random bytes by reading /dev/random or /dev/urandom. The /dev/random interface returns random bytes only when sufficient amount of entropy has been collected. In other words, when the kernel doesn''t think that it can give high quality random numbers, it stops providing them until it has gathered enough entropy. It will pause your reads. If instead you use /dev/urandom, the above problem doesn''t exist, but the generation of random numbers is CPU-intensive. There is a reasonable chance (particularly with slow CPU''s and fast disk) that you will be testing the speed of /dev/urandom rather than the speed of the disk or other I/O components. If your goal is to provide data that is not all 0''s to prevent ZFS compression from making the file sparse or want to be sure that compression doesn''t otherwise make the actual writes smaller, you could try something like: # create a file just over 100 MB dd if=/dev/random of=/tmp/randomdata bs=513 count=204401 # repeatedly feed that file to dd while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file bs=... count=... The above should make it so that it will take a while before there are two blocks that are identical, thus confounding deduplication as well. -- Mike Gerdts http://mgerdts.blogspot.com/
As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Regards, Andrey On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl> wrote:> On 21/10/2009 03:54, Bob Friesenhahn wrote: > >> >> I would be interested to know how many IOPS an OS like Solaris is able to >> push through a single device interface. The normal driver stack is likely >> limited as to how many IOPS it can sustain for a given LUN since the driver >> stack is optimized for high latency devices like disk drives. If you are >> creating a driver stack, the design decisions you make when requests will be >> satisfied in about 12ms would be much different than if requests are >> satisfied in 50us. Limitations of existing software stacks are likely >> reasons why Sun is designing hardware with more device interfaces and more >> independent devices. >> > > > Open Solaris 2009.06, 1KB READ I/O: > > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > # iostat -xnzCM 1|egrep "device|c[0123]$" > [...] > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17497.3 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17498.8 0.0 17.1 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17277.6 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17441.3 0.0 17.0 0.0 0.0 0.8 0.0 0.0 0 82 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 17333.9 0.0 16.9 0.0 0.0 0.8 0.0 0.0 0 82 c0 > > > Now lets see how it looks like for a single SAS connection but dd to 11x > SSDs: > > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0& > # dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0& > > # iostat -xnzCM 1|egrep "device|c[0123]$" > [...] > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104243.3 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104249.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104208.1 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104245.8 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 966 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104221.9 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 968 c0 > extended device statistics > r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device > 104212.2 0.0 101.8 0.0 0.2 9.7 0.0 0.1 0 967 c0 > > > It looks like a single CPU core still hasn''t been saturated and the > bottleneck is in the device rather then OS/CPU. So the MPT driver in Solaris > 2009.06 can do at least 100,000 IOPS to a single SAS port. > > It also scales well - I did run above dd''s over 4x SAS ports at the same > time and it scaled linearly by achieving well over 400k IOPS. > > > hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. > 1.27.3.0), connected to F5100. > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/4c286e7c/attachment.html>
Andrey Kuzmin wrote:> As to your results, it sounds almost too good to be true. As Bob has > pointed out, h/w design targeted hundreds IOPS, and it was hard to > believe it can scale 100x. Fantastic.Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache enabled. --Arne> > Regards, > Andrey > > > > On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl > <mailto:milek at task.gda.pl>> wrote: > > On 21/10/2009 03:54, Bob Friesenhahn wrote: > > > I would be interested to know how many IOPS an OS like Solaris > is able to push through a single device interface. The normal > driver stack is likely limited as to how many IOPS it can > sustain for a given LUN since the driver stack is optimized for > high latency devices like disk drives. If you are creating a > driver stack, the design decisions you make when requests will > be satisfied in about 12ms would be much different than if > requests are satisfied in 50us. Limitations of existing > software stacks are likely reasons why Sun is designing hardware > with more device interfaces and more independent devices. > >
On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen <sensille at gmx.net> wrote:> Andrey Kuzmin wrote: > >> As to your results, it sounds almost too good to be true. As Bob has >> pointed out, h/w design targeted hundreds IOPS, and it was hard to believe >> it can scale 100x. Fantastic. >> > > Hundreds IOPS is not quite true, even with hard drives. I just tested > a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache >Linear? May be sequential? Regards, Andrey> enabled. > > --Arne > > >> Regards, >> Andrey >> >> >> >> >> On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski <milek at task.gda.pl<mailto: >> milek at task.gda.pl>> wrote: >> >> On 21/10/2009 03:54, Bob Friesenhahn wrote: >> >> >> I would be interested to know how many IOPS an OS like Solaris >> is able to push through a single device interface. The normal >> driver stack is likely limited as to how many IOPS it can >> sustain for a given LUN since the driver stack is optimized for >> high latency devices like disk drives. If you are creating a >> driver stack, the design decisions you make when requests will >> be satisfied in about 12ms would be much different than if >> requests are satisfied in 50us. Limitations of existing >> software stacks are likely reasons why Sun is designing hardware >> with more device interfaces and more independent devices. >> >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/cea54ff2/attachment.html>
Andrey Kuzmin wrote:> On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen <sensille at gmx.net > <mailto:sensille at gmx.net>> wrote: > > Andrey Kuzmin wrote: > > As to your results, it sounds almost too good to be true. As Bob > has pointed out, h/w design targeted hundreds IOPS, and it was > hard to believe it can scale 100x. Fantastic. > > > Hundreds IOPS is not quite true, even with hard drives. I just tested > a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache > > > Linear? May be sequential?Aren''t these synonyms? linear as opposed to random.
Well, I''m more accustomed to "sequential vs. random", but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? Regards, Andrey On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen <sensille at gmx.net> wrote:> Andrey Kuzmin wrote: > > On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen <sensille at gmx.net <mailto: >> sensille at gmx.net>> wrote: >> >> Andrey Kuzmin wrote: >> >> As to your results, it sounds almost too good to be true. As Bob >> has pointed out, h/w design targeted hundreds IOPS, and it was >> hard to believe it can scale 100x. Fantastic. >> >> >> Hundreds IOPS is not quite true, even with hard drives. I just tested >> a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache >> >> >> Linear? May be sequential? >> > > Aren''t these synonyms? linear as opposed to random. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100611/01a27a98/attachment.html>
Andrey Kuzmin wrote:> Well, I''m more accustomed to "sequential vs. random", but YMMW. > > As to 67000 512 byte writes (this sounds suspiciously close to 32Mb > fitting into cache), did you have write-back enabled? >It''s a sustained number, so it shouldn''t matter.> Regards, > Andrey > > > > On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen <sensille at gmx.net > <mailto:sensille at gmx.net>> wrote: > > Andrey Kuzmin wrote: > > On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen <sensille at gmx.net > <mailto:sensille at gmx.net> <mailto:sensille at gmx.net > <mailto:sensille at gmx.net>>> wrote: > > Andrey Kuzmin wrote: > > As to your results, it sounds almost too good to be true. > As Bob > has pointed out, h/w design targeted hundreds IOPS, and > it was > hard to believe it can scale 100x. Fantastic. > > > Hundreds IOPS is not quite true, even with hard drives. I > just tested > a Hitachi 15k drive and it handles 67000 512 byte linear > write/s, cache > > > Linear? May be sequential? > > > Aren''t these synonyms? linear as opposed to random. > > >
On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:> Andrey Kuzmin wrote: >> Well, I''m more accustomed to "sequential vs. random", but YMMW. >> As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? > > It''s a sustained number, so it shouldn''t matter.That is only 34 MB/sec. The disk can do better for sequential writes. Note: in ZFS, such writes will be coalesced into 128KB chunks. -- richard -- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
For the record, with my driver (which is not the same as the one shipped by the vendor), I was getting over 150K IOPS with a single DDRdrive X1. It is possible to get very high IOPS with Solaris. However, it might be difficult to get such high numbers with systems based on SCSI/SCSA. (SCSA does have assumptions which make it "overweight" for typical simple flash based devices.) My solution was based around the "blkdev" device driver that I integrated into ON a couple of builds ago. -- Garrett On 06/10/10 12:57, Andrey Kuzmin wrote:> On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen <sensille at gmx.net > <mailto:sensille at gmx.net>> wrote: > > Andrey Kuzmin wrote: > > As to your results, it sounds almost too good to be true. As > Bob has pointed out, h/w design targeted hundreds IOPS, and it > was hard to believe it can scale 100x. Fantastic. > > > Hundreds IOPS is not quite true, even with hard drives. I just tested > a Hitachi 15k drive and it handles 67000 512 byte linear write/s, > cache > > > Linear? May be sequential? > > Regards, > Andrey > > enabled. > > --Arne > > > Regards, > Andrey > > > > > On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski > <milek at task.gda.pl <mailto:milek at task.gda.pl> > <mailto:milek at task.gda.pl <mailto:milek at task.gda.pl>>> wrote: > > On 21/10/2009 03:54, Bob Friesenhahn wrote: > > > I would be interested to know how many IOPS an OS like > Solaris > is able to push through a single device interface. The > normal > driver stack is likely limited as to how many IOPS it can > sustain for a given LUN since the driver stack is > optimized for > high latency devices like disk drives. If you are > creating a > driver stack, the design decisions you make when > requests will > be satisfied in about 12ms would be much different than if > requests are satisfied in 50us. Limitations of existing > software stacks are likely reasons why Sun is designing > hardware > with more device interfaces and more independent devices. > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100610/01c4e84a/attachment.html>
On Jun 10, 2010, at 5:54 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: > >> Andrey Kuzmin wrote: >>> Well, I''m more accustomed to "sequential vs. random", but YMMW. >>> As to 67000 512 byte writes (this sounds suspiciously close to >>> 32Mb fitting into cache), did you have write-back enabled? >> >> It''s a sustained number, so it shouldn''t matter. > > That is only 34 MB/sec. The disk can do better for sequential writes.Not doing sector sized IO. Besides this was a max IOPS number not max throughput number. If it were the OP might have used a 1M bs or better instead. -Ross
On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling <richard.elling at gmail.com>wrote:> On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: > > > Andrey Kuzmin wrote: > >> Well, I''m more accustomed to "sequential vs. random", but YMMW. > >> As to 67000 512 byte writes (this sounds suspiciously close to 32Mb > fitting into cache), did you have write-back enabled? > > > > It''s a sustained number, so it shouldn''t matter. > > That is only 34 MB/sec. The disk can do better for sequential writes. > > Note: in ZFS, such writes will be coalesced into 128KB chunks. >So this is just 256 IOPS in the controller, not 64K. Regards, Andrey> -- richard > > -- > ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 > http://nexenta-rotterdam.eventbrite.com/ > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100611/85e55f51/attachment.html>
Andrey Kuzmin wrote:> On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling > <richard.elling at gmail.com <mailto:richard.elling at gmail.com>> wrote: > > On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: > > > Andrey Kuzmin wrote: > >> Well, I''m more accustomed to "sequential vs. random", but YMMW. > >> As to 67000 512 byte writes (this sounds suspiciously close to > 32Mb fitting into cache), did you have write-back enabled? > > > > It''s a sustained number, so it shouldn''t matter. > > That is only 34 MB/sec. The disk can do better for sequential writes. > > Note: in ZFS, such writes will be coalesced into 128KB chunks. > > > So this is just 256 IOPS in the controller, not 64K.No, it''s 67k ops, it was a completely ZFS-free test setup. iostat also confirmed the numbers. --Arne> > Regards, > Andrey > > > -- richard > > -- > ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 > http://nexenta-rotterdam.eventbrite.com/ > > > > > > >
On 10/06/2010 20:43, Andrey Kuzmin wrote:> As to your results, it sounds almost too good to be true. As Bob has > pointed out, h/w design targeted hundreds IOPS, and it was hard to > believe it can scale 100x. Fantastic.But it actually can do over 100k. Also several thousand IOPS on a single FC port is nothing unusual and has been the case for at least several years. -- Robert Milkowski http://milek.blogspot.com
On 11/06/2010 09:22, sensille wrote:> Andrey Kuzmin wrote: > >> On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling >> <richard.elling at gmail.com<mailto:richard.elling at gmail.com>> wrote: >> >> On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: >> >> > Andrey Kuzmin wrote: >> >> Well, I''m more accustomed to "sequential vs. random", but YMMW. >> >> As to 67000 512 byte writes (this sounds suspiciously close to >> 32Mb fitting into cache), did you have write-back enabled? >> > >> > It''s a sustained number, so it shouldn''t matter. >> >> That is only 34 MB/sec. The disk can do better for sequential writes. >> >> Note: in ZFS, such writes will be coalesced into 128KB chunks. >> >> >> So this is just 256 IOPS in the controller, not 64K. >> > No, it''s 67k ops, it was a completely ZFS-free test setup. iostat also confirmed > the numbers.It''s a really simple test everyone can do it. # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 I did a test on my workstation a moment ago and got about 21k IOPS from my sata drive (iostat). The trick here of course is that this is sequentail write with no other workload going on and a drive should be able to nicely coalesce these IOs and do a sequential writes with large blocks. -- Robert Milkowski http://milek.blogspot.com
On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski <milek at task.gda.pl> wrote:> On 11/06/2010 09:22, sensille wrote: > >> Andrey Kuzmin wrote: >> >> >>> On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling >>> <richard.elling at gmail.com<mailto:richard.elling at gmail.com>> wrote: >>> >>> On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: >>> >>> > Andrey Kuzmin wrote: >>> >> Well, I''m more accustomed to "sequential vs. random", but YMMW. >>> >> As to 67000 512 byte writes (this sounds suspiciously close to >>> 32Mb fitting into cache), did you have write-back enabled? >>> > >>> > It''s a sustained number, so it shouldn''t matter. >>> >>> That is only 34 MB/sec. The disk can do better for sequential >>> writes. >>> >>> Note: in ZFS, such writes will be coalesced into 128KB chunks. >>> >>> >>> So this is just 256 IOPS in the controller, not 64K. >>> >>> >> No, it''s 67k ops, it was a completely ZFS-free test setup. iostat also >> confirmed >> the numbers. >> > > It''s a really simple test everyone can do it. > > # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 > > I did a test on my workstation a moment ago and got about 21k IOPS from my > sata drive (iostat). > The trick here of course is that this is sequentail write with no other > workload going on and a drive should be able to nicely coalesce these IOs > and do a sequential writes with large blocks.Exactly, though one might still wonder where the coalescing actually happens, in the respective OS layer or in the controller. Nonetheless, this is hardly a common use-case one would design h/w for. Regards, Andrey> > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100611/0b6b5cb8/attachment.html>
On 11/06/2010 10:58, Andrey Kuzmin wrote:> On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski <milek at task.gda.pl > <mailto:milek at task.gda.pl>> wrote: > > On 11/06/2010 09:22, sensille wrote: > > Andrey Kuzmin wrote: > > On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling > <richard.elling at gmail.com > <mailto:richard.elling at gmail.com><mailto:richard.elling at gmail.com > <mailto:richard.elling at gmail.com>>> wrote: > > On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote: > > > Andrey Kuzmin wrote: > >> Well, I''m more accustomed to "sequential vs. random", > but YMMW. > >> As to 67000 512 byte writes (this sounds suspiciously > close to > 32Mb fitting into cache), did you have write-back enabled? > > > > It''s a sustained number, so it shouldn''t matter. > > That is only 34 MB/sec. The disk can do better for > sequential writes. > > Note: in ZFS, such writes will be coalesced into 128KB > chunks. > > > So this is just 256 IOPS in the controller, not 64K. > > No, it''s 67k ops, it was a completely ZFS-free test setup. > iostat also confirmed > the numbers. > > > It''s a really simple test everyone can do it. > > # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 > > I did a test on my workstation a moment ago and got about 21k IOPS > from my sata drive (iostat). > The trick here of course is that this is sequentail write with no > other workload going on and a drive should be able to nicely > coalesce these IOs and do a sequential writes with large blocks. > > > Exactly, though one might still wonder where the coalescing actually > happens, in the respective OS layer or in the controller. Nonetheless, > this is hardly a common use-case one would design h/w for. > >in the above example it happens inside a disk drive. -- Robert Milkowski http://milek.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100611/d110bb87/attachment.html>
On Fri, 2010-06-11 at 13:58 +0400, Andrey Kuzmin wrote:> # dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512 > > I did a test on my workstation a moment ago and got about 21k > IOPS from my sata drive (iostat). > The trick here of course is that this is sequentail write with > no other workload going on and a drive should be able to > nicely coalesce these IOs and do a sequential writes with > large blocks. > > > Exactly, though one might still wonder where the coalescing actually > happens, in the respective OS layer or in the controller. Nonetheless, > this is hardly a common use-case one would design h/w for.No OS layer coalescing happens. The most an OS will ever do is "sort" the IOs to make them advantageous (e.g. avoid extra seeking), but the I/Os are still delivered as individual requests to the HBA. I''m not aware of any logic in an HBA to coalesce either, and I would think such a thing would be highly risky. That said, caching firmware on the drive itself may effectively (probably!) cause these transfers to happen as a single transfer if they are naturally contiguous, and if they are arriving at the drive firmware faster than the firmware can flush them to media. - Garrett