Hi, as I have learned from the discussion about which SSD to use as ZIL drives, I stumbled across this article, that discusses short stroking for increasing IOPs on SAS and SATA drives: http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html Now, I am wondering if using a mirror of such 15k SAS drives would be a good-enough fit for a ZIL on a zpool that is mainly used for file services via AFP and SMB. I''d particulary like to know, if someone has already used such a solution and how it has worked out. Cheers, budy -- Stephan Budach Jung von Matt/it-services GmbH Glash?ttenstra?e 79 20357 Hamburg Tel: +49 40-4321-1353 Fax: +49 40-4321-1114 E-Mail: stephan.budach at jvm.de Internet: http://www.jvm.com Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm AG HH HRB 98380 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/6e933387/attachment.html>
Great question. In "good enough" computing, beauty is in the eye of the beholder. My home NAS appliance uses IDE and SATA drives withoutba dedicated ZIL http://dtrace.org/blogs/ahl/2010/11/15/zil-analysis-from-chris-george/ "if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential." On 23 Dec 2010, at 10:25, Stephan Budach <stephan.budach at jvm.de> wrote:> Hi, > > as I have learned from the discussion about which SSD to use as ZIL drives, I stumbled across this article, that discusses short stroking for increasing IOPs on SAS and SATA drives: > > http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html > > Now, I am wondering if using a mirror of such 15k SAS drives would be a good-enough fit for a ZIL on a zpool that is mainly used for file services via AFP and SMB. > I''d particulary like to know, if someone has already used such a solution and how it has worked out. > > Cheers, > budy > > > -- > Stephan Budach > Jung von Matt/it-services GmbH > Glash?ttenstra?e 79 > 20357 Hamburg > > Tel: +49 40-4321-1353 > Fax: +49 40-4321-1114 > E-Mail: stephan.budach at jvm.de > Internet: http://www.jvm.com > > Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm > AG HH HRB 98380 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/9058c06e/attachment.html>
Sent from my iPhone (which had a lousy user interface which makes it all too easy for a clumsy oaf like me to touch "Send" before I''m done)... On 23 Dec 2010, at 11:07, Phil Harman <phil.harman at gmail.com> wrote:> Great question. In "good enough" computing, beauty is in the eye of the beholder. My home NAS appliance uses mirrorwd IDE and SATA drives without a dedicated ZILdevice. And for my home SMB and NFS, that''s good enough. I''m sure that even a 7200rpm SATA ZIL would improve things inmy case. The random I/O requirement for the ZIL is discussed by Adam (and Chris) here ...> http://dtrace.org/blogs/ahl/2010/11/15/zil-analysis-from-chris-george/What I find most encouraging is this statement:> "if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential."It''s not broken, but it is suboptimal, and fixable (apparently) ;)> On 23 Dec 2010, at 10:25, Stephan Budach <stephan.budach at jvm.de> wrote: > >> Hi, >> >> as I have learned from the discussion about which SSD to use as ZIL drives, I stumbled across this article, that discusses short stroking for increasing IOPs on SAS and SATA drives: >> >> http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html >> >> Now, I am wondering if using a mirror of such 15k SAS drives would be a good-enough fit for a ZIL on a zpool that is mainly used for file services via AFP and SMB. >> I''d particulary like to know, if someone has already used such a solution and how it has worked out. >> >> Cheers, >> budy >> >> >> -- >> Stephan Budach >> Jung von Matt/it-services GmbH >> Glash?ttenstra?e 79 >> 20357 Hamburg >> >> Tel: +49 40-4321-1353 >> Fax: +49 40-4321-1114 >> E-Mail: stephan.budach at jvm.de >> Internet: http://www.jvm.com >> >> Gesch?ftsf?hrer: Ulrich Pallas, Frank Wilhelm >> AG HH HRB 98380 >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/3f364bb5/attachment-0001.html>
Am 23.12.10 12:18, schrieb Phil Harman:> Sent from my iPhone (which had a lousy user interface which makes it > all too easy for a clumsy oaf like me to touch "Send" before I''m done)... > > On 23 Dec 2010, at 11:07, Phil Harman <phil.harman at gmail.com > <mailto:phil.harman at gmail.com>> wrote: > >> Great question. In "good enough" computing, beauty is in the eye of >> the beholder. My home NAS appliance uses mirrorwd IDE and SATA drives >> without a dedicated ZIL > > device. And for my home SMB and NFS, that''s good enough. > > I''m sure that even a 7200rpm SATA ZIL would improve things inmy case. > > The random I/O requirement for the ZIL is discussed by Adam (and > Chris) here ... > >> http://dtrace.org/blogs/ahl/2010/11/15/zil-analysis-from-chris-george/ > > What I find most encouraging is this statement: > >> "if HDDs and commodity SSDs continue to be target ZIL devices, ZFS >> could and should do more to ensure that writes are sequential." > > It''s not broken, but it is suboptimal, and fixable (apparently) ;)Yeah - I read through Christopher''s article already and it clearly shows the shortcomings of current flash SSDs as ZIL devices. On the other hand, if you''s be using a DDRdrive as a ZIL device, you''d pretty lock this zpool to that particular host, since you can''t easily move the zpool onto another host, without moving the DDRdrive as well or without detaching the ZIL device(s) from the zpool, which I find a little bit odd. I am not actually running in a SOHO scenario with my ZFS file server, since it has to serve up to 200 users on up to 200 zfs volumes in one zpool, but the actual data traffic is also not that high either. The traffic is more of small peaks when someone writes back to a file. Cheers, budy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/ec1f90c4/attachment.html>
On 23 Dec 2010, at 11:53, Stephan Budach <stephan.budach at jvm.de> wrote:> Am 23.12.10 12:18, schrieb Phil Harman: >> >> Sent from my iPhone (which had a lousy user interface which makes it all too easy for a clumsy oaf like me to touch "Send" before I''m done)... >> >> On 23 Dec 2010, at 11:07, Phil Harman <phil.harman at gmail.com> wrote: >> >>> Great question. In "good enough" computing, beauty is in the eye of the beholder. My home NAS appliance uses mirrorwd IDE and SATA drives without a dedicated ZIL >> >> device. And for my home SMB and NFS, that''s good enough. >> >> I''m sure that even a 7200rpm SATA ZIL would improve things inmy case. >> >> The random I/O requirement for the ZIL is discussed by Adam (and Chris) here ... >> >>> http://dtrace.org/blogs/ahl/2010/11/15/zil-analysis-from-chris-george/ >> >> What I find most encouraging is this statement: >> >>> "if HDDs and commodity SSDs continue to be target ZIL devices, ZFS could and should do more to ensure that writes are sequential." >> >> It''s not broken, but it is suboptimal, and fixable (apparently) ;) > Yeah - I read through Christopher''s article already and it clearly shows the shortcomings of current flash SSDs as ZIL devices. On the other hand, if you''s be using a DDRdrive as a ZIL device, you''d pretty lock this zpool to that particular host, since you can''t easily move the zpool onto another host, without moving the DDRdrive as well or without detaching the ZIL device(s) from the zpool, which I find a little bit odd. > > I am not actually running in a SOHO scenario with my ZFS file server, since it has to serve up to 200 users on up to 200 zfs volumes in one zpool, but the actual data traffic is also not that high either. The traffic is more of small peaks when someone writes back to a file. > > Cheers, > budyWell, your proposed config will improve what each user sees during their own private burst, and short stroking can only improve things in the worst case scenario (although it may not be measurable). So why not give it a spin and report back to the list in the new year? All the best, Phil -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/1a22ca87/attachment.html>
Am 23.12.10 13:09, schrieb Phil Harman:> On 23 Dec 2010, at 11:53, Stephan Budach <stephan.budach at jvm.de > <mailto:stephan.budach at jvm.de>> wrote: > >> Am 23.12.10 12:18, schrieb Phil Harman: >>> Sent from my iPhone (which had a lousy user interface which makes it >>> all too easy for a clumsy oaf like me to touch "Send" before I''m >>> done)... >>> >>> On 23 Dec 2010, at 11:07, Phil Harman <phil.harman at gmail.com >>> <mailto:phil.harman at gmail.com>> wrote: >>> >>>> Great question. In "good enough" computing, beauty is in the eye of >>>> the beholder. My home NAS appliance uses mirrorwd IDE and SATA >>>> drives without a dedicated ZIL >>> >>> device. And for my home SMB and NFS, that''s good enough. >>> >>> I''m sure that even a 7200rpm SATA ZIL would improve things inmy case. >>> >>> The random I/O requirement for the ZIL is discussed by Adam (and >>> Chris) here ... >>> >>>> http://dtrace.org/blogs/ahl/2010/11/15/zil-analysis-from-chris-george/ >>> >>> What I find most encouraging is this statement: >>> >>>> "if HDDs and commodity SSDs continue to be target ZIL devices, ZFS >>>> could and should do more to ensure that writes are sequential." >>> >>> It''s not broken, but it is suboptimal, and fixable (apparently) ;) >> Yeah - I read through Christopher''s article already and it clearly >> shows the shortcomings of current flash SSDs as ZIL devices. On the >> other hand, if you''s be using a DDRdrive as a ZIL device, you''d >> pretty lock this zpool to that particular host, since you can''t >> easily move the zpool onto another host, without moving the DDRdrive >> as well or without detaching the ZIL device(s) from the zpool, which >> I find a little bit odd. >> >> I am not actually running in a SOHO scenario with my ZFS file server, >> since it has to serve up to 200 users on up to 200 zfs volumes in one >> zpool, but the actual data traffic is also not that high either. The >> traffic is more of small peaks when someone writes back to a file. >> >> Cheers, >> budy > > Well, your proposed config will improve what each user sees during > their own private burst, and short stroking can only improve things in > the worst case scenario (although it may not be measurable). So why > not give it a spin and report back to the list in the new year? >Ha ha - if no one else has some more input on this, I will definetively give it a try in jannuary. Cheers, budy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101223/4270fd1f/attachment.html>
On Thu, Dec 23 at 11:25, Stephan Budach wrote:> Hi, > > as I have learned from the discussion about which SSD to use as ZIL > drives, I stumbled across this article, that discusses short stroking for > increasing IOPs on SAS and SATA drives: > > [1]http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html > > Now, I am wondering if using a mirror of such 15k SAS drives would be a > good-enough fit for a ZIL on a zpool that is mainly used for file services > via AFP and SMB. > I''d particulary like to know, if someone has already used such a solution > and how it has worked out.Haven''t personally used it, but the worst case steady-state IOPS of the Vertex2 EX, from the DDRDrive presentation, is 6k IOPS assuming a full-pack random workload. To achieve that through SAS disks in the same workload, you''ll probably spend significantly more money and it will consume a LOT more space and power. According to that Tom''s article, a typical 15k SAS enterprise drive is in the 600 IOPS ballpark when short-stroked and consumes about 15W active. Thus you''re going to need ten of these devices, to equal the "degraded" steady-state IOPS of an SSD. I just don''t think the math works out. At that point, you''re probably better-off not having a dedicated ZIL, instead of burning 10 slots and 150W. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Am 23.12.10 19:05, schrieb Eric D. Mudama:> On Thu, Dec 23 at 11:25, Stephan Budach wrote: >> Hi, >> >> as I have learned from the discussion about which SSD to use as ZIL >> drives, I stumbled across this article, that discusses short >> stroking for >> increasing IOPs on SAS and SATA drives: >> >> [1]http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html >> >> Now, I am wondering if using a mirror of such 15k SAS drives would >> be a >> good-enough fit for a ZIL on a zpool that is mainly used for file >> services >> via AFP and SMB. >> I''d particulary like to know, if someone has already used such a >> solution >> and how it has worked out. > > Haven''t personally used it, but the worst case steady-state IOPS of > the Vertex2 EX, from the DDRDrive presentation, is 6k IOPS assuming a > full-pack random workload. > > To achieve that through SAS disks in the same workload, you''ll > probably spend significantly more money and it will consume a LOT more > space and power. > > According to that Tom''s article, a typical 15k SAS enterprise drive is > in the 600 IOPS ballpark when short-stroked and consumes about 15W > active. Thus you''re going to need ten of these devices, to equal the > "degraded" steady-state IOPS of an SSD. I just don''t think the math > works out. At that point, you''re probably better-off not having a > dedicated ZIL, instead of burning 10 slots and 150W.Good - that was actually the information I have been missing. So, I will rather go with the Vertex2 EX then and save me the hassle of short stroking entirely. Thanks and merry christmas to all on this list. Cheers, budy
On Thu, Dec 23, 2010 at 11:25:43AM +0100, Stephan Budach wrote:> as I have learned from the discussion about which SSD to use as ZIL > drives, I stumbled across this article, that discusses short > stroking for increasing IOPs on SAS and SATA drives:There was a thread on this a while back. I forget when or the subject. But yes, you could even use 7200 rpm drives to make a fast ZIL device. The trick is the on-disk format, and the pseudo-device driver that you would have to layer on top of the actual device(s) to get such performance. The key is that sustained sequential I/O rates for disks can be quite large, so if you organize the disk in a log form and use the outer tracks only, then you can get pretend to have awesome write IOPS for a disk (but NOT read IOPs). But it''s not necessarily as cheap as you might think. You''d be making very inefficient use of an expensive disk (in the case of an SAS 15k rpm disk), or disks, and if plural then you are also using more ports (oops). Disks used this way probably also consume more power than SSDs (OK, this part of my analysis if very iffy), and you still need to do something about ensuring syncs to disk on power failure (such as just disabling the cache on the disk, but this would lower performance, increasing the cost). When you factor all the costs in I suspect you''ll find that SSDs are priced reasonably well. That''s not to say that one could not put together a disk-based log device that could eat SSDs'' lunch, but SSD prices would then just come down to match that -- and you can expect SSD prices to come down anyways, as with any new technologies. I don''t mean to discourage you, just to point out that there''s plenty of work to do to make "short-stroked disks as ZILs" a workable reality, while the economics of doing that work versus waiting for SSD prices to come down don''t seem appealing. Caveat emptor: my analysis is off-the-cuff; I could be wrong. Nico --
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Stephan Budach > > Now, I am wondering if using a mirror of such 15k SAS drives would be a > good-enough fit for a ZIL on a zpool that is mainly used for file servicesvia> AFP and SMB.For supporting AFP and SMB, most likely, you would be perfectly happy simply disabling the ZIL. You will get maximum performance... Even higher than the world''s fastest SSD or DDRDrive or any other type of storage device for dedicated log. To determine if this is ok for you, be aware of the argument *against* disabling the ZIL: In the event of an ungraceful crash, with ZIL enabled, you lose up to 30 sec of async data, but you do not lose any sync data. In the event of an ungraceful crash, with ZIL disabled, you lose up to 30 sec of async and sync data. In neither case do you have data corruption, or a corrupt filesystem. The only question is about 30 seconds of sync data. You must protect this type of data, if you''re running a database, an iscsi target for virtual hosts, and for some other types of data services... But if you''re doing just AFP and SMB, it''s pretty likely you don''t need to worry about it.
2010/12/24 Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Stephan Budach >> >> Now, I am wondering if using a mirror of such 15k SAS drives would be a >> good-enough fit for a ZIL on a zpool that is mainly used for file services > via >> AFP and SMB. > > For supporting AFP and SMB, most likely, you would be perfectly happy simply > disabling the ZIL. ?You will get maximum performance... Even higher than the > world''s fastest SSD or DDRDrive or any other type of storage device for > dedicated log. ?To determine if this is ok for you, be aware of the argument > *against* disabling the ZIL: > > In the event of an ungraceful crash, with ZIL enabled, you lose up to 30 sec > of async data, but you do not lose any sync data. > > In the event of an ungraceful crash, with ZIL disabled, you lose up to 30 > sec of async and sync data. > > In neither case do you have data corruption, or a corrupt filesystem. ?The > only question is about 30 seconds of sync data. ?You must protect this type > of data, if you''re running a database, ...With Netatalk for AFP he _is_ running a database: any AFP server needs to maintain a consistent mapping between _not reused_ catalog node ids (CNIDs) and filesystem objects. Luckily for Apple, HFS[+] and their Cocoa/Carbon APIs provide such a mapping making diirect use of HFS+ CNIDs. Unfortunately most UNIX filesystem reuse inodes and have no API for mapping inodes to filesystem objects. Therefor all AFP servers running on non-Apple OSen maintain a database providing this mapping, in case of Netatalk it''s `cnid_dbd` using a BerkeleyDB database. -f
> From: Frank Lahm [mailto:franklahm at googlemail.com] > > With Netatalk for AFP he _is_ running a database: any AFP server needs > to maintain a consistent mapping between _not reused_ catalog node ids > (CNIDs) and filesystem objects. Luckily for Apple, HFS[+] and their > Cocoa/Carbon APIs provide such a mapping making diirect use of HFS+ > CNIDs. Unfortunately most UNIX filesystem reuse inodes and have no API > for mapping inodes to filesystem objects. Therefor all AFP servers > running on non-Apple OSen maintain a database providing this mapping, > in case of Netatalk it''s `cnid_dbd` using a BerkeleyDB database.Don''t all of those concerns disappear in the event of a reboot? If you stop AFP, you could completely obliterate the BDB database, and restart AFP, and functionally continue from where you left off. Right?
On Dec 23, 2010, at 2:25 AM, Stephan Budach wrote:> as I have learned from the discussion about which SSD to use as ZIL drives, I stumbled across this article, that discusses short stroking for increasing IOPs on SAS and SATA drives: > > http://www.tomshardware.com/reviews/short-stroking-hdd,2157.html > > Now, I am wondering if using a mirror of such 15k SAS drives would be a good-enough fit for a ZIL on a zpool that is mainly used for file services via AFP and SMB.SMB does not create much of a synchronous load. I haven''t explored AFP directly, but if they do use Berkeley DB, then we do have a lot of experience tuning ZFS for Berkeley DB performance.> I''d particulary like to know, if someone has already used such a solution and how it has worked out.Latency is what matters most. While there is a loose relationship between IOPS and latency, you really want low latency. For 15krpm drives, the average latency is 2ms for zero seeks. A decent SSD will beat that by an order of magnitude. -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101224/dc0c0549/attachment.html>
On 24/12/2010 18:21, Richard Elling wrote:> Latency is what matters most. While there is a loose relationship > between IOPS > and latency, you really want low latency. For 15krpm drives, the > average latency > is 2ms for zero seeks. A decent SSD will beat that by an order of > magnitude.And the closer you get to the CPU, the lower the latency. For example, the DDRdrive X1 is yet another order of magnitude faster because it sits directly on the PCI bus, without the overhead of SAS protocol. Yet the humble old 15K drive with 2ms sequential latency is still and order of magnitude faster than a busy drive delivering 20ms latencies under a random workload.
On Dec 24, 2010, at 1:21 PM, Richard Elling <richard.elling at gmail.com> wrote:> Latency is what matters most. While there is a loose relationship between IOPS > and latency, you really want low latency. For 15krpm drives, the average latency > is 2ms for zero seeks. A decent SSD will beat that by an order of magnitude.Actually I''d say that latency has a direct relationship to IOPS because it''s the time it takes to perform an IO that determines how many IOs Per Second that can be performed. Ever notice how storage vendors list their max IOPS in 512 byte sequential IO workloads and sustained throughput in 1MB+ sequential IO workloads. Only SSD makers list their random IOPS workload numbers and their 4K IO workload numbers. -Ross
On Dec 25, 2010, at 5:37 PM, Ross Walker wrote:> On Dec 24, 2010, at 1:21 PM, Richard Elling <richard.elling at gmail.com> wrote: > >> Latency is what matters most. While there is a loose relationship between IOPS >> and latency, you really want low latency. For 15krpm drives, the average latency >> is 2ms for zero seeks. A decent SSD will beat that by an order of magnitude. > > Actually I''d say that latency has a direct relationship to IOPS because it''s the time it takes to perform an IO that determines how many IOs Per Second that can be performed.That is only true when there is one queue and one server (in the queueing context). This is not the case where there are multiple concurrent I/O that can be completed out of order by multiple servers working in parallel (eg. disk subsystems). For an extreme example, the Sun Storage F5100 Array specifications show 1.6 million random read IOPS @ 4KB. But instead of an average latency of 625 nanoseconds, it shows an average latency of 0.378 milliseconds. The analogy we''ve used in parallel computing for many years is "nine women cannot make a baby in one month."> Ever notice how storage vendors list their max IOPS in 512 byte sequential IO workloads and sustained throughput in 1MB+ sequential IO workloads. Only SSD makers list their random IOPS workload numbers and their 4K IO workload numbers.The vendor will present the number that makes them look best, often without regard for practical application... the "curse of marketing" :-) -- richard
On Sat, Dec 25, 2010 at 08:37:42PM -0500, Ross Walker wrote:> On Dec 24, 2010, at 1:21 PM, Richard Elling <richard.elling at gmail.com> wrote: > > > Latency is what matters most. While there is a loose relationship between IOPS > > and latency, you really want low latency. For 15krpm drives, the average latency > > is 2ms for zero seeks. A decent SSD will beat that by an order of magnitude. > > Actually I''d say that latency has a direct relationship to IOPS because it''s the time it takes to perform an IO that determines how many IOs Per Second that can be performed.Assuming you have enough synchronous writes and that you can organize them so as to keep the drive at max sustained sequential write bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn''t enter into that formula. Latency does remain though, and will be noticeable to apps doing synchronous operations. Thus 100MB/s, say, sustained sequential write bandwidth with, say, 2KB avg ZIL entries you''d get 51200/s logical, sync write operations. The latency for each such operation would still be 2ms (or whatever it is for the given disk). Since you''d likely have to batch many ZIL writes you''d end up making the latency for some ops longer than 2ms and others shorter, but if you can keep the drive at max sustained seq write bandwidth then the average latency will be 2ms. SSDs are clearly a better choice. BTW, a parallelized tar would greatly help reduce the impact of high latency open()/close() (over NFS) operations... Nico --
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nicolas Williams > > > Actually I''d say that latency has a direct relationship to IOPS becauseit''s the> time it takes to perform an IO that determines how many IOs Per Second > that can be performed. > > Assuming you have enough synchronous writes and that you can organize > them so as to keep the drive at max sustained sequential write > bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn''tOk, what we''ve hit here is two people using the same word to talk about different things. Apples to oranges, as it were. Both meanings of "IOPS" are ok, but context is everything. There are drive random IOPS, which is dependent on latency and seek time, and there is also measured random IOPS above the filesystem layer, which is not always related to latency or seek time, as described above.
On Dec 27, 2010, at 6:06 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Nicolas Williams >> >>> Actually I''d say that latency has a direct relationship to IOPS because > it''s the >> time it takes to perform an IO that determines how many IOs Per Second >> that can be performed. >> >> Assuming you have enough synchronous writes and that you can organize >> them so as to keep the drive at max sustained sequential write >> bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn''t > > Ok, what we''ve hit here is two people using the same word to talk about > different things. Apples to oranges, as it were. Both meanings of "IOPS" > are ok, but context is everything. > > There are drive random IOPS, which is dependent on latency and seek time, > and there is also measured random IOPS above the filesystem layer, which is > not always related to latency or seek time, as described above.The small, random read model can assume no cache hits. Adding caches makes the model too complicated for simple analysis, and arguably too complicated for modeling at all. For such systems, empirical measurements are possible, but can be overly optimistic. For example, it is relatively trivial to demonstrate 500,000 small, random read IOPS at the application using a file system that caches to RAM. Achieving that performance level for the general case is much less common. -- richard
On Mon, Dec 27, 2010 at 09:06:45PM -0500, Edward Ned Harvey wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Nicolas Williams > > > > > Actually I''d say that latency has a direct relationship to IOPS because > it''s the > > time it takes to perform an IO that determines how many IOs Per Second > > that can be performed. > > > > Assuming you have enough synchronous writes and that you can organize > > them so as to keep the drive at max sustained sequential write > > bandwidth, then IOPS == bandwidth / logical I/O size. Latency doesn''t > > Ok, what we''ve hit here is two people using the same word to talk about > different things. Apples to oranges, as it were. Both meanings of "IOPS" > are ok, but context is everything. > > There are drive random IOPS, which is dependent on latency and seek time, > and there is also measured random IOPS above the filesystem layer, which is > not always related to latency or seek time, as described above.Clearly the application cares about _synchronous_ operations that are meaningful to it. In the case of an NFS application that would be open() with O_CREAT (and particularly O_EXCL), close(), fsync() and so on. For a POSIX (but not NFS) application the number of synchronous operations is smaller. The rate of asynchronous operations is less important to the application because those are subject to caching, thus less predictable. But to the filesystem the IOPS are not just about synchronous I/O but about how many distinct I/O operations can be completed per unit of time. I tried to keep this clear; sorry for any confusion. Nico --
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > Ok, what we''ve hit here is two people using the same word to talk about > different things. Apples to oranges, as it were. Both meanings of "IOPS" > are ok, but context is everything. > > There are drive random IOPS, which is dependent on latency and seek time, > and there is also measured random IOPS above the filesystem layer, whichis> not always related to latency or seek time, as described above.In any event, the relevant points are: The question of IOPS here is relevant to conversation because of ZIL dedicated log. If you have advanced short-stroking to get the write latency of a log device down to zero, then it can compete against SSD for purposes of a log device, but nobody seems to believe such technology currently exists, and it certainly couldn''t compete against SSD for random reads. (ZIL log is the only situation I know of, where write performance of a drive matters and read performance does not matter.) If using ZFS for AFP (and consequently BDB)... If you disable the ZIL you will have maximum performance, but maybe you''re not comfortable with that because you''re not convinced of stability with ZIL disabled, or for other reasons. * If you put your BDB or ZIL on a spindle dedicated device, it will perform better than having no dedicated device, but the difference might be anything from 1x to 10x, depending on where your bottlenecks are. AKA no improvement is guaranteed, but probably you get at least a little bit. * If you put your BDB or ZIL on a SSD dedicated log device, it will perform still better, and again, the difference could be anywhere from 1x to 10x depending on your bottlenecks. * If you disable your ZIL, it will perform still better, and again, the difference could be anywhere from 1x to 10x. Realistically, at some point you''ll hit a network bottleneck, and you won''t notice the improved performance. If you''re just doing small numbers of large files, none of the above will probably be noticeable, because in that case latency is pretty much irrelevant. But assuming you have at least a bunch of reasonably small files, IMHO that threshold is at the SSD, because the latency of the SSD is insignificant compared to the latency of the network. But even with short-stroking getting the latency down to 2ms, that''s still significant compared to network latency, so there''s probably still room for improvement over the short-stroking techniques. At least, until somebody creates a more advanced short-stroking which gets latency down to near-zero, if that will ever happen.
On Tue, 28 Dec 2010, Edward Ned Harvey wrote:> > In any event, the relevant points are: > > The question of IOPS here is relevant to conversation because of ZIL > dedicated log. If you have advanced short-stroking to get the write latency > of a log device down to zero, then it can compete against SSD for purposes > of a log device, but nobody seems to believe such technology currently > exists, and it certainly couldn''t compete against SSD for random reads. > (ZIL log is the only situation I know of, where write performance of a drive > matters and read performance does not matter.)It seems that you may be confused. For the ZIL the drive''s rotational latency (based on RPM) is the dominating factor and not the lateral head seek time on the media. In this case, the "short-stroking" you are talking about does not help any. The ZIL is already effectively "short-stroking" since it writes in order. The (possibly) worthy optimizations I have heard about are writing the log data in a different pattern on disk (via a special device driver) with the goal that when when drive sync request comes in the drive is quite likely to be able to write immediately. Since such optimizations are quite device and write-load dependent, it is not worth while for a large company to develop the feature (but would make for an interesting project). Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > Sent: Tuesday, December 28, 2010 9:23 PM > > > The question of IOPS here is relevant to conversation because of ZIL > > dedicated log. If you have advanced short-stroking to get the writelatency> > of a log device down to zero, then it can compete against SSD forpurposes> > of a log device, but nobody seems to believe such technology currently > > exists, and it certainly couldn''t compete against SSD for random reads. > > (ZIL log is the only situation I know of, where write performance of adrive> > matters and read performance does not matter.) > > It seems that you may be confused. For the ZIL the drive''s rotational > latency (based on RPM) is the dominating factor and not the lateral > head seek time on the media. In this case, the "short-stroking" you > are talking about does not help any. The ZIL is already effectively > "short-stroking" since it writes in order.Nope. I''m not confused at all. I''m making a distinction between "short stroking" and "advanced short stroking." Where simple "short stroking" does as you said - eliminates the head seek time but still susceptible to rotational latency. As you said, the ZIL already effectively accomplishes that end result, provided a dedicated spindle disk for log device, but does not do that if your ZIL is on the pool storage. And what I''m calling "advanced short stroking" are techniques that effectively eliminate, or minimize both seek & latency, to zero or near-zero. What I''m calling "advanced short stroking" doesn''t exist as far as I know, but is theoretically possible through either special disk hardware or special drivers.
You do seem to misunderstand ZIL. ZIL is quite simply write cache and using a short stroked rotating drive is never going to provide a performance increase that is worth talking about and more importantly ZIL was designed to be used with a RAM/Solid State Disk. We use sata2 *HyperDrive5* RAM disks in mirrors and they work well and are far cheaper than STEC or other enterprise SSD''s and have non of the issue related to trim... Highly recommended... ;-) http://www.hyperossystems.co.uk/ Kevin On 29 December 2010 13:40, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > > Sent: Tuesday, December 28, 2010 9:23 PM > > > > > The question of IOPS here is relevant to conversation because of ZIL > > > dedicated log. If you have advanced short-stroking to get the write > latency > > > of a log device down to zero, then it can compete against SSD for > purposes > > > of a log device, but nobody seems to believe such technology currently > > > exists, and it certainly couldn''t compete against SSD for random reads. > > > (ZIL log is the only situation I know of, where write performance of a > drive > > > matters and read performance does not matter.) > > > > It seems that you may be confused. For the ZIL the drive''s rotational > > latency (based on RPM) is the dominating factor and not the lateral > > head seek time on the media. In this case, the "short-stroking" you > > are talking about does not help any. The ZIL is already effectively > > "short-stroking" since it writes in order. > > Nope. I''m not confused at all. I''m making a distinction between "short > stroking" and "advanced short stroking." Where simple "short stroking" > does > as you said - eliminates the head seek time but still susceptible to > rotational latency. As you said, the ZIL already effectively accomplishes > that end result, provided a dedicated spindle disk for log device, but does > not do that if your ZIL is on the pool storage. And what I''m calling > "advanced short stroking" are techniques that effectively eliminate, or > minimize both seek & latency, to zero or near-zero. What I''m calling > "advanced short stroking" doesn''t exist as far as I know, but is > theoretically possible through either special disk hardware or special > drivers. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101230/cf2a35e2/attachment.html>
HyperDrive5 = ACard ANS9010 I have personally been wanting to try one of these for some time as a ZIL device. On 12/29/2010 06:35 PM, Kevin Walker wrote:> You do seem to misunderstand ZIL. > > ZIL is quite simply write cache and using a short stroked rotating > drive is never going to provide a performance increase that is worth > talking about and more importantly ZIL was designed to be used with a > RAM/Solid State Disk. > > We use sata2 *HyperDrive/5/* RAM disks in mirrors and they work well > and are far cheaper than STEC or other enterprise SSD''s and have non > of the issue related to trim... > > Highly recommended... ;-) > > http://www.hyperossystems.co.uk/ > > Kevin > > > On 29 December 2010 13:40, Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com > <mailto:opensolarisisdeadlongliveopensolaris at nedharvey.com>> wrote: > > > From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us > <mailto:bfriesen at simple.dallas.tx.us>] > > Sent: Tuesday, December 28, 2010 9:23 PM > > > > > The question of IOPS here is relevant to conversation because > of ZIL > > > dedicated log. If you have advanced short-stroking to get the > write > latency > > > of a log device down to zero, then it can compete against SSD for > purposes > > > of a log device, but nobody seems to believe such technology > currently > > > exists, and it certainly couldn''t compete against SSD for > random reads. > > > (ZIL log is the only situation I know of, where write > performance of a > drive > > > matters and read performance does not matter.) > > > > It seems that you may be confused. For the ZIL the drive''s > rotational > > latency (based on RPM) is the dominating factor and not the lateral > > head seek time on the media. In this case, the "short-stroking" you > > are talking about does not help any. The ZIL is already effectively > > "short-stroking" since it writes in order. > > Nope. I''m not confused at all. I''m making a distinction between > "short > stroking" and "advanced short stroking." Where simple "short > stroking" does > as you said - eliminates the head seek time but still susceptible to > rotational latency. As you said, the ZIL already effectively > accomplishes > that end result, provided a dedicated spindle disk for log device, > but does > not do that if your ZIL is on the pool storage. And what I''m calling > "advanced short stroking" are techniques that effectively > eliminate, or > minimize both seek & latency, to zero or near-zero. What I''m calling > "advanced short stroking" doesn''t exist as far as I know, but is > theoretically possible through either special disk hardware or special > drivers. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101229/345ee5d9/attachment.html>
I do the same with ACARD? Works well enough. Fred From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Jason Warr Sent: ???, ??? 30, 2010 8:56 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL HyperDrive5 = ACard ANS9010 I have personally been wanting to try one of these for some time as a ZIL device. On 12/29/2010 06:35 PM, Kevin Walker wrote: You do seem to misunderstand ZIL. ZIL is quite simply write cache and using a short stroked rotating drive is never going to provide a performance increase that is worth talking about and more importantly ZIL was designed to be used with a RAM/Solid State Disk. We use sata2 HyperDrive5 RAM disks in mirrors and they work well and are far cheaper than STEC or other enterprise SSD''s and have non of the issue related to trim... Highly recommended... ;-) http://www.hyperossystems.co.uk/ Kevin On 29 December 2010 13:40, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com<mailto:opensolarisisdeadlongliveopensolaris at nedharvey.com>> wrote:> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us<mailto:bfriesen at simple.dallas.tx.us>] > Sent: Tuesday, December 28, 2010 9:23 PM > > > The question of IOPS here is relevant to conversation because of ZIL > > dedicated log. If you have advanced short-stroking to get the writelatency> > of a log device down to zero, then it can compete against SSD forpurposes> > of a log device, but nobody seems to believe such technology currently > > exists, and it certainly couldn''t compete against SSD for random reads. > > (ZIL log is the only situation I know of, where write performance of adrive> > matters and read performance does not matter.) > > It seems that you may be confused. For the ZIL the drive''s rotational > latency (based on RPM) is the dominating factor and not the lateral > head seek time on the media. In this case, the "short-stroking" you > are talking about does not help any. The ZIL is already effectively > "short-stroking" since it writes in order.Nope. I''m not confused at all. I''m making a distinction between "short stroking" and "advanced short stroking." Where simple "short stroking" does as you said - eliminates the head seek time but still susceptible to rotational latency. As you said, the ZIL already effectively accomplishes that end result, provided a dedicated spindle disk for log device, but does not do that if your ZIL is on the pool storage. And what I''m calling "advanced short stroking" are techniques that effectively eliminate, or minimize both seek & latency, to zero or near-zero. What I''m calling "advanced short stroking" doesn''t exist as far as I know, but is theoretically possible through either special disk hardware or special drivers. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org<mailto:zfs-discuss at opensolaris.org> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org<mailto:zfs-discuss at opensolaris.org> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101229/7d8302d3/attachment-0001.html>
On 12/29/2010 4:55 PM, Jason Warr wrote:> HyperDrive5 = ACard ANS9010 > > I have personally been wanting to try one of these for some time as a > ZIL device. >Yes, but do remember these require a half-height 5.25" drive bay, and you really, really should buy the extra CF card for backup. Also, stay away from the ANS-9010S with LVD SCSI interface. As (I think) Bob pointed out a long time ago, parallel SCSI isn''t good for a high-IOPS interface. It (the LVD interface) will throttle long before the drive does... I''ve been waiting for them to come out with a 3.5" version, one which I can plug directly into a standard 3.5" SAS/SATA hotswap bay... And, of course, the ANS9010 is limited to the SATA2 interface speed, so it is cheaper and lower-performing (but still better than an SSD) than the DDRdrive. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101229/5d55246b/attachment.html>
Had not even noticed the LVD version. The biggest issue for me is not the form factor but the how hard it would be to get the client I work for to accept them in the env given support issues. ----- Reply message ----- From: "Erik Trimble" <erik.trimble at oracle.com> Date: Wed, Dec 29, 2010 19:52 Subject: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL To: "Jason Warr" <jason at warr.net> Cc: <zfs-discuss at opensolaris.org> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101229/b8f7efc6/attachment.html>
> From: Kevin Walker [mailto:indigoskywalker at gmail.com] > > You do seem to misunderstand ZIL.Wrong.> ZIL is quite simply write cacheZIL is not simply write cache, but it enables certain types of operations to use write cache which otherwise would have been ineligible. The Intent Log is where ZFS immediately writes sync-write requests, so it can unblock the process which called write(). Once the data has been committed to nonvolatile ZIL storage, the process can continue processing, and ZFS can treat the write requests as async writes. Which means, after ZFS has written the ZIL, then the data is able to stay a while in the RAM write buffer along with all the async writes. Which means ZFS is able to aggregate and optimize all the writes for best performance. This means ZIL is highly sensitive to access times. (seek + latency)> using a short stroked rotating drive is > never going to provide a performance increase that is worth talking aboutIf you don''t add a dedicated log device, then the ZIL utilizes blocks from the main storage pool, and all sync writes suddenly get higher priority than all the queued reads and async writes. If you have a busy storage pool, your sync writes might see something like 20ms access times (seek + latency) before they can hit nonvolatile storage, and every time this happens, some other operation gets delayed. If you add a spindle drive dedicated log device, then that drive is always idle except when writing ZIL for sync writes, and also, the head will barely move over the platter because all the ZIL blocks will be clustered tightly together. So the ZIL might require typically 2ms or 3ms access times (negligible seek or 1ms seek + 2ms latency), which is an order of magnitude better than before. Plus the sync writes in this case don''t take away performance from the main pool reads & writes. If you replace your spindle drive with a SSD, then you get another order of magnitude smaller access time. (Tens of thousands of IOPS effectively compares to <<1ms access time per OP) If you disable your ZIL completely, then you get another order of magnitude smaller access time. (Some ns to think about putting the data directly into RAM write buffer and entirely bypass the ZIL).> and more importantly ZIL was designed to be used with a RAM/Solid State > Disk.I hope you mean NVRAM or battery-backed RAM of some kind. Because if you use volatile RAM for ZIL, then you have disabled ZIL from being able to function correctly. The ZFS Best Practices Guide specifically mentions "Better performance might be possible by using [...], or even a dedicated spindle disk."
2010/12/24 Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>:>> From: Frank Lahm [mailto:franklahm at googlemail.com] >> >> With Netatalk for AFP he _is_ running a database: any AFP server needs >> to maintain a consistent mapping between _not reused_ catalog node ids >> (CNIDs) and filesystem objects. Luckily for Apple, HFS[+] and their >> Cocoa/Carbon APIs provide such a mapping making diirect use of HFS+ >> CNIDs. Unfortunately most UNIX filesystem reuse inodes and have no API >> for mapping inodes to filesystem objects. Therefor all AFP servers >> running on non-Apple OSen maintain a database providing this mapping, >> in case of Netatalk it''s `cnid_dbd` using a BerkeleyDB database. > > Don''t all of those concerns disappear in the event of a reboot? > > If you stop AFP, you could completely obliterate the BDB database, and restart AFP, and functionally continue from where you left off. ?Right?No. Apple''s APIs provide semantics by which you can reference filesystem objects by their parent directory CNID + object name. More important in this context: these references can be stored, retrieved and reused, eg. Finder Aliasses, Adobe InDesign and many more applications use these semantics to store references to files. If you nuke the CNID database, upon renumeration of the volumes all filesystem objects are likely to assigned new and different CNIDs, thus all references are broken. -f
> From: Frank Lahm [mailto:franklahm at googlemail.com] > > > Don''t all of those concerns disappear in the event of a reboot? > > > > If you stop AFP, you could completely obliterate the BDB database, and > restart AFP, and functionally continue from where you left off. Right? > > No. Apple''s APIs provide semantics by which you can reference > filesystem objects by their parent directory CNID + object name. More > important in this context: these references can be stored, retrieved > and reused, eg. Finder Aliasses, Adobe InDesign and many more > applications use these semantics to store references to files. > If you nuke the CNID database, upon renumeration of the volumes all > filesystem objects are likely to assigned new and different CNIDs, > thus all references are broken.Just like... If you shut down your Apple OSX AFP file server, move all the files to a new upgraded file server, reassigned the old IP address and DNS name to the new server, and enabled AFP file services on the new file server. How do people handle the broken links issue, when they upgrade their Apple server? If they don''t bother doing anything about it, I would conclude it''s no big deal. If there is instead, some process you''re supposed to follow when you upgrade/replace your Apple AFP fileserver, I wonder if that process is applicable to the present thread of discussion as well.
Am 02.01.11 16:52, schrieb Edward Ned Harvey:>> From: Frank Lahm [mailto:franklahm at googlemail.com] >> >>> Don''t all of those concerns disappear in the event of a reboot? >>> >>> If you stop AFP, you could completely obliterate the BDB database, and >> restart AFP, and functionally continue from where you left off. Right? >> >> No. Apple''s APIs provide semantics by which you can reference >> filesystem objects by their parent directory CNID + object name. More >> important in this context: these references can be stored, retrieved >> and reused, eg. Finder Aliasses, Adobe InDesign and many more >> applications use these semantics to store references to files. >> If you nuke the CNID database, upon renumeration of the volumes all >> filesystem objects are likely to assigned new and different CNIDs, >> thus all references are broken. > Just like... If you shut down your Apple OSX AFP file server, move all the files to a new upgraded file server, reassigned the old IP address and DNS name to the new server, and enabled AFP file services on the new file server. > > How do people handle the broken links issue, when they upgrade their Apple server? If they don''t bother doing anything about it, I would conclude it''s no big deal. If there is instead, some process you''re supposed to follow when you upgrade/replace your Apple AFP fileserver, I wonder if that process is applicable to the present thread of discussion as well. >Well? on the Apple platform HFS+ (the Mac''s default fs) takes care of that, so you''d never have to worry about this issue there. On the *nix-side of things, when running Netatalk, you''ll have to store these information in some kind of extra database, which is BDB in this case. Initially, I only wanted check what hw to get for my ZIL and I agree that by now, I have already decided - and ordered - two Vertex 2 EX 50GB SSDs to handle the ZIL for my zpool, since am serving already 50 AFP sharepoints which are accessed by 120 clients. The number of sharepoints will eventually rise up to 250 and the number of clients will rise up to 450 and that would cause some real random workload on the zpool and the ZIL, I guess. The technical discussion about short stroking is nevertheless very interesting. ;)