There has been many threads in the past asking about ZIL devices. Most of them end up in recommending Intel X-25 as an adequate device. Nevertheless there is always the warning about them not heeding cache flushes. But what use is a ZIL that ignores cache flushes? If I''m willing to tolerate that (I''m not), I can just as well take a mechanical drive and force zfs to not issue cache flushes to it. In this case it can easily compete with SSD in regard to IOPS and bandwidth. In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. So why buy SSD for ZIL at all?
On Tue, 15 Jun 2010, Arne Jansen wrote:> In case of a power failure I will likely lose about as many writes > as I do with SSDs, a few milliseconds.I agree with your concerns, but the data loss may span as much as 30 seconds rather than just a few milliseconds. Using an SSD as the ZIL allows zfs to turn a synchronous write into a normal batched async write which is scheduled for the next TXG. Zfs intentionally postpones writes. Without the SSD, zfs needs to write to an intent log in the main pool (consuming precious IOPS) or write directly to the main pool (consuming precious response latency). Battery-backed RAM in the adaptor card or storage array can do almost as well as the SSD as long as the amount of data does not overrun the limited write cache. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> So why buy SSD for ZIL at all?For the record, not all SSDs "ignore cache flushes". There are at least two SSDs sold today that guarantee synchronous write semantics; the Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more accurate to describe the root cause as not power protecting on-board volatile caches. As the X25-E does implement the ATA FLUSH CACHE command, but does not have the required power protection to avoid transaction (data) loss. Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
On 15/06/2010 23:46, "Christopher George" <cgeorge at ddrdrive.com> wrote:>> So why buy SSD for ZIL at all? > > For the record, not all SSDs "ignore cache flushes". There are at least > two SSDs sold today that guarantee synchronous write semantics; the > Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more > accurate to describe the root cause as not power protecting on-board > volatile caches. As the X25-E does implement the ATA FLUSH > CACHE command, but does not have the required power protection to > avoid transaction (data) loss. > > Best regards, > > Christopher George > Founder/CTO > www.ddrdrive.comOften forgotten (most probably due the price) are the latest Pliant SSDs. -- Przem
Bob Friesenhahn wrote:> On Tue, 15 Jun 2010, Arne Jansen wrote: > >> In case of a power failure I will likely lose about as many writes as >> I do with SSDs, a few milliseconds. > > I agree with your concerns, but the data loss may span as much as 30 > seconds rather than just a few milliseconds.Wait, I''m talking about using SSD for ZIL vs. using a dedicated hard drive for ZIL which is configured to ignore cache flushes. Do you say I can lose 30 seconds also if I use a badly behaving SSD?> > Using an SSD as the ZIL allows zfs to turn a synchronous write into a > normal batched async write which is scheduled for the next TXG. Zfs > intentionally postpones writes. > > Without the SSD, zfs needs to write to an intent log in the main pool > (consuming precious IOPS) or write directly to the main pool (consuming > precious response latency). Battery-backed RAM in the adaptor card or > storage array can do almost as well as the SSD as long as the amount of > data does not overrun the limited write cache. > > Bob
Not to forget the The Deneva Reliability disks from OCZ that just got released. See http://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-emlc-ssd.htm l "The Deneva Reliability family features built-in supercapacitor (SF-1500 models) that acts as a temporary power backup in the event of sudden power loss, and enables the drive to complete its task ensuring no data loss." -Arve> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Christopher George > Sent: 16. juni 2010 00:47 > To: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] SSDs adequate ZIL devices? > > > So why buy SSD for ZIL at all? > > For the record, not all SSDs "ignore cache flushes". There are at > least > two SSDs sold today that guarantee synchronous write semantics; the > Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more > accurate to describe the root cause as not power protecting on-board > volatile caches. As the X25-E does implement the ATA FLUSH > CACHE command, but does not have the required power protection to > avoid transaction (data) loss. > > Best regards, > > Christopher George > Founder/CTO > www.ddrdrive.com > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Christopher George wrote:>> So why buy SSD for ZIL at all? > > For the record, not all SSDs "ignore cache flushes". There are at least > two SSDs sold today that guarantee synchronous write semantics; the > Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is moreLogZilla? Are these those STEC-thingies? For the price of those I can buy a battery backed-up RAID-controller and a few conventional drives. For ZIL this will probably do better at a lower price than STEC. The DDRdrive I wouldn''t call a flash drive but rather a NVRAM-Card. NVRAM-cards are the proper way to go for ZIL. Someone should build one for < $600, PCIe x1 would be sufficient. Xilinx has some nice Spartans :)> accurate to describe the root cause as not power protecting on-board > volatile caches. As the X25-E does implement the ATA FLUSH > CACHE command, but does not have the required power protection to > avoid transaction (data) loss.You could say the same about hard drives. They also just need a proper protection for their volatile cache... --Arne> > Best regards, > > Christopher George > Founder/CTO > www.ddrdrive.com
Arve Paalsrud wrote:> Not to forget the The Deneva Reliability disks from OCZ that just got > released. See > http://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-emlc-ssd.html > > "The Deneva Reliability family features built-in supercapacitor (SF-1500 > models) that acts as a temporary power backup in the event of sudden power > loss, and enables the drive to complete its task ensuring no data loss." >This one looks really interesting. No price to find though, and no detail about how many write cycles they can stand. --Arne
On Wed, June 16, 2010 03:03, Arne Jansen wrote:> Christopher George wrote: > >> For the record, not all SSDs "ignore cache flushes". There are at least >> two SSDs sold today that guarantee synchronous write semantics; the >> Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more > > LogZilla? Are these those STEC-thingies? For the price of those I can > buy a battery backed-up RAID-controller and a few conventional drives. > For ZIL this will probably do better at a lower price than STEC.I''m not sure you''d get the same latency and IOps with disk that you can with a good SSD: http://blogs.sun.com/brendan/entry/slog_screenshots You''re also talking about using more power (and cooling), and more moving parts, which can affect reliability numbers. TANSTAAFL. Towards the bottom of that post Brendan Gregg configures eight Logzillas (which I''m sure has the cost of a small car), and got 114,000 synchronous write ops/sec over NFS; 85% of which were done in under 1.21 ms. I''m not sure how many spindles you''d need to purchase to get numbers like that in a more "traditional" configuration. Whether it''s worth the cash is another matter entirely.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I''ve very in-frequently seen the RAMSAN devices mentioned here. Probably due to price. However a long time ago I think I remember someone suggesting a build it yourself RAMSAN. Where is the down side of one or 2 OS boxes with a whole lot of RAM (and/or SSD''s) exporting either RAMdisks or zVOLs out over iSCSI, FCoE, or direct FC (can OS do that?) If the RAM and/or SSD''s (or even HD''s) ere large enough this box might be able to serve several other ZFS servers. A dedicated Network, or direct connections if there are enough ports, should eliminate the net from the being a bottle neck. A sub $100 UPS (or 2) could protect the whole thing. I''m sure I''m missing something, but I''m not seeing it at the moment. Anyone else have any ideas? -Kyle -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJMGN7DAAoJEEADRM+bKN5w35EIAKX5T96Ls4wNQUMEtHKp1qpM cu3TlS+h+2vRGMYq0ZMnudiEvGlvxOldifSUHkHWWVMqOsPZplMcBJMoDXOQgChU i4NPSMTnjPT3zRxLeOm6ZCrfHv4/rYr4RNYjN2DUcaXHrfGdMXg0aYFAoJxObnwx zMNB8xLqqlXDIkSo3i9ONZAbvVbHehs8V3az63j/P+AyyQcyhu96xR3wjJZpfDnI N7kE3id9o8WNufw35KyQy3w/bOAvhh8dXsuZm81rpaq6VQ1wS5AnRVQ48mhbYua9 kZNy8eLrobOBR2YCZZFoLrXVQWYfSVMV/pL0fYUf2J12P7EETk6LHKnr3Hy7W2E=XDQw -----END PGP SIGNATURE-----
David Magda wrote:> On Wed, June 16, 2010 03:03, Arne Jansen wrote: >> Christopher George wrote: >> >>> For the record, not all SSDs "ignore cache flushes". There are at least >>> two SSDs sold today that guarantee synchronous write semantics; the >>> Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more >> LogZilla? Are these those STEC-thingies? For the price of those I can >> buy a battery backed-up RAID-controller and a few conventional drives. >> For ZIL this will probably do better at a lower price than STEC. > > I''m not sure you''d get the same latency and IOps with disk that you can > with a good SSD: > > http://blogs.sun.com/brendan/entry/slog_screenshots > > You''re also talking about using more power (and cooling), and more moving > parts, which can affect reliability numbers. TANSTAAFL. > > Towards the bottom of that post Brendan Gregg configures eight Logzillas > (which I''m sure has the cost of a small car), and got 114,000 synchronous > write ops/sec over NFS; 85% of which were done in under 1.21 ms. I''m not > sure how many spindles you''d need to purchase to get numbers like that in > a more "traditional" configuration.Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection of the RAID-controller the disk can leave the write cache enabled. This means the disk can write essentially with full speed, meaning 150MB/s for a 15k drive. 114000 4k writes/s are 456MB/s, so 3 spindles should do. --Arne
On Wed, June 16, 2010 10:44, Arne Jansen wrote:> David Magda wrote: > >> I''m not sure you''d get the same latency and IOps with disk that you can >> with a good SSD: >> >> http://blogs.sun.com/brendan/entry/slog_screenshots[...]> Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main > pool. Because ZIL issues nearly sequential writes, due to the > NVRAM-protection > of the RAID-controller the disk can leave the write cache enabled. This > means > the disk can write essentially with full speed, meaning 150MB/s for a 15k > drive. > 114000 4k writes/s are 456MB/s, so 3 spindles should do.Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD numbers see: http://blogs.sun.com/brendan/entry/l2arc_screenshots
On Wed, June 16, 2010 11:02, David Magda wrote: [...]> Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD > numbers see:s/suck/such/ :)
David Magda wrote:> On Wed, June 16, 2010 11:02, David Magda wrote: > [...] >> Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD >> numbers see: > > s/suck/such/ah, I tried to make sense from ''suck'' in the sense of ''just writing sequentially'' or something like that ;)> > :)
David Magda wrote:> On Wed, June 16, 2010 10:44, Arne Jansen wrote: >> David Magda wrote: >> >>> I''m not sure you''d get the same latency and IOps with disk that you can >>> with a good SSD: >>> >>> http://blogs.sun.com/brendan/entry/slog_screenshots > [...] >> Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main >> pool. Because ZIL issues nearly sequential writes, due to the >> NVRAM-protection >> of the RAID-controller the disk can leave the write cache enabled. This >> means >> the disk can write essentially with full speed, meaning 150MB/s for a 15k >> drive. >> 114000 4k writes/s are 456MB/s, so 3 spindles should do. > > Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD > numbers see: > > http://blogs.sun.com/brendan/entry/l2arc_screenshots >oops, sorry, I should at least scrolled down a bit on your link... Nevertheless I don''t find it improbable to reach numbers like that for a proper RAID-setup. Of cause it will take more space and power. Maybe someone has done some testing on this.
Arne Jansen wrote:> David Magda wrote: >> On Wed, June 16, 2010 10:44, Arne Jansen wrote: >>> David Magda wrote: >>> >>>> I''m not sure you''d get the same latency and IOps with disk that you can >>>> with a good SSD: >>>> >>>> http://blogs.sun.com/brendan/entry/slog_screenshots >> [...] >>> Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main >>> pool. Because ZIL issues nearly sequential writes, due to the >>> NVRAM-protection >>> of the RAID-controller the disk can leave the write cache enabled. This >>> means >>> the disk can write essentially with full speed, meaning 150MB/s for a 15k >>> drive. >>> 114000 4k writes/s are 456MB/s, so 3 spindles should do. >> Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD >> numbers see: >> >> http://blogs.sun.com/brendan/entry/l2arc_screenshots >> > > oops, sorry, I should at least scrolled down a bit on your link... Nevertheless > I don''t find it improbable to reach numbers like that for a proper RAID-setup. > Of cause it will take more space and power. Maybe someone has done some testing > on this.You don''t need a fast disk. It just needs to be at least as large as the write cache on your RAID controller, and that needs to be large enough to handle your SLOG needs. For example, you can get an Areca RAID controller with 4 GB of cache for about USD$1k. Hook any >4GB disk to it, and you have a _very_ fast 4GB SLOG device with battery back up. Of course this is less attractive now that other, less astronomically expensive options are becoming available. I''m not sure how that compares in performance to an Acard ANS-9010, which you can populate with 16GB of RAM + flash backup for about the same price. -- Carson
On Wed, 16 Jun 2010, Arne Jansen wrote:> > Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main > pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection > of the RAID-controller the disk can leave the write cache enabled. This means > the disk can write essentially with full speed, meaning 150MB/s for a 15k drive. > 114000 4k writes/s are 456MB/s, so 3 spindles should do.Huh? What does the battery backed memory of a RAID-controller have to do with the unprotected memory of a hard drive? This does not compute. The flushes that the RAID-controller acks need to be ultimately delivered to the disk or else there WILL be data loss. The RAID controller should not purge its own record until the disk reports that it has flushed its cache. Once the RAID controller''s cache is full, then it should start stalling writes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Wed, 16 Jun 2010, Arne Jansen wrote: >> >> Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or >> main >> pool. Because ZIL issues nearly sequential writes, due to the >> NVRAM-protection >> of the RAID-controller the disk can leave the write cache enabled. >> This means >> the disk can write essentially with full speed, meaning 150MB/s for a >> 15k drive. >> 114000 4k writes/s are 456MB/s, so 3 spindles should do. > > Huh? What does the battery backed memory of a RAID-controller have to > do with the unprotected memory of a hard drive? This does not compute.You''re right, I took a wrong turn there. Of course the RAID-controller disables the write cache of the disks. But because the controller ACKs each write immediately (as long as it has buffer left), the requests can be queued in the disk. This enables the disk to write continously. I double checked before posting: I can nearly saturate a 15k disk if I make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times 3 nearly matches the above mentioned 114k IOPS :) Thanks, Arne> The flushes that the RAID-controller acks need to be ultimately > delivered to the disk or else there WILL be data loss. The RAID > controller should not purge its own record until the disk reports that > it has flushed its cache. Once the RAID controller''s cache is full, > then it should start stalling writes.> > Bob
On Wed, June 16, 2010 15:15, Arne Jansen wrote:> I double checked before posting: I can nearly saturate a 15k disk if I > make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times > 3 nearly matches the above mentioned 114k IOPS :)34K*3 = 102K. 12K isn''t anything to sneeze at :) So you''ll need six disks to do what one SSD does: three spindles, and two (mirrored) disks on each spindle for redundancy (drives are riskier than SSDs).
David Magda wrote:> On Wed, June 16, 2010 15:15, Arne Jansen wrote: > >> I double checked before posting: I can nearly saturate a 15k disk if I >> make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times >> 3 nearly matches the above mentioned 114k IOPS :) > > 34K*3 = 102K. 12K isn''t anything to sneeze at :) > > So you''ll need six disks to do what one SSD does: three spindles, and two > (mirrored) disks on each spindle for redundancy (drives are riskier than > SSDs). >ok, 4 spindles, we already have a raid controller available :) But personally I trust drives more than SSDs. Are the 114k with mirrored or striped logzillas? In any case there are two of them, so I''d double that raid-controller setup also, being still cheaper than the STEC devices.
On Wed, Jun 16, 2010 at 04:44:07PM +0200, Arne Jansen wrote:> Please keep in mind I''m talking about a usage as ZIL, not as L2ARC or main > pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection > of the RAID-controller the disk can leave the write cache enabled. This means > the disk can write essentially with full speed, meaning 150MB/s for a 15k drive. > 114000 4k writes/s are 456MB/s, so 3 spindles should do.You''d still have to flush those caches at the end of each transaction, which would tend to come every few seconds, so you''d need to factor that in. You can definitely do with disk what you can do with SSDs, but not necessarily with the same SWAP (space, wattage and price), and you''d have a more complex system no matter what. Nico --