James
2011-Jan-26 15:14 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
I?m wondering if any of the ZIL gurus could examine the following and point out anywhere my logic is going wrong. For small backend systems (e.g. 24x10k SAS Raid 10) I?m expecting an absolute maximum backend write throughput of 10000 seq IOPS** and more realistically 2000-5000. With small (4kB) blocksizes*, 10k is 480MB over 10s so we don?t need much ZIL space or throughput. What we do need is the ability to absorb the IOPS at low latency and keep absorbing them at least as fast as the backend storage can commit them. ZIL OPTIONS: Obviously a DDRDrive is the ideal (36k 4k random IOPS***) but for the same budget I can get 2x Vertex 2 EX 50GB drives and put each behind it?s own P410 512MB BBWC controller. Assuming the SSDs can do 6300 4k random IOPS*** and that the controller cache confirms those writes in the same latency as the DDRDrive (both PCIe attached RAM?****) then we should have DDRDrive type latency up to 6300 sustained IOPS. Also, in bursting traffic, we should be able to absorb up to 512MB of data (3.5s of 36000 4k IOPS) at much higher IOPS/low latency as long as averages at 6300 (ie SSD can empty cache before fills). So what are the issues with using this approach for low budget builds looking for mirrored ZILs that don?t require >6300 sustained write IOPS (due to backend disk limitations?). Obviously there?s a lot of assumptions here but wanted to get my theory straight before I start ordering things to test. Thanks all. James * For NTFS 4kB clusters on VMWare / NFS, I believe 4kB zfs recordsize will provide best performance (avoid partial writes). Thoughts welcome on that too. ** Assumes 10k SAS can do max 900 sequential writes each striped across 12 mirrors and rounded down (900 based on TomsHardware hdd streaming write bench). Also assumes ZFS can take completely random writes and turn them into completely sequential write iops on underlying disks and that no reads,>32k writes etc are hitting disk at the same time. Realistically 2000-5000 is probably more likely maximums. *** Figures from excellent DDRDrive presentation. NB: If BBWC can sequentialise writes to SSD may get closer to 10000 IOPS **** I?m assuming that P410 BBWC and DDRDrive have similar IOPS/latency profile ? DDRDrive may do something fancy with striping across RAM to improve IO? Similar Posts: http://opensolaris.org/jive/thread.jspa?messageID=460871 - except normal disks instead of ssd behind cache (so cache would fill). http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg39729.html - same again ? -- This message posted from opensolaris.org
Christopher George
2011-Jan-26 16:29 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
> ZIL OPTIONS: Obviously a DDRdrive is the ideal (36k 4k random > IOPS***) but for the same budget I can get 2x Vertex 2 EX 50GB > drives and put each behind it?s own P410 512MB BBWC controller.The Vertex 2 EX goes for approximately $900 each online, while the P410/512 BBWC is listed at HP for $449 each. Cost wise you should contact us for a quote, as we are price competitive with just a single SSD/HBA combination. Especially, as one obtains 4GB instead of 512MB of ZIL accelerator capacity.> Assuming the SSDs can do 6300 4k random IOPS*** and that the > controller cache confirms those writes in the same latency as theFor 4KB random writes you need to look closely at slides 47/48 of the referenced presentation (http://www.ddrdrive.com/zil_accelerator). The 6443 IOPS is obtained after testing for *only* 2 hours post unpackaging or secure erase. The slope of both curves gives a hint, as the Vertex 2 EX does not level off and will continue to decrease. I am working on a new presentation focusing on this very fact for random write IOPS performance over time (life of the device). Suffice to say, 6443 IOPS is *not* worst case performance for random writes on the Vertex 2 EX.> DDRdrive (both PCIe attached RAM?****) then we should have > DDRdrive type latency up to 6300 sustained IOPS.All tests used a QD (Queue Depth) of 32 which will hide the device latency of a single IO. Very meaningful, as real life workloads can be bound by even a single outstanding IO. Let''s trace the latency to determine which has the advantage. For the SSD/HBA combination an IO has to run the gauntlet through two controllers (HBA and SSD) and propagate over a SATA cable. The DDRdrive X1 has a single unified controller and no extraneous SATA cable, see slides 15-17. Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
Eff Norwood
2011-Jan-27 12:41 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
We tried all combinations of OCZ SSDs including their PCI based SSDs and they do NOT work as a ZIL. After a very short time performance degrades horribly and for the OCZ drives they eventually fail completely. We also tried Intel which performed a little better and didn''t flat out fail over time, but these still did not work out as a ZIL. We use the DDRdrive X1 now for all of our ZIL applications and could not be happier. The cards are great, support is great and performance is incredible. We use them to provide NFS storage to 50K VMWare VDI users. As you stated, the DDRdrive is ideal. Go with that and you''ll be very happy you did! -- This message posted from opensolaris.org
James
2011-Jan-27 14:57 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
Chris & Eff, Thanks for your expertise on this and other posts. Greatly appreciated. I''ve just been re-reading some of the great SSD-as-ZIL discussions. Chris, Cost: Our case is a bit non-representative as we have spare P410/512''s that came with ESXi hosts (USB boot) so I''ve budgetted them at ?0. I will be in touch for a quote, I just want to get all my theory straight first on the options. Benchmarks: Good point on graph direction and I look forward to seeing any further papers. Latency: Yes the 9.9ms avg latency (pg 49) was what initially got me thinking about adding the BBWC in front. Thanks for reviewing that theory. Good to know it''s an option. Eff, Thanks for the Vertex review. Very helpful. Do you use mirror''d DDRDrives (or have you so much confidence in them you risk single devices?). -- This message posted from opensolaris.org
Eff Norwood
2011-Jan-27 18:09 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
They have been incredibly reliable with zero downtime or issues. As a result, we use 2 in every system striped. For one application outside of VDI, we use a pair of them mirrored, but that is very unusual and driven by the customer and not us. -- This message posted from opensolaris.org
Edward Ned Harvey
2011-Jan-28 13:25 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eff Norwood > > We tried all combinations of OCZ SSDs including their PCI based SSDs and > they do NOT work as a ZIL. After a very short time performance degrades > horribly and for the OCZ drives they eventually fail completely.This was something interesting I found recently. Apparently for flash manufacturers, flash hard drives are like the pimple on the butt of the elephant. A vast majority of the flash production in the world goes into devices like smartphones, cameras, tablets, etc. Only a slim minority goes into hard drives. As a result, they optimize for these other devices, and one of the important side effects is that standard flash chips use an 8K page size. But hard drives use either 4K or 512B. The SSD controller secretly remaps blocks internally, and aggregates small writes into a single 8K write, so there''s really no way for the OS to know if it''s writing to a 4K block which happens to be shared with another 4K block in the 8K page. So it''s unavoidable, and whenever it happens, the drive can''t simply write. It must read modify write, which is obviously much slower. Also if you look up the specs of a SSD, both for IOPS and/or sustainable throughput... They lie. Well, technically they''re not lying because technically it is *possible* to reach whatever they say. Optimize your usage patterns and only use blank drives which are new from box, or have been fully TRIM''d. Pfffft... But in my experience, reality is about 50% of whatever they say. Presently, the only way to deal with all this is via the TRIM command, which cannot eliminate the read/modify/write, but can reduce their occurrence. Make sure your OS supports TRIM. I''m not sure at what point ZFS added TRIM, or to what extent... Can''t really measure the effectiveness myself. Long story short, in the real world, you can expect the DDRDrive to crush and shame the performance of any SSD you can find. It''s mostly a question of PCIe slot versus SAS/SATA slot, and other characteristics you might care about, like external power, etc.
Deano
2011-Jan-28 14:01 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
Hi Edward, Do you have a source for the 8KiB block size data? whilst we can''t avoid the SSD controller in theory we can change the smallest size we present to the SSD to 8KiB fairly easily... I wonder if that would help the controller do a better job (especially with TRIM) I might have to do some test, so far the assumption (even inside sun''s sd driver) is that SSD are really 4KiB even when the claim 512B, perhaps we should have an 8KiB option... Thanks, Deano deano at cloudpixies.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Edward Ned Harvey Sent: 28 January 2011 13:25 To: ''Eff Norwood''; zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eff Norwood > > We tried all combinations of OCZ SSDs including their PCI based SSDs and > they do NOT work as a ZIL. After a very short time performance degrades > horribly and for the OCZ drives they eventually fail completely.This was something interesting I found recently. Apparently for flash manufacturers, flash hard drives are like the pimple on the butt of the elephant. A vast majority of the flash production in the world goes into devices like smartphones, cameras, tablets, etc. Only a slim minority goes into hard drives. As a result, they optimize for these other devices, and one of the important side effects is that standard flash chips use an 8K page size. But hard drives use either 4K or 512B. The SSD controller secretly remaps blocks internally, and aggregates small writes into a single 8K write, so there''s really no way for the OS to know if it''s writing to a 4K block which happens to be shared with another 4K block in the 8K page. So it''s unavoidable, and whenever it happens, the drive can''t simply write. It must read modify write, which is obviously much slower. Also if you look up the specs of a SSD, both for IOPS and/or sustainable throughput... They lie. Well, technically they''re not lying because technically it is *possible* to reach whatever they say. Optimize your usage patterns and only use blank drives which are new from box, or have been fully TRIM''d. Pfffft... But in my experience, reality is about 50% of whatever they say. Presently, the only way to deal with all this is via the TRIM command, which cannot eliminate the read/modify/write, but can reduce their occurrence. Make sure your OS supports TRIM. I''m not sure at what point ZFS added TRIM, or to what extent... Can''t really measure the effectiveness myself. Long story short, in the real world, you can expect the DDRDrive to crush and shame the performance of any SSD you can find. It''s mostly a question of PCIe slot versus SAS/SATA slot, and other characteristics you might care about, like external power, etc. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
taemun
2011-Jan-28 15:33 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
Comments below. On 29 January 2011 00:25, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> This was something interesting I found recently. Apparently for flash > manufacturers, flash hard drives are like the pimple on the butt of the > elephant. A vast majority of the flash production in the world goes into > devices like smartphones, cameras, tablets, etc. Only a slim minority goes > into hard drives.http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business ~6.1 percent for 2010, from that estimate (first thing that Google turned up). Not denying what you said, I just like real figures rather than random hearsay.> As a result, they optimize for these other devices, and > one of the important side effects is that standard flash chips use an 8K > page size. But hard drives use either 4K or 512B. >http://www.anandtech.com/Show/Index/2738?cPage=19&all=False&sort=0&page=5 Terms: "page" means the smallest data size that can be read or programmed (written). "Block" means the smallest data size that can be erased. SSDs commonly have a page size of 4KiB and a block size of 512KiB. I''d take Anandtech''s word on it. There is probably some variance across the market, but for the vast majority, this is true. Wikipedia''s http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common page sizes are 512B, 2KiB, and 4KiB. The SSD controller secretly remaps blocks internally, and aggregates small> writes into a single 8K write, so there''s really no way for the OS to know > if it''s writing to a 4K block which happens to be shared with another 4K > block in the 8K page. So it''s unavoidable, and whenever it happens, the > drive can''t simply write. It must read modify write, which is obviously > much slower. >This is be true, but for 512B to 4KiB aggregation, as the 8KiB page doesn''t exist. As for writing when everything is full, and you need to do an erase..... well this is where TRIM is helpful. Also if you look up the specs of a SSD, both for IOPS and/or sustainable> throughput... They lie. Well, technically they''re not lying because > technically it is *possible* to reach whatever they say. Optimize your > usage patterns and only use blank drives which are new from box, or have > been fully TRIM''d. Pfffft... But in my experience, reality is about 50% > of > whatever they say. > > Presently, the only way to deal with all this is via the TRIM command, > which > cannot eliminate the read/modify/write, but can reduce their occurrence. > Make sure your OS supports TRIM. I''m not sure at what point ZFS added > TRIM, > or to what extent... Can''t really measure the effectiveness myself. >http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655 <http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655>> Long story short, in the real world, you can expect the DDRDrive to crush > and shame the performance of any SSD you can find. It''s mostly a question > of PCIe slot versus SAS/SATA slot, and other characteristics you might care > about, like external power, etc.Sure, DDR RAM will have a much quicker sync write time. This isn''t really a surprising result. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110129/5f4b3d62/attachment-0001.html>
Eric D. Mudama
2011-Jan-28 20:04 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
On Fri, Jan 28 at 8:25, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Eff Norwood >> >> We tried all combinations of OCZ SSDs including their PCI based SSDs and >> they do NOT work as a ZIL. After a very short time performance degrades >> horribly and for the OCZ drives they eventually fail completely. > >This was something interesting I found recently. Apparently for flash >manufacturers, flash hard drives are like the pimple on the butt of the >elephant. A vast majority of the flash production in the world goes into >devices like smartphones, cameras, tablets, etc. Only a slim minority goes >into hard drives. As a result, they optimize for these other devices, and >one of the important side effects is that standard flash chips use an 8K >page size. But hard drives use either 4K or 512B. > >The SSD controller secretly remaps blocks internally, and aggregates small >writes into a single 8K write, so there''s really no way for the OS to know >if it''s writing to a 4K block which happens to be shared with another 4K >block in the 8K page. So it''s unavoidable, and whenever it happens, the >drive can''t simply write. It must read modify write, which is obviously >much slower.The reality is way more complicated, and statements like the above may or may not be true on a vendor-by-vendor basis. As time passes, the underlying NAND geometries are designed for certain sets of advantages, continually subject to re-evaluation and modification, and good SSD controllers on the top of NAND or other solid-state storage will map those advantages effectively into our problem domains as users. Testing methodologies are improving over time as well, and eventually it will be more clear which devices are suited to which tasks. The suitability of a specific solution into a problem space will always be a balance between cost, performance, reliability and time to market. No single solution (RAM SAN, RAM SSD, NAND SSD, BBU controllers, rotating HDD, etc.) wins in every single area, or else we wouldn''t be having this discussion. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Edward Ned Harvey
2011-Jan-29 16:22 UTC
[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache
> From: Deano [mailto:deano at rattie.demon.co.uk] > > Hi Edward, > Do you have a source for the 8KiB block size data? whilst we can''t avoidthe> SSD controller in theory we can change the smallest size we present to the > SSD to 8KiB fairly easily... I wonder if that would help the controller doa> better job (especially with TRIM) > > I might have to do some test, so far the assumption (even inside sun''s sd > driver) is that SSD are really 4KiB even when the claim 512B, perhaps we > should have an 8KiB option...It''s hard to say precisely where the truth lies, so I''ll just tell a story and take from it what you will. For me, it started when I started deploying new laptops with SSD''s. There was a problem with the backup software, so I kept reimaging machines using "dd" and then backing up and restoring with acronis, and when it failed, I would restore again via dd, etc etc etc. So I kept overwriting the drive repeatedly. After only 2-3 iterations, the performance degraded to around 50% of its original speed. At work, we have a team of engineers who know flash intimately. So I asked them about flash performance degrading with usage. The first response was that each time it''s erased and rewritten, the data isn''t written as clearly as before. Like erasing pencil or chalkboard and rewriting over and over. It becomes "smudgy." So with repetition and age, the device becomes slower and consumes more power, because there''s a higher incidence of errors and higher requirement for error correction and repeating the operations with varying operating parameters on the chips. All of this is invisible to the OS but affects performance internally. But then I said I was getting 50% loss after only 2-3 iterations, so this life degradation became clearly not the issue. This life degradation issue will become significant after tens of thousands, or higher number of iterations. They suggested the cause of the problem must be caused by something in the controller, not in the flash itself. So I kept working on it. I found this: http://www.pcper.com/article.php?aid=669&type=expert (see the section on Write Combining) Rather than reading that whole article ... The most valuable thing to come out of it is to identify useful search terms. The following are useful search terms: ssd "write combining" ssd internal fragmentation ssd sector remapping This is very similar to ZFS write aggregation. They''re combining small writes into larger blocks and taking advantage of block remapping to keep track of it all. You gain performance during lots of small writes. It does not hurt you for lots of random small reads. But it does hurt you for sequential reads/writes that happen after the remapping. Also, unlike ZFS, the drive can''t fully recover after the fact, when data gets deleted or moved or overwritten, etc. Unlike ZFS, the drive doesn''t have any way to straighten itself out, except TRIM. After discovering this, I went back to the flash guys at work, and explained the internal fragmentation idea. One of the head engineers was there at the time, and he''s the one who told me flash is made in 8k pages. "To flash manufacturers, SSD''s are the pimple on the butt of the elephant" was his statement. Unfortunately, hard disks and OSes historically both used 512b sectors. Then hard drives started using 4k sectors but to maintain compatibility with OSes, they still emulate 512b on the interface. But the OS assumes the disk is doing this, so the OS aligns 512b writes to multiples of every 4k in order to avoid the read/modify/write. Unfortunately, now the SSD''s are using 8k physical sector size, and emulating god knows what (4k or 512b) on the interface, so the RMW is once again necessary until the OSes become aware and start aligning on 8k pages instead... But then that doesn''t even matter anymore either, thanks to sector remapping and write combining, even if your OS is intelligent enough, you''re still going to end up with fragmentation anyway. Unless the OS pads every write to make up a full 8k page. But getting back to the point. The question I think you''re asking, is to verify the existence of the 8k physical page inside the SSD. There are two ways to prove it that I can think of: (a) rip apart your SSD and hope you can read chip numbers and hope you can find specs of those chips to confirm or deny the 8k pages. or (b) TRIM your entire drive and see if it returns to original performance afterward. This can be done via hdderase, but that requires changing temporarily into ATA mode, booting from a DOS disk, and then putting it back into AHCI mode afterward... I went as far as putting into ATA mode, but then I found it was going to be a rathole for me to create the DOS disk, so I decided to call it quits and assume I had the right answer with a high enough degree of confidence. Since performance is only degraded for sequential operations, I will see degradation for OS rebuilds, but users probably won''t notice.