thr3ads.net - zfs discuss - [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache [Jan 2011]

If this information is useful, please help other people find it:
Share via:

James

2011-Jan-26 15:14 UTC

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

I?m wondering if any of the ZIL gurus could examine the following and point out
anywhere my logic is going wrong.

For small backend systems (e.g. 24x10k SAS Raid 10) I?m expecting an absolute
maximum backend write throughput of 10000 seq IOPS** and more realistically
2000-5000.     With small (4kB) blocksizes*, 10k is 480MB over 10s so we don?t
need much ZIL space or throughput.  What we do need is the ability to absorb the
IOPS at low latency and keep absorbing them at least as fast as the backend
storage can commit them.

ZIL OPTIONS:   Obviously a DDRDrive is the ideal (36k 4k random IOPS***) but for
the same budget I can get 2x Vertex 2 EX 50GB drives and put each behind it?s
own P410 512MB BBWC controller.    Assuming the SSDs can do 6300 4k random
IOPS*** and that the controller cache confirms those writes in the same latency
as the DDRDrive (both PCIe attached RAM?****) then we should have DDRDrive type
latency up to 6300 sustained IOPS.    Also, in bursting traffic, we should be
able to absorb up to 512MB of data (3.5s of 36000 4k IOPS)  at much higher
IOPS/low latency as long as averages at 6300 (ie SSD can empty cache before
fills).

So what are the issues with using this approach for low budget builds looking
for mirrored ZILs that don?t require >6300 sustained write IOPS (due to
backend disk limitations?).   Obviously there?s a lot of assumptions here but
wanted to get my theory straight before I start ordering things to test.

Thanks all.
James

* For NTFS 4kB clusters on VMWare / NFS, I believe 4kB zfs recordsize will
provide best performance (avoid partial writes).  Thoughts welcome on that too.
** Assumes 10k SAS can do max 900 sequential writes each striped across 12
mirrors and rounded down (900 based on TomsHardware hdd streaming write bench). 
Also assumes ZFS can take completely random writes and turn them into completely
sequential write iops on underlying disks and that no reads,>32k writes etc
are hitting disk at the same time.     Realistically 2000-5000 is probably more
likely maximums.
*** Figures from excellent DDRDrive presentation.  NB: If BBWC can sequentialise
writes to SSD may get closer to 10000 IOPS
**** I?m assuming that P410 BBWC and DDRDrive have similar IOPS/latency profile
? DDRDrive may do something fancy with striping across RAM to improve IO?

Similar Posts:
http://opensolaris.org/jive/thread.jspa?messageID=460871  - except normal disks
instead of ssd behind cache (so cache would fill).
http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg39729.html - same
again
?
-- 
This message posted from opensolaris.org

Christopher George

2011-Jan-26 16:29 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

> ZIL OPTIONS: Obviously a DDRdrive is the ideal (36k 4k random 
> IOPS***) but for the same budget I can get 2x Vertex 2 EX 50GB 
> drives and put each behind it?s own P410 512MB BBWC controller.
The Vertex 2 EX goes for approximately $900 each online, while the 
P410/512 BBWC is listed at HP for $449 each.  Cost wise you should 
contact us for a quote, as we are price competitive with just a single 
SSD/HBA combination.  Especially, as one obtains 4GB instead of 
512MB of ZIL accelerator capacity.
> Assuming the SSDs can do 6300 4k random IOPS*** and that the 
> controller cache confirms those writes in the same latency as the 
For 4KB random writes you need to look closely at slides 47/48 of the 
referenced presentation (http://www.ddrdrive.com/zil_accelerator).

The 6443 IOPS is obtained after testing for *only* 2 hours post 
unpackaging or secure erase.  The slope of both curves gives a hint, as 
the Vertex 2 EX does not level off and will continue to decrease.  I am 
working on a new presentation focusing on this very fact for random 
write IOPS performance over time (life of the device).  Suffice to say, 
6443 IOPS is *not* worst case performance for random writes on the 
Vertex 2 EX.
> DDRdrive (both PCIe attached RAM?****) then we should have 
> DDRdrive type latency up to 6300 sustained IOPS.
All tests used a QD (Queue Depth) of 32 which will hide the device 
latency of a single IO.  Very meaningful, as real life workloads can 
be bound by even a single outstanding IO.  Let''s trace the latency to
determine which has the advantage.  For the SSD/HBA combination 
an IO has to run the gauntlet through two controllers (HBA and SSD)
and propagate over a SATA cable.  The DDRdrive X1 has a single 
unified controller and no extraneous SATA cable, see slides 15-17.

Best regards,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Eff Norwood

2011-Jan-27 12:41 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

We tried all combinations of OCZ SSDs including their PCI based SSDs and they do
NOT work as a ZIL. After a very short time performance degrades horribly and for
the OCZ drives they eventually fail completely. We also tried Intel which
performed a little better and didn''t flat out fail over time, but these
still did not work out as a ZIL. We use the DDRdrive X1 now for all of our ZIL
applications and could not be happier. The cards are great, support is great and
performance is incredible. We use them to provide NFS storage to 50K VMWare VDI
users. As you stated, the DDRdrive is ideal. Go with that and you''ll be
very happy you did!
-- 
This message posted from opensolaris.org

James

2011-Jan-27 14:57 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

Chris & Eff,
Thanks for your expertise on this and other posts.  Greatly appreciated. 
I''ve just been re-reading some of the great SSD-as-ZIL discussions.

Chris,
Cost:  Our case is a bit non-representative as we have spare P410/512''s
that came with ESXi hosts (USB boot) so I''ve budgetted them at ?0.  I
will be in touch for a quote, I just want to get all my theory straight first on
the options.
Benchmarks:   Good point on graph direction and I look forward to seeing any
further papers.
Latency:   Yes the 9.9ms avg latency (pg 49) was what initially got me thinking
about adding the BBWC in front.    Thanks for reviewing that theory.    Good to
know it''s an option.

Eff,
Thanks for the Vertex review.  Very helpful.     Do you use mirror''d
DDRDrives (or have you so much confidence in them you risk single devices?).
-- 
This message posted from opensolaris.org

Eff Norwood

2011-Jan-27 18:09 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

They have been incredibly reliable with zero downtime or issues. As a result, we
use 2 in every system striped. For one application outside of VDI, we use a pair
of them mirrored, but that is very unusual and driven by the customer and not
us.
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2011-Jan-28 13:25 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Eff Norwood
> 
> We tried all combinations of OCZ SSDs including their PCI based SSDs and
> they do NOT work as a ZIL. After a very short time performance degrades
> horribly and for the OCZ drives they eventually fail completely. 
This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there''s really no way for the OS to
know
if it''s writing to a 4K block which happens to be shared with another
4K
block in the 8K page.  So it''s unavoidable, and whenever it happens,
the
drive can''t simply write.  It must read modify write, which is
obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they''re not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM''d.  Pfffft...  But in my experience, reality is about
50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I''m not sure at what point ZFS added
TRIM,
or to what extent...  Can''t really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It''s mostly a
question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.

Deano

2011-Jan-28 14:01 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

Hi Edward,
Do you have a source for the 8KiB block size data? whilst we can''t
avoid the
SSD controller in theory we can change the smallest size we present to the
SSD to 8KiB fairly easily... I wonder if that would help the controller do a
better job (especially with TRIM)

I might have to do some test, so far the assumption (even inside sun''s
sd
driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
should have an 8KiB option...

Thanks,
Deano
deano at cloudpixies.com

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: 28 January 2011 13:25
To: ''Eff Norwood''; zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Lower latency ZIL Option?: SSD behind Controller
BB Write Cache
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Eff Norwood
> 
> We tried all combinations of OCZ SSDs including their PCI based SSDs and
> they do NOT work as a ZIL. After a very short time performance degrades
> horribly and for the OCZ drives they eventually fail completely. 
This was something interesting I found recently.  Apparently for flash
manufacturers, flash hard drives are like the pimple on the butt of the
elephant. A vast majority of the flash production in the world goes into
devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
into hard drives.  As a result, they optimize for these other devices, and
one of the important side effects is that standard flash chips use an 8K
page size.  But hard drives use either 4K or 512B.  

The SSD controller secretly remaps blocks internally, and aggregates small
writes into a single 8K write, so there''s really no way for the OS to
know
if it''s writing to a 4K block which happens to be shared with another
4K
block in the 8K page.  So it''s unavoidable, and whenever it happens,
the
drive can''t simply write.  It must read modify write, which is
obviously
much slower.

Also if you look up the specs of a SSD, both for IOPS and/or sustainable
throughput...  They lie.  Well, technically they''re not lying because
technically it is *possible* to reach whatever they say.  Optimize your
usage patterns and only use blank drives which are new from box, or have
been fully TRIM''d.  Pfffft...  But in my experience, reality is about
50% of
whatever they say.

Presently, the only way to deal with all this is via the TRIM command, which
cannot eliminate the read/modify/write, but can reduce their occurrence.
Make sure your OS supports TRIM.  I''m not sure at what point ZFS added
TRIM,
or to what extent...  Can''t really measure the effectiveness myself.

Long story short, in the real world, you can expect the DDRDrive to crush
and shame the performance of any SSD you can find.  It''s mostly a
question
of PCIe slot versus SAS/SATA slot, and other characteristics you might care
about, like external power, etc.



_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

taemun

2011-Jan-28 15:33 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

Comments below.

On 29 January 2011 00:25, Edward Ned Harvey <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> This was something interesting I found recently.  Apparently for flash
> manufacturers, flash hard drives are like the pimple on the butt of the
> elephant. A vast majority of the flash production in the world goes into
> devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
> into hard drives.
http://www.eetimes.com/electronics-news/4206361/SSDs--Still-not-a--solid-state--business
~6.1 percent for 2010, from that estimate (first thing that Google turned
up). Not denying what you said, I just like real figures rather than random
hearsay.

> As a result, they optimize for these other devices, and
> one of the important side effects is that standard flash chips use an 8K
> page size.  But hard drives use either 4K or 512B.
>http://www.anandtech.com/Show/Index/2738?cPage=19&all=False&sort=0&page=5
Terms: "page" means the smallest data size that can be read or
programmed
(written). "Block" means the smallest data size that can be erased.
SSDs
commonly have a page size of 4KiB and a block size of 512KiB. I''d take
Anandtech''s word on it.

There is probably some variance across the market, but for the vast
majority, this is true. Wikipedia''s
http://en.wikipedia.org/wiki/Flash_memory#NAND_memories says that common
page sizes are 512B, 2KiB, and 4KiB.

The SSD controller secretly remaps blocks internally, and aggregates
small> writes into a single 8K write, so there''s really no way for the OS
to know
> if it''s writing to a 4K block which happens to be shared with
another 4K
> block in the 8K page.  So it''s unavoidable, and whenever it
happens, the
> drive can''t simply write.  It must read modify write, which is
obviously
> much slower.
>This is be true, but for 512B to 4KiB aggregation, as the 8KiB page
doesn''t
exist. As for writing when everything is full, and you need to do an
erase..... well this is where TRIM is helpful.

Also if you look up the specs of a SSD, both for IOPS and/or
sustainable> throughput...  They lie.  Well, technically they''re not lying
because
> technically it is *possible* to reach whatever they say.  Optimize your
> usage patterns and only use blank drives which are new from box, or have
> been fully TRIM''d.  Pfffft...  But in my experience, reality is
about 50%
> of
> whatever they say.
>
> Presently, the only way to deal with all this is via the TRIM command,
> which
> cannot eliminate the read/modify/write, but can reduce their occurrence.
> Make sure your OS supports TRIM.  I''m not sure at what point ZFS
added
> TRIM,
> or to what extent...  Can''t really measure the effectiveness
myself.
>http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655
<http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6957655>
> Long story short, in the real world, you can expect the DDRDrive to crush
> and shame the performance of any SSD you can find.  It''s mostly a
question
> of PCIe slot versus SAS/SATA slot, and other characteristics you might care
> about, like external power, etc.
Sure, DDR RAM will have a much quicker sync write time. This isn''t
really a
surprising result.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110129/5f4b3d62/attachment-0001.html>

Eric D. Mudama

2011-Jan-28 20:04 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

On Fri, Jan 28 at  8:25, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Eff Norwood
>>
>> We tried all combinations of OCZ SSDs including their PCI based SSDs
and
>> they do NOT work as a ZIL. After a very short time performance degrades
>> horribly and for the OCZ drives they eventually fail completely.
>
>This was something interesting I found recently.  Apparently for flash
>manufacturers, flash hard drives are like the pimple on the butt of the
>elephant. A vast majority of the flash production in the world goes into
>devices like smartphones, cameras, tablets, etc.  Only a slim minority goes
>into hard drives.  As a result, they optimize for these other devices, and
>one of the important side effects is that standard flash chips use an 8K
>page size.  But hard drives use either 4K or 512B.
>
>The SSD controller secretly remaps blocks internally, and aggregates small
>writes into a single 8K write, so there''s really no way for the OS
to know
>if it''s writing to a 4K block which happens to be shared with
another 4K
>block in the 8K page.  So it''s unavoidable, and whenever it
happens, the
>drive can''t simply write.  It must read modify write, which is
obviously
>much slower.
The reality is way more complicated, and statements like the above may
or may not be true on a vendor-by-vendor basis.

As time passes, the underlying NAND geometries are designed for
certain sets of advantages, continually subject to re-evaluation and
modification, and good SSD controllers on the top of NAND or other
solid-state storage will map those advantages effectively into our
problem domains as users.

Testing methodologies are improving over time as well, and eventually
it will be more clear which devices are suited to which tasks.

The suitability of a specific solution into a problem space will
always be a balance between cost, performance, reliability and time to
market.  No single solution (RAM SAN, RAM SSD, NAND SSD, BBU
controllers, rotating HDD, etc.) wins in every single area, or else we
wouldn''t be having this discussion.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Edward Ned Harvey

2011-Jan-29 16:22 UTC

head link

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

> From: Deano [mailto:deano at rattie.demon.co.uk]
> 
> Hi Edward,
> Do you have a source for the 8KiB block size data? whilst we can''t
avoid
the> SSD controller in theory we can change the smallest size we present to the
> SSD to 8KiB fairly easily... I wonder if that would help the controller do
a> better job (especially with TRIM)
> 
> I might have to do some test, so far the assumption (even inside
sun''s sd
> driver) is that SSD are really 4KiB even when the claim 512B, perhaps we
> should have an 8KiB option...
It''s hard to say precisely where the truth lies, so I''ll just
tell a story
and take from it what you will.

For me, it started when I started deploying new laptops with SSD''s. 
There
was a problem with the backup software, so I kept reimaging machines using
"dd" and then backing up and restoring with acronis, and when it
failed, I
would restore again via dd, etc etc etc.  So I kept overwriting the drive
repeatedly.  After only 2-3 iterations, the performance degraded to around
50% of its original speed.

At work, we have a team of engineers who know flash intimately.  So I asked
them about flash performance degrading with usage.  The first response was
that each time it''s erased and rewritten, the data isn''t
written as clearly
as before.  Like erasing pencil or chalkboard and rewriting over and over.
It becomes "smudgy."  So with repetition and age, the device becomes
slower
and consumes more power, because there''s a higher incidence of errors
and
higher requirement for error correction and repeating the operations with
varying operating parameters on the chips.  All of this is invisible to the
OS but affects performance internally.  But then I said I was getting 50%
loss after only 2-3 iterations, so this life degradation became clearly not
the issue.  This life degradation issue will become significant after tens
of thousands, or higher number of iterations.

They suggested the cause of the problem must be caused by something in the
controller, not in the flash itself.

So I kept working on it.  I found this:
http://www.pcper.com/article.php?aid=669&type=expert (see the section on
Write Combining)
Rather than reading that whole article ... The most valuable thing to come
out of it is to identify useful search terms.  The following are useful
search terms:

ssd "write combining"
ssd internal fragmentation
ssd sector remapping

This is very similar to ZFS write aggregation.  They''re combining small
writes into larger blocks and taking advantage of block remapping to keep
track of it all.  You gain performance during lots of small writes.  It does
not hurt you for lots of random small reads.  But it does hurt you for
sequential reads/writes that happen after the remapping.  Also, unlike ZFS,
the drive can''t fully recover after the fact, when data gets deleted or
moved or overwritten, etc.  Unlike ZFS, the drive doesn''t have any way
to
straighten itself out, except TRIM.

After discovering this, I went back to the flash guys at work, and explained
the internal fragmentation idea.  One of the head engineers was there at the
time, and he''s the one who told me flash is made in 8k pages.  "To
flash
manufacturers, SSD''s are the pimple on the butt of the elephant"
was his
statement.  Unfortunately, hard disks and OSes historically both used 512b
sectors.  Then hard drives started using 4k sectors but to maintain
compatibility with OSes, they still emulate 512b on the interface.  But the
OS assumes the disk is doing this, so the OS aligns 512b writes to multiples
of every 4k in order to avoid the read/modify/write.  Unfortunately, now the
SSD''s are using 8k physical sector size, and emulating god knows what
(4k or
512b) on the interface, so the RMW is once again necessary until the OSes
become aware and start aligning on 8k pages instead...  But then that
doesn''t even matter anymore either, thanks to sector remapping and
write
combining, even if your OS is intelligent enough, you''re still going to
end
up with fragmentation anyway.  Unless the OS pads every write to make up a
full 8k page.

But getting back to the point.  The question I think you''re asking, is
to
verify the existence of the 8k physical page inside the SSD.

There are two ways to prove it that I can think of:  (a) rip apart your SSD
and hope you can read chip numbers and hope you can find specs of those
chips to confirm or deny the 8k pages.  or (b) TRIM your entire drive and
see if it returns to original performance afterward.  This can be done via
hdderase, but that requires changing temporarily into ATA mode, booting from
a DOS disk, and then putting it back into AHCI mode afterward...  I went as
far as putting into ATA mode, but then I found it was going to be a rathole
for me to create the DOS disk, so I decided to call it quits and assume I
had the right answer with a high enough degree of confidence.  Since
performance is only degraded for sequential operations, I will see
degradation for OS rebuilds, but users probably won''t notice.

zfs discuss - Jan 2011 - Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache

[zfs-discuss] Lower latency ZIL Option?: SSD behind Controller BB Write Cache