Hi, I''m currently testing a Mtron Pro 7500 16GB SLC SSD as a ZIL device and seeing very poor performance for small file writes via NFS. Copying a source code directory with around 4000 small files to the ZFS pool over NFS without the SSD log device yields around 1000 IOPS (pool of 8 sata shared mirrors). When adding the SSD as ZIL, performance drops to 50 IOPS! I can see similarly poor performance when creating a ZFS pool on the SSD and sharing it via NFS. However copy the files locally on the server from the sata to the ssd pool only takes a few seconds. The SSD''s specs reveal: sequential r/w 512B: 83,000/51,000 sequential r/w 4KB: 21,000/13,000 random r/w 512B: 19,000/130 random r/w 4KB: 12,000/130 So it is apparent, that the SSD has really poor random writes. But I was under the impression, that the ZIL is mostly sequential writes or was I misinformed here? Maybe the cache syncs bring the device to it''s knees? Best Regards, Felix Buenemann
On Fri, 19 Feb 2010, Felix Buenemann wrote:> > So it is apparent, that the SSD has really poor random writes. > > But I was under the impression, that the ZIL is mostly sequential writes or > was I misinformed here? > > Maybe the cache syncs bring the device to it''s knees?That''s what it seems like. This particular device must actually being obeying the cache sync request rather than just pretending to like many SSDs. Most SSDs are very good at seeking, and very good at random reads, but most are rather poor at small synchronous writes. The ones which are good at small synchronous writes cost more. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Am 19.02.10 19:30, schrieb Bob Friesenhahn:> On Fri, 19 Feb 2010, Felix Buenemann wrote: >> >> So it is apparent, that the SSD has really poor random writes. >> >> But I was under the impression, that the ZIL is mostly sequential >> writes or was I misinformed here? >> >> Maybe the cache syncs bring the device to it''s knees? > > That''s what it seems like. This particular device must actually being > obeying the cache sync request rather than just pretending to like many > SSDs. > > Most SSDs are very good at seeking, and very good at random reads, but > most are rather poor at small synchronous writes. The ones which are > good at small synchronous writes cost more.Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD.> > Bob- Felix
On Fri, February 19, 2010 12:50, Felix Buenemann wrote:> > Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around > 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD.Well, but the Intel X25-M is the drive that really first cracked the problem (earlier high-performance drives were hideously expensive and rather brute force). Which was relatively recently. The industry is still evolving rapidly. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Fri, 19 Feb 2010, David Dyer-Bennet wrote:>> Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around >> 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD. > > Well, but the Intel X25-M is the drive that really first cracked the > problem (earlier high-performance drives were hideously expensive and > rather brute force). Which was relatively recently. The industry is > still evolving rapidly.What is the problem is it that the X25-M cracked? The X25-M is demonstrated to ignore cache sync and toss transactions. As such, it is useless for a ZIL. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Am 19.02.10 20:50, schrieb Bob Friesenhahn:> On Fri, 19 Feb 2010, David Dyer-Bennet wrote: > >>> Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around >>> 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD. >> >> Well, but the Intel X25-M is the drive that really first cracked the >> problem (earlier high-performance drives were hideously expensive and >> rather brute force). Which was relatively recently. The industry is >> still evolving rapidly. > > What is the problem is it that the X25-M cracked? The X25-M is > demonstrated to ignore cache sync and toss transactions. As such, it is > useless for a ZIL.Yes, I see no difference with the X25-M with both zfs_nocacheflush=0 and zfs_nocacheflush=1. After setting zfs_nocacheflush=1, the Mtron SSD also performed at around 1000 IOPS, which is still useless, because the array performs the same IOPS without dedicated ZIL. Looking at the X25-E (SLC) benchmarks it should be able to do about 3000 IOPS, which would improve array performance. I think I''ll try one of thise inexpensive battery-backed PCI RAM drives from Gigabyte and see how much IOPS they can pull.> > Bob- Felix
On Fri, February 19, 2010 13:50, Bob Friesenhahn wrote:> On Fri, 19 Feb 2010, David Dyer-Bennet wrote: > >>> Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around >>> 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD. >> >> Well, but the Intel X25-M is the drive that really first cracked the >> problem (earlier high-performance drives were hideously expensive and >> rather brute force). Which was relatively recently. The industry is >> still evolving rapidly. > > What is the problem is it that the X25-M cracked? The X25-M is > demonstrated to ignore cache sync and toss transactions. As such, it > is useless for a ZIL.But it''s finally useful as, for example, a notebook boot drive. No previous vaguely affordable design was. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Felix.Buenemann at googlemail.com said:> I think I''ll try one of thise inexpensive battery-backed PCI RAM drives from > Gigabyte and see how much IOPS they can pull.Another poster, Tracy Bernath, got decent ZIL IOPS from an OCZ Vertex unit. Dunno if that''s sufficient for your purposes, but it looked pretty good for the money. Marion
Am 19.02.10 21:29, schrieb Marion Hakanson:> Felix.Buenemann at googlemail.com said: >> I think I''ll try one of thise inexpensive battery-backed PCI RAM drives from >> Gigabyte and see how much IOPS they can pull. > > Another poster, Tracy Bernath, got decent ZIL IOPS from an OCZ Vertex unit. > Dunno if that''s sufficient for your purposes, but it looked pretty good > for the money.I found the Hyperdrive 5/5M, which is a half-height drive bay sata ramdisk with battery backup and auto-backup to compact flash at power failure. Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC should be more than sufficient. http://www.hyperossystems.co.uk/07042003/hardware.htm> > Marion- Felix
On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote:> I found the Hyperdrive 5/5M, which is a half-height drive bay sata > ramdisk with battery backup and auto-backup to compact flash at power > failure. > Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty > reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC should > be more than sufficient.Wouldn''t it be better investing these 300-350 EUR into 16 GByte or more of system memory, and a cheap UPS?> http://www.hyperossystems.co.uk/07042003/hardware.htm-- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
On 19 feb 2010, at 23.40, Eugen Leitl wrote:> On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote: > >> I found the Hyperdrive 5/5M, which is a half-height drive bay sata >> ramdisk with battery backup and auto-backup to compact flash at power >> failure. >> Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty >> reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC should >> be more than sufficient. > > Wouldn''t it be better investing these 300-350 EUR into 16 GByte or more of > system memory, and a cheap UPS?System memory can''t replace a slog, since a slog is supposed to be non-volatile. An UPS plus disabling zil, or disabling synchronization, could possibly achieve the same result (or maybe better) iops wise. This would probably work given that your computer never crashes in an uncontrolled manner. If it does, some data may be lost (and possibly the entire pool lost, if you are unlucky). /ragge
On Fri, Feb 19, 2010 at 11:51:29PM +0100, Ragnar Sundblad wrote:> > On 19 feb 2010, at 23.40, Eugen Leitl wrote: > > On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote: > >> I found the Hyperdrive 5/5M, which is a half-height drive bay sata > >> ramdisk with battery backup and auto-backup to compact flash at power > >> failure. > >> Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty > >> reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC should > >> be more than sufficient.These are the same as the acard devices we''ve discussed here previously; earlier hyperdrive models were their own design. Very interesting, and my personal favourite, but I don''t know of anyone actually reporting results yet with them as ZIL. If you have more memory in them than is needed for ZIL, with some partitioning you could make a small fast pool on them for swap space and other purposes. I was originally looking at these for Postgres WAL logfiles, before there was slog and on a different platform.. Also, if you have enough non-ECC memory there''s a mode where it adds its own redundancy for reduced space, which could allow reusing existing kit - replace non-ecc system memory with ecc.> > Wouldn''t it be better investing these 300-350 EUR into 16 GByte or more of > > system memory, and a cheap UPS? > > System memory can''t replace a slog, since a slog is supposed to be > non-volatile.System memory might already be maxed out, too. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100220/f8ce7748/attachment.bin>
On 19-Feb-10, at 5:40 PM, Eugen Leitl wrote:> On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote: > >> I found the Hyperdrive 5/5M, which is a half-height drive bay sata >> ramdisk with battery backup and auto-backup to compact flash at power >> failure. >> Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty >> reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC >> should >> be more than sufficient. > > Wouldn''t it be better investing these 300-350 EUR into 16 GByte or > more of > system memory, and a cheap UPS?That would depend on the read/write mix, I think? --Toby> >> http://www.hyperossystems.co.uk/07042003/hardware.htm > > -- > Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> An UPS plus disabling zil, or disabling synchronization, could possibly > achieve the same result (or maybe better) iops wise.Even with the fastest slog, disabling zil will always be faster... (less bytes to move)> This would probably work given that your computer never crashes > in an uncontrolled manner. If it does, some data may be lost > (and possibly the entire pool lost, if you are unlucky).the pool would never be at risk, but when your server reboots, its clients will be confused that things it sent, and the server promised it had saved, are gone. For some clients, this small loss might be the loss of their entire dataset. Rob
> These are the same as the acard devices we''ve discussed here > previously; earlier hyperdrive models were their own design. ?Very > interesting, and my personal favourite, but I don''t know of anyone > actually reporting results yet with them as ZIL.Here''s one report: http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg27739.html
On 20 feb 2010, at 02.34, Rob Logan wrote:> >> An UPS plus disabling zil, or disabling synchronization, could possibly >> achieve the same result (or maybe better) iops wise. > Even with the fastest slog, disabling zil will always be faster... > (less bytes to move) > >> This would probably work given that your computer never crashes >> in an uncontrolled manner. If it does, some data may be lost >> (and possibly the entire pool lost, if you are unlucky). > the pool would never be at risk, but when your server > reboots, its clients will be confused that things > it sent, and the server promised it had saved, are gone. > For some clients, this small loss might be the loss of their > entire dataset.No, the entire pool shouldn''t be at risk, you are right of course, I don''t know what I was thinking. Sorry! /ragge
Am 20.02.10 01:33, schrieb Toby Thain:> > On 19-Feb-10, at 5:40 PM, Eugen Leitl wrote: > >> On Fri, Feb 19, 2010 at 11:17:29PM +0100, Felix Buenemann wrote: >> >>> I found the Hyperdrive 5/5M, which is a half-height drive bay sata >>> ramdisk with battery backup and auto-backup to compact flash at power >>> failure. >>> Promises 65,000 IOPS and thus should be great for ZIL. It''s pretty >>> reasonable priced (~230 EUR) and stacked with 4GB or 8GB DDR2-ECC should >>> be more than sufficient. >> >> Wouldn''t it be better investing these 300-350 EUR into 16 GByte or >> more of >> system memory, and a cheap UPS? > > > That would depend on the read/write mix, I think?Well the workload will include MaxDB (SAP), Exchange and file services (SMB), with the opensolaris box acting as a VMFS iSCSI target for VMware vSphere. Due to the mixed workload it''s hard to predict how exactly the I/O distribution will look like, so I''m trying to build a system that can hold up in various usage scenarios. I''ve been testing with NFS because it loads the ZIL heavily. Btw. in my testing I didn''t really see a performance improvement with ZIL disabled over on disk ZIL, but I''ve only been testing with a single NFS client. Or do I need multiple concurrent clients to benefit from external ZIL? Also is there a guideline on sizing the ZIL? I think in most cases even 1GB would be enough, but I haven''t done any heavy testing.> > --Toby- Felix
On 20/02/2010 01:34, Rob Logan wrote:> >> This would probably work given that your computer never crashes >> in an uncontrolled manner. If it does, some data may be lost >> (and possibly the entire pool lost, if you are unlucky). >> > the pool would never be at risk, but when your server > reboots, its clients will be confused that things > it sent, and the server promised it had saved, are gone. > For some clients, this small loss might be the loss of their > entire dataset. > >to be precise he wrote that disable zil or cache flushes... while disabling zil won''t have any impact on pool consistency disable cache flushes might. -- Robert Milkowski http://milek.blogspot.com
>>>>> "el" == Eugen Leitl <eugen at leitl.org> writes:el> Wouldn''t it be better investing these 300-350 EUR into 16 el> GByte or more of system memory, and a cheap UPS? If you think the UPS is good enough that you never have to worry about your machine rebooting then the extra memory isn''t needed to match the ACARD/hyperdrive performacne---just disable ZIL writing. AIUI there is zero point to making a ramdisk slog because you are just copying from one area of memory into another. IMHO the overall disabling the ZIL is not dumb, but should be obvious UPS is totally unequivalent to slog because it doesn''t protect against kernel panics, which are quite common on ZFS NAS if you include cases where the system didn''t actually panic but became unresponsive and had to be hard-rebooted (ex., for something as simple as ``a drive went bad by becoming slow''''), or cases where you tried to shutdown but the box got hung somewhere in SMF shutdown land because one of the hundred little scriptletts decided never to finish. Also in the real world people trip over cables, NYC mains is more reliable than UPS power bc of equipment failures and botched battery upgrades and nonserious sites without bypass switches, blah blah blah. The interesting middle-ground someone brought up months ago was a cluster with two RAMdisks, each host iscsi-exporting to the other, and whichever node is active runs a mirrored slog of the two ramdisks. The actual slog does not really need to be mirrored so much as `one ramdisk stored on the inactive member'', just 1 copy, just get the data _off_ the node we''re worried about panicing, then if something goes wrong the inactive member ought to push a copy of the slog out somewhere else before trying to force-import the pool in case one of the main pool devices or its contents is poisonous, but then once imported the newly-active member can discard its local slog ramdisk and keep just the remote one attached. IMHO this general idea''s got long-term potential, the idea being: ``treat data stored in RAM on <n> nodes as somewhat nonvolatile (given there exists a procedure to follow if it''s lost anyway, such as `restart all the NFS clients'' or `live with a little bit of lost email'')''''. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100222/7f076377/attachment.bin>
On Fri, Feb 19, 2010 at 2:17 PM, Felix Buenemann <Felix.Buenemann at googlemail.com> wrote:> > Am 19.02.10 20:50, schrieb Bob Friesenhahn: >> >> On Fri, 19 Feb 2010, David Dyer-Bennet wrote: >> >>>> Too bad, I''m getting ~1000 IOPS with an Intel X25-M G2 MLC and around >>>> 300 with a regular USB stick, so 50 IOPS is really poor for an SLC SSD. >>> >>> Well, but the Intel X25-M is the drive that really first cracked the >>> problem (earlier high-performance drives were hideously expensive and >>> rather brute force). Which was relatively recently. The industry is >>> still evolving rapidly. >> >> What is the problem is it that the X25-M cracked? The X25-M is >> demonstrated to ignore cache sync and toss transactions. As such, it is >> useless for a ZIL. > > Yes, I see no difference with the X25-M with both zfs_nocacheflush=0 and zfs_nocacheflush=1. After setting zfs_nocacheflush=1, the Mtron SSD also performed at around 1000 IOPS, which is still useless, because the array performs the same IOPS without dedicated ZIL. > Looking at the X25-E (SLC) benchmarks it should be able to do about 3000 IOPS, which would improve array performance. > > I think I''ll try one of thise inexpensive battery-backed PCI RAM drives from Gigabyte and see how much IOPS they can pull. >I''ve given up on the Gigabyte card - it''s basically unstable. I''ve tested it as a disk drive under ZFS and "another" operating system. Again - it glitches out - sometimes after only a couple of minutes if you tar up /usr (as a relative tar file), gzip it, and then untar it onto the Gigabyte "drive". I''ve got two of them - both using the recommended Kingston RAM. Both are unstable/flaky. I''ve tried removing some of the RAM to see if that makes a difference - it does not. Conclusion: Run Away from this product. Regards, -- Al Hopper ?Logical Approach Inc,Plano,TX al at logical-approach.com ? ? ? ? ? ? ? ? ? Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/