I know there''s been much discussion on the list lately about getting HW arrays to use (or not use) their caches in a way that helps ZFS the most. Just yesterday I started seeing articles on NAND Flash Drives, and I know other Solid Stae Drive technologies have been around for a while and many times are used for transaction logs or other ways of accelerating FS''s. If these devices become more prevalent, and/or cheaper I''m curious what ways ZFS could be made to bast take advantage of them? One Idea I had was for each pool allow me to designate a mirror or RaidZ of these devices just for the transaction logs. Since they''re faster than normal disks, My uneducated guess is that they could boost performance. I suppose it doesn''t eliminate the problems with the real drive (or array) caches though. You still need to know that the data is on the real drives before you can wipe that transaction from the transaction log right? Well... I''d still like to hear the experts ideas on how this could (or won''t ever?) help ZFS out? Would changes to ZFS be required? -Kyle
I''m currently working on putting the ZFS intent log on separate devices which could include seperate disks and nvram/solid state devices. This would help any application using fsync/O_DSYNC - in particular DB and NFS. From protoyping considerable peformanace improvements have been seen. Neil. Kyle McDonald wrote On 01/05/07 08:10,:> I know there''s been much discussion on the list lately about getting HW > arrays to use (or not use) their caches in a way that helps ZFS the most. > > Just yesterday I started seeing articles on NAND Flash Drives, and I > know other Solid Stae Drive technologies have been around for a while > and many times are used for transaction logs or other ways of > accelerating FS''s. > > If these devices become more prevalent, and/or cheaper I''m curious what > ways ZFS could be made to bast take advantage of them? > > One Idea I had was for each pool allow me to designate a mirror or RaidZ > of these devices just for the transaction logs. Since they''re faster > than normal disks, My uneducated guess is that they could boost > performance. > > I suppose it doesn''t eliminate the problems with the real drive (or > array) caches > though. You still need to know that the data is on the real drives > before you can wipe that transaction from the transaction log right? > > Well... I''d still like to hear the experts ideas on how this could (or > won''t ever?) help ZFS out? Would changes to ZFS be required? > > -Kyle > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> If [SSD or Flash] devices become more prevalent, and/or cheaper I''m curious what > ways ZFS could be made to bast take advantage of them?The intent log is a possibility, but this would work better with SSD than Flash; Flash writes can actually be slower than sequential writes to a real disk. Metadata accesses tend to be more random than data accesses, so separating metadata from data can be useful with most file systems. Whether this is also true for ZFS is unclear, since even sequential data access in ZFS results in random reads.> You still need to know that the data is on the real drives > before you can wipe that transaction from the > transaction log right?Yes; but if you allow the transaction log to grow to large sizes, you could eliminate the need to flush data to the real drives so frequently. Given that random access doesn''t cost any more than sequential access on an SSD, so that a large log doesn''t hurt performance much, I suspect it would be possible to reduce the frequency at which the real cache is flushed. Anton This message posted from opensolaris.org
Hello Neil, Friday, January 5, 2007, 4:36:05 PM, you wrote: NP> I''m currently working on putting the ZFS intent log on separate devices NP> which could include seperate disks and nvram/solid state devices. NP> This would help any application using fsync/O_DSYNC - in particular NP> DB and NFS. From protoyping considerable peformanace improvements have NP> been seen. Can you share any results from prototype testing? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Fri, 5 Jan 2007, Anton B. Rang wrote:> > If [SSD or Flash] devices become more prevalent, and/or cheaper I''m curious what > > ways ZFS could be made to bast take advantage of them? > > The intent log is a possibility, but this would work better with SSD than Flash; Flash writes can actually be slower than sequential writes to a real disk.The original poster is probably referring to yesterdays announcement from SanDisk. Here''s the nearest thing (publicly) available that provides basic specifications: http://www.sandisk.com/Assets/File/pdf/oem/SanDisk%20SSD%20UATA%205000%201.8.pdf SanDisk bought MSystems (IIRC) - so they know how to "distribute" writes to prevent premature failure caused by constantly writing to the same flash memory location(s). Summary (1.8" form factor): write: 35MB/Sec, Read: 62MB/Sec IOPS: 7,000 The usual disclaimers apply - will have to test one to see what it can really do ... but 7,000 IOPS makes it interesting IMHO. ... snip .... Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Robert Milkowski wrote On 01/05/07 11:45,:> Hello Neil, > > Friday, January 5, 2007, 4:36:05 PM, you wrote: > > NP> I''m currently working on putting the ZFS intent log on separate devices > NP> which could include seperate disks and nvram/solid state devices. > NP> This would help any application using fsync/O_DSYNC - in particular > NP> DB and NFS. From protoyping considerable peformanace improvements have > NP> been seen. > > Can you share any results from prototype testing?I''d prefer not to just yet as I don''t want to raise expectations unduly. When testing I was using a simple local benchmark, whereas I''d prefer to run something more official such as TPC. I''m also missing a few required features in the protoype which may affect performance. Hopefully I can can provide some results soon, but even those will be unoffical. Neil.
Could this ability (separate ZIL device) coupled with an SSD give something like a Thumper the write latency benefit of battery-backed write cache? Best Regards, Jason On 1/5/07, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > Robert Milkowski wrote On 01/05/07 11:45,: > > Hello Neil, > > > > Friday, January 5, 2007, 4:36:05 PM, you wrote: > > > > NP> I''m currently working on putting the ZFS intent log on separate devices > > NP> which could include seperate disks and nvram/solid state devices. > > NP> This would help any application using fsync/O_DSYNC - in particular > > NP> DB and NFS. From protoyping considerable peformanace improvements have > > NP> been seen. > > > > Can you share any results from prototype testing? > > I''d prefer not to just yet as I don''t want to raise expectations unduly. > When testing I was using a simple local benchmark, whereas > I''d prefer to run something more official such as TPC. > I''m also missing a few required features in the protoype which > may affect performance. > > Hopefully I can can provide some results soon, but even those will > be unoffical. > > Neil. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Al Hopper wrote:> On Fri, 5 Jan 2007, Anton B. Rang wrote: > > >>> If [SSD or Flash] devices become more prevalent, and/or cheaper I''m curious what >>> ways ZFS could be made to bast take advantage of them? >>> >> The intent log is a possibility, but this would work better with SSD than Flash; Flash writes can actually be slower than sequential writes to a real disk. >> > > The original poster is probably referring to yesterdays announcement from > SanDisk. Here''s the nearest thing (publicly) available that provides > basic specifications: > > http://www.sandisk.com/Assets/File/pdf/oem/SanDisk%20SSD%20UATA%205000%201.8.pdf > > SanDisk bought MSystems (IIRC) - so they know how to "distribute" writes > to prevent premature failure caused by constantly writing to the same > flash memory location(s). > > Summary (1.8" form factor): write: 35MB/Sec, Read: 62MB/Sec IOPS: 7,000 >That is on par with a 5400 rpm disk, except for the 100x more small, random read iops. The biggest issue is the pricing, which will become interestingly competitive for mortals this year. wrt RAS, NAND flash wear leveling is widely used already. Combine that with ZFS'' checksum and RAS features and it makes a very interesting combination. We''ve already done some MTBF testing of such devices and they do seem to be more reliable than spinning rust. In case you didn''t know, some Sun servers already support CF as a boot device, and 8 GByte CFs are readily available. -- richard
> > Summary (1.8" form factor): write: 35MB/Sec, Read: 62MB/Sec IOPS: 7,000 > > > That is on par with a 5400 rpm disk, except for the 100x more small, random > read iops. The biggest issue is the pricing, which will become interestingly > competitive for mortals this year.$600+ for a 32 GB device isn''t exactly competitive, though the low-power and random access are attractive.>From a ZFS perspective, I wouldn''t be very excited about 35 MB/sec write speeds (though again, the random access is attractive). Of course, considering it''s a 32 GB device, you could stripe ten of them, to get a 320 GB flash disk which could write at 350 MB/sec for $6000....Could be nice for laptops eventually, but the relatively small capacity is a problem for the consumer space (except maybe at the low end). Anton This message posted from opensolaris.org
On Fri, Jan 05, 2007 at 10:39:59PM -0800, Anton B. Rang wrote:> > > Summary (1.8" form factor): write: 35MB/Sec, Read: 62MB/Sec IOPS: 7,000 > > > > > That is on par with a 5400 rpm disk, except for the 100x more small, random > > read iops. The biggest issue is the pricing, which will become interestingly > > competitive for mortals this year. > > $600+ for a 32 GB device isn''t exactly competitive, though the low-power and random access are attractive.Look at previous SSD offerings. $600 is a steal. ;)> >From a ZFS perspective, I wouldn''t be very excited about 35 MB/sec write speeds (though again, the random access is attractive). Of course, considering it''s a 32 GB device, you could stripe ten of them, to get a 320 GB flash disk which could write at 350 MB/sec for $6000.... > > Could be nice for laptops eventually, but the relatively small capacity is a problem for the consumer space (except maybe at the low end).SSD is not intended for the consumer space, which is sad, as I''d love to run that sort of thing at home. ;) Prices will come down, but slowly. It''s that whole supply and demand thing. People who run larger DBs are willing to shell out several grand for what seems like a small amount of disk space just to tweak DB performance. That will keep prices up. -brian
> $600+ for a 32 GB device isn''t exactly competitive, > though the low-power and random access are > attractive. > > Look at previous SSD offerings. $600 is a steal. ;)This isn''t a performance-oriented SSD, since it''s using Flash RAM (limited lifetime, slow writes). It''s really meant as a hard drive replacement. So comparing it against a typical commercial SSD (multi-ported, 200K+ IOPs) doesn''t quite make sense.... (Note that the 7000 IOPS quoted implies that random access is much slower than the sequential access speed; you won''t see anywhere near 30 MB/sec.) Incidentally, Ritek''s flash disk may be even cheaper; their 16 GB version is $169, according to their press release. I didn''t see pricing for the 32 GB. (Though the press release didn''t make it entirely clear whether that actually included the memory....)> SSD is not intended for the consumer space, which is sad, as I''d love to run that > sort of thing at home. ;)But Flash memory is. ;-)> People who run larger DBs are willing to shell out several grand for what seems like > a small amount of disk space just to tweak DB performance.SSD is cheaper than buying a fast enough set of disk arrays to get the same performance. ;-) Anton This message posted from opensolaris.org
Just a thought: would it be theoretically possible to designate some device as a system-wide write cache for all FS writes? Not just ZFS, but for everything... In a manner similar to which we currently use extra RAM as a cache for FS read (and write, to a certain extent), it would be really nice to be able to say that a NVRAM/Flash/etc. device is the system-wide write cache, so that calls to fsync() and the like - which currently force a flush of the RAM-resident buffers to disk - would return as complete after the data was written to such a SSD (even though it might not all be written to a HD yet). Thoughts? How difficult would this be? And, problems ? (the biggest I can see is for Flash, which, if it is being constantly written two, will wear out relatively quickly...) -Erik
On 1/11/07, Erik Trimble <Erik.Trimble at sun.com> wrote:> Just a thought: would it be theoretically possible to designate some > device as a system-wide write cache for all FS writes? Not just ZFS, > but for everything... In a manner similar to which we currently use > extra RAM as a cache for FS read (and write, to a certain extent), it > would be really nice to be able to say that a NVRAM/Flash/etc. device > is the system-wide write cache, so that calls to fsync() and the like - > which currently force a flush of the RAM-resident buffers to disk - > would return as complete after the data was written to such a SSD (even > though it might not all be written to a HD yet).As a first step, we could just mark the physical vdev(s) corresponding to such device(s), and use that as the sole backing store for the ZIL. That would give us a quick win for the NFS/ZFS model. -- Regards, Jeremy
Erik Trimble wrote:> Just a thought: would it be theoretically possible to designate some > device as a system-wide write cache for all FS writes? Not just ZFS, > but for everything... In a manner similar to which we currently use > extra RAM as a cache for FS read (and write, to a certain extent), it > would be really nice to be able to say that a NVRAM/Flash/etc. device > is the system-wide write cache, so that calls to fsync() and the like - > which currently force a flush of the RAM-resident buffers to disk - > would return as complete after the data was written to such a SSD (even > though it might not all be written to a HD yet). > > Thoughts? How difficult would this be? And, problems ? (the biggest I > can see is for Flash, which, if it is being constantly written two, will > wear out relatively quickly...)The product was called Sun PrestoServ. It was successful for benchmarking and such, but unsuccessful in the market because: + when there is a failure, your data is spread across multiple fault domains + it is not clusterable, which is often a requirement for data centers + it used a battery, so you had to deal with physical battery replacement and all of the associated battery problems + it had yet another device driver, so integration was a pain Google for it and you''ll see all sorts of historical perspective. -- richard
Jeremy Teo wrote On 01/11/07 01:38,:> On 1/11/07, Erik Trimble <Erik.Trimble at sun.com> wrote: > >> Just a thought: would it be theoretically possible to designate some >> device as a system-wide write cache for all FS writes? Not just ZFS, >> but for everything... In a manner similar to which we currently use >> extra RAM as a cache for FS read (and write, to a certain extent), it >> would be really nice to be able to say that a NVRAM/Flash/etc. device >> is the system-wide write cache, so that calls to fsync() and the like - >> which currently force a flush of the RAM-resident buffers to disk - >> would return as complete after the data was written to such a SSD (even >> though it might not all be written to a HD yet). > > > As a first step, we could just mark the physical vdev(s) corresponding > to such device(s), and use that as the sole backing store for the ZIL. > That would give us a quick win for the NFS/ZFS model.We are currently working on separate log devices such as disk and nvram. This should help with both NFS and DB performance. Neil.
On Thu, 2007-01-11 at 10:35 -0800, Richard Elling wrote:> The product was called Sun PrestoServ. It was successful for benchmarking > and such, but unsuccessful in the market because: > > + when there is a failure, your data is spread across multiple > fault domains > > + it is not clusterable, which is often a requirement for data > centers > > + it used a battery, so you had to deal with physical battery > replacement and all of the associated battery problems > > + it had yet another device driver, so integration was a pain > > Google for it and you''ll see all sorts of historical perspective. > -- richardYes, I remember (and used) PrestoServ. Back in the SPARCcenter 1000 days. :-) And yes, local caching makes the system non-clusterable. However, all the other issues are common to a typical HW raid controller, and many people use host-based HW controllers just fine and don''t find their problems to be excessive. And, honestly, I wouldn''t think another driver would be needed. Attaching a SSD or similar usually uses an existing driver (it normally appears as a SCSI or FC drive to the OS). -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Jan 11, 2007, at 15:42, Erik Trimble wrote:> On Thu, 2007-01-11 at 10:35 -0800, Richard Elling wrote: >> The product was called Sun PrestoServ. It was successful for >> benchmarking >> and such, but unsuccessful in the market because: >> >> + when there is a failure, your data is spread across multiple >> fault domains >> >> + it is not clusterable, which is often a requirement for data >> centers >> >> + it used a battery, so you had to deal with physical battery >> replacement and all of the associated battery problems >> >> + it had yet another device driver, so integration was a pain >> >> Google for it and you''ll see all sorts of historical perspective. >> -- richard > > > Yes, I remember (and used) PrestoServ. Back in the SPARCcenter 1000 > days. :-)as do i .. (keep your batteries charged!! and don''t panic!)> And yes, local caching makes the system non-clusterable.not necessarily .. i like the javaspaces approach to coherency, and companies like gigaspaces have done some pretty impressive things with in memory SBA databases and distributed grid architectures .. intelligent coherency design with a good distribution balance for local, remote, and redundant can go a long way in improving your cache numbers.> However, all > the other issues are common to a typical HW raid controller, and many > people use host-based HW controllers just fine and don''t find their > problems to be excessive.True given most workloads, but in general it''s the coherency issues that drastically affect throughput on shared controllers particularly as you add and distribute the same luns or data across different control processors. Add too many and your cache hit rates might fall in the toilet. .je
Hello all, Just my two cents on the issue. The Thumper is proving to be a terrific database server in all aspects except latency. While the latency is acceptable, being able to add some degree of battery-backed write cache that ZFS could use would be phenomenal. Best Regards, Jason On 1/11/07, Jonathan Edwards <Jonathan.Edwards at sun.com> wrote:> > On Jan 11, 2007, at 15:42, Erik Trimble wrote: > > > On Thu, 2007-01-11 at 10:35 -0800, Richard Elling wrote: > >> The product was called Sun PrestoServ. It was successful for > >> benchmarking > >> and such, but unsuccessful in the market because: > >> > >> + when there is a failure, your data is spread across multiple > >> fault domains > >> > >> + it is not clusterable, which is often a requirement for data > >> centers > >> > >> + it used a battery, so you had to deal with physical battery > >> replacement and all of the associated battery problems > >> > >> + it had yet another device driver, so integration was a pain > >> > >> Google for it and you''ll see all sorts of historical perspective. > >> -- richard > > > > > > Yes, I remember (and used) PrestoServ. Back in the SPARCcenter 1000 > > days. :-) > > as do i .. (keep your batteries charged!! and don''t panic!) > > > And yes, local caching makes the system non-clusterable. > > not necessarily .. i like the javaspaces approach to coherency, and > companies like gigaspaces have done some pretty impressive things > with in memory SBA databases and distributed grid architectures .. > intelligent coherency design with a good distribution balance for > local, remote, and redundant can go a long way in improving your > cache numbers. > > > However, all > > the other issues are common to a typical HW raid controller, and many > > people use host-based HW controllers just fine and don''t find their > > problems to be excessive. > > True given most workloads, but in general it''s the coherency issues > that drastically affect throughput on shared controllers particularly > as you add and distribute the same luns or data across different > control processors. Add too many and your cache hit rates might fall > in the toilet. > > .je > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Neil Perrin wrote:> > Jeremy Teo wrote On 01/11/07 01:38,: >> On 1/11/07, Erik Trimble <Erik.Trimble at sun.com> wrote: >> >>> Just a thought: would it be theoretically possible to designate some >>> device as a system-wide write cache for all FS writes? Not just ZFS, >>> but for everything... In a manner similar to which we currently use >>> extra RAM as a cache for FS read (and write, to a certain extent), it >>> would be really nice to be able to say that a NVRAM/Flash/etc. device >>> is the system-wide write cache, so that calls to fsync() and the like - >>> which currently force a flush of the RAM-resident buffers to disk - >>> would return as complete after the data was written to such a SSD (even >>> though it might not all be written to a HD yet). >> >> As a first step, we could just mark the physical vdev(s) corresponding >> to such device(s), and use that as the sole backing store for the ZIL. >> That would give us a quick win for the NFS/ZFS model. > > We are currently working on separate log devices such as disk and nvram. > This should help with both NFS and DB performance.This is a critical difference -- ZFS will *know* about the log device! This means that we should be able to maintain the advantages of ZFS and low-latency persistent storage without the data loss disadvantage^H^H^H opportunity that plagues a PrestoServ-like system. PrestoServ didn''t have any idea what data was stored in it, nor its relationship to other data on the system. -- richard
Neil Perrin wrote:> We are currently working on separate log devices such as disk and nvram. > This should help with both NFS and DB performance.It also makes things "interest" from the zfs-crypto view point. It means that it would allow a configuration where we don''t do encryption on the ZIL but instead put it on a different type of device. Obviously not for everyone and certainly not for the case where you are only using spinning rust (especially if there is only one spindle) but "interesting" anyway. -- Darren J Moffat