I''m in the planing stages of a rather larger ZFS system to house approximately 1 PB of data. I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be the bottle neck for large bursts of data being written. I can''t confirm this for sure, but the when throwing enough data at my storage pool and the write latency starts rising, the ZIL write speed hangs close the max sustained throughput I''ve measured on the SSD (~200 MB/s). The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and showed ~1300MB/s serial read and ~800MB/s serial write speed. How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Thanks for any input, -Chip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/61a2bf2d/attachment.html>
I found something similar happening when writing over NFS (at significantly lower throughput than available on the system directly), specifically that effectively all data, even asynchronous writes, were being written to the ZIL, which I eventually traced (with help from Richard Elling and others on this list) at least partially to the linux NFS client issuing commit requests before ZFS wanted to write the asynchronous data to a txg. I tried fiddling with zfs_write_limit_override to get more data onto normal vdevs faster, but this reduced performance (perhaps setting a tunable to make ZFS not throttle writes while hitting the write limit could fix that), and didn''t cause it to go significantly easier on the ZIL devices. I decided to live with the default behavior, since my main bottleneck is ethernet anyway, and the projected lifespan of the ZIL devices was fairly large due to our workload. I did find that setting logbias=throughput on a zfs filesystem caused it to act as though the ZIL devices weren''t there, which actually reduced commit times under continuous streaming writes (mostly due to having more throughput for the same amount of data to commit, in large chunks, but the zilstat script also reported less writing to the ZIL blocks (which are allocated from normal vdevs without a ZIL device, or with logbias=throughput) under this condition, so perhaps there is more to the story), so if you have different workloads for different datasets, this could help (since it isn''t a poolwide setting). Obviously, small synchronous writes to that zfs filesystem will take a large hit from this setting. It would be nice if there was a feature in ZFS that could direct small commits to ZIL blocks on log devices, but behave like logbias=throughput for large commits. It would probably need manual tuning, but it would treat SSD log devices more gently, and increase performance for large contiguous writes. If you can''t configure ZFS to write less data to the ZIL, I think a RAM based ZIL device would be a good way to get throughput up higher (and less worries about flash endurance, etc). Tim On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip <chip at innovates.com> wrote:> I''m in the planing stages of a rather larger ZFS system to house > approximately 1 PB of data. > > I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be > the bottle neck for large bursts of data being written. I can''t confirm > this for sure, but the when throwing enough data at my storage pool and the > write latency starts rising, the ZIL write speed hangs close the max > sustained throughput I''ve measured on the SSD (~200 MB/s). > > The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and > showed ~1300MB/s serial read and ~800MB/s serial write speed. > > How can I determine for sure that my ZIL is my bottleneck? If it is the > bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL > to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. > > Thanks for any input, > -Chip > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/239dcfb2/attachment-0001.html>
To answer your questions more directly, zilstat is what I used to check what the ZIL was doing: http://www.richardelling.com/Home/scripts-and-programs-1/zilstat While I have added a mirrored log device, I haven''t tried adding multiple sets of mirror log devices, but I think it should work. I believe that a failed unmirrored log device is only a problem if the pool is ungracefully closed before ZFS notices that the log device failed (ie, simultaneous power failure and log device failure), so mirroring them may not be required. Tim On Wed, Oct 3, 2012 at 2:54 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:> I found something similar happening when writing over NFS (at > significantly lower throughput than available on the system directly), > specifically that effectively all data, even asynchronous writes, were > being written to the ZIL, which I eventually traced (with help from Richard > Elling and others on this list) at least partially to the linux NFS client > issuing commit requests before ZFS wanted to write the asynchronous data to > a txg. I tried fiddling with zfs_write_limit_override to get more data > onto normal vdevs faster, but this reduced performance (perhaps setting a > tunable to make ZFS not throttle writes while hitting the write limit could > fix that), and didn''t cause it to go significantly easier on the ZIL > devices. I decided to live with the default behavior, since my main > bottleneck is ethernet anyway, and the projected lifespan of the ZIL > devices was fairly large due to our workload. > > I did find that setting logbias=throughput on a zfs filesystem caused it > to act as though the ZIL devices weren''t there, which actually reduced > commit times under continuous streaming writes (mostly due to having more > throughput for the same amount of data to commit, in large chunks, but the > zilstat script also reported less writing to the ZIL blocks (which are > allocated from normal vdevs without a ZIL device, or with > logbias=throughput) under this condition, so perhaps there is more to the > story), so if you have different workloads for different datasets, this > could help (since it isn''t a poolwide setting). Obviously, small > synchronous writes to that zfs filesystem will take a large hit from this > setting. > > It would be nice if there was a feature in ZFS that could direct small > commits to ZIL blocks on log devices, but behave like logbias=throughput > for large commits. It would probably need manual tuning, but it would > treat SSD log devices more gently, and increase performance for large > contiguous writes. > > If you can''t configure ZFS to write less data to the ZIL, I think a RAM > based ZIL device would be a good way to get throughput up higher (and less > worries about flash endurance, etc). > > Tim > > On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip <chip at innovates.com> wrote: > >> I''m in the planing stages of a rather larger ZFS system to house >> approximately 1 PB of data. >> >> I have only one system with SSDs for L2ARC and ZIL, The ZIL seems to be >> the bottle neck for large bursts of data being written. I can''t confirm >> this for sure, but the when throwing enough data at my storage pool and the >> write latency starts rising, the ZIL write speed hangs close the max >> sustained throughput I''ve measured on the SSD (~200 MB/s). >> >> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and >> showed ~1300MB/s serial read and ~800MB/s serial write speed. >> >> How can I determine for sure that my ZIL is my bottleneck? If it is the >> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL >> to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. >> >> Thanks for any input, >> -Chip >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/306cce87/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 01:33 UTC
[zfs-discuss] Making ZIL faster
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Schweiss, Chip > > How can I determine for sure that my ZIL is my bottleneck?? If it is the > bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to > make it faster?? Or should I be looking for a DDR drive, ZeusRAM, etc.Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Schweiss, Chip >> >> How can I determine for sure that my ZIL is my bottleneck? If it is the >> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to >> make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. > > Temporarily set sync=disabled > Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload.Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won''t be corrupted, but it will look like it''s been wound back by up to 30 seconds when you reboot.) This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. -- Andrew
Thanks for all the input. It seems information on the performance of the ZIL is sparse and scattered. I''ve spent significant time researching this the past day. I''ll summarize what I''ve found. Please correct me if I''m wrong. - The ZIL can have any number of SSDs attached either mirror or individually. ZFS will stripe across these in a raid0 or raid10 fashion depending on how you configure. - To determine the true maximum streaming performance of the ZIL setting sync=disabled will only use the in RAM ZIL. This gives up power protection to synchronous writes. - Many SSDs do not help protect against power failure because they have their own ram cache for writes. This effectively makes the SSD useless for this purpose and potentially introduces a false sense of security. (These SSDs are fine for L2ARC) - Mirroring SSDs is only helpful if one SSD fails at the time of a power failure. This leave several unanswered questions. How good is ZFS at detecting that an SSD is no longer a reliable write target? The chance of silent data corruption is well documented about spinning disks. What chance of data corruption does this introduce with up to 10 seconds of data written on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is returning what we write to it? - Zpool versions 19 and higher should be able to survive a ZIL failure only loosing the uncommitted data. However, I haven''t seen good enough information that I would necessarily trust this yet. - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. I''m not sure if that is current, but I can''t find any reports of better performance. I would suspect that DDR drive or Zeus RAM as ZIL would push past this. Anyone care to post their performance numbers on current hardware with E5 processors, and ram based ZIL solutions? Thanks to everyone who has responded and contacted me directly on this issue. -Chip On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel < andrew.gabriel at cucumber.demon.co.uk> wrote:> Edward Ned Harvey (**opensolarisisdeadlongliveopens**olaris) wrote: > >> From: zfs-discuss-bounces@**opensolaris.org<zfs-discuss-bounces at opensolaris.org>[mailto: >>> zfs-discuss- >>> bounces at opensolaris.org] On Behalf Of Schweiss, Chip >>> >>> How can I determine for sure that my ZIL is my bottleneck? If it is the >>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the >>> ZIL to >>> make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. >>> >> >> Temporarily set sync=disabled >> Or, depending on your application, leave it that way permanently. I >> know, for the work I do, most systems I support at most locations have >> sync=disabled. It all depends on the workload. >> > > Noting of course that this means that in the case of an unexpected system > outage or loss of connectivity to the disks, synchronous writes since the > last txg commit will be lost, even though the applications will believe > they are secured to disk. (ZFS filesystem won''t be corrupted, but it will > look like it''s been wound back by up to 30 seconds when you reboot.) > > This is fine for some workloads, such as those where you would start again > with fresh data and those which can look closely at the data to see how far > they got before being rudely interrupted, but not for those which rely on > the Posix semantics of synchronous writes/syncs meaning data is secured on > non-volatile storage when the function returns. > > -- > Andrew >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/2ed89830/attachment-0001.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 11:40 UTC
[zfs-discuss] Making ZIL faster
> From: Andrew Gabriel [mailto:andrew.gabriel at cucumber.demon.co.uk] > > > Temporarily set sync=disabled > > Or, depending on your application, leave it that way permanently. I know, > for the work I do, most systems I support at most locations have > sync=disabled. It all depends on the workload. > > Noting of course that this means that in the case of an unexpected system > outage or loss of connectivity to the disks, synchronous writes since the last > txg commit will be lost, even though the applications will believe they are > secured to disk. (ZFS filesystem won''t be corrupted, but it will look like it''s > been wound back by up to 30 seconds when you reboot.) > > This is fine for some workloads, such as those where you would start again > with fresh dataIt''s fine for any load where you don''t have clients keeping track of your state. Examples where it''s not fine: You''re processing credit card transactions. You just finished processing a transaction, system crashes, and you forget about it. Not fine, because systems external to yourself are aware of state that is in the future of your state, and you aren''t aware of it. You''re a NFS server. Some clients write some files, you say they''re written, you crash, and forget about it. Now you reboot, start serving NFS again, and the client still has a file handle for something it thinks exists ... but according to you in your new state, it doesn''t exist. You''re a database server, and your clients are external to yourself. They do transactions against you, you say they''re complete, and you forget about it. But it''s ok: You''re a NFS server, and you''re configured to NOT restart NFS automatically upon reboot. In the event of an ungraceful crash, admin intervention is required, and the admin is aware, he needs to reboot the NFS clients before starting the NFS services again. You''re a database server, and your clients are all inside yourself, either as VM''s, or services of various kinds.... You''re a VM server, and all of your VM''s are desktop clients, like a windows 7 machine for example. None of your guests are servers in and of themselves maintaining state with external entities (such as processing credit card transactions, serving a database, or file server.) By mere virtue of the fact that you crash ungracefully implies your guests also crash ungracefully. You all reboot, rewind a few seconds, no problem.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 11:57 UTC
[zfs-discuss] Making ZIL faster
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Schweiss, Chip > > . The ZIL can have any number of SSDs attached either mirror or > individually.?? ZFS will stripe across these in a raid0 or raid10 fashion > depending on how you configure.I''m regurgitating something somebody else said - but I don''t know where. I believe multiple ZIL devices don''t get striped. They get round-robin''d. This means your ZIL can absolutely become a bottleneck, if you''re doing sustained high throughput (not high IOPS) sync writes. But the way to prevent that bottleneck is by tuning the ... I don''t know the names of the parameters. Some parameters that indicate "a sync write larger than X should skip the ZIL and go directly to pool."> . To determine the true maximum streaming performance of the ZIL setting > sync=disabled will only use the in RAM ZIL.?? This gives up power protection to > synchronous writes.There is no RAM ZIL. The basic idea behind ZIL is like this: Some applications simply tell the system to "write" and the system will buffer these writes in memory, and the application will continue processing. But some applications do not want the OS to buffer writes, so they issue writes in "sync" mode. These applications will issue the write command, and they will block there, until the OS says it''s written to nonvolatile storage. In ZFS, this means the transaction gets written to the ZIL, and then it gets put into the memory buffer just like any other write. Upon reboot, when the filesystem is mounting, ZFS will always look in the ZIL to see if there are any transactions that have not yet been played to disk. So, when you set sync=disabled, you''re just bypassing that step. You''re lying to the applications, if they say "I want to know when this is written to disk," and you just immediately say "Yup, it''s done" unconditionally. This is the highest performance thing you could possibly do - but depending on your system workload, could put you at risk for data loss.> . Mirroring SSDs is only helpful if one SSD fails at the time of a power > failure.? This leave several unanswered questions.? How good is ZFS at > detecting that an SSD is no longer a reliable write target??? The chance of > silent data corruption is well documented about spinning disks.? What chance > of data corruption does this introduce with up to 10 seconds of data written > on SSD.? Does ZFS read the ZIL during a scrub to determine if our SSD is > returning what we write to it?Not just power loss -- any ungraceful crash. ZFS doesn''t have any way to scrub ZIL devices, so it''s not very good at detecting failed ZIL devices. There is definitely the possibility for an SSD to enter a failure mode where you write to it, it doesn''t complain, but you wouldn''t be able to read it back if you tried. Also, upon ungraceful crash, even if you try to read that data, and fail to get it back, there''s no way to know that you should have expected something. So you still don''t detect the failure. If you want to maintain your SSD periodically, you should do something like: Remove it as a ZIL device, create a new pool with just this disk in it, write a bunch of random data to the new junk pool, scrub the pool, then destroy the junk pool and return it as a ZIL device to the main pool. This does not guarantee anything - but then - nothing anywhere guarantees anything. This is a good practice, and it definitely puts you into a territory of reliability better than the competing alternatives.> . Zpool versions 19 and higher should be able to survive a ZIL failure only > loosing the uncommitted data. ? However, I haven''t seen good enough > information that I would necessarily trust this yet.That was a very long time ago. (What, 2-3 years?) It''s very solid now.> . Several threads seem to suggest a ZIL throughput limit of 1Gb/s with > SSDs.?? I''m not sure if that is current, but I can''t find any reports of better > performance.?? I would suspect that DDR drive or Zeus RAM as ZIL would push > past this.Whenever I measure the sustainable throughput of a SSD, HDD, DDRDrive, or anything else ... Very few devices can actually sustain faster than 1Gb/s, for use as a ZIL or anything else. Published specs are often higher, but not realistic. If you are ZIL bandwidth limited, you should consider tuning the size of stuff that goes to ZIL.
On 10/04/12 05:30, Schweiss, Chip wrote: Thanks for all the input. It seems information on the performance of the ZIL is sparse and scattered. I''ve spent significant time researching this the past day. I''ll summarize what I''ve found. Please correct me if I''m wrong. The ZIL can have any number of SSDs attached either mirror or individually. ZFS will stripe across these in a raid0 or raid10 fashion depending on how you configure. The ZIL code chains blocks together and these are allocated round robin among slogs or if they don''t exist then the main pool devices. To determine the true maximum streaming performance of the ZIL setting sync=disabled will only use the in RAM ZIL. This gives up power protection to synchronous writes. There is no RAM ZIL. If sync=disabled then all writes are asynchronous and are written as part of the periodic ZFS transaction group (txg) commit that occurs every 5 seconds. Many SSDs do not help protect against power failure because they have their own ram cache for writes. This effectively makes the SSD useless for this purpose and potentially introduces a false sense of security. (These SSDs are fine for L2ARC) The ZIL code issues a write cache flush to all devices it has written before returning from the system call. I''ve heard, that not all devices obey the flush but we consider them as broken hardware. I don''t have a list to avoid. Mirroring SSDs is only helpful if one SSD fails at the time of a power failure. This leave several unanswered questions. How good is ZFS at detecting that an SSD is no longer a reliable write target? The chance of silent data corruption is well documented about spinning disks. What chance of data corruption does this introduce with up to 10 seconds of data written on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is returning what we write to it? If the ZIL code gets a block write failure it will force the txg to commit before returning. It will depend on the drivers and IO subsystem as to how hard it tries to write the block. Zpool versions 19 and higher should be able to survive a ZIL failure only loosing the uncommitted data. However, I haven''t seen good enough information that I would necessarily trust this yet. This has been available for quite a while and I haven''t heard of any bugs in this area. Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. I''m not sure if that is current, but I can''t find any reports of better performance. I would suspect that DDR drive or Zeus RAM as ZIL would push past this. 1GB/s seems very high, but I don''t have any numbers to share. Anyone care to post their performance numbers on current hardware with E5 processors, and ram based ZIL solutions? Thanks to everyone who has responded and contacted me directly on this issue. -Chip On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel <andrew.gabriel@cucumber.demon.co.uk> wrote: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss- bounces@opensolaris.org] On Behalf Of Schweiss, Chip How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won''t be corrupted, but it will look like it''s been wound back by up to 30 seconds when you reboot.) This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. -- Andrew _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Thanks Neil, we always appreciate your comments on ZIL implementation. One additional comment below... On Oct 4, 2012, at 8:31 AM, Neil Perrin <neil.perrin at oracle.com> wrote:> On 10/04/12 05:30, Schweiss, Chip wrote: >> >> Thanks for all the input. It seems information on the performance of the ZIL is sparse and scattered. I''ve spent significant time researching this the past day. I''ll summarize what I''ve found. Please correct me if I''m wrong. >> The ZIL can have any number of SSDs attached either mirror or individually. ZFS will stripe across these in a raid0 or raid10 fashion depending on how you configure. > > The ZIL code chains blocks together and these are allocated round robin among slogs or > if they don''t exist then the main pool devices. > >> To determine the true maximum streaming performance of the ZIL setting sync=disabled will only use the in RAM ZIL. This gives up power protection to synchronous writes. > > There is no RAM ZIL. If sync=disabled then all writes are asynchronous and are written > as part of the periodic ZFS transaction group (txg) commit that occurs every 5 seconds. > >> Many SSDs do not help protect against power failure because they have their own ram cache for writes. This effectively makes the SSD useless for this purpose and potentially introduces a false sense of security. (These SSDs are fine for L2ARC) > > The ZIL code issues a write cache flush to all devices it has written before returning > from the system call. I''ve heard, that not all devices obey the flush but we consider them > as broken hardware. I don''t have a list to avoid. > >> >> Mirroring SSDs is only helpful if one SSD fails at the time of a power failure. This leave several unanswered questions. How good is ZFS at detecting that an SSD is no longer a reliable write target? The chance of silent data corruption is well documented about spinning disks. What chance of data corruption does this introduce with up to 10 seconds of data written on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is returning what we write to it? > > If the ZIL code gets a block write failure it will force the txg to commit before returning. > It will depend on the drivers and IO subsystem as to how hard it tries to write the block. > >> >> Zpool versions 19 and higher should be able to survive a ZIL failure only loosing the uncommitted data. However, I haven''t seen good enough information that I would necessarily trust this yet. > > This has been available for quite a while and I haven''t heard of any bugs in this area. > >> Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. I''m not sure if that is current, but I can''t find any reports of better performance. I would suspect that DDR drive or Zeus RAM as ZIL would push past this. > > 1GB/s seems very high, but I don''t have any numbers to share.It is not unusual for workloads to exceed the performance of a single device. For example, if you have a device that can achieve 700 MB/sec, but a workload generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it should be immediately obvious that the slog needs to be striped. Empirically, this is also easy to measure. -- richard> >> >> Anyone care to post their performance numbers on current hardware with E5 processors, and ram based ZIL solutions? >> >> Thanks to everyone who has responded and contacted me directly on this issue. >> >> -Chip >> On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel <andrew.gabriel at cucumber.demon.co.uk> wrote: >> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Schweiss, Chip >> >> How can I determine for sure that my ZIL is my bottleneck? If it is the >> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to >> make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. >> >> Temporarily set sync=disabled >> Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. >> >> Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won''t be corrupted, but it will look like it''s been wound back by up to 30 seconds when you reboot.) >> >> This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. >> >> -- >> Andrew >> >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/8df214b8/attachment-0001.html>
Again thanks for the input and clarifications. I would like to clarify the numbers I was talking about with ZiL performance specs I was seeing talked about on other forums. Right now I''m getting streaming performance of sync writes at about 1 Gbit/S. My target is closer to 10Gbit/S. If I get to build it this system, it will house a decent size VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe. This is all medical imaging research. We move data around by the TB and fast streaming is imperative. On the system I''ve been testing with is 10Gbe connected and I have about 50 VMs running very happily, and haven''t yet found my random I/O limit. However every time, I storage vMotion a handful of additional VMs, the ZIL seems to max out it''s writing speed to the SSDs and random I/O also suffers. With out the SSD ZIL, random I/O is very poor. I will be doing some testing with sync=off, tomorrow and see how things perform. If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more streaming synchronous writes please let me know. -Chip On Thu, Oct 4, 2012 at 1:33 PM, Richard Elling <richard.elling at gmail.com>wrote:> > This has been available for quite a while and I haven''t heard of any bugs > in this area. > > > - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with > SSDs. I''m not sure if that is current, but I can''t find any reports of > better performance. I would suspect that DDR drive or Zeus RAM as ZIL > would push past this. > > > 1GB/s seems very high, but I don''t have any numbers to share. > > > It is not unusual for workloads to exceed the performance of a single > device. > For example, if you have a device that can achieve 700 MB/sec, but a > workload > generated by lots of clients accessing the server via 10GbE (1 GB/sec), > then it > should be immediately obvious that the slog needs to be striped. > Empirically, > this is also easy to measure. > -- richard > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/d746429d/attachment.html>
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 21:57 UTC
[zfs-discuss] Making ZIL faster
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Schweiss, Chip > > If I get to build it this system, it will house a decent size VMware > NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.?? This is all > medical imaging research.? We move data around by the TB and fast > streaming is imperative.This might not carry over to vmware, iscsi vs nfs. But with virtualbox, using a local file versus using a local zvol, I have found the zvol is much faster for the guest OS. Also, by default the zvol will have smarter reservations (refreservation) which seems to be a good thing.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-04 21:59 UTC
[zfs-discuss] Making ZIL faster
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Neil Perrin > > The ZIL code chains blocks together and these are allocated round robin > among slogs or > if they don''t exist then the main pool devices.So, if somebody is doing sync writes as fast as possible, would they gain more bandwidth by adding multiple slog devices?
On Oct 4, 2012, at 1:33 PM, "Schweiss, Chip" <chip at innovates.com> wrote:> Again thanks for the input and clarifications. > > I would like to clarify the numbers I was talking about with ZiL performance specs I was seeing talked about on other forums. Right now I''m getting streaming performance of sync writes at about 1 Gbit/S. My target is closer to 10Gbit/S. If I get to build it this system, it will house a decent size VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe. This is all medical imaging research. We move data around by the TB and fast streaming is imperative. > > On the system I''ve been testing with is 10Gbe connected and I have about 50 VMs running very happily, and haven''t yet found my random I/O limit. However every time, I storage vMotion a handful of additional VMs, the ZIL seems to max out it''s writing speed to the SSDs and random I/O also suffers. With out the SSD ZIL, random I/O is very poor. I will be doing some testing with sync=off, tomorrow and see how things perform. > > If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more streaming synchronous writes please let me know.Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3 GBytes/sec of storage vmotion on a relatively modest system. To sustain that sort of thing often requires full system-level tuning and proper systems engineering design. Fortunately, people tend to not do storage vmotion on a continuous basis. -- richard -- Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/7df9d57a/attachment.html>
On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Neil Perrin >> >> The ZIL code chains blocks together and these are allocated round robin >> among slogs or >> if they don''t exist then the main pool devices. > So, if somebody is doing sync writes as fast as possible, would they gain more bandwidth by adding multiple slog devices? >In general - yes, but it really depends. Multiple synchronous writes of any size across multiple file systems will fan out across the log devices. That is because there is a separate independent log chain for each file system. Also large synchronous writes (eg 1MB) within a specific file system will be spread out. The ZIL code will try to allocate a block to hold all the records it needs to commit up to the largest block size - which currently for you should be 128KB. Anything larger will allocate a new block - on a different device if there are multiple devices. However, lots of small synchronous writes to the same file system might not use more than one 128K block and benefit from multiple slog devices. Neil.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Oct-05 13:50 UTC
[zfs-discuss] Making ZIL faster
> From: Neil Perrin [mailto:neil.perrin at oracle.com] > > In general - yes, but it really depends. Multiple synchronous writes of any > size > across multiple file systems will fan out across the log devices. That is > because there is a separate independent log chain for each file system. > > Also large synchronous writes (eg 1MB) within a specific file system will be > spread out. > The ZIL code will try to allocate a block to hold all the records it needs to > commit up to the largest block size - which currently for you should be 128KB. > Anything larger will allocate a new block - on a different device if there are > multiple devices. > > However, lots of small synchronous writes to the same file system might not > use more than one 128K block and benefit from multiple slog devices.That is an awesome explanation. Thank you.