Hi, I known it''s been discussed here more than once, and I read the Evil tuning guide, but I didn''t find a definitive statement: There is absolutely no sense in having slog devices larger than then main memory, because it will never be used, right? ZFS will rather flush the txg to disk than reading back from zil? So there is a guideline to have enough slog to hold about 10 seconds of zil, but the absolute maximum value is the size of main memory. Is this correct? Thanks, Arne
On Mon, Jun 14, 2010 at 4:41 AM, Arne Jansen <sensille at gmx.net> wrote:> Hi, > > I known it''s been discussed here more than once, and I read the > Evil tuning guide, but I didn''t find a definitive statement: > > There is absolutely no sense in having slog devices larger than > then main memory, because it will never be used, right? > ZFS will rather flush the txg to disk than reading back from > zil? > So there is a guideline to have enough slog to hold about 10 > seconds of zil, but the absolute maximum value is the size of > main memory. Is this correct? > >I thought it was half the size of memory. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100614/1058dbe5/attachment.html>
> There is absolutely no sense in having slog devices larger than > then main memory, because it will never be used, right? > ZFS will rather flush the txg to disk than reading back from > zil? So there is a guideline to have enough slog to hold about 10 > seconds of zil, but the absolute maximum value is the size of > main memory. Is this correct?ZFS uses at most RAM/2 for ZIL Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Arne Jansen > > There is absolutely no sense in having slog devices larger than > then main memory, because it will never be used, right?Also: A TXG is guaranteed to flush within 30 sec. Let''s suppose you have a super fast device, which is able to log 8Gbit/sec (which is unrealistic). That''s 1Gbyte/sec, unrealistically theoretically possible, at best. You do the math. ;-) That being said, it''s difficult to buy an SSD smaller than 32G. So what are you going to do? Slice it and use the remaining space for cache? Some people do. Some people may even get a performance benefit by doing so. But if you do, now you''ve got a cache and a log both competing for IO on the same device. The performance benefit degrades for sure. My advice is to simply acknowledge wasted space in your log device, forget about it and move on. Same thing you did with all the wasted space on your mirrored OS boot device, which can''t (or shouldn''t) be used by your data pool.
Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Arne Jansen >> >> There is absolutely no sense in having slog devices larger than >> then main memory, because it will never be used, right? > > Also: A TXG is guaranteed to flush within 30 sec. Let''s suppose you have a > super fast device, which is able to log 8Gbit/sec (which is unrealistic). > That''s 1Gbyte/sec, unrealistically theoretically possible, at best. You do > the math. ;-) > > That being said, it''s difficult to buy an SSD smaller than 32G. So what are > you going to do?I''m still building my rotational write delay eliminating driver and am trying to figure out how much space I can waste on the underlying device without ever running into problems. I need half the physical memory, or, under the assumption that it might be tunable, a maximum of my physical memory. It''s good to know a hard upper limit. The more I can waste, the faster the device will be. Also, to stay in your line of argumentation, this super-fast slog is most probably a DRAM-based, battery backed solution. In this case it will make a difference if you buy 8 or 32GB ;) --Arne
Roy Sigurd Karlsbakk wrote:>> There is absolutely no sense in having slog devices larger than >> then main memory, because it will never be used, right? >> ZFS will rather flush the txg to disk than reading back from >> zil? So there is a guideline to have enough slog to hold about 10 >> seconds of zil, but the absolute maximum value is the size of >> main memory. Is this correct? > > ZFS uses at most RAM/2 for ZILThanks!
On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:>> There is absolutely no sense in having slog devices larger than >> then main memory, because it will never be used, right? >> ZFS will rather flush the txg to disk than reading back from >> zil? So there is a guideline to have enough slog to hold about 10 >> seconds of zil, but the absolute maximum value is the size of >> main memory. Is this correct? > > ZFS uses at most RAM/2 for ZILIt is good to keep in mind that only small writes go to the dedicated slog. Large writes to to main store. A succession of that many small writes (to fill RAM/2) is highly unlikely. Also, that the zil is not read back unless the system is improperly shut down. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
----- Original Message -----> On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: > > >> There is absolutely no sense in having slog devices larger than > >> then main memory, because it will never be used, right? > >> ZFS will rather flush the txg to disk than reading back from > >> zil? So there is a guideline to have enough slog to hold about 10 > >> seconds of zil, but the absolute maximum value is the size of > >> main memory. Is this correct? > > > > ZFS uses at most RAM/2 for ZIL > > It is good to keep in mind that only small writes go to the dedicated > slog. Large writes to to main store. A succession of that many small > writes (to fill RAM/2) is highly unlikely. Also, that the zil is not > read back unless the system is improperly shut down.I thought all sync writes, meaning everything NFS and iSCSI, went into the slog - IIRC the docs says so. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote:>> It is good to keep in mind that only small writes go to the dedicated >> slog. Large writes to to main store. A succession of that many small >> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >> read back unless the system is improperly shut down. > > I thought all sync writes, meaning everything NFS and iSCSI, went > into the slog - IIRC the docs says so.Check a month or two back in the archives for a post by Matt Ahrens. It seems that larger writes (>32k?) are written directly to main store. This is probably a change from the original zfs design. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 06/14/10 12:29, Bob Friesenhahn wrote:> On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: > >>> It is good to keep in mind that only small writes go to the dedicated >>> slog. Large writes to to main store. A succession of that many small >>> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >>> read back unless the system is improperly shut down. >> >> I thought all sync writes, meaning everything NFS and iSCSI, went >> into the slog - IIRC the docs says so. > > Check a month or two back in the archives for a post by Matt Ahrens. > It seems that larger writes (>32k?) are written directly to main > store. This is probably a change from the original zfs design. > > BobIf there''s a slog then the data, regardless of size, gets written to the slog. If there''s no slog and if the data size is greater than zfs_immediate_write_sz/zvol_immediate_write_sz (both default to 32K) then the data is written as a block into the pool and the block pointer written into the log record. This is the WR_INDIRECT write type. So Matt and Roy are both correct. But wait, there''s more complexity!: If logbias=throughput is set we always use WR_INDIRECT. If we just wrote more than 1MB for a single zil commit and there''s more than 2MB waiting then we start using the main pool. Clear as mud? This is likely to change again... Neil.
On 6/14/2010 12:10 PM, Neil Perrin wrote:> On 06/14/10 12:29, Bob Friesenhahn wrote: >> On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: >> >>>> It is good to keep in mind that only small writes go to the dedicated >>>> slog. Large writes to to main store. A succession of that many small >>>> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >>>> read back unless the system is improperly shut down. >>> >>> I thought all sync writes, meaning everything NFS and iSCSI, went >>> into the slog - IIRC the docs says so. >> >> Check a month or two back in the archives for a post by Matt Ahrens. >> It seems that larger writes (>32k?) are written directly to main >> store. This is probably a change from the original zfs design. >> >> Bob > > If there''s a slog then the data, regardless of size, gets written to > the slog. > > If there''s no slog and if the data size is greater than > zfs_immediate_write_sz/zvol_immediate_write_sz > (both default to 32K) then the data is written as a block into the > pool and the block pointer > written into the log record. This is the WR_INDIRECT write type. > > So Matt and Roy are both correct. > > But wait, there''s more complexity!: > > If logbias=throughput is set we always use WR_INDIRECT. > > If we just wrote more than 1MB for a single zil commit and there''s > more than 2MB waiting > then we start using the main pool. > > Clear as mud? This is likely to change again... > > Neil. >How do I monitor the amount of live (i.e. non-committed) data in the slog? I''d like to spend some time with my setup, seeing exactly how much I tend to use. I''d suspect that very few use cases call for more than a couple (2-4) GB of slog... I''m trying to get hard numbers as I''m working on building a DRAM/battery/flash slog device in one of my friend''s electronics prototyping shops. It would be really nice if I could solve 99% of the need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB thumb drive... -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Jun 14, 2010, at 6:35 PM, Erik Trimble wrote:> On 6/14/2010 12:10 PM, Neil Perrin wrote: >> On 06/14/10 12:29, Bob Friesenhahn wrote: >>> On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: >>> >>>>> It is good to keep in mind that only small writes go to the dedicated >>>>> slog. Large writes to to main store. A succession of that many small >>>>> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >>>>> read back unless the system is improperly shut down. >>>> >>>> I thought all sync writes, meaning everything NFS and iSCSI, went into the slog - IIRC the docs says so. >>> >>> Check a month or two back in the archives for a post by Matt Ahrens. It seems that larger writes (>32k?) are written directly to main store. This is probably a change from the original zfs design. >>> >>> Bob >> >> If there''s a slog then the data, regardless of size, gets written to the slog. >> >> If there''s no slog and if the data size is greater than zfs_immediate_write_sz/zvol_immediate_write_sz >> (both default to 32K) then the data is written as a block into the pool and the block pointer >> written into the log record. This is the WR_INDIRECT write type. >> >> So Matt and Roy are both correct. >> >> But wait, there''s more complexity!: >> >> If logbias=throughput is set we always use WR_INDIRECT. >> >> If we just wrote more than 1MB for a single zil commit and there''s more than 2MB waiting >> then we start using the main pool. >> >> Clear as mud? This is likely to change again... >> >> Neil. >> > > How do I monitor the amount of live (i.e. non-committed) data in the slog? I''d like to spend some time with my setup, seeing exactly how much I tend to use.zilstat http://www.richardelling.com/Home/scripts-and-programs-1/zilstat> I''d suspect that very few use cases call for more than a couple (2-4) GB of slog...I''d suspect few real cases need more than 1GB. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On 06/14/10 19:35, Erik Trimble wrote:> On 6/14/2010 12:10 PM, Neil Perrin wrote: >> On 06/14/10 12:29, Bob Friesenhahn wrote: >>> On Mon, 14 Jun 2010, Roy Sigurd Karlsbakk wrote: >>> >>>>> It is good to keep in mind that only small writes go to the dedicated >>>>> slog. Large writes to to main store. A succession of that many small >>>>> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >>>>> read back unless the system is improperly shut down. >>>> >>>> I thought all sync writes, meaning everything NFS and iSCSI, went >>>> into the slog - IIRC the docs says so. >>> >>> Check a month or two back in the archives for a post by Matt Ahrens. >>> It seems that larger writes (>32k?) are written directly to main >>> store. This is probably a change from the original zfs design. >>> >>> Bob >> >> If there''s a slog then the data, regardless of size, gets written to >> the slog. >> >> If there''s no slog and if the data size is greater than >> zfs_immediate_write_sz/zvol_immediate_write_sz >> (both default to 32K) then the data is written as a block into the >> pool and the block pointer >> written into the log record. This is the WR_INDIRECT write type. >> >> So Matt and Roy are both correct. >> >> But wait, there''s more complexity!: >> >> If logbias=throughput is set we always use WR_INDIRECT. >> >> If we just wrote more than 1MB for a single zil commit and there''s >> more than 2MB waiting >> then we start using the main pool. >> >> Clear as mud? This is likely to change again... >> >> Neil. >> > > How do I monitor the amount of live (i.e. non-committed) data in the > slog? I''d like to spend some time with my setup, seeing exactly how > much I tend to use.I think monitoring the capacity when running "zpool iostat -v <pool> 1" should be fairly accurate. A simple d script can be written to determine how often the ZIL (code) fails to get a slog block and has to resort to the allocation in the main pool. One recent change reduced the amount of data written and possibly the slog block fragmentation. This is zpool version 23: "Slim ZIL". So be sure to experiment with that.> > > I''d suspect that very few use cases call for more than a couple (2-4) > GB of slog...I agree this is typically true. Of course it depends on your workload. The amount slog data will reflect the uncommitted synchronous txg data, and the size of each txg will depend on memory size. This area is also undergoing tuning.> > I''m trying to get hard numbers as I''m working on building a > DRAM/battery/flash slog device in one of my friend''s electronics > prototyping shops. It would be really nice if I could solve 99% of > the need with 1 or 2 2GB SODIMMs and the chips from a cheap 4GB USB > thumb drive... >Sounds like fun. Good luck. Neil.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > It is good to keep in mind that only small writes go to the dedicated > slog. Large writes to to main store. A succession of that many small > writes (to fill RAM/2) is highly unlikely. Also, that the zil is not > read back unless the system is improperly shut down.Can anyone verify this? I thought the decision for small vs large sync writes to go to log vs main store was determined by zfs_immediate_write_sz and logbias. logbias was introduced in snv_122, which is zpool 18 or 19. zfs_immediate_write_sz seems to have been around forever (I see comments about it as early as 2006). Then again, I can''t seem to find my zfs_immediate_write_sz, via either zpool or zfs. Can anybody say what version zpool introduced zfs_immediate_write_sz, or perhaps I''m using the wrong commands to try and see mine? zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all rpool | grep zfs_immediate_write_sz I thought, if you didn''t explicitly tune these, all sync writes go to ZIL before the main store. Can''t seem to find any way to verify this.
On Jun 15, 2010, at 8:13 PM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn >> >> It is good to keep in mind that only small writes go to the dedicated >> slog. Large writes to to main store. A succession of that many small >> writes (to fill RAM/2) is highly unlikely. Also, that the zil is not >> read back unless the system is improperly shut down. > > Can anyone verify this? I thought the decision for small vs large sync > writes to go to log vs main store was determined by zfs_immediate_write_sz > and logbias. > > logbias was introduced in snv_122, which is zpool 18 or 19. > zfs_immediate_write_sz seems to have been around forever (I see comments > about it as early as 2006). > > Then again, I can''t seem to find my zfs_immediate_write_sz, via either zpool > or zfs. Can anybody say what version zpool introduced > zfs_immediate_write_sz, or perhaps I''m using the wrong commands to try and > see mine? zpool get all rpool | grep zfs_immediate_write_sz ; zfs get all > rpool | grep zfs_immediate_write_szIt is an int, as in C, not a parameter tunable by zpool or zfs commands. For NFS service, it can be tuned by the client via wsize.> I thought, if you didn''t explicitly tune these, all sync writes go to ZIL > before the main store. Can''t seem to find any way to verify this.Cake. All sync writes go to the ZIL. The ZIL may be in the pool or in the separate log device :-) -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
On Jun 15, 2010, at 8:51 PM, Richard Elling wrote>> I thought, if you didn''t explicitly tune these, all sync writes go to ZIL >> before the main store. Can''t seem to find any way to verify this. > > Cake. All sync writes go to the ZIL. The ZIL may be in the pool or in > the separate log device :-)"go to" may be too confusing. s/go to/are handled by/ -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/