I need some help with clarification. My understanding is that there are 2 instances in which ZFS will write to disk: 1) TXG Sync 2) ZIL Post-snv_87 a TXG should sync out when the TXG is either over filled or hits the timeout of 30 seconds. First question is... is there some place I can see what this max TXG size is? If I recall its 1/8th of system memory... but there has to be a counter somewhere right? I''m unclear on ZIL writes. I think that they happen independently of the normal txg rotation, but I''m not sure. So the second question is: do they happen with a TXG sync (expitied) or independent of the normal TXG sync flow? Finally, I''m unclear on exactly what constitutes a TXG Stall. I had assumed that it indicated TXG''s that exceeded the alloted time, but after some dtracing I''m uncertain. Any help is appreciated. benr. -- Ben Rockwood cuddletech.com Joyent Inc. PGP: 0xC823A182 @ pgp.mit.edu "...even at night his mind does not rest. This too is meaningless." - Ecclesiastes 2:23
I''ll answer the ZIL question: On 02/05/09 03:38, Ben Rockwood wrote:> I need some help with clarification. > > My understanding is that there are 2 instances in which ZFS will write > to disk: > 1) TXG Sync > 2) ZIL > > Post-snv_87 a TXG should sync out when the TXG is either over filled or > hits the timeout of 30 seconds. > > First question is... is there some place I can see what this max TXG > size is? If I recall its 1/8th of system memory... but there has to be > a counter somewhere right? > > I''m unclear on ZIL writes. I think that they happen independently of > the normal txg rotation, but I''m not sure. > > So the second question is: do they happen with a TXG sync (expitied) or > independent of the normal TXG sync flow?They are independent of the normal TXG flow. It would be too slow to force a txg commit for any synchronous write. In rare error situations (like failure to allocate an intent log block) then the ZIL will wait for the txg to commit. It has no choice, as it cannot return to the application until data is stable.> > Finally, I''m unclear on exactly what constitutes a TXG Stall. I had > assumed that it indicated TXG''s that exceeded the alloted time, but > after some dtracing I''m uncertain. > > Any help is appreciated. > > benr. >
Ben Rockwood wrote:> I need some help with clarification. > > My understanding is that there are 2 instances in which ZFS will write > to disk: > 1) TXG Sync > 2) ZIL > > Post-snv_87 a TXG should sync out when the TXG is either over filled or > hits the timeout of 30 seconds. > > First question is... is there some place I can see what this max TXG > size is? If I recall its 1/8th of system memory... but there has to be > a counter somewhere right? >There is both a memory throttle limit (enforced in arc_memory_throttle) and a write throughput throttle limit (calculated in dsl_pool_sync(), enforced in dsl_pool_tempreserve_space()). The write limit is stored as the ''dp_write_limit'' for each pool.> I''m unclear on ZIL writes. I think that they happen independently of > the normal txg rotation, but I''m not sure. > > So the second question is: do they happen with a TXG sync (expitied) or > independent of the normal TXG sync flow? > > Finally, I''m unclear on exactly what constitutes a TXG Stall. I had > assumed that it indicated TXG''s that exceeded the alloted time, but > after some dtracing I''m uncertain. >I''m not certain what you mean by: "TXG Stall".> Any help is appreciated. > > benr. >
Mark Maybee wrote:> Ben Rockwood wrote: >> I need some help with clarification. >> >> My understanding is that there are 2 instances in which ZFS will write >> to disk: >> 1) TXG Sync >> 2) ZIL >> >> Post-snv_87 a TXG should sync out when the TXG is either over filled or >> hits the timeout of 30 seconds. >> >> First question is... is there some place I can see what this max TXG >> size is? If I recall its 1/8th of system memory... but there has to be >> a counter somewhere right? >> > There is both a memory throttle limit (enforced in arc_memory_throttle) > and a write throughput throttle limit (calculated in dsl_pool_sync(), > enforced in dsl_pool_tempreserve_space()). The write limit is stored as > the ''dp_write_limit'' for each pool.I cooked up the following: $ dtrace -qn fbt::dsl_pool_sync:entry''{ printf("Throughput is \t %d\n write limit is\t %d\n\n", args[0]->dp_throughput, args[0]->dp_write_limit); }'' Throughput is 883975129 write limit is 3211748352 I''m confused with regard to the units and interpretation. For instance, the write limit here is almost 3GB on a system with 4GB of RAM. However, if I read the code right the value here is already inflated *6... so the real write limit is actually 510MB right? As for the throughput, I need verification... I think the unit here is bytes per second?>> I''m unclear on ZIL writes. I think that they happen independently of >> the normal txg rotation, but I''m not sure. >> >> So the second question is: do they happen with a TXG sync (expitied) or >> independent of the normal TXG sync flow? >> >> Finally, I''m unclear on exactly what constitutes a TXG Stall. I had >> assumed that it indicated TXG''s that exceeded the alloted time, but >> after some dtracing I''m uncertain. >> > I''m not certain what you mean by: "TXG Stall".I refer to the following code, which I''m having some trouble properly understanding: 475 txg_stalled(dsl_pool_t *dp) 476 { 477 tx_state_t *tx = &dp->dp_tx; 478 return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg); 479 } Ultimately, what this all comes down to is finding a reliable way to determine when ZFS is struggling. I''m currently watching (on pre-87) txg sync times and if it exceeds like 4 seconds per txg sync I know their is trouble brewing. I''m considering whether watching either txg_stalled or txg_delay may be better ways to flag trouble. dp_throughput looks like it also might be a good candidate, although it was only added in snv_98 unfortunate so it doesn''t help a lot of my existing installs. Nevertheless, graphing this value could be very telling and would be nice to have available as a kstat. The intended result is to have a reliable means of monitoring (via a standard monitoring framework such as Nagios or Zabbix or something) ZFS health... and from my studies simply watching traditional values via iostat isn''t the best method. If ZIL is either disabled or pushing to SLOG, then watching the breathing of TXG sync''s should be all thats really important to me, at least on the write side... thats my theory anyway. Feel free to flog me. :) Thank you very much for your help Mark! benr.
Ben Rockwood wrote:> Mark Maybee wrote: >> Ben Rockwood wrote: >>> I need some help with clarification. >>> >>> My understanding is that there are 2 instances in which ZFS will write >>> to disk: >>> 1) TXG Sync >>> 2) ZIL >>> >>> Post-snv_87 a TXG should sync out when the TXG is either over filled or >>> hits the timeout of 30 seconds. >>> >>> First question is... is there some place I can see what this max TXG >>> size is? If I recall its 1/8th of system memory... but there has to be >>> a counter somewhere right? >>> >> There is both a memory throttle limit (enforced in arc_memory_throttle) >> and a write throughput throttle limit (calculated in dsl_pool_sync(), >> enforced in dsl_pool_tempreserve_space()). The write limit is stored as >> the ''dp_write_limit'' for each pool. > > I cooked up the following: > $ dtrace -qn fbt::dsl_pool_sync:entry''{ printf("Throughput is \t %d\n > write limit is\t %d\n\n", args[0]->dp_throughput, > args[0]->dp_write_limit); }'' > Throughput is 883975129 > write limit is 3211748352 > > I''m confused with regard to the units and interpretation. > > For instance, the write limit here is almost 3GB on a system with 4GB of > RAM. However, if I read the code right the value here is already > inflated *6... so the real write limit is actually 510MB right? >The write_limit is independent of the memory size. Its based purely on the IO bandwidth available to the pool. So a write_limit of 3GB implies that we think that we can push 3GB of (inflated) data in 5 seconds to the drives. If we take out the inflation, this means we think we can push 100MB/s to the pools drives.> As for the throughput, I need verification... I think the unit here is > bytes per second? >Correct.> >>> I''m unclear on ZIL writes. I think that they happen independently of >>> the normal txg rotation, but I''m not sure. >>> >>> So the second question is: do they happen with a TXG sync (expitied) or >>> independent of the normal TXG sync flow? >>> >>> Finally, I''m unclear on exactly what constitutes a TXG Stall. I had >>> assumed that it indicated TXG''s that exceeded the alloted time, but >>> after some dtracing I''m uncertain. >>> >> I''m not certain what you mean by: "TXG Stall". > > I refer to the following code, which I''m having some trouble properly > understanding: > > 475 txg_stalled(dsl_pool_t *dp) > 476 { > 477 tx_state_t *tx = &dp->dp_tx; > 478 return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg); > 479 } >Ah. A "stall" in this context means that the sync phase is idle, waiting for the next txg to quiesce.... so the the current train is "stalled" until the quiesce finishes.> > > Ultimately, what this all comes down to is finding a reliable way to > determine when ZFS is struggling. I''m currently watching (on pre-87) > txg sync times and if it exceeds like 4 seconds per txg sync I know > their is trouble brewing. I''m considering whether watching either > txg_stalled or txg_delay may be better ways to flag trouble. >Stalled tends to mean that there is something happening "up top" preventing things from moving (i.e., a tx not closing). Delay is used when we are trying to push more data than the pool can handle, so that may be what you want to look at.> dp_throughput looks like it also might be a good candidate, although it > was only added in snv_98 unfortunate so it doesn''t help a lot of my > existing installs. Nevertheless, graphing this value could be very > telling and would be nice to have available as a kstat. >agreed.> The intended result is to have a reliable means of monitoring (via a > standard monitoring framework such as Nagios or Zabbix or something) ZFS > health... and from my studies simply watching traditional values via > iostat isn''t the best method. If ZIL is either disabled or pushing to > SLOG, then watching the breathing of TXG sync''s should be all thats > really important to me, at least on the write side... thats my theory > anyway. Feel free to flog me. :) >Nope, that makes sense to me. Ideally, we should either be chugging along at 5-to-30 second intervals between syncs with no delays (i.e. a light IO load), or we should be doing consistent 5s syncs with a few delays seen (max capacity). If you start seeing lots of delays, you are probably trying to push too much data. Note that we are still tuning this code. Recently we discovered that we may want to change the throughput calculation (we currently don''t include the dsl_dataset_sync() calls in the calc... we may want to change that) to include more of the IO "setup" time.> Thank you very much for your help Mark! > > benr.
Mark Maybee wrote:> > The write_limit is independent of the memory size. Its based purely > on the IO bandwidth available to the pool. So a write_limit of 3GB > implies that we think that we can push 3GB of (inflated) data in 5 > seconds to the drives. If we take out the inflation, this means > we think we can push 100MB/s to the pools drives.Ah, I see... 3GB / 5 secs = 600MB/s, de-inflate by dividing by 6 to get 100MB/s. Thank you.>> The intended result is to have a reliable means of monitoring (via a >> standard monitoring framework such as Nagios or Zabbix or something) ZFS >> health... and from my studies simply watching traditional values via >> iostat isn''t the best method. If ZIL is either disabled or pushing to >> SLOG, then watching the breathing of TXG sync''s should be all thats >> really important to me, at least on the write side... thats my theory >> anyway. Feel free to flog me. :) >> > Nope, that makes sense to me. Ideally, we should either be chugging > along at 5-to-30 second intervals between syncs with no delays (i.e. > a light IO load), or we should be doing consistent 5s syncs with a > few delays seen (max capacity). If you start seeing lots of delays, > you are probably trying to push too much data. > > Note that we are still tuning this code. Recently we discovered that > we may want to change the throughput calculation (we currently don''t > include the dsl_dataset_sync() calls in the calc... we may want to > change that) to include more of the IO "setup" time.I''m certain that there is more than enough on everyones plate atm, but I would be very happy to see purpose build kstats for monitoring in the future. The more time I spend in the ZFS code the more I realize that iostat and even VFS calculations to be misleading at best. Standardizing on the "right numbers"(tm) to watch would be a great benefit to the user base. Comforting and consoling our customers who are reacting to numbers they don''t understand is becoming ever more time consuming. :) Thank you for your help Mark, this clears up several things for me! benr.