thr3ads.net - zfs code - [zfs-code] Disk Writes [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Ben Rockwood

2009-Feb-05 10:38 UTC

[zfs-code] Disk Writes

I need some help with clarification.

My understanding is that there are 2 instances in which ZFS will write
to disk:
1) TXG Sync
2) ZIL

Post-snv_87 a TXG should sync out when the TXG is either over filled or
hits the timeout of 30 seconds.

First question is... is there some place I can see what this max TXG
size is?  If I recall its 1/8th of system memory... but there has to be
a counter somewhere right?

I''m unclear on ZIL writes.   I think that they happen independently of
the normal txg rotation, but I''m not sure.

So the second question is: do they happen with a TXG sync (expitied) or
independent of the normal TXG sync flow?

Finally, I''m unclear on exactly what constitutes a TXG Stall.  I had
assumed that it indicated TXG''s that exceeded the alloted time, but
after some dtracing I''m uncertain.

Any help is appreciated.

benr.

-- 
Ben Rockwood 				       cuddletech.com 
Joyent Inc.			PGP: 0xC823A182 @ pgp.mit.edu

		    "...even at night his mind does not rest. 
				    This too is meaningless."
				          - Ecclesiastes 2:23

Neil Perrin

2009-Feb-05 16:52 UTC

head link

[zfs-code] Disk Writes

I''ll answer the ZIL question:

On 02/05/09 03:38, Ben Rockwood wrote:> I need some help with clarification.
> 
> My understanding is that there are 2 instances in which ZFS will write
> to disk:
> 1) TXG Sync
> 2) ZIL
> 
> Post-snv_87 a TXG should sync out when the TXG is either over filled or
> hits the timeout of 30 seconds.
> 
> First question is... is there some place I can see what this max TXG
> size is?  If I recall its 1/8th of system memory... but there has to be
> a counter somewhere right?
> 
> I''m unclear on ZIL writes.   I think that they happen
independently of
> the normal txg rotation, but I''m not sure.
> 
> So the second question is: do they happen with a TXG sync (expitied) or
> independent of the normal TXG sync flow?
They are independent of the normal TXG flow. It would be too slow to
force a txg commit for any synchronous write.

In rare error situations (like failure to allocate an intent log block)
then the ZIL will wait for the txg to commit. It has no choice, as it
cannot return to the application until data is stable.  
> 
> Finally, I''m unclear on exactly what constitutes a TXG Stall.  I
had
> assumed that it indicated TXG''s that exceeded the alloted time,
but
> after some dtracing I''m uncertain.
> 
> Any help is appreciated.
> 
> benr.
>

Mark Maybee

2009-Feb-05 21:57 UTC

head link

[zfs-code] Disk Writes

Ben Rockwood wrote:> I need some help with clarification.
> 
> My understanding is that there are 2 instances in which ZFS will write
> to disk:
> 1) TXG Sync
> 2) ZIL
> 
> Post-snv_87 a TXG should sync out when the TXG is either over filled or
> hits the timeout of 30 seconds.
> 
> First question is... is there some place I can see what this max TXG
> size is?  If I recall its 1/8th of system memory... but there has to be
> a counter somewhere right?
> There is both a memory throttle limit (enforced in arc_memory_throttle)
and a write throughput throttle limit (calculated in dsl_pool_sync(),
enforced in dsl_pool_tempreserve_space()).  The write limit is stored as
the ''dp_write_limit'' for each pool.
> I''m unclear on ZIL writes.   I think that they happen
independently of
> the normal txg rotation, but I''m not sure.
> 
> So the second question is: do they happen with a TXG sync (expitied) or
> independent of the normal TXG sync flow?
> 
> Finally, I''m unclear on exactly what constitutes a TXG Stall.  I
had
> assumed that it indicated TXG''s that exceeded the alloted time,
but
> after some dtracing I''m uncertain.
> I''m not certain what you mean by: "TXG Stall".
> Any help is appreciated.
> 
> benr.
>

Ben Rockwood

2009-Feb-05 22:15 UTC

head link

[zfs-code] Disk Writes

Mark Maybee wrote:> Ben Rockwood wrote:
>> I need some help with clarification.
>>
>> My understanding is that there are 2 instances in which ZFS will write
>> to disk:
>> 1) TXG Sync
>> 2) ZIL
>>
>> Post-snv_87 a TXG should sync out when the TXG is either over filled or
>> hits the timeout of 30 seconds.
>>
>> First question is... is there some place I can see what this max TXG
>> size is?  If I recall its 1/8th of system memory... but there has to be
>> a counter somewhere right?
>>
> There is both a memory throttle limit (enforced in arc_memory_throttle)
> and a write throughput throttle limit (calculated in dsl_pool_sync(),
> enforced in dsl_pool_tempreserve_space()).  The write limit is stored as
> the ''dp_write_limit'' for each pool.
I cooked up the following:
$ dtrace -qn fbt::dsl_pool_sync:entry''{ printf("Throughput is \t
%d\n
write limit is\t %d\n\n", args[0]->dp_throughput,
args[0]->dp_write_limit); }''
Throughput is    883975129
 write limit is  3211748352

I''m confused with regard to the units and interpretation. 

For instance, the write limit here is almost 3GB on a system with 4GB of
RAM.  However, if I read the code right the value here is already
inflated *6... so the real write limit is actually 510MB right?

As for the throughput, I need verification... I think the unit here is
bytes per second?

>> I''m unclear on ZIL writes.   I think that they happen
independently of
>> the normal txg rotation, but I''m not sure.
>>
>> So the second question is: do they happen with a TXG sync (expitied) or
>> independent of the normal TXG sync flow?
>>
>> Finally, I''m unclear on exactly what constitutes a TXG Stall. 
I had
>> assumed that it indicated TXG''s that exceeded the alloted
time, but
>> after some dtracing I''m uncertain.
>>
> I''m not certain what you mean by: "TXG Stall".
I refer to the following code, which I''m having some trouble properly
understanding:

   475 txg_stalled(dsl_pool_t *dp)
    476 {
    477     tx_state_t *tx = &dp->dp_tx;
    478     return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg);
    479 }

Ultimately, what this all comes down to is finding a reliable way to
determine when ZFS is struggling.  I''m currently watching (on pre-87)
txg sync times and if it exceeds like 4 seconds per txg sync I know
their is trouble brewing.  I''m considering whether watching either
txg_stalled or txg_delay may be better ways to flag trouble.

dp_throughput looks like it also might be a good candidate, although it
was only added in snv_98 unfortunate so it doesn''t help a lot of my
existing installs.  Nevertheless, graphing this value could be very
telling and would be nice to have available as a kstat.

The intended result is to have a reliable means of monitoring (via a
standard monitoring framework such as Nagios or Zabbix or something) ZFS
health... and from my studies simply watching traditional values via
iostat isn''t the best method.  If ZIL is either disabled or pushing to
SLOG, then watching the breathing of TXG sync''s should be all thats
really important to me, at least on the write side... thats my theory
anyway.  Feel free to flog me. :)

Thank you very much for your help Mark!

benr.

Mark Maybee

2009-Feb-05 23:10 UTC

head link

[zfs-code] Disk Writes

Ben Rockwood wrote:> Mark Maybee wrote:
>> Ben Rockwood wrote:
>>> I need some help with clarification.
>>>
>>> My understanding is that there are 2 instances in which ZFS will
write
>>> to disk:
>>> 1) TXG Sync
>>> 2) ZIL
>>>
>>> Post-snv_87 a TXG should sync out when the TXG is either over
filled or
>>> hits the timeout of 30 seconds.
>>>
>>> First question is... is there some place I can see what this max
TXG
>>> size is?  If I recall its 1/8th of system memory... but there has
to be
>>> a counter somewhere right?
>>>
>> There is both a memory throttle limit (enforced in arc_memory_throttle)
>> and a write throughput throttle limit (calculated in dsl_pool_sync(),
>> enforced in dsl_pool_tempreserve_space()).  The write limit is stored
as
>> the ''dp_write_limit'' for each pool.
> 
> I cooked up the following:
> $ dtrace -qn fbt::dsl_pool_sync:entry''{ printf("Throughput is
\t %d\n
> write limit is\t %d\n\n", args[0]->dp_throughput,
> args[0]->dp_write_limit); }''
> Throughput is    883975129
>  write limit is  3211748352
> 
> I''m confused with regard to the units and interpretation. 
> 
> For instance, the write limit here is almost 3GB on a system with 4GB of
> RAM.  However, if I read the code right the value here is already
> inflated *6... so the real write limit is actually 510MB right?
> The write_limit is independent of the memory size.  Its based purely
on the IO bandwidth available to the pool.  So a write_limit of 3GB
implies that we think that we can push 3GB of (inflated) data in 5
seconds to the drives.  If we take out the inflation, this means
we think we can push 100MB/s to the pools drives.
> As for the throughput, I need verification... I think the unit here is
> bytes per second?
> 
Correct.> 
>>> I''m unclear on ZIL writes.   I think that they happen
independently of
>>> the normal txg rotation, but I''m not sure.
>>>
>>> So the second question is: do they happen with a TXG sync
(expitied) or
>>> independent of the normal TXG sync flow?
>>>
>>> Finally, I''m unclear on exactly what constitutes a TXG
Stall.  I had
>>> assumed that it indicated TXG''s that exceeded the alloted
time, but
>>> after some dtracing I''m uncertain.
>>>
>> I''m not certain what you mean by: "TXG Stall".
> 
> I refer to the following code, which I''m having some trouble
properly
> understanding:
> 
>    475 txg_stalled(dsl_pool_t *dp)
>     476 {
>     477     tx_state_t *tx = &dp->dp_tx;
>     478     return (tx->tx_quiesce_txg_waiting > tx->tx_open_txg);
>     479 }
> Ah.  A "stall" in this context means that the sync phase is idle,
waiting for the next txg to quiesce.... so the the current train
is "stalled" until the quiesce finishes.> 
> 
> Ultimately, what this all comes down to is finding a reliable way to
> determine when ZFS is struggling.  I''m currently watching (on
pre-87)
> txg sync times and if it exceeds like 4 seconds per txg sync I know
> their is trouble brewing.  I''m considering whether watching either
> txg_stalled or txg_delay may be better ways to flag trouble.
> Stalled tends to mean that there is something happening "up top"
preventing things from moving (i.e., a tx not closing).  Delay is
used when we are trying to push more data than the pool can handle,
so that may be what you want to look at.
> dp_throughput looks like it also might be a good candidate, although it
> was only added in snv_98 unfortunate so it doesn''t help a lot of
my
> existing installs.  Nevertheless, graphing this value could be very
> telling and would be nice to have available as a kstat.
> agreed.
> The intended result is to have a reliable means of monitoring (via a
> standard monitoring framework such as Nagios or Zabbix or something) ZFS
> health... and from my studies simply watching traditional values via
> iostat isn''t the best method.  If ZIL is either disabled or
pushing to
> SLOG, then watching the breathing of TXG sync''s should be all
thats
> really important to me, at least on the write side... thats my theory
> anyway.  Feel free to flog me. :)
> Nope, that makes sense to me.  Ideally, we should either be chugging
along at 5-to-30 second intervals between syncs with no delays (i.e.
a light IO load), or we should be doing consistent 5s syncs with a
few delays seen (max capacity).  If you start seeing lots of delays,
you are probably trying to push too much data.

Note that we are still tuning this code.  Recently we discovered that
we may want to change the throughput calculation (we currently don''t
include the dsl_dataset_sync() calls in the calc... we may want to
change that) to include more of the IO "setup" time.
> Thank you very much for your help Mark!
> 
> benr.

Ben Rockwood

2009-Feb-06 00:01 UTC

head link

[zfs-code] Disk Writes

Mark Maybee wrote:>
> The write_limit is independent of the memory size.  Its based purely
> on the IO bandwidth available to the pool.  So a write_limit of 3GB
> implies that we think that we can push 3GB of (inflated) data in 5
> seconds to the drives.  If we take out the inflation, this means
> we think we can push 100MB/s to the pools drives.
Ah, I see... 3GB / 5 secs = 600MB/s, de-inflate by dividing by 6 to get
100MB/s.  Thank you.


>> The intended result is to have a reliable means of monitoring (via a
>> standard monitoring framework such as Nagios or Zabbix or something)
ZFS
>> health... and from my studies simply watching traditional values via
>> iostat isn''t the best method.  If ZIL is either disabled or
pushing to
>> SLOG, then watching the breathing of TXG sync''s should be all
thats
>> really important to me, at least on the write side... thats my theory
>> anyway.  Feel free to flog me. :)
>>
> Nope, that makes sense to me.  Ideally, we should either be chugging
> along at 5-to-30 second intervals between syncs with no delays (i.e.
> a light IO load), or we should be doing consistent 5s syncs with a
> few delays seen (max capacity).  If you start seeing lots of delays,
> you are probably trying to push too much data.
>
> Note that we are still tuning this code.  Recently we discovered that
> we may want to change the throughput calculation (we currently
don''t
> include the dsl_dataset_sync() calls in the calc... we may want to
> change that) to include more of the IO "setup" time.

I''m certain that there is more than enough on everyones plate atm, but
I
would be very happy to see purpose build kstats for monitoring in the
future.  The more time I spend in the ZFS code the more I realize that
iostat and even VFS calculations to be misleading at best.  
Standardizing on the "right numbers"(tm) to watch would be a great
benefit to the user base.  Comforting and consoling our customers who
are reacting to numbers they don''t understand is becoming ever more
time
consuming.  :)


Thank you for your help Mark, this clears up several things for me!

benr.

zfs code - Feb 2009 - Disk Writes

[zfs-code] Disk Writes

[zfs-code] Disk Writes

[zfs-code] Disk Writes

[zfs-code] Disk Writes

[zfs-code] Disk Writes

[zfs-code] Disk Writes