thr3ads.net - zfs discuss - [zfs-discuss] ZFS + DB + "fragments" [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Louwtjie Burger

2007-Nov-13 06:57 UTC

[zfs-discuss] ZFS + DB + "fragments"

Hi

After a clean database load a database would (should?) look like this,
if a random stab at the data is taken...

[8KB-m][8KB-n][8KB-o][8KB-p]...

The data should be fairly (100%) sequential in layout ... after some
days though that same spot (using ZFS) would problably look like:

[8KB-m][   ][8KB-o][   ]

Is this "pseudo logical-physical" view correct (if blocks n and p was
updated and with COW relocated somewhere else)?

Could a utility be constructed to show the level of "fragmentation" ?
(50% in above example)

IF the above theory is flawed... how would fragmentation "look/be
observed/calculated" under ZFS with large Oracle tablespaces?

Does it even matter what the "fragmentation" is from a performance
perspective?

Roch - PAE

2007-Nov-13 09:54 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Louwtjie Burger writes:
 > Hi
 > 
 > After a clean database load a database would (should?) look like this,
 > if a random stab at the data is taken...
 > 
 > [8KB-m][8KB-n][8KB-o][8KB-p]...
 > 
 > The data should be fairly (100%) sequential in layout ... after some
 > days though that same spot (using ZFS) would problably look like:
 > 
 > [8KB-m][   ][8KB-o][   ]
 > 
 > Is this "pseudo logical-physical" view correct (if blocks n and
p was
 > updated and with COW relocated somewhere else)?
 > 

That''s the proper view if the ZFS recordsize is tuned to be 8KB.
That''s a best practice that might need to be qualified in
the future.


 > Could a utility be constructed to show the level of
"fragmentation" ?
 > (50% in above example)
 > 

That will need to dive into the internals of ZFS. But
anything is possible.  It''s been done for UFS before.


 > IF the above theory is flawed... how would fragmentation "look/be
 > observed/calculated" under ZFS with large Oracle tablespaces?
 > 
 > Does it even matter what the "fragmentation" is from a
performance perspective?

It matters to table scans and how those scans will impact OLTP
workloads. Good blog topic. Stay tune.


 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nathan Kroenert

2007-Nov-13 21:19 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

This question triggered some silly questions in my mind:

Lots of folks are determined that the whole COW to different locations 
are a Bad Thing(tm), and in some cases, I guess it might actually be...

What if ZFS had a pool / filesystem property that caused zfs to do a 
journaled, but non-COW update so the data''s relative location for 
databases is always the same?

Or - What if it did a double update: One to a staged area, and another 
immediately after that to the ''old'' data blocks. Still always
have
on-disk consistency etc, at a cost of double the I/O''s...

Of course, both of these would require non-sparse file creation for the 
DB etc, but would it be plausible?

For very read intensive and position sensitive applications, I guess 
this sort of capability might make a difference?

Just some stabs in the dark...

Cheers!

Nathan.

Louwtjie Burger wrote:> Hi
> 
> After a clean database load a database would (should?) look like this,
> if a random stab at the data is taken...
> 
> [8KB-m][8KB-n][8KB-o][8KB-p]...
> 
> The data should be fairly (100%) sequential in layout ... after some
> days though that same spot (using ZFS) would problably look like:
> 
> [8KB-m][   ][8KB-o][   ]
> 
> Is this "pseudo logical-physical" view correct (if blocks n and p
was
> updated and with COW relocated somewhere else)?
> 
> Could a utility be constructed to show the level of
"fragmentation" ?
> (50% in above example)
> 
> IF the above theory is flawed... how would fragmentation "look/be
> observed/calculated" under ZFS with large Oracle tablespaces?
> 
> Does it even matter what the "fragmentation" is from a
performance perspective?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2007-Nov-13 23:27 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Nathan Kroenert wrote:> This question triggered some silly questions in my mind:
> 
> Lots of folks are determined that the whole COW to different locations 
> are a Bad Thing(tm), and in some cases, I guess it might actually be...
There is a lot of speculation about this, but no real data.
I''ve done some experiments on long seeks and didn''t see much
of a
performance difference, but I wasn''t using a database workload.

Note that the many caches and optimizations in the path between the
database and physical medium will make this very difficult to characterize
for a general case.  Needless to say, you''ll get better performance on
a
device which can handle multiple outstanding I/Os -- avoid PATA disks.
> What if ZFS had a pool / filesystem property that caused zfs to do a 
> journaled, but non-COW update so the data''s relative location for 
> databases is always the same?
> 
> Or - What if it did a double update: One to a staged area, and another 
> immediately after that to the ''old'' data blocks. Still
always have
> on-disk consistency etc, at a cost of double the I/O''s...
This is a non-starter.  Two I/Os is worse than one.
> Of course, both of these would require non-sparse file creation for the 
> DB etc, but would it be plausible?
> 
> For very read intensive and position sensitive applications, I guess 
> this sort of capability might make a difference?
We are all anxiously awaiting data...
  -- richard

can you guess?

2007-Nov-14 08:36 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> This question triggered some silly questions in my
> mind:
Actually, they''re not silly at all.
> 
> Lots of folks are determined that the whole COW to
> different locations 
> are a Bad Thing(tm), and in some cases, I guess it
> might actually be...
> 
> What if ZFS had a pool / filesystem property that
> caused zfs to do a 
> journaled, but non-COW update so the data''s relative
> location for 
> databases is always the same?
That''s just what a conventional file system (no need even for a
journal, when you''re updating in place) does when it''s not
guaranteeing write atomicity (you address the latter below).
> 
> Or - What if it did a double update: One to a staged
> area, and another 
> immediately after that to the ''old'' data blocks.
> Still always have 
> on-disk consistency etc, at a cost of double the
> I/O''s...
It only requires an extra disk access if the new data is too large to dump right
into the journal itself (which guarantees that the subsequent in-place update
can complete).  Whether the new data is dumped into the log or into a temporary
location the pointer to which is logged, the subsequent in-place update can be
deferred until it''s convenient (e.g., until after any additional
updates to the same data have also been accumulated, activity has cooled off,
and the modified blocks are getting ready to be evicted from the system cache -
and, optionally, until the target disks are idle or have their heads positioned
conveniently near the target location).

ZFS''s small-synchronous-write log can do something similar as long as
the writes aren''t too large to place in it.  However, data
that''s only persisted in the journal isn''t accessible via the
normal snapshot mechanisms (well, if an entire file block was dumped into the
journal I guess it could be, at the cost of some additional complexity in
journal space reuse), so I''m guessing that ZFS writes back any dirty
data that''s in the small-update journal whenever a snapshot is created.

And if you start actually updating in place as described above, then you
can''t use ZFS-style snapshotting at all:  instead of capturing the
current state as the snapshot with the knowledge that any subsequent updates
will not disturb it, you have to capture the old state that you''re
about to over-write and stuff it somewhere else - and then figure out how to
maintain appropriate access to it while the rest of the system moves on.

Snapshots make life a lot more complex for file systems than it used to be, and
COW techniques make snapshotting easy at the expense of normal run-time
performance - not just because they make update-in-place infeasible for
preserving on-disk contiguity but because of the significant increase in disk
bandwidth (and snapshot storage space) required to write back changes all the
way up to whatever root structure is applicable:  I suspect that ZFS does this
on every synchronous update save for those that it can leave temporarily in its
small-update journal, and it *has* to do it whenever a snapshot is created.
> 
> Of course, both of these would require non-sparse
> file creation for the 
> DB etc, but would it be plausible?
Update-in-place files can still be sparse:  it''s only data that already
exists that must be present (and updated in place to preserve sequential access
performance to it).
> 
> For very read intensive and position sensitive
> applications, I guess 
> this sort of capability might make a difference?
No question about it.  And sequential table scans in databases are among the
most significant examples, because (unlike things like streaming video files
which just get laid down initially and non-synchronously in a manner that at
least potentially allows ZFS to accumulate them in large, contiguous chunks -
though ISTR some discussion about just how well ZFS managed this when it was
accommodating multiple such write streams in parallel) the tables are also
subject to fine-grained, often-random update activity.

Background defragmentation can help, though it generates a boatload of
additional space overhead in any applicable snapshot.

- bill
 
 
This message posted from opensolaris.org

Richard Elling

2007-Nov-14 16:49 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

can you guess? wrote:>> For very read intensive and position sensitive
>> applications, I guess 
>> this sort of capability might make a difference?
> 
> No question about it.  And sequential table scans in databases 
> are among the most significant examples, because (unlike things 
> like streaming video files which just get laid down initially 
> and non-synchronously in a manner that at least potentially 
> allows ZFS to accumulate them in large, contiguous chunks - 
> though ISTR some discussion about just how well ZFS managed 
> this when it was accommodating multiple such write streams in 
> parallel) the tables are also subject to fine-grained, 
> often-random update activity.
> 
> Background defragmentation can help, though it generates a 
> boatload of additional space overhead in any applicable snapshot.
The reason that this is hard to characterize is that there are
really two very different configurations used to address different
performance requirements: cheap and fast.  It seems that when most
people first consider this problem, they do so from the cheap
perspective: single disk view.  Anyone who strives for database
performance will choose the fast perspective: stripes.  Note: data
redundancy isn''t really an issue for this analysis, but consider it
done in real life.  When you have a striped storage device under a
file system, then the database or file system''s view of contiguous
data is not contiguous on the media. There are many different ways
to place the data on the media and we would typically strive for a
diverse stochastic spread.  Hmm... one could theorize that COW will
also result in a diverse stochastic spread.  The complexity of the
characterization is then caused by the large number of variables
which the systems use to spread the data (interlace size, block size,
prefetch, caches, cache policies, etc) and the feasibility of
understanding the interdependent relationships these will have on
performance.

Real data would be greatly appreciated.
  -- richard

can you guess?

2007-Nov-15 01:13 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> can you guess? wrote:
> >> For very read intensive and position sensitive
> >> applications, I guess 
> >> this sort of capability might make a difference?
> > 
> > No question about it.  And sequential table scans
> in databases 
> > are among the most significant examples, because
> (unlike things 
> > like streaming video files which just get laid down
> initially 
> > and non-synchronously in a manner that at least
> potentially 
> > allows ZFS to accumulate them in large, contiguous
> chunks - 
> > though ISTR some discussion about just how well ZFS
> managed 
> > this when it was accommodating multiple such write
> streams in 
> > parallel) the tables are also subject to
> fine-grained, 
> > often-random update activity.
> > 
> > Background defragmentation can help, though it
> generates a 
> > boatload of additional space overhead in any
> applicable snapshot.
> 
> The reason that this is hard to characterize is that
> there are
> really two very different configurations used to
> address different
> performance requirements: cheap and fast.  It seems
> that when most
> people first consider this problem, they do so from
> the cheap
> perspective: single disk view.  Anyone who strives
> for database
> performance will choose the fast perspective:
> stripes.
And anyone who *really* understands the situation will do both.

  Note: data> redundancy isn''t really an issue for this analysis,
> but consider it
> done in real life.  When you have a striped storage
> device under a
> file system, then the database or file system''s view
> of contiguous
> data is not contiguous on the media.
The best solution is to make the data piece-wise contiguous on the media at the
appropriate granularity - which is largely determined by disk access
characteristics (the following assumes that the database table is large enough
to be spread across a lot of disks at moderately coarse granularity, since
otherwise it''s often small enough to cache in the generous amounts of
RAM that are inexpensively available today).

A single chunk on an (S)ATA disk today (the analysis is similar for
high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB in size
to yield over 80% of the disk''s maximum possible (fully-contiguous
layout) sequential streaming performance (after the overhead of an
''average'' - 1/3 stroke - initial seek and partial rotation are
figured in:  the latter could be avoided by using a chunk size that''s
an integral multiple of the track size, but on today''s zoned disks
that''s a bit awkward).  A 1 MB chunk yields around 50% of the maximum
streaming performance.  ZFS''s maximum 128 KB ''chunk
size'' if effectively used as the disk chunk size as you seem to be
suggesting yields only about 15% of the disk''s maximum streaming
performance (leaving aside an additional degradation to a small fraction of even
that should you use RAID-Z).  And if you match the ZFS block size to a 16 KB
database block size and use that as the effective unit of distribution across
the set of disks, you''ll obtain a mighty 2% of the potential streaming
performance (again, we''ll be charitable and ignore the further
degradation if RAID-Z is used).

Now, if your system is doing nothing else but sequentially scanning this one
database table, this may not be so bad:  you get truly awful disk utilization
(2% of its potential in the last case, ignoring RAID-Z), but you can still read
ahead through the entire disk set and obtain decent sequential scanning
performance by reading from all the disks in parallel.  But if your database
table scan is only one small part of a workload which is (perhaps the worst
case) performing many other such scans in parallel, your overall system
throughput will be only around 4% of what it could be had you used 1 MB chunks
(and the individual scan performances will also suck commensurately, of course).

Using 1 MB chunks still spreads out your database admirably for parallel
random-access throughput:  even if the table is only 1 GB in size (eminently
cachable in RAM, should that be preferable), that''ll spread it out
across 1,000 disks (2,000, if you mirror it and load-balance to spread out the
accesses), and for much smaller database tables if they''re accessed
sufficiently heavily for throughput to be an issue they''ll be wholly
cache-resident.  Or another way to look at it is in terms of how many disks you
have in your system:  if it''s less than the number of MB in your table
size, then the table will be spread across all of them regardless of what chunk
size is used, so you might as well use one that''s large enough to give
you decent sequential scanning performance (and if your table is too small to
spread across all the disks, then it may well all wind up in cache anyway).

ZFS''s problem (well, the one specific to this issue, anyway) is that it
tries to use its ''block size'' to cover two different needs: 
performance for moderately fine-grained updates (though its need to propagate
those updates upward to the root of the applicable tree significantly
compromises this effort), and decent disk utilization (I''m using that
term to describe throughput as a fraction of potential streaming throughput: 
just ''keeping the disks saturated'' only describes where they
system hits its throughput wall, not how well its design does in pushing that
wall back as far as possible).  The two requirements conflict, and in
ZFS''s case the latter one loses - badly.

Which is why background defragmentation could help, as I previously noted:  it
could rearrange the table such that multiple virtually-sequential ZFS blocks
were placed contiguously on each disk (to reach 1 MB total, in the current
example) without affecting the ZFS block size per se.  But every block so
rearranged (and every tree ancestor of each such block) would then leave an
equal-sized residue in the most recent snapshot if one existed, which gets
expensive fast in terms of snapshot space overhead (which then is proportional
to the amount of reorganization performed as well as to the amount of actual
data updating).

- bill

This message posted from opensolaris.org

Richard Elling

2007-Nov-15 02:02 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

can you guess? wrote:>> can you guess? wrote:
>>     
>>>> For very read intensive and position sensitive
>>>> applications, I guess 
>>>> this sort of capability might make a difference?
>>>>         
>>> No question about it.  And sequential table scans
>>>       
>> in databases 
>>     
>>> are among the most significant examples, because
>>>       
>> (unlike things 
>>     
>>> like streaming video files which just get laid down
>>>       
>> initially 
>>     
>>> and non-synchronously in a manner that at least
>>>       
>> potentially 
>>     
>>> allows ZFS to accumulate them in large, contiguous
>>>       
>> chunks - 
>>     
>>> though ISTR some discussion about just how well ZFS
>>>       
>> managed 
>>     
>>> this when it was accommodating multiple such write
>>>       
>> streams in 
>>     
>>> parallel) the tables are also subject to
>>>       
>> fine-grained, 
>>     
>>> often-random update activity.
>>>
>>> Background defragmentation can help, though it
>>>       
>> generates a 
>>     
>>> boatload of additional space overhead in any
>>>       
>> applicable snapshot.
>>
>> The reason that this is hard to characterize is that
>> there are
>> really two very different configurations used to
>> address different
>> performance requirements: cheap and fast.  It seems
>> that when most
>> people first consider this problem, they do so from
>> the cheap
>> perspective: single disk view.  Anyone who strives
>> for database
>> performance will choose the fast perspective:
>> stripes.
>>     
>
> And anyone who *really* understands the situation will do both.
>   
I''m not sure I follow.  Many people who do high performance
databases use hardware RAID arrays which often do not
expose single disks.
>   Note: data
>   
>> redundancy isn''t really an issue for this analysis,
>> but consider it
>> done in real life.  When you have a striped storage
>> device under a
>> file system, then the database or file system''s view
>> of contiguous
>> data is not contiguous on the media.
>>     
>
> The best solution is to make the data piece-wise contiguous on the media at
the appropriate granularity - which is largely determined by disk access
characteristics (the following assumes that the database table is large enough
to be spread across a lot of disks at moderately coarse granularity, since
otherwise it''s often small enough to cache in the generous amounts of
RAM that are inexpensively available today).
>
> A single chunk on an (S)ATA disk today (the analysis is similar for
high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB in size
to yield over 80% of the disk''s maximum possible (fully-contiguous
layout) sequential streaming performance (after the overhead of an
''average'' - 1/3 stroke - initial seek and partial rotation are
figured in:  the latter could be avoided by using a chunk size that''s
an integral multiple of the track size, but on today''s zoned disks
that''s a bit awkward).  A 1 MB chunk yields around 50% of the maximum
streaming performance.  ZFS''s maximum 128 KB ''chunk
size'' if effectively used as the disk chunk size as you seem to be
suggesting yields only about 15% of the disk''s maximum streaming
performance (leaving aside an additional degradation to a small fraction of even
that should you use RAID-Z).  And if you match the ZFS block size to a 16 KB
database block size and use that as the effective unit of distribution across
the set of disks, you''ll !
 obt>  ain a mighty 2% of the potential streaming performance (again,
we''ll be charitable and ignore the further degradation if RAID-Z is
used).
>
>   
You do not seem to be considering the track cache, which for
modern disks is 16-32 MBytes.  If those disks are in a RAID array,
then there is often larger read caches as well.  Expecting a seek and
read for each iop is a bad assumption.
> Now, if your system is doing nothing else but sequentially scanning this
one database table, this may not be so bad:  you get truly awful disk
utilization (2% of its potential in the last case, ignoring RAID-Z), but you can
still read ahead through the entire disk set and obtain decent sequential
scanning performance by reading from all the disks in parallel.  But if your
database table scan is only one small part of a workload which is (perhaps the
worst case) performing many other such scans in parallel, your overall system
throughput will be only around 4% of what it could be had you used 1 MB chunks
(and the individual scan performances will also suck commensurately, of course).
>
> Using 1 MB chunks still spreads out your database admirably for parallel
random-access throughput:  even if the table is only 1 GB in size (eminently
cachable in RAM, should that be preferable), that''ll spread it out
across 1,000 disks (2,000, if you mirror it and load-balance to spread out the
accesses), and for much smaller database tables if they''re accessed
sufficiently heavily for throughput to be an issue they''ll be wholly
cache-resident.  Or another way to look at it is in terms of how many disks you
have in your system:  if it''s less than the number of MB in your table
size, then the table will be spread across all of them regardless of what chunk
size is used, so you might as well use one that''s large enough to give
you decent sequential scanning performance (and if your table is too small to
spread across all the disks, then it may well all wind up in cache anyway).
>
> ZFS''s problem (well, the one specific to this issue, anyway) is
that it tries to use its ''block size'' to cover two different
needs:  performance for moderately fine-grained updates (though its need to
propagate those updates upward to the root of the applicable tree significantly
compromises this effort), and decent disk utilization (I''m using that
term to describe throughput as a fraction of potential streaming throughput: 
just ''keeping the disks saturated'' only describes where they
system hits its throughput wall, not how well its design does in pushing that
wall back as far as possible).  The two requirements conflict, and in
ZFS''s case the latter one loses - badly.
>   
Real data would be greatly appreciated.  In my tests, I see
reasonable media bandwidth speeds for reads.
> Which is why background defragmentation could help, as I previously noted: 
it could rearrange the table such that multiple virtually-sequential ZFS blocks
were placed contiguously on each disk (to reach 1 MB total, in the current
example) without affecting the ZFS block size per se.  But every block so
rearranged (and every tree ancestor of each such block) would then leave an
equal-sized residue in the most recent snapshot if one existed, which gets
expensive fast in terms of snapshot space overhead (which then is proportional
to the amount of reorganization performed as well as to the amount of actual
data updating).
>
>   This comes up often from people who want > 128 kByte block sizes
for ZFS.  And yet we can demonstrate media bandwidth limits
relatively easily.  How would you reconcile the differences?
 -- richard

can you guess?

2007-Nov-15 02:07 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> Nathan Kroenert wrote:
...

 What if it did a double update: One to a> staged area, and another 
> > immediately after that to the ''old'' data blocks.
> Still always have 
> > on-disk consistency etc, at a cost of double the
> I/O''s...
> 
> This is a non-starter.  Two I/Os is worse than one.
Well, that attitude may be supportable for a write-only workload, but then so is
the position that you really don''t even need *one* I/O (since no one
will ever need to read the data and you might as well just drop it on the
floor).

In the real world, data (especially database data) does usually get read after
being written, and the entire reason the original poster raised the question was
because sometimes it''s well worth taking on some additional write
overhead to reduce read overhead.  In such a situation, if you need to protect
the database from partial-block updates as well as to keep it reasonably laid
out for sequential table access, then performing the two writes described is
about as good a solution as one can get (especially if the first of them can be
logged - even better, logged in NVRAM - such that its overhead can be amortized
across multiple such updates by otherwise independent processes, and even more
especially if, as is often the case, the same data gets updated multiple times
in sufficiently close succession that instead of 2N writes you wind up only
needing to perform N+1 writes, the last being the only one that updates the data
in place after the activity has cooled down).
> 
> > Of course, both of these would require non-sparse
> file creation for the 
> > DB etc, but would it be plausible?
> > 
> > For very read intensive and position sensitive
> applications, I guess 
> > this sort of capability might make a difference?
> 
> We are all anxiously awaiting data...
Then you might find it instructive to learn more about the evolution of file
systems on Unix:

In The Beginning there was the block, and the block was small, and it was
isolated from its brethren, and darkness was upon the face of the deep because
any kind of sequential performance well and truly sucked.

Then (after an inexcusably lengthy period of such abject suckage lasting into
the ''80s) there came into the world FFS, and while there was still only
the block the block was at least a bit larger, and it was at least somewhat less
isolated from its brethren, and once in a while it actually lived right next to
them, and while sequential performance still usually sucked at least it sucked
somewhat less.

And then the disciples Kleiman and McVoy looked upon FFS and decided that mere
proximity was still insufficient, and they arranged that blocks should (at least
when convenient) be aggregated into small groups (56 KB actually not being all
that small at the time, given the disk characteristics back then), and the Great
Sucking Sound of Unix sequential-access performance was finally reduced to
something at least somewhat quieter than a dull roar.

But other disciples had (finally) taken a look at commercial file systems that
had been out in the real world for decades and that had had sequential
performance down pretty well pat for nearly that long.  And so it came to pass
that corporations like Veritas (VxFS), and SGI (EFS & XFS), and IBM (JFS)
imported the concept of extents into the Unix pantheon, and the Gods of
Throughput looked upon it, and it was good, and (at least in those systems) Unix
sequential performance no longer sucked at all, and even non-corporate
developers whose faith was strong nearly to the point of being blind could not
help but see the virtues revealed there, and began incorporating extents into
their own work, yea, even unto ext4.

And the disciple Hitz (for it was he, with a few others) took a somewhat
different tack, and came up with a ''write anywhere file
layout'' but had the foresight to recognize that it needed some
mechanism to address sequential performance (not to mention parity-RAID
performance).  So he abandoned general-purpose approaches in favor of the
Appliance, and gave it most uncommodity-like but yet virtuous NVRAM to allow
many consecutive updates to be aggregated into not only stripes but adjacent
stripes before being dumped to disk, and the Gods of Throughput smiled upon his
efforts, and they became known throughout the land.

Now comes back Sun with ZFS, apparently ignorant of the last decade-plus of Unix
file system development (let alone development in other systems dating back to
the ''60s).  Blocks, while larger (though not necessarily proportionally
larger, due to dramatic increases in disk bandwidth), are once again often
isolated from their brethren.  True, this makes the COW approach a lot easier to
implement, but (leaving aside the debate about whether COW as implemented in ZFS
is a good idea at all) there is *no question whatsoever* that it returns a
significant degree of suckage to sequential performance - especially for data
subject to small, random updates.

Here ends our lesson for today.

- bill

This message posted from opensolaris.org

Anton B. Rang

2007-Nov-15 07:22 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> When you have a striped storage device under a
> file system, then the database or file system''s view
> of contiguous data is not contiguous on the media.
Right.  That''s a good reason to use fairly large stripes.  (The primary
limiting factor for stripe size is efficient parallel access; using a 100 MB
stripe size means that an average 100 MB file gets less than two disks''
worth of throughput.)

ZFS, of course, doesn''t have this problem, since it''s handling
the layout on the media; it can store things as contiguously as it wants.
> There are many different ways to place the data on the media and we would
typically
> strive for a diverse stochastic spread.
Err ... why?

A random distribution makes reasonable sense if you assume that future read
requests are independent, or that they are dependent in unpredictable ways. Now,
if you''ve got sufficient I/O streams, you could argue that requests
*are* independent, but in many other cases they are not, and they''re
usually predictable (particularly after a startup period). Optimizing for the
predicted access cases makes sense. (Optimizing for observed access may make
sense in some cases as well.)

-- Anton
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-15 07:48 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...

  But every block so rearranged> (and every tree ancestor of each such block) would
> then leave an equal-sized residue in the most recent
> snapshot if one existed, which gets expensive fast in
> terms of snapshot space overhead (which then is
> proportional to the amount of reorganization
> performed as well as to the amount of actual data
> updating).
Actually, it''s not *quite* as bad as that, since the common parent
block of multiple children should appear only once in the snapshot, not once for
each child moved.

Still, it does drive up snapshot overhead, and if you start trying to use
snapshots to simulate ''continuous data protection'' rather than
more sparingly the problem becomes more significant (because each snapshot will
catch any background defragmentation activity at a different point, such that
common parent blocks may appear in more than one snapshot even if no child data
has actually been updated).  Once you introduce CDP into the process (and
it''s tempting to, since the file system is in a better position to
handle it efficiently than some add-on product), rethinking how one approaches
snapshots (and COW in general) starts to make more sense.

- bill
 
 
This message posted from opensolaris.org

Louwtjie Burger

2007-Nov-15 11:53 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> We are all anxiously awaiting data...
>   -- richard
Would it be worthwhile to build a test case:

- Build a postgresql database and import 1 000 000 (or more) lines of data.
- Run a single and multiple large table scan queries ... and watch the system

then,

- Update a column of each row in the database, run the same queries
and watch the system

Continue updating more colums (to get more "defrag") until you notice
something.

I personally believe that since most people will have hardware LUN''s
(with underlying RAID) and cache, it will be difficult to notice
anything. Given that those hardware LUN''s might be busy with their own
wizardry ;) You will also have to minimize the effect of the database
cache ...

It will be a tough assignment ... maybe someone has already done this?

Thinking about this (very abstract) ... does it really matter?

[8KB-a][8KB-b][8KB-c]

So what it 8KB-b gets updated and moved somewhere else? If the DB gets
a request to read 8KB-a, it needs to do an I/O (eliminate all
caching). If it gets a request to read 8KB-b, it needs to do an I/O.

Does it matter that b is somewhere else ... it still needs to go get
it ... only in a very abstract world with read-ahead (both hardware or
db) would 8KB-b be in cache after 8KB-a was read.

Hmmm... the only way is to get some data :) *hehe*

Richard Elling

2007-Nov-15 23:53 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Anton B. Rang wrote:>> There are many different ways to place the data on the media and we
would typically
>> strive for a diverse stochastic spread.
>>     
>
> Err ... why?
>
> A random distribution makes reasonable sense if you assume that future read
requests are independent, or that they are dependent in unpredictable ways. Now,
if you''ve got sufficient I/O streams, you could argue that requests
*are* independent, but in many other cases they are not, and they''re
usually predictable (particularly after a startup period). Optimizing for the
predicted access cases makes sense. (Optimizing for observed access may make
sense in some cases as well.)
>
>   
For modern disks, media bandwidths are now getting to be > 100 MBytes/s.
If you need 500 MBytes/s of sequential read, you''ll never get it from 
one disk.
You can get it from multiple disks, so the questions are:
    1. How to avoid other bottlenecks, such as a shared fibre channel 
path?  Diversity.
    2. How to predict the data layout such that you can guarantee a wide 
spread?
       This is much more difficult. But you can use random distribution 
to reduce
       the probability (stochastic) that you''ll be reading all blocks 
from one disk.

There are pathological cases, especially for block-aligned data, but 
those tend
to be rather easy to identify when you look at the performance data.
 -- richard

can you guess?

2007-Nov-16 02:48 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> I personally believe that since most people will have
> hardware LUN''s
> (with underlying RAID) and cache, it will be
> difficult to notice
> anything. Given that those hardware LUN''s might be
> busy with their own
> wizardry ;) You will also have to minimize the effect
> of the database
> cache ...
By definition, once you''ve got the entire database in cache, none of
this matters (though filling up the cache itself takes some added time if the
table is fragmented).

Most real-world databases don''t manage to fit all or even mostly in
cache, because people aren''t willing to dedicate that much RAM to
running them.  Instead, they either use a lot less RAM than the database size or
share the system with other activity that shares use of the RAM.

In other words, they use a cost-effective rather than a money-is-no-object
configuration, but then would still like to get the best performance they can
from it.
> 
> It will be a tough assignment ... maybe someone has
> already done this?
> 
> Thinking about this (very abstract) ... does it
> really matter?
> 
> [8KB-a][8KB-b][8KB-c]
> 
> So what it 8KB-b gets updated and moved somewhere
> else? If the DB gets
> a request to read 8KB-a, it needs to do an I/O
> (eliminate all
> caching). If it gets a request to read 8KB-b, it
> needs to do an I/O.
> 
> Does it matter that b is somewhere else ...
Yes, with any competently-designed database.

 it still> needs to go get
> it ... only in a very abstract world with read-ahead
> (both hardware or
> db) would 8KB-b be in cache after 8KB-a was read.
1.  If there''s no other activity on the disk, then the disk''s
track cache will acquire the following data when the first block is read,
because it has nothing better to do.  But if the all the disks are just sitting
around waiting for this table scan to get to them, then if ZFS has a
sufficiently intelligent read-ahead mechanism it could help out a lot here as
well:  the differences become greater when the system is busier.

2.  Even a moderately smart disk will detect a sequential access pattern if one
exists and may read ahead at least modestly after having detected that pattern
even if it *does* have other requests pending.

3.  But in any event any competent database will explicitly issue prefetches
when it knows (and it *does* know) that it is scanning a table sequentially -
and will also have taken pains to try to ensure that the table data is laid out
such that it can be scanned efficiently.  If it''s using disks that
support tagged command queuing it may just issue a bunch of
single-database-block requests at once, and the disk will organize them such
that they can all be satisfied by a single streaming access; with disks that
don''t support queuing, the database can elect to issue a single large
I/O request covering many database blocks, accomplishing the same thing as long
as the table is in fact laid out contiguously on the medium (the database knows
this if it''s handling the layout directly, but when it''s using
a file system as an intermediary it usually can only hope that the file system
has minimized file fragmentation).
> 
> Hmmm... the only way is to get some data :) *hehe*
Data is good, as long as you successfully analyze what it actually means:  it
either tends to confirm one''s understanding or to refine it.

- bill
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-16 02:54 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Richard Elling wrote:

...
>>> there are
>>> really two very different configurations used to
>>> address different
>>> performance requirements: cheap and fast.  It seems
>>> that when most
>>> people first consider this problem, they do so from
>>> the cheap
>>> perspective: single disk view.  Anyone who strives
>>> for database
>>> performance will choose the fast perspective:
>>> stripes.
>>>     
>>
>> And anyone who *really* understands the situation will do both.
>>   
> 
> I''m not sure I follow.  Many people who do high performance
> databases use hardware RAID arrays which often do not
> expose single disks.
They don''t have to expose single disks:  they just have to use
reasonable chunk sizes on each disk, as I explained later.

Only very early (or very low-end) RAID used very small per-disk chunks (up to 64
KB max).  Before the mid-''90s chunk sizes had grown to 128 - 256 KB per
disk on mid-range arrays in order to improve disk utilization in the array. 
From talking with one of its architects years ago my impression is that
HP''s (now somewhat aging) EVA series uses 1 MB as its chunk size (the
same size I used as an example, though today one could argue for as much as 4 MB
and soon perhaps even more).

The array chunk size is not the unit of update, just the unit of distribution
across the array:  RAID-5 will happily update a single 4 KB file block within a
given array chunk and the associated 4 KB of parity within the parity chunk. 
But the larger chunk size does allow files to retain the option of using logical
contiguity to attain better streaming sequential performance, rather than
splintering that logical contiguity at fine grain across multiple disks.

...
>> A single chunk on an (S)ATA disk today (the analysis is similar for 
>> high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB
in size
>> to yield over 80% of the disk''s maximum possible
(fully-contiguous
>> layout) sequential streaming performance (after the overhead of an 
>> ''average'' - 1/3 stroke - initial seek and partial
rotation are figured
>> in:  the latter could be avoided by using a chunk size that''s
an
>> integral multiple of the track size, but on today''s zoned
disks that''s
>> a bit awkward).  A 1 MB chunk yields around 50% of the maximum 
>> streaming performance.  ZFS''s maximum 128 KB ''chunk
size'' if
>> effectively used as the disk chunk size as you seem to be suggesting 
>> yields only about 15% of the disk''s maximum streaming
performance
>> (leaving aside an additional degradation to a small fraction of even 
>> that should you use RAID-Z).  And if you match the ZFS block size to a 
>> 16 KB database block size and use that as the effective unit of 
>> distribution across the set of disks, you''ll 
>> obtain a mighty 2% of the potential streaming performance (again,
we''ll
>> be charitable and ignore the further degradation if RAID-Z is used).
>>
>>   
> 
> You do not seem to be considering the track cache, which for
> modern disks is 16-32 MBytes.  If those disks are in a RAID array,
> then there is often larger read caches as well.
Are you talking about hardware RAID in that last comment?  I thought ZFS was
supposed to eliminate the need for that.

  Expecting a seek and> read for each iop is a bad assumption.
The bad assumption is that the disks are otherwise idle and therefore have the
luxury of filling up their track caches - especially when I explicitly assumed
otherwise in the following paragraph in that post.  If the system is heavily
loaded the disks will usually have other requests queued up (even if the next
request comes in immediately rather than being queued at the disk itself, an
even half-smart disk will abort any current read-ahead activity so that it can
satisfy the new request).

Not that it would necessarily do much good for the case currently under
discussion even if the disks weren''t otherwise busy and they did fill
up the track caches:  ZFS''s COW policies tend to encourage data
that''s updated randomly at fine grain (as a database table often is) to
be splattered across the storage rather than neatly arranged such that the next
data requested from a given disk will just happen to reside right after the
previous data requested from that disk.
> 
>> Now, if your system is doing nothing else but sequentially scanning 
>> this one database table, this may not be so bad:  you get truly awful 
>> disk utilization (2% of its potential in the last case, ignoring 
>> RAID-Z), but you can still read ahead through the entire disk set and 
>> obtain decent sequential scanning performance by reading from all the 
>> disks in parallel.  But if your database table scan is only one small 
>> part of a workload which is (perhaps the worst case) performing many 
>> other such scans in parallel, your overall system throughput will be 
>> only around 4% of what it could be had you used 1 MB chunks (and the 
>> individual scan performances will also suck commensurately, of course).
...
> Real data would be greatly appreciated.  In my tests, I see
> reasonable media bandwidth speeds for reads.
You already said that you hadn''t been studying databases (the source of
the kind of random-update/streaming read access mix specifically under
consideration here).  But while they may one of the worst cases in this respect
(especially given their tendency to want to perform synchronous rather than lazy
writes), the underlying problem is hardly unique:  didn''t I see a
reference recently to streaming read performance issues with data that had been
laid down by multiple concurrent sequential write streams?
> 
>> Which is why background defragmentation could help, as I previously 
>> noted:  it could rearrange the table such that multiple 
>> virtually-sequential ZFS blocks were placed contiguously on each disk 
>> (to reach 1 MB total, in the current example) without affecting the 
>> ZFS block size per se.  But every block so rearranged (and every tree 
>> ancestor of each such block) would then leave an equal-sized residue 
>> in the most recent snapshot if one existed, which gets expensive fast 
>> in terms of snapshot space overhead (which then is proportional to the 
>> amount of reorganization performed as well as to the amount of actual 
>> data updating).
>>
>>   
> This comes up often from people who want > 128 kByte block sizes
> for ZFS.
Using larger block sizes to solve this problem would just be piling one kludge
on top of another.  Block size is not the right answer to streaming performance
- you achieve it by arranging *multiple* blocks sensibly on the media, so that
you can then use a block size that''s otherwise appropriate for the
application (e.g., 16 KB for a database that uses 16 KB blocks itself).

  And yet we can demonstrate media bandwidth limits> relatively easily.  How would you reconcile the differences?
Perhaps the difference is that you''re happier talking about workloads
that make ZFS look good rather than actively looking for workloads that give ZFS
fits.  Start looking for what ZFS is *not* good at and you''ll find it
(and then be able to start thinking about how to fix it).

- bill
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-16 03:10 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> For modern disks, media bandwidths are now getting to
> be > 100 MBytes/s.
> If you need 500 MBytes/s of sequential read, you''ll
> never get it from 
> one disk.
And no one here even came remotely close to suggesting that you should try to.
> You can get it from multiple disks, so the questions
> are:
> 1. How to avoid other bottlenecks, such as a
>  shared fibre channel 
> ath?  Diversity.
> 2. How to predict the data layout such that you
> can guarantee a wide 
> spread?
You''ve missed at least one more significant question:

3.  How to lay out the data such that this 500 MB/s drain doesn''t
cripple *other* concurrent activity going on in the system (that''s what
increasing the amount laid down on each drive to around 1 MB accomplishes -
otherwise, you can easily wind up using all the system''s disk resources
to satisfy that one application, or even fall short if you have fewer than 50
disks available, since if you spread the data out relatively randomly in 128 KB
chunks on a system with disks reasonably well-filled with data you''ll
only be obtaining around 10 MB/s from each disk, whereas with 1 MB chunks
similarly spread about each disk can contribute more like 35 MB/s and
you''ll need only 14 - 15 disks to meet your requirement).

Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse.

- bill
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Nov-19 10:51 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Anton B. Rang writes:

 > > When you have a striped storage device under a
 > > file system, then the database or file system''s view
 > > of contiguous data is not contiguous on the media.
 > 
 > Right.  That''s a good reason to use fairly large stripes.  (The
 > primary limiting factor for stripe size is efficient parallel access;
 > using a 100 MB stripe size means that an average 100 MB file gets less
 > than two disks'' worth of throughput.) 
 > 
 > ZFS, of course, doesn''t have this problem, since it''s
handling the
 > layout on the media; it can store things as contiguously as it wants. 
 > 

It can  do what it  wants. But currently  what it does is to
maintain files subject to  small random writes contiguous to
the level  of the zfs recordsize.   Now  after a significant
run of  random writes the  files ends  up with  a  scattered
on-disk layout.  This should work  well  for the transaction
parts  of the workload.  But   the implications  of using  a
small  recordsize are that   large sequential scans of files
will make the disk heads  very busy fetching or pre-fetching
recordsized chunks.  Get  more spindles and  a good prefetch
algorithm you  can reach whatever  throughput you need.  The
problem  is   that your  scanning   ops will   create  heavy
competition   at  the  spindle  level  thus    impacting the
transactional response time (once you have 150 IOPS on every
spindle just prefetching data for your full table scans, the
OLTP will suffer). Yes we  do need data to characterise this
but the physics are fairly clear.

The BP suggesting a small recordize needs to be updated.

We need to strike a balance between random writes and
sequential reads which does imply using greater records that 
8K/16K DB blocks.

-r

can you guess?

2007-Nov-19 23:19 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...

 currently  what it> does is to
> maintain files subject to  small random writes
> contiguous to
> the level  of the zfs recordsize.   Now  after a
> significant
> run of  random writes the  files ends  up with  a
>  scattered
> n-disk layout.  This should work  well  for the
> transaction
> parts  of the workload.
Absolutely (save for the fact that every database block write winds up writing
all the block''s ancestors as well, but that''s a different
discussion and one where ZFS looks only somewhat sub-optimal rather than
completely uncompetitive when compared with different approaches).

  But   the implications  of> using  a
> small  recordsize are that   large sequential scans
> of files
> will make the disk heads  very busy fetching or
> pre-fetching
> recordsized chunks.
Well, no:  that''s only an implication if you *choose* not to arrange
the individual blocks on the disk to support sequential access better - and that
choice can have devastating implications for the kind of workload being
discussed here (another horrendous example would be a simple array of fixed-size
records in a file accessed - and in particular updated - randomly by ordinal
record number converted to a file offset but also scanned sequentially for bulk
processing).

Yes, choosing to reorganize files does have the kinds of snapshot implications
that I''ve mentioned, but in most installations (save those entirely
dedicated to databases) the situation under discussion here will typically
involve only a small subset of the total data stored and thus reorganization
shouldn''t have severe consequences.

  Get  more spindles and  a good> prefetch
> algorithm you  can reach whatever  throughput you
> need.
At the cost of using about 25x (or 50x - 200x, if using RAID-Z) as many disks as
you''d need for the same throughput if the blocks were laid out in 1 MB
chunks rather than the 16 KB chunks in the example database.

  The> problem  is   that your  scanning   ops will   create
>  heavy
> ompetition   at  the  spindle  level  thus
>    impacting the
> nsactional response time (once you have 150 IOPS on
> every
> spindle just prefetching data for your full table
> scans, the
> OLTP will suffer).
And this effect on OLTP performance would be dramatically reduced as well if you
pulled the sequential scan off the disk in large chunks rather than in
randomly-distributed individual blocks.

 Yes we  do need data to> characterise this
> but the physics are fairly clear.
Indeed they are - thanks for recognizing this better than some here have managed
to.
> 
> 
> The BP suggesting a small recordize needs to be
> updated.
> 
> We need to strike a balance between random writes and
> sequential reads which does imply using greater
> records that 
> 8K/16K DB blocks.
As Jesus Cea observed in the recent "ZFS + DB + default blocksize"
discussion, this then requires that every database block update first read in
the larger ZFS record before performing the update rather than allowing the
database block to be written directly.  The ZFS block *might* still be present
in ZFS''s cache, but if most RAM is dedicated to the database cache
(which for many workloads makes more sense) the chances of this are reduced (not
only is ZFS''s cache smaller but the larger database cache will hold the
small database blocks a *lot* longer than ZFS''s cache will hold any
associated large ZFS blocks, so a database block update can easily occur long
after the associated ZFS block has been evicted).

Even ignoring the deleterious effect on random database update performance, you
still can''t get *good* performance for sequential scans this way
because maximum-size 128 KB ZFS blocks laid out randomly are still a factor of
about 4 less efficient in doing this than 1 MB chunks would be (i.e.,
you''d need about 4x as many disks - or 8x - 32x as many disks if using
RAID-Z - to achieve comparable performance).

And it just doesn''t have to be that way - as other modern Unix file
systems recognized long ago.  You don''t even need to embrace
extent-based allocation as they did, but just rearrange your blocks sensibly -
and to at least some degree you could do that while they''re still
cache-resident if you captured updates for a while in the ZIL (what''s
the largest update that you''re willing to stuff in there now?) before
batch-writing them back to less temporary locations.

RAID-Z could be fixed as well, which would help a much wider range of workloads.

- bill
 
 
This message posted from opensolaris.org

James Cone

2007-Nov-20 03:39 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Hello All,

Here''s a possibly-silly proposal from a non-expert.

Summarising the problem:
   - there''s a conflict between small ZFS record size, for good random 
update performance, and large ZFS record size for good sequential read 
performance

   - COW probably makes that conflict worse

   - re-packing (~= defragmentation) would make it better, but cause 
problems with the snapshot mechanism

Proposed solution:
   - keep COW

   - create a new operation that combines snapshots and cloning

   - when you''re cloning, always write a tidy, re-packed layout of the
data

   - if you''re using the new operation, keep the existing layout as the
clone, and give the new layout to the running file-system

Things that have to be done to make this work:

   - sort out the semantics, because the clone will be in the existing 
zpool, and the file-system will move to a new zpool (not sure if I have 
the terminology right)

   - sort out the transactional properties; the changes made since the 
start of the operation will have to be copied across into the new layout

Regards,
James.

Richard Elling

2007-Nov-20 04:08 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

James Cone wrote:> Hello All,
>
> Here''s a possibly-silly proposal from a non-expert.
>
> Summarising the problem:
>    - there''s a conflict between small ZFS record size, for good
random
> update performance, and large ZFS record size for good sequential read 
> performance
>   
Poor sequential read performance has not been quantified.
>    - COW probably makes that conflict worse
>
>   
This needs to be proven with a reproducible, real-world workload before it
makes sense to try to solve it.  After all, if we cannot measure where 
we are,
how can we prove that we''ve improved?

Note: some block devices will not exhibit the phenomenon which people
seem to be worried about in this thread.  There are more options than just
re-architect ZFS.

I''m not saying there aren''t situations where there may be a
problem, I''m
just
observing that nobody has brought data to this party.
 -- richard
>    - re-packing (~= defragmentation) would make it better, but cause 
> problems with the snapshot mechanism
>
> Proposed solution:
>    - keep COW
>
>    - create a new operation that combines snapshots and cloning
>
>    - when you''re cloning, always write a tidy, re-packed layout of
the data
>
>    - if you''re using the new operation, keep the existing layout
as the
> clone, and give the new layout to the running file-system
>
> Things that have to be done to make this work:
>
>    - sort out the semantics, because the clone will be in the existing 
> zpool, and the file-system will move to a new zpool (not sure if I have 
> the terminology right)
>
>    - sort out the transactional properties; the changes made since the 
> start of the operation will have to be copied across into the new layout
>
> Regards,
> James.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

can you guess?

2007-Nov-20 06:18 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Regardless of the merit of the rest of your proposal, I think you have put your
finger on the core of the problem:  aside from some apparent reluctance on the
part of some of the ZFS developers to believe that any problem exists here at
all (and leaving aside the additional monkey wrench that using RAID-Z here would
introduce, because one could argue that files used in this manner are poor
candidates for RAID-Z anyway hence that there''s no need to consider
reorganizing RAID-Z files), the *only* down-side (other than a small matter of
coding) to defragmenting files in the background in ZFS is the impact that would
have on run-time performance (which should be minimal if the defragmentation is
performed at lower priority) and the impact it would have on the space consumed
by a snapshot that existed while the defragmentation was being done.

One way to eliminate the latter would be simply not to reorganize while any
snapshot (or clone) existed:  no worse than the situation today, and better
whenever no snapshot or clone is present.  That would change the perceived
''expense'' of a snapshot, though, since you''d know you
were potentially giving up some run-time performance whenever one existed - and
it''s easy to imagine installations which might otherwise like to run
things such that a snapshot was *always* present.

Another approach would be just to accept any increased snapshot space overhead. 
So many sequentially-accessed files are just written once and read-only
thereafter that a lot of installations might not see any increased snapshot
overhead at all.  Some files are never accessed sequentially (or done so only in
situations where performance is unimportant), and if they could be marked
"Don''t reorganize" then they wouldn''t contribute any
increased snapshot overhead either.

One could introduce controls to limit the times when reorganization was done,
though my inclination is to suspect that such additional knobs ought to be
unnecessary.

One way to eliminate almost completely the overhead of the additional disk
accesses consumed by background defragmentation would be to do it as part of the
existing background scrubbing activity, but for actively-updated files one might
want to defragment more often than one needed to scrub.

In any event, background defragmention should be a relatively easy feature to
introduce and try out if suitable multi-block contiguous allocation mechanisms
already exist to support ZFS''s existing batch writes.  Use of ZIL to
perform opportunistic defragmentation while updated data was still present in
the cache might be a bit more complex, but could still be worth investigating.

- bill
 
 
This message posted from opensolaris.org

Louwtjie Burger

2007-Nov-20 06:20 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

>
> Poor sequential read performance has not been quantified.
>
> >    - COW probably makes that conflict worse
> >
> >
>
> This needs to be proven with a reproducible, real-world workload before it
> makes sense to try to solve it.  After all, if we cannot measure where
> we are,
> how can we prove that we''ve improved?
I agree, let''s first find a reproducible example where
"updates"
negatively impacts large table scans ... one that is rather simple (if
there is one) to reproduce and then work from there.

I might be able to help with such an example during the month of
December/January :)

I do have a question though? (Should probably ask in database-discuss).

Q: Does a full online backup of a DB (let''s say Legato Networker with
oracle plugin) constitute "large sequential table scans" in the way
that it reads database data? It probably is not as simple as that...

Ross

2007-Nov-20 10:25 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

My initial thought was that this whole thread may be irrelevant - anybody
wanting to run such a database is likely to use a specialised filesystem
optimised for it.  But then I realised that for a database admin the integrity
checking and other benefits of ZFS would be very tempting, but only if ZFS can
guarantee equivalent performance to other filesystems.

So, let me see if I understand this right:

- Louwtjie is concerned that ZFS will fragment databases, potentially leading to
read performance issues for some databases.

- Nathan appears to have suggested a good workaround.  Could ZFS be updated to
have a ''contiguous'' setting where blocks are kept together. 
This sacrifices write performance for read.

- Richard isn''t convinced there''s a problem as he''s
not seen any data supporting this.  I can see his point, but I don''t
agree that this is a non starter.  For certain situations it could be very
useful, and balancing read and write performance is an integral part in the
choice of storage configuration.

- Bill seems to understand the issue, and added some useful background (although
in an entertaining but rather condascending way).

Richard then went into a little more detail.  I think he''s pointing out
here that while contiguous data is fastest if you consider a single disk, is not
necessarily the fastest approach when your data is spread across multiple disks.
Instead he feels a ''diverse stochastic spread'' is needed.  I
guess that means you want the data spread so all the disks can be used in
parallel.

I think I''m now seeing why Richard is asking for real data.  I think he
believes that ZFS may already be faster or equal to a standard contiguous
filesystem in this scenario.  Richard seems to be using a random or statistical
approach to this:  If data is saved randomly, you''re likely to be using
all disks when reading data.

I do see the point, and yes, data would be useful, but I think I agree with Bill
on this.  For reading data, while random locations are likely to be fast in
terms of using multiple disks, that data is also likely to be spread and so is
almost certain to result in more disk seeks.  Whereas if you have contiguous
data you can guarantee that it will be striped across the maximum possible
number of disks, with the minimum number of seeks.  As a database admin I would
take guaranteed performance over probable performance any day of the week. 
Especially if I can be sure that performance will be consistent and will not
degrade as the database ages.

One point that I haven''t seen raised yet:  I believe most databases
will have had years of tuning based around the assumption that their data is
saved contigously on disk.  They will be optimising their disk access based on
that and this is not something we should ignore.

Yes, until we have data to demonstrate the problem it''s just
theoretical.  However that may be hard to obtain and in the meantime I think the
theory is sound, and the solution easy enough that it is worth tackling.

I definately don''t think defragmentation is the solution (although that
is needed in ZFS for other scenarios).  If your database is under enough read
strain to need the fix suggested here, your disks definately do not have the
time needed to scan and defrag the entire system.

It would seem to me that Nathan''s suggestion right at the start of the
thread is the way to go.  It guarantees read performance for the database, and
would seem to be relatively easy to implement at the zpool level.  Yes it adds
considerable overhead to writes, but that is a decision database administrators
can make given the expected load.

If I''m understanding Nathan right, saving a block of data would mean:
 - Reading the original block (may be cached if we''re lucky)
 - Saving that block to a new location
 - Saving the new data to the original location

So you''ve got a 2-3x slowdown in write performance, but you guarantee
read performance will at least match existing filesystems (with ZFS caching, it
may exceed it).  ZFS then works much better with all the existing optimisations
done within the database software, and you still keep all the benefits of ZFS -
full data integrity, snapshots, clones, etc...

For many database admins, I think that would be an option they would like to
have.

Taking it a stage further, I wonder if this would work well with the prioritized
write feature request (caching writes to a solid state disk)? 
http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List

That could potentially mean there''s very little slowdown:
 - Read the original block
 - Save that to solid state disk
 - Write the the new block in the original location
 - Periodically stream writes from the solid state disk to the main storage

In theory there''s no need for the drive head to move at all between the
read and the write, so this should only be fractionally slower than traditional
ZFS writes.  Yes the data needs to be flushed from the solid state store from
time to time, but those writes can be batched together for improved performance
and streamed to contiguous free space on the disk.

That would appear to then give you the best of both worlds.
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-20 12:20 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> - Nathan appears to have suggested a good workaround.
> Could ZFS be updated to have a ''contiguous'' setting
> where blocks are kept together.  This sacrifices
>  write performance for read.
I had originally thought that this would be incompatible with ZFS''s
snapshot mechanism, but with a minor tweak it may not be.

...
> - Bill seems to understand the issue, and added some
> useful background (although in an entertaining but
> rather condascending way).
There is a bit of nearby history that led to that.

...
> One point that I haven''t seen raised yet:  I believe
> most databases will have had years of tuning based
> around the assumption that their data is saved
> contigously on disk.  They will be optimising their
> disk access based on that and this is not something
> we should ignore.
Ah - nothing like real, experienced user input.  I tend to agree with
ZFS''s general philosophy of attempting to minimize the number of knobs
that need tuning, but this can lead to forgetting that higher-level software may
have knobs of its own.  My original assumption was that databases automatically
attempted to leverage on-disk contiguity (which the more evolved ones certainly
do when they''re controlling the on-disk layout themselves and one might
suspect try to do even when running on top of files by assuming that the file
system is trying to preserve on-disk contiguity), but of course admins play a
major role as well (e.g., in determining which indexes need not be created
because sequential table scans can get the job done efficiently).

...
 > I definately don''t think defragmentation is the
> solution (although that is needed in ZFS for other
> scenarios).  If your database is under enough read
> strain to need the fix suggested here, your disks
> definately do not have the time needed to scan and
> defrag the entire system.
Well, it''s only this kind of randomly-updated/sequentially-scanned data
that needs much defragmention in the first place.  Data that''s written
once and then only read at worst needs a single defragmentation pass (if the
original writes got interrupted by a lot of other update activity), data
that''s not read sequentially (e.g., indirect blocks) needn''t
be defragmented at all, nor need data that''s seldom read and/or not
very fragmented in the first place.
> 
> It would seem to me that Nathan''s suggestion right at
> the start of the thread is the way to go.  It
> guarantees read performance for the database, and
> would seem to be relatively easy to implement at the
> zpool level.  Yes it adds considerable overhead to
> writes, but that is a decision database
> administrators can make given the expected load.  
> 
> If I''m understanding Nathan right, saving a block of
> data would mean:
> - Reading the original block (may be cached if we''re
>  lucky)
> - Saving that block to a new location
>  - Saving the new data to the original location
1.  You''d still need an initial defragmentation pass to ensure that the
file was reasonably piece-wise contiguous to begin with.

2.  You can''t move the old version of the block without updating all
its ancestors (since the pointer to it changes).  When you update this path to
the old version, you need to suppress the normal COW behavior if a snapshot
exists because it would otherwise maintain the old path pointing to the old data
location that you''re just about to over-write below.  This presumably
requires establishing the entire new path and deallocating the entire old path
in a single transaction but this may just be equivalent to a normal data block
''update'' (that just doesn''t happen to change any data
in the block) when no snapshot exists.  I don''t *think* that there
should be any new issues raised with other updates that may be combined in the
same ''transaction'', even if they may affect some of the same
ancestral blocks.

3.  You can''t just slide in the new version of the block using the old
version''s existing set of ancestors because a) you just deallocated
that path above (introducing additional mechanism to preserve it temporarily
almost certainly would not be wise), b) the data block checksum changed, and c)
in any event this new path should be *newer* than the path to the old
version''s new location that you just had to establish (if a snapshot
exists, that''s the path that should be propagated to it by the COW
mechanism).  However, this is just the normal situation whenever you update a
data block:  all the *additional* overhead occurred in the previous steps.

Given that doing the update twice, as described above, only adds to the
bandwidth consumed (steps 2 and 3 should be able to be combined in a single
transaction), the only additional disk seek would be that required to re-read
the original data if it wasn''t cached.  So you may well be correct that
this approach would likely consume fewer resources than background
defragmentation would (though, as noted above, you''d still need an
initial defrag pass to establish initial contiguity), and while the additional
resources would be consumed at normal rather than reduced priority the file
would be kept contiguous all the time rather than just returned to contiguity
whenever there was time to do so.

...
> Taking it a stage further, I wonder if this would
> work well with the prioritized write feature request
> (caching writes to a solid state disk)?
>  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
> age_Developer_Wish_List
> 
> That could potentially mean there''s very little
> slowdown:
>  - Read the original block
> - Save that to solid state disk
>  - Write the the new block in the original location
> - Periodically stream writes from the solid state
> disk to the main storage
I don''t think this applies (nor would it confer any benefit) if things
in fact need to be handled as I described above.

- bill
 
 
This message posted from opensolaris.org

Ross

2007-Nov-20 12:39 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

In that case, this may be a much tougher nut to crack than I thought.

I''ll be the first to admit that other than having seen a few
presentations I don''t have a clue about the details of how ZFS works
under the hood, however...

You mention that moving the old block means updating all it''s
ancestors.  I had naively assumed moving a block would be relatively simple, and
would also update all the ancestors.

My understanding of ZFS (in short: an upside down tree) is that each block is
referenced by it''s parent.  So regardless of how many snapshots you
take, each block is only ever referenced by one other, and I''m guessing
that the pointer and checksum are both stored there.

If that''s the case, to move a block it''s just a case of:
 - read the data
 - write to the new location
 - update the pointer in the parent block

Please let me know if I''m mis-understanding ZFS here.

The major problem with this is that I don''t know if there''s
any easy way to identify the parent block from the child, or an effcient way to
do this move.  However, thinking about it, there must be.  ZFS intelligently
moves data if it detects corruption, so there must already be tools in place to
do exactly what we need here.

In which case, this is still relatively simple and much of the code already
exists.
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-20 13:16 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> My understanding of ZFS (in short: an upside down
> tree) is that each block is referenced by it''s
> parent. So regardless of how many snapshots you take,
> each block is only ever referenced by one other, and
> I''m guessing that the pointer and checksum are both
> stored there.
> 
> If that''s the case, to move a block it''s just a case
> of:
> - read the data
> - write to the new location
> - update the pointer in the parent block
Which changes the contents of the parent block (the change in the data checksum
changed it as well), and thus requires that this parent also be rewritten (using
COW), which changes the pointer to it (and of course its checksum as well) in
*its* parent block, which thus also must be re-written... and finally a new copy
of the superblock is written to reflect the new underlying tree structure - all
this in a single batch-written ''transaction''.

The old version of each of these blocks need only be *saved* if a snapshot
exists and it hasn''t previously been updated since that snapshot was
created.  But all the blocks need to be COWed even if no snapshot exists (in
which case the old versions are simply discarded).

...
 > PS.
> 
> >1. You''d still need an initial defragmentation pass
> to ensure that the file was reasonably piece-wise
> contiguous to begin with.
> 
> No, not necessarily.  If you were using a zpool
> configured like this I''d hope you were planning on
> creating the file as a contiguous block in the first
> place :)
I''m not certain that you could ensure this if other updates in the
system were occurring concurrently.  Furthermore, the file may be extended
dynamically as new data is inserted, and you''d like to have some
mechanism that could restore reasonable contiguity to the result (which can be
difficult to accomplish in the foreground if, for example, free space
doesn''t happen to exist on the disk right after the existing portion of
the file).

...
 > Any zpool with this option would probably be
> dedicated to the database file and nothing else.  In
> fact, even with multiple databases I think I''d have a
> single pool per database.
It''s nice if you can afford such dedicated resources, but it seems a
bit cavalier to ignore users who just want decent performance from a database
that has to share its resources with other activity.

Your prompt response is probably what prevented me from editing my previous post
after I re-read it and realized I had overlooked the fact that over-writing the
old data complicates things.  So I''ll just post the revised portion
here:


3.  Now you must make the above transaction persistent, and then randomly
over-write the old data block with the new data (since that data must be in
place before you update the path to it below, and unfortunately since its
location is not arbitrary you can''t combine this update with either the
transaction above or the transaction below).

4.  You can''t just slide in the new version of the block using the old
version''s existing set of ancestors because a) you just deallocated
that path above (introducing additional mechanism to preserve it temporarily
almost certainly would not be wise), b) the data block checksum changed, and c)
in any event this new path should be *newer* than the path to the old
version''s new location that you just had to establish (if a snapshot
exists, that''s the path that should be propagated to it by the COW
mechanism).  However, this is just the normal situation whenever you update a
data block (save for the fact that the block itself was already written above): 
all the *additional* overhead occurred in the previous steps.

So instead of a single full-path update that fragments the file, you have two
full-path updates, a random write, and possibly a random read initially to fetch
the old data.  And you still need an initial defrag pass to establish initial
contiguity.  Furthermore, these additional resources are consumed at normal
rather than the reduced priority at which a background reorg can operate.  On
the plus side, though, the file would be kept contiguous all the time rather
than just returned to contiguity whenever there was time to do so.

...
> Taking it a stage further, I wonder if this would
> work well with the prioritized write feature request
> (caching writes to a solid state disk)?
>  http://www.genunix.org/wiki/index.php/OpenSolaris_Sto
> age_Developer_Wish_List
> 
> That could potentially mean there''s very little
> slowdown:
>  - Read the original block
> - Save that to solid state disk
>  - Write the the new block in the original location
> - Periodically stream writes from the solid state
> disk to the main storage
I''m not sure this would confer much benefit if things in fact need to
be handled as I described above.  In particular, if a snapshot exists you almost
certainly must establish the old version in its new location in the snapshot
rather than just capture it in the log; if no snapshot exists you could capture
the old version in the log and then discard it as soon as the new version
becomes persistent, but I''m not sure how easily that (and especially
recovering should a crash occur before the new version becomes persistent) could
be integrated with the existing COW facilities.

- bill
 
 
This message posted from opensolaris.org

Ross

2007-Nov-20 13:39 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Hmm... that''s a pain if updating the parent also means updating the
parent''s checksum too.  I guess the functionality is there for moving
bad blocks, but since that''s likely to be a rare occurence, it
wasn''t something that would need to be particularly efficient.

With regards sharing the disk resources with other programs, obviously
it''s down to the individual admins how they would configure this, but I
would suggest that if you have a database with heavy enough requirements to be
suffering noticable read performance issues due to fragmentation, then that
database really should have it''s own dedicated drives and
shouldn''t be competing with other programs.

I''m not saying defrag is bad (it may be the better solution here), just
that if you''re looking at performance in this kind of depth,
you''re probably experienced enough to have created the database in a
contiguous chunk in the first place :-)

I do agree that doing these writes now sounds like a lot of work.  I''m
guessing that needing two full-path updates to achieve this means
you''re talking about a much greater write penalty.  And that means you
can probably expect significant read penalty if you have any significant volume
of writes at all, which would rather defeat the point.  After all, if you have a
low enough amount of writes to not suffer from this penalty, your database
isn''t going to be particularly fragmented.

However, I''m now in over my depth.  This needs somebody who knows the
internal architecture of ZFS to decide whether it''s feasible or
desirable, and whether defrag is a good enough workaround.

It may be that ZFS is not a good fit for this kind of use, and that if
you''re really concerned about this kind of performance you should be
looking at other file systems.
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-20 14:11 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> With regards sharing the disk resources with other
> programs, obviously it''s down to the individual
> admins how they would configure this,
Only if they have an unconstrained budget.

 but I would> suggest that if you have a database with heavy enough
> requirements to be suffering noticable read
> performance issues due to fragmentation, then that
> database really should have it''s own dedicated drives
> and shouldn''t be competing with other programs.
You''re not looking at it from a whole-system viewpoint (which if
you''re accustomed to having your own dedicated storage devices is
understandable).

Even if your database performance is acceptable, if it''s performing 50x
as many disk seeks as it would otherwise need to when scanning a table
that''s affecting the performance of *other* applications.
> 
> Also, I''m not saying defrag is bad (it may be the
> better solution here), just that if you''re looking at
> performance in this kind of depth, you''re probably
> experienced enough to have created the database in a
> contiguous chunk in the first place :-)
As I noted, ZFS may not allow you to ensure that and in any event if the
database grows that contiguity may need to be reestablished.  You could grow the
db in separate files, each of which was preallocated in full (though again ZFS
may not allow you to ensure that each is created contiguously on disk), but
while databases may include such facilities as a matter of course it would still
(all other things being equal) be easier to manage everything if it could just
extend a single existing file (or one file per table, if they needed to be kept
separate) as it needed additional space.
> 
> I do agree that doing these writes now sounds like a
> lot of work.  I''m guessing that needing two full-path
> updates to achieve this means you''re talking about a
> much greater write penalty.
Not all that much.  Each full-path update is still only a single write request
to the disk, since all the path blocks (again, possibly excepting the
superblock) are batch-written together, thus mostly increasing only streaming
bandwidth consumption.

...
> It may be that ZFS is not a good fit for this kind of
> use, and that if you''re really concerned about this
> kind of performance you should be looking at other
> file systems.
I suspect that while it may not be a great fit now with relatively minor changes
it could be at least an acceptable one.

- bill
 
 
This message posted from opensolaris.org

Moore, Joe

2007-Nov-20 14:29 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Louwtjie Burger wrote:> Richard Elling wrote:
> >
> > >    - COW probably makes that conflict worse
> > >
> > >
> >
> > This needs to be proven with a reproducible, real-world 
> workload before it
> > makes sense to try to solve it.  After all, if we cannot 
> measure where
> > we are,
> > how can we prove that we''ve improved?
> 
> I agree, let''s first find a reproducible example where
"updates"
> negatively impacts large table scans ... one that is rather simple (if
> there is one) to reproduce and then work from there.
I''d say it would be possible to define a reproducible workload that
demonstrates this using the Filebench tool... I haven''t worked with it
much (maybe over the holidays I''ll be able to do this), but I think a
workload like:

1) create a large file (bigger than main memory) on an empty ZFS pool.
2) time a sequential scan of the file
3) random write i/o over say, 50% of the file (either with or without
matching blocksize)
4) time a sequential scan of the file

The difference between times 2 and 4 are the "penalty" that COW block
reordering (which may introduce seemingly-random seeks between
"sequential" blocks) imposes on the system.

It would be interesting to watch seeksize.d''s output during this run
too.

--Joe

Ross

2007-Nov-20 14:49 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> > doing these writes now sounds like a
> > lot of work.  I''m guessing that needing two full-path
> > updates to achieve this means you''re talking about a
> > much greater write penalty.
> 
> Not all that much.  Each full-path update is still
> only a single write request to the disk, since all
> the path blocks (again, possibly excepting the
> superblock) are batch-written together, thus mostly
> increasing only streaming bandwidth consumption.
Ok, that took some thinking about.  I''m pretty new to ZFS, so
I''ve only just gotten my head around how CoW works, and I''m
not used to thinking about files at this kind of level.  I''d not
considered that path blocks would be batch-written close together, but of course
that makes sense.

What I''d been thinking was that ordinarily files would get fragmented
as they age, which would make these updates slower as blocks would be scattered
over the disk, so a full-path update would take some time.  I''d
forgotten that the whole point of doing this is to prevent fragmentation...

So a nice side effect of this approach is that if you use it, it makes itself
more efficient :D
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-20 15:46 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Rats - I was right the first time:  there''s a messy problem with
snapshots.

The problem is that the parent of the child that you''re about to update
in place may *already* be in one or more snapshots because one or more of its
*other* children was updated since each snapshot was created.  If so, then each
snapshot copy of the parent is pointing to the location of the existing copy of
the child you now want to update in place, and unless you change the snapshot
copy of the parent (as well as the current copy of the parent) the snapshot will
point to the *new* copy of the child you are now about to update (with an
incorrect checksum to boot).

With enough snapshots, enough children, and bad enough luck, you might have to
change the parent (and of course all its ancestors...) in every snapshot.

In other words, Nathan''s approach is pretty much infeasible in the
presence of snapshots.  Background defragmention works as long as you move the
entire region (which often has a single common parent) to a new location, which
if the source region isn''t excessively fragmented may not be all that
expensive; it''s probably not something you''d want to try at
normal priority *during* an update to make Nathan''s approach work,
though, especially since you''d then wind up moving the entire region on
every such update rather than in one batch in the background.

- bill
 
 
This message posted from opensolaris.org

Chris Csanady

2007-Nov-20 16:44 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

On Nov 19, 2007 10:08 PM, Richard Elling <Richard.Elling at sun.com>
wrote:> James Cone wrote:
> > Hello All,
> >
> > Here''s a possibly-silly proposal from a non-expert.
> >
> > Summarising the problem:
> >    - there''s a conflict between small ZFS record size, for
good random
> > update performance, and large ZFS record size for good sequential read
> > performance
> >
>
> Poor sequential read performance has not been quantified.
I think this is a good point.  A lot of solutions are being thrown
around, and the problems are only theoretical at the moment.
Conventional solutions may not even be appropriate for something like
ZFS.

The point that makes me skeptical is this: blocks do not need to be
logically contiguous to be (nearly) physically contiguous.  As long as
you reallocate the blocks close to the originals, chances are that a
scan of the file will end up being mostly physically contiguous reads
anyway.  ZFS''s intelligent prefetching along with the disk''s
track
cache should allow for good performance even in this case.

ZFS may or may not already do this, I haven''t checked.  Obviously, you
won''t want to keep a years worth of snapshots, or run the pool near
capacity.  With a few minor tweaks though, it should work quite well.
Talking about fundamental ZFS design flaws at this point seems
unnecessary to say the least.

Chris

Ross

2007-Nov-20 17:06 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

But the whole point of snapshots is that they don''t take up extra space
on the disk.  If a file (and hence a block) is in every snapshot it
doesn''t mean you''ve got multiple copies of it.  You only have
one copy of that block, it''s just referenced by many snapshots.

The thing is, the location of that block isn''t saved separately in
every snapshot either - the location is just stored in it''s parent.  So
moving a block is just a case of updating one parent.  So regardless of how many
snapshots the parent is in, you only have to update one parent to point it at
the new location for the *old* data.  Then you save the new data to the old
location and ensure the current tree points to that.

If you think about it, that has to work for the old data since as I said before,
ZFS already has this functionality.  If ZFS detects a bad block, it moves it to
a new location on disk.  If it can already do that without affecting any of the
existing snapshots, so there''s no reason to think we couldn''t
use the same code for a different purpose.

Ultimately, your old snapshots get fragmented, but the live data stays
contiguous.
 
 
This message posted from opensolaris.org

can you guess?

2007-Nov-20 17:33 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> But the whole point of snapshots is that they don''t
> take up extra space on the disk.  If a file (and
> hence a block) is in every snapshot it doesn''t mean
> you''ve got multiple copies of it.  You only have one
> copy of that block, it''s just referenced by many
> snapshots.
I used the wording "copies of a parent" loosely to mean "previous
states of the parent that also contain pointers to the current state of the
child about to be updated in place".
> 
> The thing is, the location of that block isn''t saved
> separately in every snapshot either - the location is
> just stored in it''s parent.
And in every earlier version of the parent that was updated for some *other*
reason and still contains a pointer to the current child that someone using that
snapshot must be able to follow correctly.

  So moving a block is> just a case of updating one parent.
No:  every version of the parent that points to the current version of the child
must be updated.

...
> If you think about it, that has to work for the old
> data since as I said before, ZFS already has this
> functionality.  If ZFS detects a bad block, it moves
> it to a new location on disk.  If it can already do
> that without affecting any of the existing snapshots,
> so there''s no reason to think we couldn''t use the
> same code for a different purpose.
Only if it works the way you think it works, rather than, say, by using a
look-aside list of moved blocks (there shouldn''t be that many of them),
or by just leaving the bad block in the snapshot (if it''s mirrored or
parity-protected, it''ll still be usable there unless a second failure
occurs; if not, then it was lost anyway).

- bill
 
 
This message posted from opensolaris.org

Will Murnane

2007-Nov-20 17:50 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

On Nov 20, 2007 5:33 PM, can you guess? <billtodd at metrocast.net>
wrote:> > But the whole point of snapshots is that they don''t
> > take up extra space on the disk.  If a file (and
> > hence a block) is in every snapshot it doesn''t mean
> > you''ve got multiple copies of it.  You only have one
> > copy of that block, it''s just referenced by many
> > snapshots.
>
> I used the wording "copies of a parent" loosely to mean
"previous
> states of the parent that also contain pointers to the current state of
> the child about to be updated in place".But children are never updated in place.  When a new block is written
to a leaf, new blocks are used for all the ancestors back to the
superblock, and then the old ones are either freed or held on to by
the snapshot.
> And in every earlier version of the parent that was updated for some
> *other* reason and still contains a pointer to the current child that
> someone using that snapshot must be able to follow correctly.The snapshot doesn''t get the ''current'' child - it
gets the one that
was there when the snapshot was taken.
> No:  every version of the parent that points to the current version of
> the child must be updated.Even with clones, the two ''parent'' and the
''clone'' are allowed to
diverge - they contain different data.

Perhaps I''m missing something.  Excluding ditto blocks, when in ZFS
would two parents point to the same child, and need to both be updated
when the child is updated?

Will

Al Hopper

2007-Nov-20 19:20 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

On Tue, 20 Nov 2007, Ross wrote:
>>> doing these writes now sounds like a
>>> lot of work.  I''m guessing that needing two full-path
>>> updates to achieve this means you''re talking about a
>>> much greater write penalty.
>>
>> Not all that much.  Each full-path update is still
>> only a single write request to the disk, since all
>> the path blocks (again, possibly excepting the
>> superblock) are batch-written together, thus mostly
>> increasing only streaming bandwidth consumption.
>... reformatted ...
> Ok, that took some thinking about.  I''m pretty new to ZFS, so
I''ve
> only just gotten my head around how CoW works, and I''m not used to
> thinking about files at this kind of level.  I''d not considered
that
Here''s a couple of resources that''ll help you get up to speed
with ZFS
internals:

a) From the London OpenSolaris User Group (LOSUG) session, presented 
by Jarod Nash, TSC Systems Engineer entitled: "ZFS: Under The Hood":

ZFS-UTH_3_v1.1_LOSUG.pdf
zfs_data_structures_for_single_file.pdf

also referred to as "ZFS Internals Lite".

and b) the ZFS on-disk Specification:

ondiskformat0822.pdf
> path blocks would be batch-written close together, but of course 
> that makes sense.
>
> What I''d been thinking was that ordinarily files would get 
> fragmented as they age, which would make these updates slower as 
> blocks would be scattered over the disk, so a full-path update would 
> take some time.  I''d forgotten that the whole point of doing this
is
> to prevent fragmentation...
>
> So a nice side effect of this approach is that if you use it, it 
> makes itself more efficient :D
>
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
            Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from "sugar-coating school"?  Sorry - I never attended! :)

can you guess?

2007-Nov-21 04:46 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...

 just rearrange your blocks sensibly -> and to at least some degree you could do that while
> they''re still cache-resident
Lots of discussion has passed under the bridge since that observation above, but
it may have contained the core of a virtually free solution:  let your table
become fragmented, but each time that a sequential scan is performed on it
determine whether the region that you''re currently scanning is
*sufficiently* fragmented that you should retain the sequential blocks that
you''ve just had to access anyway in cache until you''ve built
up around 1 MB of them and then (in a background thread) flush the result
contiguously back to a new location in a single bulk ''update''
that changes only their location rather than their contents.

1.  You don''t incur any extra reads, since you were reading
sequentially anyway and already have the relevant blocks in cache.  Yes, if you
had reorganized earlier in the background the current scan would have gone
faster, but if scans occur sufficiently frequently for their performance to be a
significant issue then the *previous* scan will probably not have left things
*all* that fragmented.  This is why you choose a fragmentation threshold to
trigger reorg rather than just do it whenever there''s any fragmentation
at all, since the latter would probably not be cost-effective in some
circumstances; conversely, if you only perform sequential scans once in a blue
moon, every one may be completely fragmented but it probably wouldn''t
have been worth defragmenting constantly in the background to avoid this, and
the occasional reorg triggered by the rare scan won''t constitute enough
additional overhead to justify heroic efforts to avoid it.  Such a
''threshold'' is a crude but possibly adequate metric; a better
but more complex one would perhaps nudge up the threshold value every time a
sequential scan took place without an intervening update, such that
rarely-updated but frequently-scanned files would eventually approach full
contiguity, and an even finer-grained metric would maintain such information
about each individual *region* in a file, but absent evidence that the single,
crude, unchanging threshold (probably set to defragment moderately aggressively
- e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 1 MB region)
is inadequate these sound a bit like over-kill.

2.  You don''t defragment data that''s never sequentially
scanned, avoiding unnecessary system activity and snapshot space consumption.

3.  You still incur additional snapshot overhead for data that you do decide to
defragment for each block that hadn''t already been modified since the
most recent snapshot, but performing the local reorg as a batch operation means
that only a single copy of all affected ancestor blocks will wind up in the
snapshot due to the reorg (rather than potentially multiple copies in multiple
snapshots if snapshots were frequent and movement was performed one block at a
time).

- bill
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Nov-21 10:28 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

Moore, Joe writes:
 > Louwtjie Burger wrote:
 > > Richard Elling wrote:
 > > >
 > > > >    - COW probably makes that conflict worse
 > > > >
 > > > >
 > > >
 > > > This needs to be proven with a reproducible, real-world 
 > > workload before it
 > > > makes sense to try to solve it.  After all, if we cannot 
 > > measure where
 > > > we are,
 > > > how can we prove that we''ve improved?
 > > 
 > > I agree, let''s first find a reproducible example where
"updates"
 > > negatively impacts large table scans ... one that is rather simple
(if
 > > there is one) to reproduce and then work from there.
 > 
 > I''d say it would be possible to define a reproducible workload
that
 > demonstrates this using the Filebench tool... I haven''t worked
with it
 > much (maybe over the holidays I''ll be able to do this), but I
think a
 > workload like:
 > 
 > 1) create a large file (bigger than main memory) on an empty ZFS pool.
 > 2) time a sequential scan of the file
 > 3) random write i/o over say, 50% of the file (either with or without
 > matching blocksize)
 > 4) time a sequential scan of the file
 > 
 > The difference between times 2 and 4 are the "penalty" that COW
block
 > reordering (which may introduce seemingly-random seeks between
 > "sequential" blocks) imposes on the system.
 > 

But it''s not the only thing. The difference between 2 and 4
is the COW penalty that one can hide under prefetching and
many spindles. 

The other thing is to see what is the impact (throughput and
response time) of the file  scan operation to the ever going
random write load.

Third is the impact on CPU cycles required to do the filescans.

-r

 > It would be interesting to watch seeksize.d''s output during this
run
 > too.
 > 
 > --Joe
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

can you guess?

2007-Nov-21 14:14 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

In order to be reasonably representative of a real-world situation, I''d
suggest the following additions:
> 1) create a large file (bigger than main memory) on
> an empty ZFS pool.
1a.  The pool should include entire disks, not small partitions (else seeks will
be artificially short).

1b.  The file needs to be a *lot* bigger than the cache available to it, else
caching effects on the reads will be non-negligible.

1c.  Unless the file fills up a large percentage of the pool the rest of the
pool needs to be fairly full (else the seeks that updating the file generates
will, again, be artificially short ones).
> 2) time a sequential scan of the file
> 3) random write i/o over say, 50% of the file (either
> with or without
> matching blocksize)
3a.  Unless the file itself fills up a large percentage of the pool, do this
while other significant other updating activity is also occurring in the pool so
that the local holes in the original file layout created by some of its updates
don''t get favored for use by subsequent updates to the same file
(again, artificially shortening seeks).

- bill
 
 
This message posted from opensolaris.org

Moore, Joe

2007-Nov-21 16:30 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

BillTodd wrote:> In order to be reasonably representative of a real-world 
> situation, I''d suggest the following additions:
> 
Your suggestions (make the benchmark big enough so seek times are really
noticed) are good.  I''m hoping that over the holidays, I''ll
get to play
with an extra server...  If I''m lucky, I''ll have 2x36GB drives
(in a
1-2GB memory server) that I can dedicate to their own mirrored zfs pool.
I figure a 30GB test file should make the seek times interesting.

There''s also a needed 
5) Run the same microbenchmark against a UFS filesystem to compare the
step2/step4 ratio with what a non-COW filesystem offers.

In theory, the UFS ratio "should" be 1:1, that is, sequential read
performance should not be affected by the intervening random writes.
(In the case of my test server, I''ll make it an SVM mirror of the same
2
drives)

--Joe

can you guess?

2007-Nov-21 19:03 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

...
> This needs to be proven with a reproducible,
> real-world workload before it
> makes sense to try to solve it.  After all, if we
> cannot measure where 
> we are,
> how can we prove that we''ve improved?
Ah - Tests & Measurements types:  you''ve just gotta love
''em.

Wife:  "Darling, is there really supposed to be that much water in the
bottom of our boat?"

T&M:  "There''s almost always a little water in the bottom of a
boat, Love."

Wife:  "But I think it''s getting deeper!"

T&M:  "I suppose you *could* be right:  I''ll just put this
mark where the water is now, and then after a few minutes we can see if it
really has gotten deeper and, if so, just how much we really may need to worry
about it."

Wife:  "I think I''ll use this bucket to get rid of some of it,
just in case."

T&M:  "No, don''t do that:  then we won''t be able to
see how bad the problem is!"

Wife:  "But -"

T&M:  "And try not to rock the boat:  it changes the level of the water
at the mark that I just made."

Wife:  "I''m really not a very good swimmer, dear:  let''s
just head for shore."

T&M:  "That would be silly if there turns out not to be any problem,
wouldn''t it?"

(Wife hits T&M over head with bucket, grabs oars, and starts rowing.)

- bill
 
 
This message posted from opensolaris.org

James Cone

2007-Nov-21 21:10 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

OK, I''ll bite; it''s not like I''m getting an answer to
my other question.

Bill, please explain why deciding what to do about sequential scan 
performance in ZFS is urgent?

   ie why it''s urgent rather than important (I agree that if
it''s bad
then it''s going to be important eventually).

   ie why it''s too urgent to work out, first, how to measure whether 
we''re succeeding.

Regards,
James.

can you guess? wrote:
<snip>> 
> Ah - Tests & Measurements types:  you''ve just gotta love
''em.
> 
> Wife:  "Darling, is there really supposed to be that much water in the
bottom of our boat?"
> 
<snip>> 
> Wife:  "I''m really not a very good swimmer, dear: 
let''s just head for shore."
> 
> T&M:  "That would be silly if there turns out not to be any
problem, wouldn''t it?"
> 
> (Wife hits T&M over head with bucket, grabs oars, and starts rowing.)
> 
> - bill
>

can you guess?

2007-Nov-23 05:23 UTC

head link

[zfs-discuss] ZFS + DB + "fragments"

> OK, I''ll bite; it''s not like I''m getting an
answer to
> my other question.
Did I miss one somewhere?
> 
> Bill, please explain why deciding what to do about
> sequential scan 
> performance in ZFS is urgent?
It''s not so much that it''s ''urgent'' (anyone
affected by it simply won''t use ZFS) as that it''s a
no-brainer.
> 
> ie why it''s urgent rather than important (I agree
>  that if it''s bad 
> hen it''s going to be important eventually).
It''s bad, and it''s important now for anyone who cares whether
ZFS is viable for such workloads.
> 
> ie why it''s too urgent to work out, first, how to
> measure whether 
> we''re succeeding.
You don''t have to measure the *rate* at which the depth of the water in
the boat is rising in order to know that you''ve got a problem that
needs addressing.  You don''t have to measure *just how bad* sequential
performance in a badly-fragmented file is to know that you''ve got a
problem that needs addressing (see both Anton''s and Roch''s
comments if you don''t find mine convincing).

*After* you''ve tried to fix things, *then* it makes sense to measure
just how close you got to ideal streaming-sequential disk bandwidth in order to
see whether you need to work some more.  Right now, the only reason to measure
precisely how awful sequential scanning performance can get after severely
fragmenting a file by updating it randomly in small chunks is to be able to hand
out "Attaboy!"s for how much changing it improved things - even though
this by itself *still* won''t say anything about whether the result
attained offered reasonable performance in comparison with what''s
attainable (which is what should *really* be the basis for handing out any
"Attaboy!"s).

Rather than make a politically-incorrect comment about the Special Olympics
here, I''ll just ask whether common sense is no longer considered an
essential attribute in an engineer:  given the nature of the discussions about
this and about RAID-Z, I''ve really got to wonder.

- bill
 
 
This message posted from opensolaris.org

zfs discuss - Nov 2007 - ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"

[zfs-discuss] ZFS + DB + "fragments"