thr3ads.net - zfs discuss - [zfs-discuss] ZFS and databases [May 2006]

If this information is useful, please help other people find it:
Share via:

Boyd Adamson

2006-May-10 22:48 UTC

[zfs-discuss] ZFS and databases

One question that has come up a number of times when I''ve been  
speaking with people (read: evangelizing :) ) about ZFS is about  
database storage. In conventional use storage has separated redo logs  
from table space, on a spindle basis.

I''m not a database expert but I believe the reasons boil down to a  
combination of:
- Separation for redundancy

- Separation for reduction of bottlenecks (most write ops touch both  
the logs and the table)

- Separation of usage patterns (logs are mostly sequential writes,  
tables are random).

The question then comes up about whether in a ZFS world this  
separation is still needed. It seems to me that each of the above  
reasons is to some extent ameliorated by ZFS:
- Redundancy is performed at the filesystem level, probably on all  
disks in the pool.

- Dynamic striping and copy-on-write mean that all write ops can be  
striped across vdevs and the log writes can go right next to the  
table writes

- Copy-on-write also turns almost all writes into sequential writes  
anyway.

So it seems that the old reasoning may no longer apply. Is my  
thinking correct here? Have I missed something? Do we have any  
information to support either the use of a single pool or of separate  
pools for database usage?

Boyd
Melbourne, Australia

James C. McPherson

2006-May-10 23:17 UTC

head link

[zfs-discuss] ZFS and databases

Hi Boyd,

Boyd Adamson wrote:> One question that has come up a number of times when I''ve been
speaking
> with people (read: evangelizing :) ) about ZFS is about database 
> storage. In conventional use storage has separated redo logs from table 
> space, on a spindle basis.
> I''m not a database expert but I believe the reasons boil down to a
> combination of:
> - Separation for redundancy
correct
> - Separation for reduction of bottlenecks (most write ops touch both the 
> logs and the table)
correct
> - Separation of usage patterns (logs are mostly sequential writes, 
> tables are random).
correct
> The question then comes up about whether in a ZFS world this separation 
> is still needed. 
I don''t think it is.

 > It seems to me that each of the above reasons is
to> some extent ameliorated by ZFS:
> - Redundancy is performed at the filesystem level, probably on all disks 
> in the pool.
more at the pool level iirc, but yes, over all the disks where you have
them mirrored or raid/raidZ-ed
> - Dynamic striping and copy-on-write mean that all write ops can be 
> striped across vdevs and the log writes can go right next to the table 
> writes
Yes. No need to separate metadata (and archive/rollback logs are just that)
> - Copy-on-write also turns almost all writes into sequential writes anyway.
yup.
> So it seems that the old reasoning may no longer apply. Is my thinking 
> correct here? Have I missed something? Do we have any information to 
> support either the use of a single pool or of separate pools for 
> database usage?
To my way of thinking, you can still separate things out if you''re not
comfortable with having everything all together in the one pool. My
take on that though is that it stems from an inability to appreciate
just how different zfs is - a lack of paradigm shifting lets you down.

If I was setting up a db server today and could use ZFS, then I''d be
making sure that the DBAs didn''t get a say in how the filesystems
were laid out. I''d ask them what they want to see in a directory
structure and provide that. If they want raw ("don''t you know that
everything is faster on raw?!?!") then I''d carve a zvol for them.
Anything else would be carefully delineated - they stick to the rdbms
and don''t tell me how to do my job, and vice versa.


cheers,
James C. McPherson
--
Solaris Datapath Engineering
Data Management Group
Sun Microsystems

Gehr, Chuck R

2006-May-11 00:55 UTC

head link

[zfs-discuss] ZFS and databases

One word of caution about random writes.  From my experience, they are
not nearly as fast as sequential writes (like 10 to 20 times slower)
unless they are carefully aligned on the same boundary as the file
system record size. Otherwise, there is a heavy read penalty that you
can easily observe by doing a zpool iostat.  So, depending on the
workload, it''s really a stretch to say random writes can be done at
sequential speed.

	Chuck

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of James C.
McPherson
Sent: Wednesday, May 10, 2006 5:18 PM
To: Boyd Adamson
Cc: ZFS filesystem discussion list
Subject: Re: [zfs-discuss] ZFS and databases


Hi Boyd,

Boyd Adamson wrote:> One question that has come up a number of times when I''ve been 
> speaking with people (read: evangelizing :) ) about ZFS is about 
> database storage. In conventional use storage has separated redo logs 
> from table space, on a spindle basis.
> I''m not a database expert but I believe the reasons boil down to a
> combination of:
> - Separation for redundancy
correct
> - Separation for reduction of bottlenecks (most write ops touch both 
> the logs and the table)
correct
> - Separation of usage patterns (logs are mostly sequential writes, 
> tables are random).
correct
> The question then comes up about whether in a ZFS world this 
> separation is still needed.
I don''t think it is.

 > It seems to me that each of the above reasons is
to> some extent ameliorated by ZFS:
> - Redundancy is performed at the filesystem level, probably on all 
> disks in the pool.
more at the pool level iirc, but yes, over all the disks where you have
them mirrored or raid/raidZ-ed
> - Dynamic striping and copy-on-write mean that all write ops can be 
> striped across vdevs and the log writes can go right next to the table
> writes
Yes. No need to separate metadata (and archive/rollback logs are just
that)
> - Copy-on-write also turns almost all writes into sequential writesanyway.

yup.
> So it seems that the old reasoning may no longer apply. Is my thinking
> correct here? Have I missed something? Do we have any information to 
> support either the use of a single pool or of separate pools for 
> database usage?
To my way of thinking, you can still separate things out if you''re not
comfortable with having everything all together in the one pool. My take
on that though is that it stems from an inability to appreciate just how
different zfs is - a lack of paradigm shifting lets you down.

If I was setting up a db server today and could use ZFS, then I''d be
making sure that the DBAs didn''t get a say in how the filesystems were
laid out. I''d ask them what they want to see in a directory structure
and provide that. If they want raw ("don''t you know that
everything is
faster on raw?!?!") then I''d carve a zvol for them.
Anything else would be carefully delineated - they stick to the rdbms
and don''t tell me how to do my job, and vice versa.


cheers,
James C. McPherson
--
Solaris Datapath Engineering
Data Management Group
Sun Microsystems
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Boyd Adamson

2006-May-11 01:18 UTC

head link

[zfs-discuss] ZFS and databases

On 11/05/2006, at 9:17 AM, James C. McPherson wrote:>> - Redundancy is performed at the filesystem level, probably on all  
>> disks in the pool.
>
> more at the pool level iirc, but yes, over all the disks where you  
> have them mirrored or raid/raidZ-ed
Yes, of course. I meant at the filesystem level (ZFS as a whole)  
rather than at the sysadmin/application data layout level.
> To my way of thinking, you can still separate things out if you''re
not
> comfortable with having everything all together in the one pool. My
> take on that though is that it stems from an inability to appreciate
> just how different zfs is - a lack of paradigm shifting lets you down.
>
> If I was setting up a db server today and could use ZFS, then I''d
be
> making sure that the DBAs didn''t get a say in how the filesystems
> were laid out. I''d ask them what they want to see in a directory
> structure and provide that. If they want raw ("don''t you know
that
> everything is faster on raw?!?!") then I''d carve a zvol for
them.
> Anything else would be carefully delineated - they stick to the rdbms
> and don''t tell me how to do my job, and vice versa.
Old dogma dies hard.

What we need is some clear blueprints/best practices docs on this, I  
think.

Mike Gerdts

2006-May-11 01:42 UTC

head link

[zfs-discuss] ZFS and databases

On 5/10/06, Boyd Adamson <boyd-adamson at usa.net>
wrote:> What we need is some clear blueprints/best practices docs on this, I
> think.
>
Most definitely.  Key things that people I work with (including me...)
would like to see are...

-  Some success stories of people running large databases (working set
much larger than RAM) on ZFS
-  Configuration/tuning best practices
-  Description of why I don''t need directio, quickio, or ODM.
-  Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM
using standard benchmarks for OLTP and DSS workloads
-  Same as above but with real workloads.
-  How the ZFS feature set improves the lives of system
administrators, DBA''s, storage, and backup admins.

For general purpse file systems (especially zones and root) I am very
eager to put zfs to work.  It has great potential to simplify things
like live upgrade and answering the question "what changed?"  I
don''t
yet have eagerness to propose that I get a cross-functional team
together to perform purely exploratory database load tests on zfs.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Richard Elling

2006-May-11 03:09 UTC

head link

[zfs-discuss] ZFS and databases

On Wed, 2006-05-10 at 20:42 -0500, Mike Gerdts wrote:> On 5/10/06, Boyd Adamson <boyd-adamson at usa.net> wrote:
> > What we need is some clear blueprints/best practices docs on this, I
> > think.
> >
In due time... it was only recently that some of the performance
enhancements were put back.

Note: ideally, this info makes the main doc set.  BluePrints and
Infodocs can fill in the missing bits because they have traditionally
had a much shorter time-to-market.  Ultimately, we''d like the info
rolled into the main doc set.  In other words, go for the main docs
first.
> Most definitely.  Key things that people I work with (including me...)
> would like to see are...
> 
> -  Some success stories of people running large databases (working set
> much larger than RAM) on ZFS
> -  Configuration/tuning best practices
> -  Description of why I don''t need directio, quickio, or ODM.
> -  Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM
> using standard benchmarks for OLTP and DSS workloads
> -  Same as above but with real workloads.
> -  How the ZFS feature set improves the lives of system
> administrators, DBA''s, storage, and backup admins.
I''m working on some RAS stuff... I can''t talk for the
performance guys,
but I know some have been working on it.  Don''t expect an audited
TPC-like benchmark as we tend to not use file systems for those.
 -- richard

Roch Bourbonnais - Performance Engineering

2006-May-11 07:18 UTC

head link

[zfs-discuss] ZFS and databases

Gehr, Chuck R writes:
 > One word of caution about random writes.  From my experience, they are
 > not nearly as fast as sequential writes (like 10 to 20 times slower)
 > unless they are carefully aligned on the same boundary as the file
 > system record size. Otherwise, there is a heavy read penalty that you
 > can easily observe by doing a zpool iostat.  So, depending on the
 > workload, it''s really a stretch to say random writes can be done
at
 > sequential speed.
 > 
 > 	Chuck
 > 

Could we agree on saying that

	partial writes to blocks that are not in cache are much
	slower than writes to blocks that are.

Then  given that    Sequential pattern  can   benefit   from
readahead, then those will fall in the fast category most of
the time.  Performance of Random  writes will  depend on the
cached   ratio.  For DB working   sets  that greatly exceeds
system memory, which  is common, then  this fall in the slower
case and this stays true for any filesystem.

Or said otherwise, There is no free lunch.

-r

Roch Bourbonnais - Performance Engineering

2006-May-11 10:28 UTC

head link

[zfs-discuss] ZFS and databases

-  Description of why I don''t need directio, quickio, or ODM.

The  2 main benefits that  cames  out of using directio  was
reducing memory consumption by avoiding the page cache AND
bypassing the UFS single writer behavior.

ZFS does not have the single writer lock.

As for memory, the UFS code path would I/O straight from
user buffer to disk overwritting live data. So we won''t do
this. ZFS will hold the data in memory for the time it
takes to insure data-integrity.

ZFS concurrent O_DSYNC writes  will gang together in the ZIL
(ZFS intent log) and be released  after the I/Os are done to
the  log.  Performance  characteristics   will be  different
between the Filesystems and certainly dynamic data from real
workloads, as you point out, will be enlightening.


-r

Gehr, Chuck R

2006-May-11 13:56 UTC

head link

[zfs-discuss] ZFS and databases

Absolutely, I have done hot spot tests using a Poisson random
distribution.  With that pattern (where there are many cache hits), the
writes are 3-10 times faster than sequential speed.  My comment was
regarding purely random i/o across a large (at least much larger than
available memory cache) area.  A real workload is likely to have a
combination of patterns, i.e. some fairly random, some hot spot, and
some sequential. 

	Chuck

-----Original Message-----
From: Roch Bourbonnais - Performance Engineering
[mailto:Roch.Bourbonnais at Sun.Com] 
Sent: Thursday, May 11, 2006 1:18 AM
To: Gehr, Chuck R
Cc: James.McPherson at Sun.Com; Boyd Adamson; ZFS filesystem discussion
list
Subject: RE: [zfs-discuss] ZFS and databases

Gehr, Chuck R writes:
 > One word of caution about random writes.  From my experience, they
are  > not nearly as fast as sequential writes (like 10 to 20 times
slower)  > unless they are carefully aligned on the same boundary as the
file  > system record size. Otherwise, there is a heavy read penalty
that you  > can easily observe by doing a zpool iostat.  So, depending
on the  > workload, it''s really a stretch to say random writes can
be
done at  > sequential speed.
 > 
 > 	Chuck
 > 

Could we agree on saying that

	partial writes to blocks that are not in cache are much
	slower than writes to blocks that are.

Then  given that    Sequential pattern  can   benefit   from
readahead, then those will fall in the fast category most of the time.
Performance of Random  writes will  depend on the
cached   ratio.  For DB working   sets  that greatly exceeds
system memory, which  is common, then  this fall in the slower case and
this stays true for any filesystem.

Or said otherwise, There is no free lunch.

-r

Gregory Shaw

2006-May-11 16:31 UTC

head link

[zfs-discuss] ZFS and databases

A couple of points/additions with regard to oracle in particular:

	When talking about large database installations, copy-on-write may  
or may not apply.   The files are never completely rewritten, only  
changed internally via mmap().  When you lay down your database, you  
will generally allocate the storage for the anticipated capacity  
required.  That will result in sparse files in the actual filesystems.
	This brings up the question:
	How does ZFS allocate sparse files, and, how does the allocation  
occur as the sparse files have data added?

	Regarding the separation of data files, you *really* want your logs  
to be in a different place (spindles-wise) than your DB.  After all,  
should you have a catastrophic failure (crash, disk hiccup, etc.),  
your redo and transaction logs are your recovery system.
	With this in mind, I''d envisioned using zfs as such:

- Allocate a number of database filesystems using a ''db'' or
standard
pool.  Generally 1 per CPU, as oracle will use parallel queries if  
the tables are spread across multiple filesystems.
- Allocate another pool from another storage system (internal,  
another disk array, etc.) the log areas.  Name the pool something  
like ''dblog''.

	That would guarantee that you don''t mix your data types.

	I''m interested to see how a zfs pool using multiple LUNs on a large  
storage array will behave when using a database.  I think the  
performance spreading across multiple luns will result in increased  
db performance.

	On a side note, does anybody know of a way to track hot areas on the  
disk?  One concern I''ve got with the zfs pool layout is multiple  
tables being allocated on the same LUN and both tables being used  
heavily.  The need to move the data around within the pool may be  
required to address a single LUN bottleneck.  Has anybody thought of  
this situation?

On May 10, 2006, at 4:48 PM, Boyd Adamson wrote:
> One question that has come up a number of times when I''ve been  
> speaking with people (read: evangelizing :) ) about ZFS is about  
> database storage. In conventional use storage has separated redo  
> logs from table space, on a spindle basis.
>
> I''m not a database expert but I believe the reasons boil down to a
> combination of:
> - Separation for redundancy
>
> - Separation for reduction of bottlenecks (most write ops touch  
> both the logs and the table)
>
> - Separation of usage patterns (logs are mostly sequential writes,  
> tables are random).
>
> The question then comes up about whether in a ZFS world this  
> separation is still needed. It seems to me that each of the above  
> reasons is to some extent ameliorated by ZFS:
> - Redundancy is performed at the filesystem level, probably on all  
> disks in the pool.
>
> - Dynamic striping and copy-on-write mean that all write ops can be  
> striped across vdevs and the log writes can go right next to the  
> table writes
>
> - Copy-on-write also turns almost all writes into sequential writes  
> anyway.
>
> So it seems that the old reasoning may no longer apply. Is my  
> thinking correct here? Have I missed something? Do we have any  
> information to support either the use of a single pool or of  
> separate pools for database usage?
>
> Boyd
> Melbourne, Australia
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Richard Elling

2006-May-11 17:27 UTC

head link

[zfs-discuss] ZFS and databases

On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:> A couple of points/additions with regard to oracle in particular:
> 
> 	When talking about large database installations, copy-on-write may  
> or may not apply.   The files are never completely rewritten, only  
> changed internally via mmap().  When you lay down your database, you  
> will generally allocate the storage for the anticipated capacity  
> required.  That will result in sparse files in the actual filesystems.
This is counter to my Oracle experience, which I''ll admit is dated.
Oracle will zero-fill the tablespace with 128kByte iops -- it is not
sparse.  I''ve got a scar.  Has this changed in the past few years?

Now, if compression is enabled, then this represents and interesting
challenge...
 -- richard

Peter Rival

2006-May-11 17:39 UTC

head link

[zfs-discuss] ZFS and databases

Richard Elling wrote:> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:
>> A couple of points/additions with regard to oracle in particular:
>>
>> 	When talking about large database installations, copy-on-write may  
>> or may not apply.   The files are never completely rewritten, only  
>> changed internally via mmap().  When you lay down your database, you  
>> will generally allocate the storage for the anticipated capacity  
>> required.  That will result in sparse files in the actual filesystems.
> 
> This is counter to my Oracle experience, which I''ll admit is
dated.
> Oracle will zero-fill the tablespace with 128kByte iops -- it is not
> sparse.  I''ve got a scar.  Has this changed in the past few years?
I have the same scar (from a different company).  We actually wound up calling
it the "tablespace create benchmark".  And that was valid as of work
done just over a year ago.  Multiple parallel tablespace creates is usually a
big pain point for filesystem / cache interaction, and also fragmentation once
in a while.  The latter ZFS should take care of; the former, well, I dunno.
> Now, if compression is enabled, then this represents and interesting
> challenge...
Indeed.  One would think the intents of these two are rather divergent...  How
about filling with /dev/random? ;)

 - Pete

Richard Elling

2006-May-11 17:59 UTC

head link

[zfs-discuss] ZFS and databases

On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote:> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:
> > A couple of points/additions with regard to oracle in particular:
> > 
> > 	When talking about large database installations, copy-on-write may  
> > or may not apply.   The files are never completely rewritten, only  
> > changed internally via mmap().  When you lay down your database, you  
> > will generally allocate the storage for the anticipated capacity  
> > required.  That will result in sparse files in the actual filesystems.
> 
> This is counter to my Oracle experience, which I''ll admit is
dated.
> Oracle will zero-fill the tablespace with 128kByte iops -- it is not
> sparse.  I''ve got a scar.  Has this changed in the past few years?
I hit [send] too soon.  Here is a writeup on how I got scarred, and
healed :-)  I wrote it up so that hopefully someone else would be
spared the agony.  I guess Peter didn''t read Sun BluePrints :-O
http://www.sun.com/blueprints/0400/ram-vxfs.pdf

 -- richard

Peter Rival

2006-May-11 18:10 UTC

head link

[zfs-discuss] ZFS and databases

Richard Elling wrote:> On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote:
>> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:
>>> A couple of points/additions with regard to oracle in particular:
>>>
>>> 	When talking about large database installations, copy-on-write may
>>> or may not apply.   The files are never completely rewritten, only
>>> changed internally via mmap().  When you lay down your database,
you
>>> will generally allocate the storage for the anticipated capacity  
>>> required.  That will result in sparse files in the actual
filesystems.
>> This is counter to my Oracle experience, which I''ll admit is
dated.
>> Oracle will zero-fill the tablespace with 128kByte iops -- it is not
>> sparse.  I''ve got a scar.  Has this changed in the past few
years?
> 
> I hit [send] too soon.  Here is a writeup on how I got scarred, and
> healed :-)  I wrote it up so that hopefully someone else would be
> spared the agony.  I guess Peter didn''t read Sun BluePrints :-O
> http://www.sun.com/blueprints/0400/ram-vxfs.pdf
Don''t blame me, I wasn''t at Sun at the time. :)  It does,
however, sound much like what happened to me other than the fact the server and
database were a couple orders of magnitude bigger in my case.  And we
weren''t using VxFS so discovered_direct_io wasn''t an option. 
Having a multi-terabyte database lock up is a great way to learn how important
performance is to some people... ;)  I just hope ZFS doesn''t have this
problem with multi-terabyte databases.

 - Pete

Tao Chen

2006-May-11 18:46 UTC

head link

[zfs-discuss] ZFS and databases

On 5/11/06, Peter Rival <peter.rival at sun.com>
wrote:> Richard Elling wrote:
> > Oracle will zero-fill the tablespace with 128kByte iops -- it is not
> > sparse.  I''ve got a scar.  Has this changed in the past few
years?
>
>  Multiple parallel tablespace creates is usually a big pain point for
filesystem / cache interaction, and also fragmentation once in a while.  The
latter ZFS should take care of; the former, well, I dunno.
>
The purpose of zero-filled tablespace is to prevent fragmentation by
future writes, in the case when multiple tablespaces are being
updated/filled on the same disk, correct?
This becomes pointless on ZFS, since it never overwrites the same
pre-allocated block, i.e. the tablespace becomes fragmented in that
case no matter what.

Also, in order to write a partial update to a new block, zfs needs the
rest of the orignal block, hence the notion by Roch:
"partial writes to blocks that are not in cache are much slower than
writes to blocks that are."
Fortunately I think DB almost always does aligned full block I/O, or
is that right?

Tao

Gregory Shaw

2006-May-11 19:15 UTC

head link

[zfs-discuss] ZFS and databases

Regarding directio and quickio, is there a way with ZFS to skip the  
system buffer cache?  I''ve seen big benefits for using directio when  
the data files have been segregated from the log files.

Having the system compete with the DB for read-ahead results in  
double work.

On May 10, 2006, at 7:42 PM, Mike Gerdts wrote:
> On 5/10/06, Boyd Adamson <boyd-adamson at usa.net> wrote:
>> What we need is some clear blueprints/best practices docs on this, I
>> think.
>>
>
> Most definitely.  Key things that people I work with (including me...)
> would like to see are...
>
> -  Some success stories of people running large databases (working set
> much larger than RAM) on ZFS
> -  Configuration/tuning best practices
> -  Description of why I don''t need directio, quickio, or ODM.
> -  Performance comparisons of ZFS vs. SVM/UFS, VxVM/VxFS/ODM, and ASM
> using standard benchmarks for OLTP and DSS workloads
> -  Same as above but with real workloads.
> -  How the ZFS feature set improves the lives of system
> administrators, DBA''s, storage, and backup admins.
>
> For general purpse file systems (especially zones and root) I am very
> eager to put zfs to work.  It has great potential to simplify things
> like live upgrade and answering the question "what changed?"  I
don''t
> yet have eagerness to propose that I get a cross-functional team
> together to perform purely exploratory database load tests on zfs.
>
> Mike
>
> -- 
> Mike Gerdts
> http://mgerdts.blogspot.com/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Torrey McMahon

2006-May-11 21:16 UTC

head link

[zfs-discuss] ZFS and databases

This thread is useless without data.
This thread is useless without data.
This thread is useless without data.
This thread is useless without data.
This thread is useless without data.

                      :-P

Boyd Adamson

2006-May-12 00:57 UTC

head link

[zfs-discuss] ZFS and databases

On 12/05/2006, at 3:59 AM, Richard Elling wrote:> On Thu, 2006-05-11 at 10:27 -0700, Richard Elling wrote:
>> On Thu, 2006-05-11 at 10:31 -0600, Gregory Shaw wrote:
>>> A couple of points/additions with regard to oracle in particular:
>>>
>>> 	When talking about large database installations, copy-on-write may
>>> or may not apply.   The files are never completely rewritten, only
>>> changed internally via mmap().  When you lay down your database,
you
>>> will generally allocate the storage for the anticipated capacity
>>> required.  That will result in sparse files in the actual  
>>> filesystems.
Sorry, I didn''t receive Greg or Richard''s original emails.
Apologies
for messing up threading (such as it is).

Are you saying that copy-on-write doesn''t apply for mmap changes, but  
only file re-writes? I don''t think that gels with anything else I  
know about ZFS.

Boyd
Melbourne, Australia

Jeff Bonwick

2006-May-12 05:09 UTC

head link

[zfs-discuss] ZFS and databases

> Are you saying that copy-on-write doesn''t apply for mmap changes,
but
> only file re-writes? I don''t think that gels with anything else I
> know about ZFS.
No, you''re correct -- everything is copy-on-write.

Jeff

Roch Bourbonnais - Performance Engineering

2006-May-12 06:53 UTC

head link

[zfs-discuss] ZFS and databases

From: Gregory Shaw <greg.shaw at Sun.COM>
  Sender: zfs-discuss-bounces at opensolaris.org
  To: Mike Gerdts <mgerdts at gmail.com>
  Cc: ZFS filesystem discussion list <zfs-discuss at opensolaris.org>,
   James.McPherson at Sun.COM
  Subject: Re: [zfs-discuss] ZFS and databases
  Date: Thu, 11 May 2006 13:15:48 -0600

  Regarding directio and quickio, is there a way with ZFS to skip the  
  system buffer cache?  I''ve seen big benefits for using directio when
  the data files have been segregated from the log files.

Were the benefits coming from extra concurrency (no
single writter lock) or avoiding the extra copy to page cache or 
from too much readahead that is not used before pages need to 
be recycled. 

ZFS already has the concurrency.

The  page cache copy is   really rather cheap  and I  assert
somewhat necessary to insure data integrity

The extra readahead is somewhat of a bug in UFS (read 2
pages get a maxcontig chunk (1MB)).

ZFS is new, conventional wisdom, may or may not apply.

-r

Roch Bourbonnais - Performance Engineering

2006-May-12 07:24 UTC

head link

[zfs-discuss] ZFS and databases

Jeff Bonwick writes:
 > > Are you saying that copy-on-write doesn''t apply for mmap
changes, but
 > > only file re-writes? I don''t think that gels with anything
else I
 > > know about ZFS.
 > 
 > No, you''re correct -- everything is copy-on-write.
 > 

Maybe the confusion comes from:

	mmap changes : application interaction with memory
	COW : memory(ZFS) interaction with storage

There is an fsync() or fsflush deamon in between.

-r

 > Jeff
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch Bourbonnais - Performance Engineering

2006-May-12 07:30 UTC

head link

[zfs-discuss] ZFS and databases

Tao Chen writes:
 > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote:
 > > Richard Elling wrote:
 > > > Oracle will zero-fill the tablespace with 128kByte iops -- it is
not
 > > > sparse.  I''ve got a scar.  Has this changed in the past
few years?
 > >
 > >  Multiple parallel tablespace creates is usually a big pain point for
filesystem / cache interaction, and also fragmentation once in a while.  The
latter ZFS should take care of; the former, well, I dunno.
 > >
 > 
 > The purpose of zero-filled tablespace is to prevent fragmentation by
 > future writes, in the case when multiple tablespaces are being
 > updated/filled on the same disk, correct?

That and also there was a need for block reservation. Thus
posix_fallocate was added (recently).

 > This becomes pointless on ZFS, since it never overwrites the same
 > pre-allocated block, i.e. the tablespace becomes fragmented in that
 > case no matter what.

is fragmented the right word here ?
Anyway: random writes can be turned into sequential.

 > 
 > Also, in order to write a partial update to a new block, zfs needs the
 > rest of the orignal block, hence the notion by Roch:
 > "partial writes to blocks that are not in cache are much slower than
 > writes to blocks that are."
 > Fortunately I think DB almost always does aligned full block I/O, or
 > is that right?

That''s my understanding also.

-r

 > 
 > Tao
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tao Chen

2006-May-12 08:11 UTC

head link

[zfs-discuss] ZFS and databases

On 5/12/06, Roch Bourbonnais - Performance Engineering
<Roch.Bourbonnais at sun.com> wrote:>
>   From: Gregory Shaw <greg.shaw at Sun.COM>
>   Regarding directio and quickio, is there a way with ZFS to skip the
>   system buffer cache?  I''ve seen big benefits for using directio
when
>   the data files have been segregated from the log files.
>
>
> Were the benefits coming from extra concurrency (no single writter lock)
Does DIO bypass "writter lock" on Solaris?
Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks
at filesystem level:
http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582
> or avoiding the extra copy to page cache
Certainly. Also to avoid VM overhead (DB does like raw devices).
> or from too much readahead that is not used before pages need to be
recycled.
Not sure what you mean ( avoid unnecessary readahead? )
> ZFS already has the concurrency.
Interesting, would like to find more on this.
> The page cache copy is really rather cheap
VM as a whole is certainly not cheap.
> and I assert somewhat necessary to insure data integrity.
Not following you.
> The extra readahead is somewhat of a bug in UFS (read 2
> pages get a maxcontig chunk (1MB)).
Ouch.
>
> ZFS is new, conventional wisdom, may or may not apply.
>
This (zfs-discuss) is the place where we can be enlightened :-)

Tao

Roch Bourbonnais - Performance Engineering

2006-May-12 09:20 UTC

head link

[zfs-discuss] ZFS and databases

Tao Chen writes:
 > On 5/12/06, Roch Bourbonnais - Performance Engineering
 > <Roch.Bourbonnais at sun.com> wrote:
 > >
 > >   From: Gregory Shaw <greg.shaw at Sun.COM>
 > >   Regarding directio and quickio, is there a way with ZFS to skip the
 > >   system buffer cache?  I''ve seen big benefits for using
directio when
 > >   the data files have been segregated from the log files.
 > >
 > >
 > > Were the benefits coming from extra concurrency (no single writter
lock)
 > 
 > Does DIO bypass "writter lock" on Solaris?

Yep.

 > Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks
 > at filesystem level:
 >
http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582
 > 
 > > or avoiding the extra copy to page cache
 > 
 > Certainly. Also to avoid VM overhead (DB does like raw devices).

OK, but again, is it to avoid badly configured readahead, or 
get extra concurrency, or something else ? I have a hard
time that managing the page cache represents a cost when you 
compare this to a 5ms I/O. 

 > 
 > > or from too much readahead that is not used before pages need to be
recycled.
 > 
 > Not sure what you mean ( avoid unnecessary readahead? )

There is thing thing where a 2K read over UFS if it crosses
a page boundary can lead UFS to assert sequential access to
the file and do a clustered readahead. Since clusters are
often set to 1MB then you can get a lot of spurious I/O form 
this 2K read. If the data readahead turns out to never be
used later because of memory pressure then you have a
suboptimal configuration. This is one kind of issue that DIO 
would not have.

 > 
 > > ZFS already has the concurrency.
 > 
 > Interesting, would like to find more on this.
 > 

I''ll have to blog this down one day.

 > > The page cache copy is really rather cheap
 > 
 > VM as a whole is certainly not cheap.

In some aspect certainly. Compared to I/O I''d say it''s
really cheap minus bugs. My point is to be cautious of this
syllogism: 

	DIO, a VM bypass mechanism, can be much faster than
	regular I/O. Thus the VM is costly.

DIO is a VM bypass _AND_ a different UFS codepath.

 > 
 > > and I assert somewhat necessary to insure data integrity.
 > 
 > Not following you.

I''m on    thin ground here.  But I    believe that you can''t
directly  write   a disk block  and    it''s checksum in  the
refering block in the  way ZFS wants  to do; or at least you
couldn''t  do this and hold up  application in a  way that is
acceptable   performance   wise.     So   to   insure    the
data-integrity that  ZFS delivers is   has to have the  data
cached for  the time it takes  to update the  on-disk format
properly. I''m willing to be corrected on this (and anything
else for that matters, we live in a complex world).

 > 
 > > The extra readahead is somewhat of a bug in UFS (read 2
 > > pages get a maxcontig chunk (1MB)).
 > 
 > Ouch.

You said it. But people have learned to tune it down when
this hits (tunefs -a) which is not that often.

 > 
 > >
 > > ZFS is new, conventional wisdom, may or may not apply.
 > >
 > 
 > This (zfs-discuss) is the place where we can be enlightened :-)
 > 
 > Tao
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-r
____________________________________________________________________________________
Roch Bourbonnais                        Sun Microsystems, Icnc-Grenoble 
Senior Performance Analyst              180, Avenue De L''Europe, 38330,
					Montbonnot Saint Martin, France
Performance & Availability Engineering  
http://icncweb.france/~rbourbon		http://blogs.sun.com/roller/page/roch
Roch.Bourbonnais at Sun.Com		(+33).4.76.18.83.20

Franz Haberhauer

2006-May-12 12:18 UTC

head link

[zfs-discuss] ZFS and databases

Roch Bourbonnais - Performance Engineering wrote On 05/12/06
09:30,:> Tao Chen writes:
>  > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote:
>  > > Richard Elling wrote:
>  > > > Oracle will zero-fill the tablespace with 128kByte iops --
it is not
>  > > > sparse.  I''ve got a scar.  Has this changed in the
past few years?
>  > >
>  > >  Multiple parallel tablespace creates is usually a big pain
point for filesystem / cache interaction, and also fragmentation once in a
while.  The latter ZFS should take care of; the former, well, I dunno.
>  > >
>  > 
>  > The purpose of zero-filled tablespace is to prevent fragmentation by
>  > future writes, in the case when multiple tablespaces are being
>  > updated/filled on the same disk, correct?
> 
> That and also there was a need for block reservation. Thus
> posix_fallocate was added (recently). >
 >  > This becomes pointless on ZFS, since it never overwrites the same
 >  > pre-allocated block, i.e. the tablespace becomes fragmented in that
 >  > case no matter what.
 >
 > is fragmented the right word here ?
 > Anyway: random writes can be turned into sequential.

This really optimizes the data on disk for full table scans
(sequential reads of the whole tables or portions of them).
Random access may be supported from a DMBS perspective using
indexes - then read access patterns are random anyway.
In contrast to usual files DBMS files exhibit different access patterns:
They are often loaded sequentially (resulting in a nice layout for
later sequential reads) yet updates to the tables will erode this
sequential layout over time (e.g. updating accounts as they are credited
and debited online), so later full table scans will suffer.
ZFS optimizes random writes versus potential sequential reads.
this may hurt if there are many full table scans - but DBMS
designers try to avoid unnecessary full table scans anyway.
On the other hand there are use cases in which tables are updated
randomly and read sequentially many times (e.g. in batch runs) -
here overall performance may suffer.
An way to resolve this could be an option to transparently
"reorganize"
a file (a tool triggering an internal rewrite - such that the
data become sequential on disk again - on-the-fly while the
DBMS is running/file is open).

- Franz

Franz Haberhauer

2006-May-12 12:46 UTC

head link

[zfs-discuss] ZFS and databases

Gregory Shaw wrote On 05/11/06 21:15,:> Regarding directio and quickio, is there a way with ZFS to skip the  
> system buffer cache?  I''ve seen big benefits for using directio
when
> the data files have been segregated from the log files.
> 
> Having the system compete with the DB for read-ahead results in  double 
> work.
> 
Getting rid of the extra copy of data in a filesystem buffer also
reduces the memory footprint. As the data are being cached in the
DBMS buffer the extra copy in the filesystem cache is useless (except
occasionly as a work around for deficiencies in the DBMS'' cacheing
mechanisms). This is also important if there is non-DBMS related
filesystem activity which can benefit from having data in the
filesystem buffer as these compete with the (useless) DBMS data.

Is there a good short description about how the ZFS buffer cache
works?

Thanks,
Franz

Roch Bourbonnais - Performance Engineering

2006-May-12 12:49 UTC

head link

[zfs-discuss] ZFS and databases

''ZFS optimizes random writes versus potential sequential
reads.''

Now I   don''t think the current  readahead  code is where we
want   it  to  be   yet but, in   the   same way that enough
concurrent 128K I/O  can saturate a  disk (I sure  hope that
Milkowski''s data will   confirm this, otherwise  I''m  dead),
enough  concurrent read  I/O will   do the  same. So It''s  a
simple  matter  of  programming to  detect   file sequential
access an issue enough I/Os early enough.


With UFS, we had a simple algorithm and one tunable. Touch 2 
sequential page, read a cluster ahead. Then, don''t do any
other I/O until all the data is processed. This is flawed in 
many respect. And it certainly requires large cluster size
to get good I/O throughput because it had stop and go behavior.

With ZFS  (again, prefetch code  being looked upon), I think
we  can manage get good I/O   throughput using 128K, through
enough concurrency and intelligent coding.


-r

Peter Rival

2006-May-12 12:54 UTC

head link

[zfs-discuss] ZFS and databases

Roch Bourbonnais - Performance Engineering wrote:> Tao Chen writes:
>  > On 5/12/06, Roch Bourbonnais - Performance Engineering
>  > <Roch.Bourbonnais at sun.com> wrote:
>  > >
>  > >   From: Gregory Shaw <greg.shaw at Sun.COM>
>  > >   Regarding directio and quickio, is there a way with ZFS to
skip the
>  > >   system buffer cache?  I''ve seen big benefits for
using directio when
>  > >   the data files have been segregated from the log files.
>  > >
>  > >
>  > > Were the benefits coming from extra concurrency (no single
writter lock)
>  > 
>  > Does DIO bypass "writter lock" on Solaris?
>
> Yep.
>
>  > Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks
>  > at filesystem level:
>  >
http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582
>  > 
>  > > or avoiding the extra copy to page cache
>  > 
>  > Certainly. Also to avoid VM overhead (DB does like raw devices).
>
> OK, but again, is it to avoid badly configured readahead, or 
> get extra concurrency, or something else ? I have a hard
> time that managing the page cache represents a cost when you 
> compare this to a 5ms I/O. I think you''re missing one other thing - handling the memory overload
of
having orders of magnitude more accessed data than you have memory.  
Think about how you can handle having a couple hundred GB of dirty data 
being written by many threads (say either tablespace creates or temp 
table creation for a large table join) - fsflush and writebehind et. al. 
just can''t keep up with it.  Of course, I know ZFS is
"better" but to be
useable in those situations it needs to be probably an order of 
magnitude better or so, and I haven''t seen any data on a decently big 
rig with a proper storage config that shows that it is.  I''m not saying
it''s not, I''m just saying I haven''t seen the data. :)
  Like you said, Roch, I''ve been down this road before and
don''t want to
go down it again. ;)

 - Pete

Roch Bourbonnais - Performance Engineering

2006-May-12 12:56 UTC

head link

[zfs-discuss] ZFS and databases

You could start with the ARC paper, Megiddo/Modha FAST''03
conference. ZFS uses a variation of that. It''s an interesting
read.

-r

Franz Haberhauer writes:
 > Gregory Shaw wrote On 05/11/06 21:15,:
 > > Regarding directio and quickio, is there a way with ZFS to skip the  
 > > system buffer cache?  I''ve seen big benefits for using
directio when
 > > the data files have been segregated from the log files.
 > > 
 > > Having the system compete with the DB for read-ahead results in 
double
 > > work.
 > > 
 > 
 > Getting rid of the extra copy of data in a filesystem buffer also
 > reduces the memory footprint. As the data are being cached in the
 > DBMS buffer the extra copy in the filesystem cache is useless (except
 > occasionly as a work around for deficiencies in the DBMS''
cacheing
 > mechanisms). This is also important if there is non-DBMS related
 > filesystem activity which can benefit from having data in the
 > filesystem buffer as these compete with the (useless) DBMS data.
 > 
 > Is there a good short description about how the ZFS buffer cache
 > works?
 > 
 > Thanks,
 > Franz
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Franz Haberhauer

2006-May-12 13:10 UTC

head link

[zfs-discuss] ZFS and databases

> 	''ZFS optimizes random writes versus potential sequential
reads.''
This remark focused on the allocation policy during writes,
not the readahead that occurs during reads.
Data that are rewritten randomly but in place in a sequential,
contiguos file (like a preallocated UFS file) are not optimized
for these writes, but for later sequential read accesses.

Now with ZFS the writes are fast, but the later sequential reads
probably not - readahead may help with this wrt. latency (data may
already be available in the file buffer when the DBMS requests them -
yet the DBMS does readaheads as well). But it will still be random IO
to the disk (higher utilization compared to a sequential pattern).
this is not an issue for a single user, but could be one if there are
many.

- Franz



Roch Bourbonnais - Performance Engineering wrote On 05/12/06
14:49,:> 
> 	''ZFS optimizes random writes versus potential sequential
reads.''
> 
> Now I   don''t think the current  readahead  code is where we
> want   it  to  be   yet but, in   the   same way that enough
> concurrent 128K I/O  can saturate a  disk (I sure  hope that
> Milkowski''s data will   confirm this, otherwise  I''m 
dead),
> enough  concurrent read  I/O will   do the  same. So It''s  a
> simple  matter  of  programming to  detect   file sequential
> access an issue enough I/Os early enough.
> 
> 
> With UFS, we had a simple algorithm and one tunable. Touch 2 
> sequential page, read a cluster ahead. Then, don''t do any
> other I/O until all the data is processed. This is flawed in 
> many respect. And it certainly requires large cluster size
> to get good I/O throughput because it had stop and go behavior.
> 
> With ZFS  (again, prefetch code  being looked upon), I think
> we  can manage get good I/O   throughput using 128K, through
> enough concurrency and intelligent coding.
> 
> 
> -r
>

Roch Bourbonnais - Performance Engineering

2006-May-12 13:19 UTC

head link

[zfs-discuss] ZFS and databases

Peter Rival writes:
 > Roch Bourbonnais - Performance Engineering wrote:
 > > Tao Chen writes:
 > >  > On 5/12/06, Roch Bourbonnais - Performance Engineering
 > >  > <Roch.Bourbonnais at sun.com> wrote:
 > >  > >
 > >  > >   From: Gregory Shaw <greg.shaw at Sun.COM>
 > >  > >   Regarding directio and quickio, is there a way with ZFS
to skip the
 > >  > >   system buffer cache?  I''ve seen big benefits
for using directio when
 > >  > >   the data files have been segregated from the log files.
 > >  > >
 > >  > >
 > >  > > Were the benefits coming from extra concurrency (no single
writter lock)
 > >  > 
 > >  > Does DIO bypass "writter lock" on Solaris?
 > >
 > > Yep.
 > >
 > >  > Not on AIX, which uses CIO (concurrent I/O) to bypass managing
locks
 > >  > at filesystem level:
 > >  >
http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582
 > >  > 
 > >  > > or avoiding the extra copy to page cache
 > >  > 
 > >  > Certainly. Also to avoid VM overhead (DB does like raw
devices).
 > >
 > > OK, but again, is it to avoid badly configured readahead, or 
 > > get extra concurrency, or something else ? I have a hard
 > > time that managing the page cache represents a cost when you 
 > > compare this to a 5ms I/O. 

 > I think you''re missing one other thing - handling the memory
overload of
 > having orders of magnitude more accessed data than you have memory.  
 > Think about how you can handle having a couple hundred GB of dirty data 
 > being written by many threads (say either tablespace creates or temp 
 > table creation for a large table join) - fsflush and writebehind et. al.

When you dirty data enough, ZFS will start to throttle those 
writters; a bit like ufs_HW but at the system level. So most 
data in the ARC cache should be evictable on demand. There
are issues in current state of code that makes the amount of 
dirty data greater that we''d like but it''s limited by design.

 > just can''t keep up with it.  Of course, I know ZFS is
"better" but to be
 > useable in those situations it needs to be probably an order of 
 > magnitude better or so, and I haven''t seen any data on a decently
big
 > rig with a proper storage config that shows that it is.  I''m not
saying
 > it''s not, I''m just saying I haven''t seen the
data. :)
 >   Like you said, Roch, I''ve been down this road before and
don''t want to
 > go down it again. ;)

Yes,  performance wise, ZFS is  already fast on lots of test
_and_ a big moving target.  That''s  another great thing  about
it. But keep  those scenarios coming it''s always interesting
to make sure they''re covered.

-r

 > 
 >  - Pete

Roch Bourbonnais - Performance Engineering

2006-May-12 13:36 UTC

head link

[zfs-discuss] ZFS and databases

Franz Haberhauer writes:
 >  > 	''ZFS optimizes random writes versus potential sequential
reads.''
 > 
 > This remark focused on the allocation policy during writes,
 > not the readahead that occurs during reads.
 > Data that are rewritten randomly but in place in a sequential,
 > contiguos file (like a preallocated UFS file) are not optimized
 > for these writes, but for later sequential read accesses.
 > 
 > Now with ZFS the writes are fast, but the later sequential reads
 > probably not - readahead may help with this wrt. latency (data may
 > already be available in the file buffer when the DBMS requests them -
 > yet the DBMS does readaheads as well). But it will still be random IO
 > to the disk (higher utilization compared to a sequential pattern).
 > this is not an issue for a single user, but could be one if there are
 > many.

Last summer, a little experiment took me by surprise.
	
We had a tight loop issuing single synchroneous I/O to raw.
Results where:
>  > > size:  2048, count 1000, secs  3.96 :random (same cyl ?)
>  > > size:  2048, count 1000, secs  6.02 :sequential
>  > > size:  2048, count 1000, secs  6.34 :random (random cyl ?)
>  > > 
>  > > So it looks like for a 2K write we have in order:
>  > > 
>  > > 	write to same cylinder random offset (fastest)
>  > > 	write to same cylinder sequential offset (slower)
>  > > 	write to random cylinder (slowest)
So it kind of makes sense; if I issue a write just after one
completes then it''s will  take a full rotational latency for
it to get  going. If it''s random same   cylinder it will  be
more like half that.

Sequential is  good _if_ you can  keep a pipe of I/Os hitting
in stride. But with a pipe of enough  concurrent I/Os we can
be close to that kind of performance; or at least this has
not been proven wrong yet.


-r

Anton B. Rang

2006-May-12 14:51 UTC

head link

[zfs-discuss] Re: ZFS and databases

>Were the benefits coming from extra concurrency (no
>single writer lock) or avoiding the extra copy to page cache or 
>from too much readahead that is not used before pages need to 
>be recycled. 
With QFS, a major benefit we see for databases and direct I/O is an effective
doubling of the memory available to the database for caching.  Without direct
I/O, every page read winds up in the file system cache and the database cache.
For large databases, this is the difference between retaining key indexes in
memory, or not.

(The block copy into user space is also not cheap.)

Anton
 
 
This message posted from opensolaris.org

Roch Bourbonnais - Performance Engineering

2006-May-12 15:23 UTC

head link

[zfs-discuss] Re: ZFS and databases

Anton B. Rang writes:
 > >Were the benefits coming from extra concurrency (no
 > >single writer lock) or avoiding the extra copy to page cache or 
 > >from too much readahead that is not used before pages need to 
 > >be recycled. 
 > 
 > With QFS, a major benefit we see for databases and direct I/O is an
 > effective doubling of the memory available to the database for
 > caching.  Without direct I/O, every page read winds up in the file
 > system cache and the database cache. For large databases, this is the
 > difference between retaining key indexes in memory, or not. 

For read it is an interesting concept. Since

	Reading into cache
	Then copy into user space
	then keep data around but never use it

is not optimal. 
So 2 issues, there is the cost of copy and there is the memory.

Now could we detect the pattern that cause holding to the
cached block not optimal and do a quick freebehind after the 
copyout ? Something like Random access +  very large file + poor cache hit
ratio ?

Now about avoiding the copy; That would mean dma straight
into user space ? But if the checksum does not validate the
data, what do we do ? If storage is not raid-protected and we
have to return EIO, I don''t think we can do this _and_
corrupt the user buffer also, not sure what POSIX says for
this situation.

Now latency wise, the cost of copy is  small compared to the
I/O;  right ? So it now  turns into an  issue of saving some
CPU cycles.

-r

Anton Rang

2006-May-12 15:42 UTC

head link

[zfs-discuss] Re: ZFS and databases

> Now could we detect the pattern that cause holding to the
> cached block not optimal and do a quick freebehind after the
> copyout ? Something like Random access +  very large file +
> poor cache hit ratio ?
We might detect it ... or we could let the application give us
the hint, via the directio ioctl, which for ZFS might mean not
"bypass the cache" but "free cache as soon as possible."

(The problem with detecting this situation is that we don''t
know future access patterns, and we don''t know whether the
application is doing its own caching, in which case any that
we do isn''t particularly useful ... unless there are subblock
writes in future, in which case our cache can be used to avoid
the read-modify-write.)
> Now about avoiding the copy; That would mean dma straight
> into user space ? But if the checksum does not validate the
> data, what do we do ? If storage is not raid-protected and we
> have to return EIO, I don''t think we can do this _and_
> corrupt the user buffer also, not sure what POSIX says for
> this situation.
Well, direct I/O behaves that way today.  Actually, paged I/O
does as well -- we move one page at a time into user space, so
if we encounter an error while reading a later portion of the
request, the earlier portion of the user buffer will already
have been overwritten.

SUSv3 doesn''t specify anything about buffer contents in the
event of an error.  (It even leaves the file offset undefined.)

So I think we''re safe here.
> Now latency wise, the cost of copy is  small compared to the
> I/O;  right ? So it now  turns into an  issue of saving some
> CPU cycles.
CPU cycles and memory bandwidth (which both can be in short
supply on a database server).

Anton

Gregory Shaw

2006-May-12 15:52 UTC

head link

[zfs-discuss] ZFS and databases

I thought the benefits were from skipping the read-ahead logic.

What was seen prior to the implementation of directio was this:

- System running a high(er) load.  It was difficult to see why the  
load was higher, as oracle was the primary process(es).

After the implementation, the load on the server dropped from a load  
of ~6 (on a 6 way box) to a load of 1.5 to 2.  The system
''felt''
faster, as well.

It should be noted that traditional filesystem access (creating  
tables, etc.) dropped in performance by a factor of about 10.  Again,  
I attributed this to the lack of a read ahead capability.  Luckily,  
creating new database tables is a relatively infrequent event.

Does anybody else have any other points on this?

On May 12, 2006, at 12:53 AM, Roch Bourbonnais - Performance  
Engineering wrote:
>
>   From: Gregory Shaw <greg.shaw at Sun.COM>
>   Sender: zfs-discuss-bounces at opensolaris.org
>   To: Mike Gerdts <mgerdts at gmail.com>
>   Cc: ZFS filesystem discussion list <zfs-discuss at
opensolaris.org>,
>    James.McPherson at Sun.COM
>   Subject: Re: [zfs-discuss] ZFS and databases
>   Date: Thu, 11 May 2006 13:15:48 -0600
>
>   Regarding directio and quickio, is there a way with ZFS to skip the
>   system buffer cache?  I''ve seen big benefits for using directio
when
>   the data files have been segregated from the log files.
>
>
> Were the benefits coming from extra concurrency (no
> single writter lock) or avoiding the extra copy to page cache or
> from too much readahead that is not used before pages need to
> be recycled.
>
> ZFS already has the concurrency.
>
> The  page cache copy is   really rather cheap  and I  assert
> somewhat necessary to insure data integrity
>
> The extra readahead is somewhat of a bug in UFS (read 2
> pages get a maxcontig chunk (1MB)).
>
>
> ZFS is new, conventional wisdom, may or may not apply.
>
> -r
>
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Nicolas Williams

2006-May-12 16:04 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance
Engineering wrote:> For read it is an interesting concept. Since
> 
> 	Reading into cache
> 	Then copy into user space
> 	then keep data around but never use it
> 
> is not optimal. 
> So 2 issues, there is the cost of copy and there is the memory.
> 
> Now could we detect the pattern that cause holding to the
> cached block not optimal and do a quick freebehind after the 
> copyout ? Something like Random access +  very large file + poor cache hit
> ratio ?
An interface to request no caching on a per-file basis would be good
(madvise(2) should do for mmap''ed files, an fcntl(2) or open(2) flag
would be better).
> Now about avoiding the copy; That would mean dma straight
> into user space ? But if the checksum does not validate the
> data, what do we do ?
Who cares?  You DMA into user-space, check the checksum and if there''s
a
problem return an error; so there''s [corrupted] data in the user space
buffer... but the app knows it, so what''s the problem (see below)?
>                       If storage is not raid-protected and we
> have to return EIO, I don''t think we can do this _and_
> corrupt the user buffer also, not sure what POSIX says for
> this situation.
If POSIX compliance is an issue just add new interfaces (possibly as
simple as an open(2) flag).
> Now latency wise, the cost of copy is  small compared to the
> I/O;  right ? So it now  turns into an  issue of saving some
> CPU cycles.
Can you build a system where the cost of the copy adds significantly to
the latency numbers?  (Think RAM disks.)

Nico
--

Roch Bourbonnais - Performance Engineering

2006-May-12 16:33 UTC

head link

[zfs-discuss] Re: ZFS and databases

Nicolas Williams writes:
 > On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance
Engineering wrote:
 > > For read it is an interesting concept. Since
 > > 
 > > 	Reading into cache
 > > 	Then copy into user space
 > > 	then keep data around but never use it
 > > 
 > > is not optimal. 
 > > So 2 issues, there is the cost of copy and there is the memory.
 > > 
 > > Now could we detect the pattern that cause holding to the
 > > cached block not optimal and do a quick freebehind after the 
 > > copyout ? Something like Random access +  very large file + poor
cache hit
 > > ratio ?
 > 
 > An interface to request no caching on a per-file basis would be good
 > (madvise(2) should do for mmap''ed files, an fcntl(2) or open(2)
flag
 > would be better).
 > 
 > > Now about avoiding the copy; That would mean dma straight
 > > into user space ? But if the checksum does not validate the
 > > data, what do we do ?
 > 
 > Who cares?  You DMA into user-space, check the checksum and if
there''s a
 > problem return an error; so there''s [corrupted] data in the user
space
 > buffer... but the app knows it, so what''s the problem (see
below)?
 > 
 > >                       If storage is not raid-protected and we
 > > have to return EIO, I don''t think we can do this _and_
 > > corrupt the user buffer also, not sure what POSIX says for
 > > this situation.
 > 
 > If POSIX compliance is an issue just add new interfaces (possibly as
 > simple as an open(2) flag).
 > 
 > > Now latency wise, the cost of copy is  small compared to the
 > > I/O;  right ? So it now  turns into an  issue of saving some
 > > CPU cycles.
 > 
 > Can you build a system where the cost of the copy adds significantly to
 > the latency numbers?  (Think RAM disks.)
 > 
 > Nico
 > -- 


Finally I can agree with somebody today. 

Directio is non-posix anyway and given that people have been 
train to inform the system that the cache won''t be useful,
that it''s a hard problem to detect automatically, let''s
avoid the copy and save memory all at once for the read path.

We could use the directio() call for that ...

-r

Nicolas Williams

2006-May-12 16:41 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Fri, May 12, 2006 at 06:33:00PM +0200, Roch Bourbonnais - Performance
Engineering wrote:> Directio is non-posix anyway and given that people have been 
> train to inform the system that the cache won''t be useful,
> that it''s a hard problem to detect automatically, let''s
> avoid the copy and save memory all at once for the read path.
> 
> We could use the directio() call for that ...
I had no idea about directio(3C)!

We might want an interface for the app to know what the natural block 
size of the file is, so it can read at proper file offsets.

Of course, if that block size is smaller than the ZFS filesystem record
size then ZFS may yet grow it.  How to deal with this?  (One option:
don''t grow it as long as an app has turned direct I/O on for a fildes
and the fildes remains open.)

Nico
--

Richard Elling

2006-May-12 16:59 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote:> > Now latency wise, the cost of copy is  small compared to the
> > I/O;  right ? So it now  turns into an  issue of saving some
> > CPU cycles.
> 
> CPU cycles and memory bandwidth (which both can be in short
> supply on a database server).
We can throw hardware at that :-)  Imagine a machine with lots
of extra CPU cycles and lots of parallel access to multiple
memory banks.  This is the strategy behind CMT.  In the future,
you will have many more CPU cycles and even better memory
bandwidth than you do now, perhaps by an order of magnitude
in the next few years.
 -- richard

Nicolas Williams

2006-May-12 17:07 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Fri, May 12, 2006 at 09:59:56AM -0700, Richard Elling
wrote:> On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote:
> > > Now latency wise, the cost of copy is  small compared to the
> > > I/O;  right ? So it now  turns into an  issue of saving some
> > > CPU cycles.
> > 
> > CPU cycles and memory bandwidth (which both can be in short
> > supply on a database server).
> 
> We can throw hardware at that :-)  Imagine a machine with lots
> of extra CPU cycles and lots of parallel access to multiple
> memory banks.  This is the strategy behind CMT.  In the future,
> you will have many more CPU cycles and even better memory
> bandwidth than you do now, perhaps by an order of magnitude
> in the next few years.
Well, yes, of course, but I think the arguments for direct I/O are
excellent.

Another thing that I see an argument for is limiting the size of various
caches, to avoid paging (even having no swap isn''t enough as you
don''t
want memory pressure evicting hot text pages).

Nico
--

Anton Rang

2006-May-12 17:36 UTC

head link

[zfs-discuss] Re: ZFS and databases

> We might want an interface for the app to know what the natural block
> size of the file is, so it can read at proper file offsets.
Seems that stat(2) could be used for this ...

      long     st_blksize;  /* Preferred I/O block size */

This isn''t particularly useful for databases if they already have a
fixed page size, though.

There isn''t a comparable way for the application to indicate a record
size to the file system, and there probably ought to be, particularly
since the size of writes used to initially create a file (if any) may
be quite different than the size of writes used for updating the file.
(In the long term, it might be interesting to study dynamically  
splitting
blocks which are written using small record sizes.)

-- Anton

Anton Rang

2006-May-12 17:40 UTC

head link

[zfs-discuss] Re: ZFS and databases

On May 12, 2006, at 11:59 AM, Richard Elling wrote:
>> CPU cycles and memory bandwidth (which both can be in short
>> supply on a database server).
>
> We can throw hardware at that :-)  Imagine a machine with lots
> of extra CPU cycles [ ... ]
Yes, I''ve heard this story before, and I won''t believe it this
time.  ;-)

Seriously, I believe a database can perform very well on a CMT system,
but there won''t be any "extra" CPU cycles or memory
bandwidth, because
the demand for transaction rates will always exceed what we can supply.

Anton

Gregory Shaw

2006-May-12 17:42 UTC

head link

[zfs-discuss] ZFS and databases

I really like the below idea:
	- the ability to defragment a file ''live''.

	I can see instances where that could be very useful.  For instance,  
if you have multiple LUNs (or spindles, whatever) using ZFS, you  
could re-optimize large files to spread the chunks across as many  
spindles as possible.  Or, as stated below, make it contiguous.
	I don''t know if that is automatic with ZFS today, but it''s an
idea.

On May 12, 2006, at 6:18 AM, Franz Haberhauer wrote:
>
>
> Roch Bourbonnais - Performance Engineering wrote On 05/12/06 09:30,:
>> Tao Chen writes:
>>  > On 5/11/06, Peter Rival <peter.rival at sun.com> wrote:
>>  > > Richard Elling wrote:
>>  > > > Oracle will zero-fill the tablespace with 128kByte iops
--
>> it is not
>>  > > > sparse.  I''ve got a scar.  Has this changed in
the past few
>> years?
>>  > >
>>  > >  Multiple parallel tablespace creates is usually a big pain
>> point for filesystem / cache interaction, and also fragmentation  
>> once in a while.  The latter ZFS should take care of; the former,  
>> well, I dunno.
>>  > >
>>  >  > The purpose of zero-filled tablespace is to prevent  
>> fragmentation by
>>  > future writes, in the case when multiple tablespaces are being
>>  > updated/filled on the same disk, correct?
>> That and also there was a need for block reservation. Thus
>> posix_fallocate was added (recently).
> >
> >  > This becomes pointless on ZFS, since it never overwrites the
same
> >  > pre-allocated block, i.e. the tablespace becomes fragmented in  
> that
> >  > case no matter what.
> >
> > is fragmented the right word here ?
> > Anyway: random writes can be turned into sequential.
>
> This really optimizes the data on disk for full table scans
> (sequential reads of the whole tables or portions of them).
> Random access may be supported from a DMBS perspective using
> indexes - then read access patterns are random anyway.
> In contrast to usual files DBMS files exhibit different access  
> patterns:
> They are often loaded sequentially (resulting in a nice layout for
> later sequential reads) yet updates to the tables will erode this
> sequential layout over time (e.g. updating accounts as they are  
> credited
> and debited online), so later full table scans will suffer.
> ZFS optimizes random writes versus potential sequential reads.
> this may hurt if there are many full table scans - but DBMS
> designers try to avoid unnecessary full table scans anyway.
> On the other hand there are use cases in which tables are updated
> randomly and read sequentially many times (e.g. in batch runs) -
> here overall performance may suffer.
> An way to resolve this could be an option to transparently  
> "reorganize"
> a file (a tool triggering an internal rewrite - such that the
> data become sequential on disk again - on-the-fly while the
> DBMS is running/file is open).
>
> - Franz
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Matthew Ahrens

2006-May-12 17:52 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Fri, May 12, 2006 at 12:36:53PM -0500, Anton Rang
wrote:> >We might want an interface for the app to know what the natural block
> >size of the file is, so it can read at proper file offsets.
> 
> Seems that stat(2) could be used for this ...
> 
>      long     st_blksize;  /* Preferred I/O block size */
And in fact, it is! :-)

In general, this will be 128k on ZFS filesystems.  If you have changed
the ''recordsize'' property, then it will be that value (for
files created
after the property was changed).

--matt

Franz Haberhauer

2006-May-13 06:23 UTC

head link

[zfs-discuss] Re: ZFS and databases

Given that ISV apps can be only changed by the ISV who may or may not be 
willing to
use such a new interface, having a "no cache" property for the file -
or
given that filesystems
are now really cheap with ZFS -  for the filesystem would be important 
as well,
like the forcedirectio mount option for UFS.
No caching at the filesystem level is always appropriate if the 
application itself
maintains a buffer of application data and does their own application 
specific buffer management
like DBMSes or large matrix solvers. Double caching these typicaly huge 
amounts data
in the filesystem is always a waste of RAM.

- Franz

Nicolas Williams wrote:
>On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance
Engineering wrote:
>  
>
>>For read it is an interesting concept. Since
>>
>>	Reading into cache
>>	Then copy into user space
>>	then keep data around but never use it
>>
>>is not optimal. 
>>So 2 issues, there is the cost of copy and there is the memory.
>>
>>Now could we detect the pattern that cause holding to the
>>cached block not optimal and do a quick freebehind after the 
>>copyout ? Something like Random access +  very large file + poor cache
hit
>>ratio ?
>>    
>>
>
>An interface to request no caching on a per-file basis would be good
>(madvise(2) should do for mmap''ed files, an fcntl(2) or open(2)
flag
>would be better).
>
>  
>
>>Now about avoiding the copy; That would mean dma straight
>>into user space ? But if the checksum does not validate the
>>data, what do we do ?
>>    
>>
>
>Who cares?  You DMA into user-space, check the checksum and if
there''s a
>problem return an error; so there''s [corrupted] data in the user
space
>buffer... but the app knows it, so what''s the problem (see below)?
>
>  
>
>>                      If storage is not raid-protected and we
>>have to return EIO, I don''t think we can do this _and_
>>corrupt the user buffer also, not sure what POSIX says for
>>this situation.
>>    
>>
>
>If POSIX compliance is an issue just add new interfaces (possibly as
>simple as an open(2) flag).
>
>  
>
>>Now latency wise, the cost of copy is  small compared to the
>>I/O;  right ? So it now  turns into an  issue of saving some
>>CPU cycles.
>>    
>>
>
>Can you build a system where the cost of the copy adds significantly to
>the latency numbers?  (Think RAM disks.)
>
>Nico
>  
>

Nicolas Williams

2006-May-13 17:28 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer
wrote:> Given that ISV apps can be only changed by the ISV who may or may not be 
> willing to
> use such a new interface, having a "no cache" property for the
file - or
> given that filesystems
> are now really cheap with ZFS -  for the filesystem would be important 
> as well,
> like the forcedirectio mount option for UFS.
> No caching at the filesystem level is always appropriate if the 
> application itself
> maintains a buffer of application data and does their own application 
> specific buffer management
> like DBMSes or large matrix solvers. Double caching these typicaly huge 
> amounts data
> in the filesystem is always a waste of RAM.
Yes, but remember, DB vendors have adopted new features before -- they
want to have the fastest DB.  Same with open source web servers.  So
I''m
a bit optimistic.

Also, an LD_PRELOAD library could be provided to enable direct I/O as
necessary.

Nico
--

Roch Bourbonnais - Performance Engineering

2006-May-15 14:47 UTC

head link

[zfs-discuss] ZFS and databases

Gregory Shaw writes:

 > I really like the below idea:
 > 	- the ability to defragment a file ''live''.
 > 
 > 	I can see instances where that could be very useful.  For instance,  
 > if you have multiple LUNs (or spindles, whatever) using ZFS, you  
 > could re-optimize large files to spread the chunks across as many  
 > spindles as possible.  Or, as stated below, make it contiguous.
 > 	I don''t know if that is automatic with ZFS today, but
it''s an idea.

I think the expected benefits of making it contiguous is
rooted in the belief that bigger I/Os is the only way to 
reach top performance. 

I think that before ZFS, both physical and logical
contiguity was required to enable sufficient readahead and
get performance. 

Once  we  have  good  readahead based  on   detected logical
contiguous   accesses, It may  well  be  possible to get top
device speed through reasonably-sized I/O concurrency.

-r

Gregory Shaw

2006-May-15 16:18 UTC

head link

[zfs-discuss] ZFS and databases

Rich, correct me if I''m wrong, but here''s the scenario I was
thinking
of:

- A large file is created.
- Over time, the file grows and shrinks.

The anticipated layout on disk due to this is that extents are  
allocated as the file changes.  The extents may or may not be on  
multiple spindles.

I envision a fragmentation over time that will cause sequential  
access to jump all over the place.   If you use smart controllers or  
disks with read caching, their use of stripes and read-ahead (if  
enabled) could cause performance to be bad.

So, my thought was to de-fragment the file to make it more contiguous  
and to allow hardware read-ahead to be effective.

An additional benefit would be to spread it across multiple spindles  
in a contiguous fashion, such as:

disk0: 32mb
disk1: 32mb
disk2: 32mb
... etc.

Perhaps this is unnecessary.  I''m simply trying to grasp the long  
term performance implications of COW data.

On May 15, 2006, at 8:47 AM, Roch Bourbonnais - Performance  
Engineering wrote:
>
> Gregory Shaw writes:
>
>> I really like the below idea:
>> 	- the ability to defragment a file ''live''.
>>
>> 	I can see instances where that could be very useful.  For instance,
>> if you have multiple LUNs (or spindles, whatever) using ZFS, you
>> could re-optimize large files to spread the chunks across as many
>> spindles as possible.  Or, as stated below, make it contiguous.
>> 	I don''t know if that is automatic with ZFS today, but
it''s an idea.
>
> I think the expected benefits of making it contiguous is
> rooted in the belief that bigger I/Os is the only way to
> reach top performance.
>
> I think that before ZFS, both physical and logical
> contiguity was required to enable sufficient readahead and
> get performance.
>
> Once  we  have  good  readahead based  on   detected logical
> contiguous   accesses, It may  well  be  possible to get top
> device speed through reasonably-sized I/O concurrency.
>
> -r
>
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Franz Haberhauer

2006-May-15 16:45 UTC

head link

[zfs-discuss] ZFS and databases

The problem I see with "sequential access jump all over the place" is 
that this increases the utilization of the disks -
over the years disks have become even faster for sequential access, 
whereas random access (as they have
to move the actuator) has not improved at the same pace - this is what 
ZFS exploits when writing.
With its fancy detection of sequential access patterns and improved 
readahead, ZFS should be able to
deal with the latency aspect of randomized read accesses - but at the 
expense of a higher disk utilization.
If you think of many processes accessing the same disks this may result 
in disks "running out of IOPS"
earlier than in an environment with sequential accesses (though 
contiguos data).
Obviously this heavily depends on the workload - but with the trend 
towards even higher capacity disks,
IOPS become a valuable resource and it may be worth to think about how 
to most efficiently use disks -
a "self-optimizing" mechanism that in the background or on request 
rearranges files to become contigous
may therefore be useful.

- Franz

Gregory Shaw wrote:
> Rich, correct me if I''m wrong, but here''s the scenario I
was thinking
> of:
>
> - A large file is created.
> - Over time, the file grows and shrinks.
>
> The anticipated layout on disk due to this is that extents are  
> allocated as the file changes.  The extents may or may not be on  
> multiple spindles.
>
> I envision a fragmentation over time that will cause sequential  
> access to jump all over the place.   If you use smart controllers or  
> disks with read caching, their use of stripes and read-ahead (if  
> enabled) could cause performance to be bad.
>
> So, my thought was to de-fragment the file to make it more contiguous  
> and to allow hardware read-ahead to be effective.
>
> An additional benefit would be to spread it across multiple spindles  
> in a contiguous fashion, such as:
>
> disk0: 32mb
> disk1: 32mb
> disk2: 32mb
> ... etc.
>
> Perhaps this is unnecessary.  I''m simply trying to grasp the long
> term performance implications of COW data.
>
> On May 15, 2006, at 8:47 AM, Roch Bourbonnais - Performance  
> Engineering wrote:
>
>>
>> Gregory Shaw writes:
>>
>>> I really like the below idea:
>>>     - the ability to defragment a file ''live''.
>>>
>>>     I can see instances where that could be very useful.  For
instance,
>>> if you have multiple LUNs (or spindles, whatever) using ZFS, you
>>> could re-optimize large files to spread the chunks across as many
>>> spindles as possible.  Or, as stated below, make it contiguous.
>>>     I don''t know if that is automatic with ZFS today, but
it''s an idea.
>>
>>
>> I think the expected benefits of making it contiguous is
>> rooted in the belief that bigger I/Os is the only way to
>> reach top performance.
>>
>> I think that before ZFS, both physical and logical
>> contiguity was required to enable sufficient readahead and
>> get performance.
>>
>> Once  we  have  good  readahead based  on   detected logical
>> contiguous   accesses, It may  well  be  possible to get top
>> device speed through reasonably-sized I/O concurrency.
>>
>> -r
>>
>
> -----
> Gregory Shaw, IT Architect
> Phone: (303) 673-8273        Fax: (303) 673-8273
> ITCTO Group, Sun Microsystems Inc.
> 1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
> Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
> "When Microsoft writes an application for Linux, I''ve
Won." - Linus
> Torvalds
>
>

Franz Haberhauer

2006-May-15 17:16 UTC

head link

[zfs-discuss] Re: ZFS and databases

Nicolas Williams wrote:
>On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote:
>  
>
>>Given that ISV apps can be only changed by the ISV who may or may not be
>>willing to
>>use such a new interface, having a "no cache" property for the
file - or
>>given that filesystems
>>are now really cheap with ZFS -  for the filesystem would be important 
>>as well,
>>like the forcedirectio mount option for UFS.
>>No caching at the filesystem level is always appropriate if the 
>>application itself
>>maintains a buffer of application data and does their own application 
>>specific buffer management
>>like DBMSes or large matrix solvers. Double caching these typicaly huge 
>>amounts data
>>in the filesystem is always a waste of RAM.
>>    
>>
>
>Yes, but remember, DB vendors have adopted new features before -- they
>want to have the fastest DB.  Same with open source web servers.  So
I''m
>a bit optimistic.
>  
>Yes, but they usually adopt it only with their latest releases, but it 
takes time until these are
adopted by customers. And it''s not just DB vendors, there are other
apps
around which could
benefit, and there are always some who may not adopt a new feature in 
Solaris at all.
Remember when UFS Directio was introduced - forcedirectio was in much 
wider use than
apps which used the API directly.
>Also, an LD_PRELOAD library could be provided to enable direct I/O as
>necessary.
>  
>This would work technically, but wether ISVs are willing to support such 
usage is a different
topic (there may be startup scripts involved making it a little tricky 
to pass an library path
to the app).

So while having the app request no caching may be the architecturally 
cleaner approach, having
it as a property on a file or filesystem is a pragmatic approach, with 
faster time-to-market
and a potential for a much broader use.

- Franz
>Nico
>  
>

Darren J Moffat

2006-May-15 17:24 UTC

head link

[zfs-discuss] Re: ZFS and databases

Franz Haberhauer wrote:> This would work technically, but wether ISVs are willing to support such 
> usage is a different
> topic (there may be startup scripts involved making it a little tricky 
> to pass an library path
> to the app).
Yet another reason to start the applications from SMF.

-- 
Darren J Moffat

Nicolas Williams

2006-May-15 18:07 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer
wrote:> Nicolas Williams wrote:
> >Yes, but remember, DB vendors have adopted new features before -- they
> >want to have the fastest DB.  Same with open source web servers.  So
I''m
> >a bit optimistic.
> > 
> >
> Yes, but they usually adopt it only with their latest releases, but it 
> takes time until these are
> adopted by customers. And it''s not just DB vendors, there are
other apps
> around which could
> benefit, and there are always some who may not adopt a new feature in 
> Solaris at all.
> Remember when UFS Directio was introduced - forcedirectio was in much 
> wider use than
> apps which used the API directly.
I (but I''m not in the ZFS team) don''t oppose a file attribute
of some
sort to provide hints to the FS about the utility of direct I/O to
processes that open such files.

Ideally the OS could just figure it out every time with enough accuracy
that no interface should be necessary at all, but I''m not sure that
this
is possible.

But really, the right interface is for the application to tell the OS.
I don''t know what others (marketing particularly -- you may well be
right about time to market) here think of it but if we could just stick
to proper interfaces that would be best.

Nico
--

Bart Smaalders

2006-May-15 18:17 UTC

head link

[zfs-discuss] Re: ZFS and databases

Nicolas Williams wrote:> On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer wrote:
>> Nicolas Williams wrote:
>>> Yes, but remember, DB vendors have adopted new features before --
they
>>> want to have the fastest DB.  Same with open source web servers. 
So I''m
>>> a bit optimistic.
>>>
>>>
>> Yes, but they usually adopt it only with their latest releases, but it 
>> takes time until these are
>> adopted by customers. And it''s not just DB vendors, there are
other apps
>> around which could
>> benefit, and there are always some who may not adopt a new feature in 
>> Solaris at all.
>> Remember when UFS Directio was introduced - forcedirectio was in much 
>> wider use than
>> apps which used the API directly.
> 
> I (but I''m not in the ZFS team) don''t oppose a file
attribute of some
> sort to provide hints to the FS about the utility of direct I/O to
> processes that open such files.
> 
> Ideally the OS could just figure it out every time with enough accuracy
> that no interface should be necessary at all, but I''m not sure
that this
> is possible.
> 
> But really, the right interface is for the application to tell the OS.
> I don''t know what others (marketing particularly -- you may well
be
> right about time to market) here think of it but if we could just stick
> to proper interfaces that would be best.
> 
> Nico

Perhaps an fadvise call is in order?

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Nicolas Williams

2006-May-15 18:19 UTC

head link

[zfs-discuss] Re: ZFS and databases

On Mon, May 15, 2006 at 11:17:17AM -0700, Bart Smaalders
wrote:> Perhaps an fadvise call is in order?
We already have directio(3C).

(That was a surprise for me also.)

Roch Bourbonnais - Performance Engineering

2006-May-22 15:07 UTC

head link

[zfs-discuss] ZFS and databases

Gregory Shaw writes:
 > Rich, correct me if I''m wrong, but here''s the scenario I
was thinking
 > of:
 > 
 > - A large file is created.
 > - Over time, the file grows and shrinks.
 > 
 > The anticipated layout on disk due to this is that extents are  
 > allocated as the file changes.  The extents may or may not be on  
 > multiple spindles.
 > 
 > I envision a fragmentation over time that will cause sequential  
 > access to jump all over the place.   If you use smart controllers or  
 > disks with read caching, their use of stripes and read-ahead (if  
 > enabled) could cause performance to be bad.
 > 
 > So, my thought was to de-fragment the file to make it more contiguous  
 > and to allow hardware read-ahead to be effective.
 > 
 > An additional benefit would be to spread it across multiple spindles  
 > in a contiguous fashion, such as:
 > 
 > disk0: 32mb
 > disk1: 32mb
 > disk2: 32mb
 > ... etc.
 > 
 > Perhaps this is unnecessary.  I''m simply trying to grasp the long
 > term performance implications of COW data.
 > 

And my take is that, if I spread the 128K block all over but 
read then sufficiently ahead (say 2MB) then I shall be very
much OK from the performance standpoint. 

Actually I  just ran this (8M  reads) and am  getting pretty
good numbers (single disk pool):

  rbourbon at crazycanucks(44): zpool iostat 1
		 capacity     operations    bandwidth
  pool         used  avail   read  write   read  write
  ----------  -----  -----  -----  -----  -----  -----
  zfs         24.4G  9.38G      0      0  59.5K     42
  zfs         24.4G  9.38G    496      0  62.1M      0
  zfs         24.4G  9.38G    497      0  62.1M      0
  zfs         24.4G  9.38G    496      0  62.0M      0
  zfs         24.4G  9.38G    496      0  62.0M      0
  zfs         24.4G  9.38G    497      0  62.1M      0
  zfs         24.4G  9.38G    493      0  61.6M      0
  zfs         24.4G  9.38G    491      0  61.4M      0
  zfs         24.4G  9.38G    492      0  61.5M      0
  zfs         24.4G  9.38G    485      0  60.6M      0


So what benefit do you see from relaying-out the file, is it 
just performance, or something else ?

One benefit that ZFS gets out of working with smaller chunks
is that everytime one   I/O  completes then ZFS   can decide
which  of    the in-kernel  queue    ZIO   has  the  highest
priority. If you swamp a disk with a ton of very large I/Os;
they will take more time  to complete and high priority ones
that show up in-between will just have to wait more.

-r

Gregory Shaw

2006-May-22 15:36 UTC

head link

[zfs-discuss] ZFS and databases

On May 22, 2006, at 9:07 AM, Roch Bourbonnais - Performance  
Engineering wrote:
>
> Gregory Shaw writes:
>> Rich, correct me if I''m wrong, but here''s the
scenario I was thinking
>> of:
>>
>> - A large file is created.
>> - Over time, the file grows and shrinks.
>>
>> The anticipated layout on disk due to this is that extents are
>> allocated as the file changes.  The extents may or may not be on
>> multiple spindles.
>>
>> I envision a fragmentation over time that will cause sequential
>> access to jump all over the place.   If you use smart controllers or
>> disks with read caching, their use of stripes and read-ahead (if
>> enabled) could cause performance to be bad.
>>
>> So, my thought was to de-fragment the file to make it more contiguous
>> and to allow hardware read-ahead to be effective.
>>
>> An additional benefit would be to spread it across multiple spindles
>> in a contiguous fashion, such as:
>>
>> disk0: 32mb
>> disk1: 32mb
>> disk2: 32mb
>> ... etc.
>>
>> Perhaps this is unnecessary.  I''m simply trying to grasp the
long
>> term performance implications of COW data.
>>
>
> And my take is that, if I spread the 128K block all over but
> read then sufficiently ahead (say 2MB) then I shall be very
> much OK from the performance standpoint.
>
> Actually I  just ran this (8M  reads) and am  getting pretty
> good numbers (single disk pool):
>
>   rbourbon at crazycanucks(44): zpool iostat 1
> 		 capacity     operations    bandwidth
>   pool         used  avail   read  write   read  write
>   ----------  -----  -----  -----  -----  -----  -----
>   zfs         24.4G  9.38G      0      0  59.5K     42
>   zfs         24.4G  9.38G    496      0  62.1M      0
>   zfs         24.4G  9.38G    497      0  62.1M      0
>   zfs         24.4G  9.38G    496      0  62.0M      0
>   zfs         24.4G  9.38G    496      0  62.0M      0
>   zfs         24.4G  9.38G    497      0  62.1M      0
>   zfs         24.4G  9.38G    493      0  61.6M      0
>   zfs         24.4G  9.38G    491      0  61.4M      0
>   zfs         24.4G  9.38G    492      0  61.5M      0
>   zfs         24.4G  9.38G    485      0  60.6M      0
>
>
> So what benefit do you see from relaying-out the file, is it
> just performance, or something else ?
>
> One benefit that ZFS gets out of working with smaller chunks
> is that everytime one   I/O  completes then ZFS   can decide
> which  of    the in-kernel  queue    ZIO   has  the  highest
> priority. If you swamp a disk with a ton of very large I/Os;
> they will take more time  to complete and high priority ones
> that show up in-between will just have to wait more.
>
> -r
>
Was the above random access reads?  I was thinking of the case of a  
database.  I''ve got a number of 2g (or larger) files that over time  
get written in a random fashion.  With COW, I would expect the re- 
write to cause the file extents to slowly migrate to be all over the  
disk.  I think this would be sub-optimal, as the disk would have to  
do a lot of seeks to access data that may be contiguous in the file,  
but not on the disk.  An example of this would be to create a 2g  
file, and re-write the file in random chunks.  After a set percentage  
of the data were replace (say 10% intervals), a comparison sequential  
read test would be performed.   If the numbers grow as the file is  
updated, that would indicate changes due to COW and file fragmentation.

Have you heard of stkio?  It''s a simple tool to simulate different  
types of disk loads that could be used for this testing.  I''ve  
attached it for reference:


-------------- next part --------------
A non-text attachment was scrubbed...
Name: stkio.tar
Type: application/x-tar
Size: 133120 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/09dd0c5c/attachment.tar>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stkio.txt
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/09dd0c5c/attachment.txt>
-------------- next part --------------

-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive ULVL4-382           greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Roch Bourbonnais - Performance Engineering

2006-May-22 15:50 UTC

head link

[zfs-discuss] ZFS and databases

Cool, I''ll  try  the tool and for  good  measure the data  I
posted was sequential access (from  logical point of  view).
As   for  the physical  layout, I    don''t  know, it''s quite
possible that  ZFS has layed out  all blocks sequentially on
the physical side; so certainly this is not a good way to
test random-read access. Looked too good.

-r

can you guess?

2006-Jun-13 23:24 UTC

head link

[zfs-discuss] Re: ZFS and databases

Sorry for resurrecting this interesting discussion so late:  I''m
skinning backwards through the forum.

One comment about segregating database logs is that people who take their data
seriously often want a ''belt plus suspenders'' approach to
recovery.  Conventional RAID, even supplemented with ZFS''s self-healing
scrubbing, isn''t sufficient (though RAID-6 might be):  they want at
least the redo logs separate so that in the extremely unlikely event that they
lose something in the (already replicated) database the failure is guaranteed
not to have affected the redo logs as well, from which they can reconstruct the
current database state from a backup.

True, this will mean that you can''t aggregate redo log activity with
other transaction bulk-writes, but that''s at least partly good as well:
databases are often extremely sensitive to redo log write latency and would not
want such writes delayed by combination with other updates, let alone by up to a
5-second delay.

ZFS''s synchronous write intent log could help here (if you replicate
it:  serious database people would consider even the very temporary exposure to
a single failure inherent in an unmirrored log completely unacceptable), but
that could also be slowed by other synch small write activity; conversely,
databases often couldn''t care less about the latency of many of their
other writes, because their own (replicated) redo log has already established
the persistence that they need.

As for direct I/O, it''s not clear why ZFS couldn''t support it:
it could verify each read in user memory against its internal checksum and
perform its self-healing magic if necessary before returning completion status
(which would be the same status it would return if the same situation occurred
during its normal mode of operation:  either unconditional success or
success-after-recovery if the application might care to know that); it could
handle each synchronous write analogously, and if direct I/O mechanisms support
lazy writes then presumably they tie up the user buffer until the write
completes such that you could use your normal mechanisms there as well (just
operating on the user buffer instead of your cache).  In this I''m
assuming that ''direct I/O'' refers not to raw device access but
to file-oriented access that simply avoids any internal cache use, such that you
could still use your no-overwrite approach.

Of course, this also assumes that the direct I/O is always being performed in
aligned integral multiples of checksum units by the application; if not,
you''d either have to bag the checksum facility (this would not be an
entirely unreasonable option to offer, given that some sophisticated applictions
might want to use their own even higher-level integrity mechanisms, e.g., across
geographically-separated sites, and would not need yours) or run everything
through cache as you normally do.  In suitably-aligned cases where you do
validate the data you could avoid half the copy overhead (an issue of memory
bandwidth as well as simply operation latency:  TPC-C submissions can be
affected by this, though it may be rare in real-world use) by integrating the
checksum calculation with the copy, but would still have multiple copies of the
data taking up memory in a situation (direct I/O) where the application *by
definition* does not expect you to be caching the data (quite likely because it
is doing any desirable caching itself).

Tablespace contiguity may, however, be a deal-breaker for some users:  it is
common for tablespaces to be scanned sequentially (when selection criteria
don''t mesh with existing indexes, perhaps especially in joins where the
smaller tablespace (still too large to be retained in cache, though) is scanned
repeatedly in an inner loop, and a DBMS often goes to some effort to keep them
defragmented.  Until ZFS provides some effective continuous defragmenting
mechanisms of its own, its no-overwrite policy may do more harm than good in
such cases (since the database''s own logs keep persistence latency low,
while the backing tablespaces can then be updated at leisure).

I do want to comment on the observation that "enough concurrent 128K I/O
can saturate a disk" - the apparent implication being that one could
therefore do no better with larger accesses, an incorrect conclusion.  Current
disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the
average-seek-plus-partial-rotation required to get to that 128 KB in the first
place.  Thus on a full drive serial random accesses to 128 KB chunks will yield
only about 20% of the drive''s streaming capability (by contrast,
accessing data using serial random accesses in 4 MB contiguous chunks achieves
around 90% of a drive''s streaming capability):  one can do better on
disks that support queuing if one allows queues to form, but this trades
significantly increased average operation latency for the increase in throughput
(and said increase still falls far short of the 90% utilization one could
achieve using 4 MB chunking).

Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this says
little about effective utilization.

Others have touched on several of these points as well - apologies for any
repetition arising from writing while I read.

- bill
 
 
This message posted from opensolaris.org

Richard Elling

2006-Jun-14 15:07 UTC

head link

[zfs-discuss] Re: ZFS and databases

billtodd wrote:> I do want to comment on the observation that "enough concurrent 128K
I/O can
> saturate a disk" - the apparent implication being that one could
therefore do
> no better with larger accesses, an incorrect conclusion.  Current disks can
> stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
> average-seek-plus-partial-rotation required to get to that 128 KB in the
first
> place.  Thus on a full drive serial random accesses to 128 KB chunks will
yield
> only about 20% of the drive''s streaming capability (by contrast,
accessing
> data using serial random accesses in 4 MB contiguous chunks achieves around
> 90% of a drive''s streaming capability):  one can do better on
disks that
> support queuing if one allows queues to form, but this trades significantly
> increased average operation latency for the increase in throughput (and
said
> increase still falls far short of the 90% utilization one could achieve
using
> 4 MB chunking).
> 
> Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this
> says little about effective utilization.
I think I can summarize where we are at on this.

This is the classic big-{packet|block|$-line|bikini} versus
small-{packet|block|$-line|bikini} argument.  One size won''t fit all.

The jury is still out on what all of this means for any given application.
  -- richard

Roch

2006-Jun-14 16:18 UTC

head link

[zfs-discuss] Re: ZFS and databases

For Output ops, ZFS could setup  a 10MB I/O transfer to disk
starting at sector  X, or chunk that up  in 128K while still
assigning  the same    range    of  disk   blocks for    the
operations. Yes there will be more control information going
around, a little more  CPU  consumed, but  the disk  will be
streaming all right, I would guess.

Most heavy output load will behave this way with ZFS, random
or not. The throughput  will depend more on the availability
of  contiguous chunk of  disk blocks  that the actual record
size in use.

As for  random input, the  issue is that  ZFS does not get a
say   as to what the  application  is requesting in terms of
size  and location. Yes doing a  4M Input of contiguous disk
block will be faster than  random reading 128K chunks spread
out.  But  if the application  is  manipulating  4M objects,
those will stream out and land on contiguous disk blocks (if
available) and those should stream in as well (if our read
codepath is clever enough).

The fundamental question is really,  is there something that
ZFS  does  that causes data   that represents an application
logical unit of information (likely to  be read as one chunk)
and so that we  would like to have   contiguous on disk,  to
actually spread out everywhere on the platter.

-r

Richard Elling writes:
 > billtodd wrote:
 > > I do want to comment on the observation that "enough concurrent
128K I/O can
 > > saturate a disk" - the apparent implication being that one could
therefore do
 > > no better with larger accesses, an incorrect conclusion.  Current
disks can
 > > stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the
 > > average-seek-plus-partial-rotation required to get to that 128 KB in
the first
 > > place.  Thus on a full drive serial random accesses to 128 KB chunks
will yield
 > > only about 20% of the drive''s streaming capability (by
contrast, accessing
 > > data using serial random accesses in 4 MB contiguous chunks achieves
around
 > > 90% of a drive''s streaming capability):  one can do better
on disks that
 > > support queuing if one allows queues to form, but this trades
significantly
 > > increased average operation latency for the increase in throughput
(and said
 > > increase still falls far short of the 90% utilization one could
achieve using
 > > 4 MB chunking).
 > > 
 > > Enough concurrent 0.5 KB I/O can also saturate a disk, after all -
but this
 > > says little about effective utilization.
 > 
 > I think I can summarize where we are at on this.
 > 
 > This is the classic big-{packet|block|$-line|bikini} versus
 > small-{packet|block|$-line|bikini} argument.  One size won''t fit
all.
 > 
 > The jury is still out on what all of this means for any given application.
 >   -- richard
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

can you guess?

2006-Jun-14 20:03 UTC

head link

[zfs-discuss] Re: Re: ZFS and databases

-r:

ZFS''s output aggregation mechanisms seem entirely adequate in terms of
throughput, given that the ZIL should mask what would otherwise be poor disk
utilization in the event of many small, synchronous writes.  The problems are
purely on the input side (just as they are with RAID-Z).

The read-side fragmentation problem occurs when an application writes at fine
grain and subsequently reads at coarse grain, as I mentioned in the example of a
tablespace which is updated at fine grain and then streamed back in bulk for
sequential scans.  Ironically, you already have part of a solution in the ZIL,
at least if the fine-grained updates are small enough to place there:  once in
the ZIL, you no longer need worry about over-writing the original data (ignoring
for the moment the impact on your snapshot facility - a drawback of
block-oriented snapshots, but one you''ll need to resolve if you ever
want to defragment anything), since you can simply reapply the ZIL images until
they stick and update checksums (if they''re maintained - see earlier
comments) accordingly (this would require using the ZIL as a conventional
transaction log to protect this action, but that''s not all that much
more a stretch than its current small-update staging process).

ZFS does not appear to deal with such situations very well right now:  either it
uses coarse-grained checksumming, in which case each of those small (e.g., 4 KB)
tablespace updates turns into a read/modity/write operation on a 128 KB entity,
or it uses fine-grained (4 KB in this case) checksums in which case these small
blocks get spread all over the storage as they''re individually updated
and the subsequent sequential tablespace scans run at well under 1 MB/sec/disk
(even worse if RAID-Z is used).

richard:

Characterizing the disk-utilization problem as a classic
big-block-vs.-small-block argument may be more a Unix mind-set issue than
anything else:  other file systems (including a few on Unix, for that matter)
solve this by using extent-based allocation to aggregate many smaller (though
still possibly variable-size) blocks into groups which can be streamed
efficiently.

- bill
 
 
This message posted from opensolaris.org

zfs discuss - May 2006 - ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: ZFS and databases

[zfs-discuss] Re: Re: ZFS and databases