thr3ads.net - zfs discuss - [zfs-discuss] ZFS, XFS, and EXT4 compared [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Jeffrey W. Baker

2007-Aug-30 06:16 UTC

[zfs-discuss] ZFS, XFS, and EXT4 compared

I have a lot of people whispering "zfs" in my virtual ear these days,
and at the same time I have an irrational attachment to xfs based
entirely on its lack of the 32000 subdirectory limit.  I''m not afraid
of
ext4''s newness, since really a lot of that stuff has been in Lustre for
years.  So a-benchmarking I went.  Results at the bottom:

http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html

Short version: ext4 is awesome.  zfs has absurdly fast metadata
operations but falls apart on sequential transfer.  xfs has great
sequential transfer but really bad metadata ops, like 3 minutes to tar
up the kernel.

It would be nice if mke2fs would copy xfs''s code for optimal layout on
a
software raid.  The mkfs defaults and the mdadm defaults interact badly.

Postmark is somewhat bogus benchmark with some obvious quantization
problems.

Regards,
jwb

Cyril Plisko

2007-Aug-30 06:25 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

Jeffrey,

it would be interesting to see your zpool layout info as well.
It can significantly influence the results obtained in the benchmarks.



On 8/30/07, Jeffrey W. Baker <jwbaker at acm.org>
wrote:> I have a lot of people whispering "zfs" in my virtual ear these
days,
> and at the same time I have an irrational attachment to xfs based
> entirely on its lack of the 32000 subdirectory limit.  I''m not
afraid of
> ext4''s newness, since really a lot of that stuff has been in
Lustre for
> years.  So a-benchmarking I went.  Results at the bottom:
>
> http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html
>
> Short version: ext4 is awesome.  zfs has absurdly fast metadata
> operations but falls apart on sequential transfer.  xfs has great
> sequential transfer but really bad metadata ops, like 3 minutes to tar
> up the kernel.
>
> It would be nice if mke2fs would copy xfs''s code for optimal
layout on a
> software raid.  The mkfs defaults and the mdadm defaults interact badly.
>
> Postmark is somewhat bogus benchmark with some obvious quantization
> problems.
>
> Regards,
> jwb
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
Regards,
        Cyril

mike

2007-Aug-30 06:27 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On 8/29/07, Jeffrey W. Baker <jwbaker at acm.org>
wrote:> I have a lot of people whispering "zfs" in my virtual ear these
days,
> and at the same time I have an irrational attachment to xfs based
> entirely on its lack of the 32000 subdirectory limit.  I''m not
afraid of
> ext4''s newness, since really a lot of that stuff has been in
Lustre for
> years.  So a-benchmarking I went.  Results at the bottom:
>
> http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html
>
> Short version: ext4 is awesome.  zfs has absurdly fast metadata
> operations but falls apart on sequential transfer.  xfs has great
> sequential transfer but really bad metadata ops, like 3 minutes to tar
> up the kernel.
>
> It would be nice if mke2fs would copy xfs''s code for optimal
layout on a
> software raid.  The mkfs defaults and the mdadm defaults interact badly.
this is cool to see. however, performance wouldn''t be my reason for
moving to zfs. the inline checksumming and all that is what i want. if
someone could get nearly incorruptable filesystems (or just a linux
version of zfs... btrfs looks promising) this would be even better.

sadly ext4+swraid isn''t as good, i might have tried that since waiting
the right hardware support for zfs for me seems to be unknown at this
point.

Jim Mauro

2007-Aug-30 18:33 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

I''ll take a look at this. ZFS provides outstanding sequential IO
performance
(both read and write). In my testing, I can essentially sustain 
"hardware speeds"
with ZFS on sequential loads. That is, assuming 30-60MB/sec per disk 
sequential
IO capability (depending on hitting inner or out cylinders), I get 
linear scale-up
on sequential loads as I add disks to a zpool, e.g. I can sustain 
250-300MB/sec
on a 6 disk zpool, and it''s pretty consistent for raidz and raidz2.

Your numbers are in the 50-90MB/second range, or roughly 1/2 to 1/4 what was
measured on the other 2 file systems for the same test. Very odd.

Still looking...

Thanks,
/jim

Jeffrey W. Baker wrote:> I have a lot of people whispering "zfs" in my virtual ear these
days,
> and at the same time I have an irrational attachment to xfs based
> entirely on its lack of the 32000 subdirectory limit.  I''m not
afraid of
> ext4''s newness, since really a lot of that stuff has been in
Lustre for
> years.  So a-benchmarking I went.  Results at the bottom:
>
> http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html
>
> Short version: ext4 is awesome.  zfs has absurdly fast metadata
> operations but falls apart on sequential transfer.  xfs has great
> sequential transfer but really bad metadata ops, like 3 minutes to tar
> up the kernel.
>
> It would be nice if mke2fs would copy xfs''s code for optimal
layout on a
> software raid.  The mkfs defaults and the mdadm defaults interact badly.
>
> Postmark is somewhat bogus benchmark with some obvious quantization
> problems.
>
> Regards,
> jwb
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jeffrey W. Baker

2007-Aug-30 18:44 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 14:33 -0400, Jim Mauro wrote:> Your numbers are in the 50-90MB/second range, or roughly 1/2 to 1/4
> what was
> measured on the other 2 file systems for the same test. Very odd.
> 
Yeah it''s pretty odd.  I''d tend to blame the Areca HBA, but
then I''d
also point out that the HBA is "Verified" by Sun.

-jwb

Jeffrey W. Baker

2007-Aug-30 18:52 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 08:37 -0500, Jose R. Santos wrote:> On Wed, 29 Aug 2007 23:16:51 -0700
> "Jeffrey W. Baker" <jwbaker at acm.org> wrote:
> > http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html
> 
> FFSB:
> Could you send the patch to fix FFSB Solaris build?  I should probably
> update the Sourceforge version so that it built out of the box.
Sadly I blew away OpenSolaris without preserving the patch, but the gist
of it is this: ctime_r takes three parameters on Solaris (the third is
the buffer length) and Solaris has directio(3c) instead of O_DIRECT.
> I''m also curious about your choices in the FFSB profiles you
created.
> Specifically, the very short run time and doing fsync after every file
> close.  When using FFSB, I usually run with a large run time (usually
> 600 seconds) to make sure that we do enough IO to get a stable
> result.
With a 1GB machine and max I/O of 200MB/s, I assumed 30 seconds would be
enough for the machine to quiesce.  You disagree?  The fsync flag is in
there because my primary workload is PostgreSQL, which is entirely
synchronous.
> Running longer means that we also use more of the disk
> storage and our results are not base on doing IO to just the beginning
> of the disk.  When running for that long period of time, the fsync flag
> is not required since we do enough reads and writes to cause memory
> pressure and guarantee IO going to disk.  Nothing wrong in what you
> did, but I wonder how it would affect the results of these runs.
So do I :)  I did want to finish the test in a practical amount of time,
and it takes 4 hours for the RAID to build.  I will do a few hours-long
runs of ffsb with Ext4 and see what it looks like.
> The agefs options you use are also interesting since you only utilize a
> very small percentage of your filesystem.  Also note that since create
> and append weight are very heavy compare to deletes, the desired
> utilization would be reach very quickly and without that much
> fragmentation.  Again, nothing wrong here, just very interested in your
> perspective in selecting these setting for your profile.
The aging takes forever, as you are no doubt already aware.  It requires
at least 1 minute for 1% utilization.  On a longer run, I can do more
aging.  The create and append weights are taken from the README.
> Don''t mean to invalidate the Postmark results, just merely
pointing out
> a possible error in the assessment of the meta-data performance of ZFS.
> I say possible since it''s still unknown if another workload will
be
> able to validate these results.
I don''t want to pile scorn on XFS, but the postmark workload was chosen
for a reasonable run time on XFS, and then it turned out that it runs in
1-2 seconds on the other filesystems.  The scaling factors could have
been better chosen to exercise the high speeds of Ext4 and ZFS.  The
test needs to run for more than a minute to get meaningful results from
postmark, since it uses truncated whole number seconds as the
denominator when reporting.

One thing that stood out from the postmark results is how ext4/sw has a
weird inverse scaling with respect to the number of subdirectories.
It''s faster with 10000 files in 1 directory than with 100 files each in
100 subdirectories.  Odd, no?
> Did you gathered CPU statistics when running these benchmarks?
I didn''t bother.  If you buy a server these days and it has fewer than
four CPUs, you got ripped off.

-jwb

eric kustarz

2007-Aug-30 19:07 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Aug 29, 2007, at 11:16 PM, Jeffrey W. Baker wrote:
> I have a lot of people whispering "zfs" in my virtual ear these
days,
> and at the same time I have an irrational attachment to xfs based
> entirely on its lack of the 32000 subdirectory limit.  I''m not  
> afraid of
> ext4''s newness, since really a lot of that stuff has been in
Lustre
> for
> years.  So a-benchmarking I went.  Results at the bottom:
>
> http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html
>
> Short version: ext4 is awesome.  zfs has absurdly fast metadata
> operations but falls apart on sequential transfer.  xfs has great
> sequential transfer but really bad metadata ops, like 3 minutes to tar
> up the kernel.
>
> It would be nice if mke2fs would copy xfs''s code for optimal
layout
> on a
> software raid.  The mkfs defaults and the mdadm defaults interact  
> badly.
>
> Postmark is somewhat bogus benchmark with some obvious quantization
> problems.
>
> Regards,
> jwb
>
Hey jwb,

Thanks for taking up the task, its benchmarking so i''ve got some  
questions...

What does it mean to have an external vs. internal journal for ZFS?

Can you show the output of ''zpool status'' when using software
RAID
vs. hardware RAID for ZFS?

The hardware RAID has a cache on the controller.  ZFS will flush the  
"cache" when pushing out a txg (essentially before writing out the  
uberblock and after writing out the uberblock).  When you have a non- 
volatile cache with battery backing (such as your setup), its safe to  
disable that via putting ''set zfs:zfs_nocacheflush = 1'' in
/etc/
system and rebooting.  Its ugly but we''re going through the final  
code review of a fix for this (its partly we aren''t sending down the  
right command and partly even if we did, no storage devices actually  
support it quite yet).

What parameters did you give bonnie++?  compiled 64bit, right?

For the randomio test, it looks like you used an io_size of 4KB.  Are  
those aligned?  random?  How big is the ''/dev/sdb'' file?

Do you have the parameters given to FFSB?

eric

Jeffrey W. Baker

2007-Aug-30 19:09 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 13:57 -0500, Eric Sandeen wrote:> Christoph Hellwig wrote:
> > On Thu, Aug 30, 2007 at 05:07:46PM +1000, Nathan Scott wrote:
> >> To improve metadata performance, you have many options with XFS
(which
> >> ones are useful depends on the type of metadata workload) - you
can try
> >> a v2 format log, and mount with "-o logbsize=256k", try
increasing the
> >> directory block size (e.g. mkfs.xfs -nsize=16k, etc), and also the
log
> >> size (mkfs.xfs -lsize=XXXXXXb).
> > 
> > Okay, these suggestions are one too often now.  v2 log and large
logs/log
> > buffers are the almost universal suggestions, and we really need to
make
> > these defaults.  XFS is already the laughing stock of the Linux
community
> > due to it''s absurdely bad default settings.
> 
> Agreed on reevaluating the defaults, Christoph!
> 
> barrier seems to hurt badly on xfs, too.  Note: barrier is off by
> default on ext[34], so if you want apples to apples there, you need to
> change one or the other filesystem''s mount options.  If your write
cache
> is safe (battery backed?) you may as well turn barriers off.  I''m
not
> sure offhand who will react more poorly to an evaporating write cache
> (with no barriers), ext4 or xfs...
I didn''t compare the safety of the three filesystems, but I did have
disk caches disabled and only battery-backed caches enabled.  Do you
need barriers without volatile caches?

Most people benchmark ext3 with data=writeback which is unsafe.  I used
ordered (the default).

I think if you look at all the features, zfs is theoretically the most
safe filesystem.  But in practice, who knows?

-jwb

Jeffrey W. Baker

2007-Aug-30 19:33 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote:> Hey jwb,
> 
> Thanks for taking up the task, its benchmarking so i''ve got some  
> questions...
> 
> What does it mean to have an external vs. internal journal for ZFS?
This is my first use of ZFS, so be gentle.  External == ZIL on a
separate device, e.g.

zpool create tank c2t0d0 log c2t1d0
> Can you show the output of ''zpool status'' when using
software RAID
> vs. hardware RAID for ZFS?
I blew away the hardware RAID but here''s the one for software:

# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0
        logs        ONLINE       0     0     0
          c2t6d0    ONLINE       0     0     0

errors: No known data errors

iostat shows balanced reads and writes across t[0-5], so I assume this
is working.
> The hardware RAID has a cache on the controller.  ZFS will flush the  
> "cache" when pushing out a txg (essentially before writing out
the
> uberblock and after writing out the uberblock).  When you have a non- 
> volatile cache with battery backing (such as your setup), its safe to  
> disable that via putting ''set zfs:zfs_nocacheflush = 1''
in /etc/
> system and rebooting. 
Do you think this would matter?  There''s no reason to believe that the
RAID controller respects the flush commands, is there?  As far as the
operating system is concerned, the flush means that data is in
non-volatile storage, and the RAID controller''s cache/disk
configuration
is opaque.
> What parameters did you give bonnie++?  compiled 64bit, right?
Uh, whoops.  As I freely admit this is my first encounter with
opensolaris, I just built the software on the assumption that it would
be 64-bit by default.  But it looks like all my benchmarks were built
32-bit.  Yow.  I''d better redo them with -m64, eh?

[time passes]

Well, results are _substantially_ worse with bonnie++ recompiled at
64-bit.  Way, way worse.  54MB/s linear reads, 23MB/s linear writes,
33MB/s mixed.
> For the randomio test, it looks like you used an io_size of 4KB.  Are  
> those aligned?  random?  How big is the ''/dev/sdb'' file?
Randomio does aligned reads and writes.  I''m not sure what you mean
by /dev/sdb?  The file upon which randomio operates is 4GiB.
> Do you have the parameters given to FFSB?
The parameters are linked on my page.

Regards,
jwb

eric kustarz

2007-Aug-30 20:07 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote:
> On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote:
>> Hey jwb,
>>
>> Thanks for taking up the task, its benchmarking so i''ve got
some
>> questions...
>>
>> What does it mean to have an external vs. internal journal for ZFS?
>
> This is my first use of ZFS, so be gentle.  External == ZIL on a
> separate device, e.g.
>
> zpool create tank c2t0d0 log c2t1d0
Ok, cool!, that''s the way to do it.  I''m always curious to see
if
people know about some of the new features in ZFS.  (and then there''s  
the game of matching lingo - "separate intent log" <->
"external
journal").

So the ZIL will be responsible for handling "synchronous" operations  
(O_DYSNC writes, file creates over NFS, fsync, etc).  I actually  
don''t see anything in the tests you ran that would stress this aspect  
(looks like randomio is doing 1% fsyncs).  If you did, then you''d  
want to have more log devices (ie: a stripe of them).

>
>> Can you show the output of ''zpool status'' when using
software RAID
>> vs. hardware RAID for ZFS?
>
> I blew away the hardware RAID but here''s the one for software:
Ok, for the hardware RAID config to do a fair comparison, you''d just  
want to do just a RAID-0 in ZFS, so something like:
# zpool create tank c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0

We call this "dynamic striping" in ZFS.
>
> # zpool status
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t0d0  ONLINE       0     0     0
>             c2t1d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t2d0  ONLINE       0     0     0
>             c2t3d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t4d0  ONLINE       0     0     0
>             c2t5d0  ONLINE       0     0     0
>         logs        ONLINE       0     0     0
>           c2t6d0    ONLINE       0     0     0
>
> errors: No known data errors
>
> iostat shows balanced reads and writes across t[0-5], so I assume this
> is working.
Cool, makes sense.
>
>> The hardware RAID has a cache on the controller.  ZFS will flush the
>> "cache" when pushing out a txg (essentially before writing
out the
>> uberblock and after writing out the uberblock).  When you have a non-
>> volatile cache with battery backing (such as your setup), its safe to
>> disable that via putting ''set zfs:zfs_nocacheflush =
1'' in /etc/
>> system and rebooting.
>
> Do you think this would matter?  There''s no reason to believe that
the
> RAID controller respects the flush commands, is there?  As far as the
> operating system is concerned, the flush means that data is in
> non-volatile storage, and the RAID controller''s cache/disk  
> configuration
> is opaque.
 From my experience dealing with some Hitachi and LSI devices, it  
makes a big difference (of course depending on the workload).  ZFS  
needs to flush the cache for every transaction group (aka "txg") and  
for ZIL operations.  The txg happens about very 5 seconds.  The ZIL  
operations are of course dependent on the workload.  So a workload  
that does lots of synchronous writes will triggers lots of ZIL  
operations, which will trigger lots of cache flushes.

For ZFS, we can safely enable the write cache on a disk - and part of  
that requires we flush the write cache at specific times.  However,  
syncing the non-volatile cache on a controller (with battery backup)  
doesn''t make sense (and some devices will actually flush their  
cache), and can really hurt performance for workloads that flush a lot.

>
>> What parameters did you give bonnie++?  compiled 64bit, right?
>
> Uh, whoops.  As I freely admit this is my first encounter with
> opensolaris, I just built the software on the assumption that it would
> be 64-bit by default.  But it looks like all my benchmarks were built
> 32-bit.  Yow.  I''d better redo them with -m64, eh?
>
> [time passes]
>
> Well, results are _substantially_ worse with bonnie++ recompiled at
> 64-bit.  Way, way worse.  54MB/s linear reads, 23MB/s linear writes,
> 33MB/s mixed.
Hmm, what are you parameters?
>
>> For the randomio test, it looks like you used an io_size of 4KB.  Are
>> those aligned?  random?  How big is the ''/dev/sdb''
file?
>
> Randomio does aligned reads and writes.  I''m not sure what you
mean
> by /dev/sdb?  The file upon which randomio operates is 4GiB.
Sorry i was grabbing "dev/sb" from the "http://arctic.org/~dean/ 
randomio/" link (that was kinda silly).  Ok cool, just making sure  
the file wasn''t completely cacheable.

Another thing to know about ZFS is that it has a variable block size  
(that maxes out at 128KB).  And since ZFS is COW, we can grow the  
block size on demand.  For instance, if you just create a small file,  
say 1B, you''re block size is 512B.  If you go over to 513B, we double  
you to 1KB, etc.

Why it matters here (and you see this especially on databases) is  
that this particular benchmark is doing aligned random 2KB reads/ 
writes.  If the file is big, then all of its blocks will max out at  
the biggest allowable block size for that file system (which by  
default is 128KB).  Which means, if you need to read in 2KB and have  
to go to disk, then you''re really reading in 128KB.  Most other  
filesystems have a blocksize of 8KB.

We added a special property (recordsize) to accommodate workloads/ 
apps like this benchmark.  By setting the recordsize property to 2K,  
that will make the maximum blocksize 2KB (instead of 128KB) for that  
file system.  You''ll see a nice win.  To set it, try:
fsh-hake# zfs set recordsize=2k tank
fsh-hake# zfs get recordsize tank
NAME  PROPERTY    VALUE    SOURCE
tank     recordsize  2K       local
fsh-hake#

>
>> Do you have the parameters given to FFSB?
>
> The parameters are linked on my page.
Whoops, my bad.  Let me go take a look.

eric

Jeffrey W. Baker

2007-Aug-30 20:53 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 13:07 -0700, eric kustarz wrote:> On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote:
> >
> > Uh, whoops.  As I freely admit this is my first encounter with
> > opensolaris, I just built the software on the assumption that it would
> > be 64-bit by default.  But it looks like all my benchmarks were built
> > 32-bit.  Yow.  I''d better redo them with -m64, eh?
> >
> > [time passes]
> >
> > Well, results are _substantially_ worse with bonnie++ recompiled at
> > 64-bit.  Way, way worse.  54MB/s linear reads, 23MB/s linear writes,
> > 33MB/s mixed.
> 
> Hmm, what are you parameters?
bonnie++ -g daemon -d /tank/bench/ -f

This becomes more interesting.  The very slow numbers above were on an
aged (post-benchmark) filesystem.  After destroying and recreating the
zpool, the numbers are similar to the originals (55/87/37).  Does ZFS
really age that quickly?  I think I need to do more investigating here.
> >> For the randomio test, it looks like you used an io_size of 4KB. 
Are
> >> those aligned?  random?  How big is the
''/dev/sdb'' file?
> >
> > Randomio does aligned reads and writes.  I''m not sure what
you mean
> > by /dev/sdb?  The file upon which randomio operates is 4GiB.
> Another thing to know about ZFS is that it has a variable block size  
> (that maxes out at 128KB).  And since ZFS is COW, we can grow the  
> block size on demand.  For instance, if you just create a small file,  
> say 1B, you''re block size is 512B.  If you go over to 513B, we
double
> you to 1KB, etc.
# zfs set recordsize=2K tank/bench
# randomio bigfile 10 .25 .01 2048 60 1

  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  463.9 |  346.8   0.0   21.6  761.9   33.7 |  117.1   0.0   21.3  883.9   33.5

Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
before creating bigfile:

  total |  read:         latency (ms)       |  write:        latency (ms)
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  614.7 |  460.4   0.0   18.5  249.3   14.2 |  154.4   0.0    9.6  989.0   27.6

Much better!  Yay!  So I assume you would always set RS=8K when using
PostgreSQL, etc?

-jwb

Nicolas Williams

2007-Aug-30 21:39 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, Aug 30, 2007 at 01:53:53PM -0700, Jeffrey W. Baker
wrote:> Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
> before creating bigfile:
> 
>   total |  read:         latency (ms)       |  write:        latency (ms)
>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max  
sdev
>
--------+-----------------------------------+----------------------------------
>   614.7 |  460.4   0.0   18.5  249.3   14.2 |  154.4   0.0    9.6  989.0  
27.6
> 
> Much better!  Yay!  So I assume you would always set RS=8K when using
> PostgreSQL, etc?
That''s what we recommend (set recordsize to match the page size of the
database you''re using).

Richard Elling

2007-Aug-30 22:28 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

Jeffrey W. Baker wrote:> # zfs set recordsize=2K tank/bench
> # randomio bigfile 10 .25 .01 2048 60 1
> 
>   total |  read:         latency (ms)       |  write:        latency (ms)
>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max  
sdev
>
--------+-----------------------------------+----------------------------------
>   463.9 |  346.8   0.0   21.6  761.9   33.7 |  117.1   0.0   21.3  883.9  
33.5
> 
> Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
> before creating bigfile:
> 
>   total |  read:         latency (ms)       |  write:        latency (ms)
>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max  
sdev
>
--------+-----------------------------------+----------------------------------
>   614.7 |  460.4   0.0   18.5  249.3   14.2 |  154.4   0.0    9.6  989.0  
27.6
> 
> Much better!  Yay!  So I assume you would always set RS=8K when using
> PostgreSQL, etc?
I presume these are something like Seagate DB35.3 series SATA 400 GByte drives?
If so, then the spec''ed average read seek time is < 11 ms and
rotational delay
is 7,200 rpm.  So the theoretical peak random read rate per drive is ~66 iops.
http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM100000dd04090aRCRD&locale=en-US#

For an 8-disk mirrored set, the max theoretical random read rate is 527 iops.
I see you''re getting 460, so you''re at 87% of theoretical. 
Not bad.

When writing, the max theoretical rate is a little smaller because of the longer
seek time (see datasheet) so we can get ~62 iops per disk.  Also, the total is
divided in half because we have to write to both sides of the mirror.  Thus the
peak is 248 iops.  You see 154 or 62% of peak.  Not quite so good.  But there is
another behaviour here which is peculiar to ZFS.  All writes are COW and
allocated
from free space.  But this is done in 1 MByte chunks.  For 2 kByte I/Os, that
means
you need to get to a very high rate before the workload is spread out across all
of
the disks simultaneously.  You should be able to see this if you look at iostat
with
a small interval.  For 8 kByte recordsize, you should see that it is easier to
spread
the wealth across all 4 mirrored pairs.  For other RAID systems, you can vary
the
stripe interlace, usually to much smaller values, to help spread the wealth.  It
is
difficult to predict how this will affect your application performance, though.

For simultaneous reads and writes, 614 iops is pretty decent, but it makes me
wonder
if the spread is much smaller than the full disk.

If the application only does 8 kByte iops, then I wouldn''t even bother
doing large,
sequential workload testing... you''ll never be able to approach that
limit before you
run out of some other resource, usually CPU or controller.
  -- richard

Jeffrey W. Baker

2007-Aug-30 23:50 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

On Thu, 2007-08-30 at 15:28 -0700, Richard Elling wrote:> Jeffrey W. Baker wrote:
> > # zfs set recordsize=2K tank/bench
> > # randomio bigfile 10 .25 .01 2048 60 1
> > 
> >   total |  read:         latency (ms)       |  write:        latency
(ms)
> >    iops |   iops   min    avg    max   sdev |   iops   min    avg   
max   sdev
> >
--------+-----------------------------------+----------------------------------
> >   463.9 |  346.8   0.0   21.6  761.9   33.7 |  117.1   0.0   21.3 
883.9   33.5
> > 
> > Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
> > before creating bigfile:
> > 
> >   total |  read:         latency (ms)       |  write:        latency
(ms)
> >    iops |   iops   min    avg    max   sdev |   iops   min    avg   
max   sdev
> >
--------+-----------------------------------+----------------------------------
> >   614.7 |  460.4   0.0   18.5  249.3   14.2 |  154.4   0.0    9.6 
989.0   27.6
> > 
> > Much better!  Yay!  So I assume you would always set RS=8K when using
> > PostgreSQL, etc?
> 
> I presume these are something like Seagate DB35.3 series SATA 400 GByte
drives?
> If so, then the spec''ed average read seek time is < 11 ms and
rotational delay
> is 7,200 rpm.  So the theoretical peak random read rate per drive is ~66
iops.
>
http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM100000dd04090aRCRD&locale=en-US#
400GB 7200.10, which have slightly better seek specs.
> For an 8-disk mirrored set, the max theoretical random read rate is 527
iops.
> I see you''re getting 460, so you''re at 87% of
theoretical.  Not bad.
> 
> When writing, the max theoretical rate is a little smaller because of the
longer
> seek time (see datasheet) so we can get ~62 iops per disk.  Also, the total
is
> divided in half because we have to write to both sides of the mirror.  Thus
the
> peak is 248 iops.  You see 154 or 62% of peak.
I think this line of reasoning is a bit misleading, since the reads and
the writes are happening simultaneously, with a ratio of 3:1 in favor of
writes, and 1% of all writes followed by an fsync.  With all writes and
no fsyncs, it''s more like this:

   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  364.1 |    0.0   Inf   -NaN    0.0   -NaN |  364.1   0.0   27.4 1795.8   69.3

Which is altogether respectable.
> For simultaneous reads and writes, 614 iops is pretty decent, but it makes
me wonder
> if the spread is much smaller than the full disk.
Sure it is.  4GiB << 1.2TiB.  If I spread it out over 128GiB,
it''s much
slower, but it seems that would apply to any filesystem.

  190.8 |  143.4   0.0   53.4  254.4   26.6 |   47.4   3.6   49.4  558.8   29.4

-jwb

Robert Milkowski

2007-Sep-01 02:29 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

Hello Jeffrey,

Thursday, August 30, 2007, 9:53:53 PM, you wrote:

JWB> On Thu, 2007-08-30 at 13:07 -0700, eric kustarz
wrote:>> On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote:
>> >
>> > Uh, whoops.  As I freely admit this is my first encounter with
>> > opensolaris, I just built the software on the assumption that it
would
>> > be 64-bit by default.  But it looks like all my benchmarks were
built
>> > 32-bit.  Yow.  I''d better redo them with -m64, eh?
>> >
>> > [time passes]
>> >
>> > Well, results are _substantially_ worse with bonnie++ recompiled
at
>> > 64-bit.  Way, way worse.  54MB/s linear reads, 23MB/s linear
writes,
>> > 33MB/s mixed.
>> 
>> Hmm, what are you parameters?
JWB> bonnie++ -g daemon -d /tank/bench/ -f

JWB> This becomes more interesting.  The very slow numbers above were on an
JWB> aged (post-benchmark) filesystem.  After destroying and recreating the
JWB> zpool, the numbers are similar to the originals (55/87/37).  Does ZFS
JWB> really age that quickly?  I think I need to do more investigating here.
>> >> For the randomio test, it looks like you used an io_size of
4KB.  Are
>> >> those aligned?  random?  How big is the
''/dev/sdb'' file?
>> >
>> > Randomio does aligned reads and writes.  I''m not sure
what you mean
>> > by /dev/sdb?  The file upon which randomio operates is 4GiB.
>> Another thing to know about ZFS is that it has a variable block size  
>> (that maxes out at 128KB).  And since ZFS is COW, we can grow the  
>> block size on demand.  For instance, if you just create a small file,  
>> say 1B, you''re block size is 512B.  If you go over to 513B, we
double
>> you to 1KB, etc.
JWB> # zfs set recordsize=2K tank/bench
JWB> # randomio bigfile 10 .25 .01 2048 60 1

JWB>   total |  read:         latency (ms)       |  write:        latency
(ms)
JWB>    iops |   iops   min    avg    max   sdev |   iops   min    avg    max
sdev
JWB>
--------+-----------------------------------+----------------------------------
JWB>   463.9 |  346.8   0.0   21.6  761.9   33.7 |  117.1   0.0   21.3  883.9
33.5

JWB> Roughly the same as when the RS was 128K.  But, if I set the RS to 2K
JWB> before creating bigfile:

You have to. If large file was created with recordsize 128K then all
block will be 128K. Only new written blocks will be 2K - however I''m
not sure if modified block in the same file will start to be 2K...

-- 
Best regards,
 Robert Milkowski                      mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Osma Ahvenlampi

2007-Sep-01 12:06 UTC

head link

[zfs-discuss] ZFS, XFS, and EXT4 compared

> Aside from the different kernels and filesystems, I tested internal and 
> external journal devices and software and hardware RAIDs. Software 
> RAIDs are "raid-10 near2" with 6 disks on Linux. On Solaris the
zpool is
> created with three mirrors of two disks each. Hardware RAIDs use the 
> Areca''s RAID-10 for both Linux and Solaris. Drive caches are
disabled
> throughout, but the battery-backed cache on the controller is enabled 
> when using hardware RAID.
Apart from the filesystem tests, you may want to repeat the RAID testing with a
combination of three hardware-mirrored pairs striped together using software. At
least under Linux, I would predict you will see the performance increasing by
20% or so, due to the combination of the utilization of the hardware cache plus
parallel kernel I/O queues. I don''t have experience of a similar
configuration under Solaris, unfortunately.
 
 
This message posted from opensolaris.org

Apparently Analagous Threads

Search for more seemingly similar threads

zfs discuss - Aug 2007 - ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

[zfs-discuss] ZFS, XFS, and EXT4 compared

Apparently Analagous Threads