thr3ads.net - zfs discuss - [zfs-discuss] Old-style sequential read performance [Nov 2005]

If this information is useful, please help other people find it:
Share via:

nathan

2005-Nov-17 00:31 UTC

[zfs-discuss] Old-style sequential read performance

Thanks to the ZFS folks for completing this file system. It''s amazing.

My question is performance on what old-style file systems would view as a large
sequential reads (backups, database full table scans, etc...). Now that the data
is not sequential, what kind of hit should we expect? Specifically, should we
avoid ZFS for our large data warehouses or other DSS systems?

Thanks!!

-nathan
This message posted from opensolaris.org

Bill Moore

2005-Nov-17 01:09 UTC

head link

[zfs-discuss] Old-style sequential read performance

On Wed, Nov 16, 2005 at 04:31:21PM -0800, nathan wrote:> Thanks to the ZFS folks for completing this file system. It''s
amazing.
Thanks.  We''re glad to get the monkey off our back, so to speak.  :)
> My question is performance on what old-style file systems would view
> as a large sequential reads (backups, database full table scans,
> etc...). Now that the data is not sequential, what kind of hit should
> we expect? Specifically, should we avoid ZFS for our large data
> warehouses or other DSS systems?
We view any performance deficiencies as bugs, and will treat them as
such.

To answer your question, the real advantage of ZFS is that a random
write workload becomes sequential writes on disk (really fast).  As you
point out, a sequntial read then becomes random reads.  The thing to
note, however, is that when you''re doing a sequntial access to a file,
we can easily predict the next N blocks that are accessed and send all
the I/O requests down to our scheduler to pull them off disk in an
efficient and non-random fashion.  So it''s not as bad as you think.
And there is no way to predict the order or random writes and somehow
optimize them.  You could hold on to them for a while and hope to batch
the I/O that way, but then you''re making the worst tradeoff ever:
giving up correctness for improved performance.

To help ZFS even further, large database installations have been
transitioning the workload they present to disk over time.  As memory
has gotten larger and cheaper, the I/O the disk subsystems sees has
changed from being read mostly (about 15 years ago), to read/write mix
(about 5-10 years ago), to being mostly writes these days.  So the thing
you are now most interested in optimizing are random writes, which is
what we do really well.

--Bill

nathan

2005-Nov-17 01:36 UTC

head link

[zfs-discuss] Re: Old-style sequential read performance

> > My question is performance on what old-style file
> systems would view
> > as a large sequential reads (backups, database full
> table scans,
> > etc...). Now that the data is not sequential, what
> kind of hit should
> > we expect? Specifically, should we avoid ZFS for
> our large data
> > warehouses or other DSS systems?
> 
> We view any performance deficiencies as bugs, and
> will treat them as
> such.
> 
> To answer your question, the real advantage of ZFS is
> that a random
> write workload becomes sequential writes on disk
> (really fast).  As you
> point out, a sequntial read then becomes random
> reads.  The thing to
> note, however, is that when you''re doing a sequntial
> access to a file,
> we can easily predict the next N blocks that are
> accessed and send all
> the I/O requests down to our scheduler to pull them
> off disk in an
> efficient and non-random fashion.  So it''s not as bad
> as you think.
I should''ve paid more attention to the information you all have already
put out on the scheduler!
> And there is no way to predict the order or random
> writes and somehow
> optimize them.  You could hold on to them for a while
> and hope to batch
> the I/O that way, but then you''re making the worst
> tradeoff ever:
> giving up correctness for improved performance.
>
Brillian point. May I quote you when I propose to immediately change our
datacenter to opensolaris b27 w/ ZFS? ;-)
> To help ZFS even further, large database
> installations have been
> transitioning the workload they present to disk over
> time.  As memory
> has gotten larger and cheaper, the I/O the disk
> subsystems sees has
> changed from being read mostly (about 15 years ago),
> to read/write mix
> (about 5-10 years ago), to being mostly writes these
> days.  So the thing
> you are now most interested in optimizing are random
> writes, which is
> what we do really well.
Unfortunately the data warehouses I''ve worked with are mostly read
which it sounds like you may have already figured out.

Thanks again for your response. 

-nathan

In case it matters, our read workload is:
>From the source databases to the data warehouse''s staging database
I see up to 80MB/s large, sequential read from the sources (that changes the
OLTP source''s mix from 80% write to 80% read temporarily). After the
transfor and load into the DW, batch reports on the warehouse generate over
200MB/s of large sequential (1MB operations) read over a few hours. The
warehouse may be writing for the initial load, but after that it''s all
read.This message posted from opensolaris.org

Jim Lin

2005-Nov-17 13:21 UTC

head link

[zfs-discuss] Re: Old-style sequential read performance

Does ZFS reorganize (ie. defrag) the files over time?
If it doesn''t, it might not perform well in "write-little
read-much" scenarios (where read performance is much more important than
write performance).
Thank you.
This message posted from opensolaris.org

Bill Moore

2005-Nov-20 03:24 UTC

head link

[zfs-discuss] Re: Old-style sequential read performance

On Thu, Nov 17, 2005 at 05:21:36AM -0800, Jim Lin wrote:> Does ZFS reorganize (ie. defrag) the files over time?
Not yet.
> If it doesn''t, it might not perform well in "write-little
read-much"
> scenarios (where read performance is much more important than write
> performance).
As always, the correct answer is "it depends".  Let''s take a
look at
several cases:

    - Random reads:  No matter if the data was written randomly or
      sequentially, random reads are random for any filesystem,
      regardless of their layout policy.  Not much you can do to
      optimize these, except have the best I/O scheduler possible.

    - Sequential writes, sequential reads:  With ZFS, sequential writes
      lead to sequential layout on disk.  So sequential reads will
      perform quite well in this case.

    - Random writes, sequential reads:  This is the most interesting
      case.  With random writes, ZFS turns them into sequential writes,
      which go *really* fast.  With sequential reads, you know which
      order the reads are going to be coming in, so you can kick off
      a bunch of prefetch reads.  Again, with a good I/O scheduler
      (which ZFS just happens to have), you can turn this into good read
      performance, if not entirely as good as totally sequential.

Believe me, we''ve thought about this a lot.  There is a lot we can do
to
improve performance, and we''re just getting started.


--Bill

Jim Lin

2005-Nov-20 17:04 UTC

head link

[zfs-discuss] Re: Re: Old-style sequential read performance

Thank you very much.
That was very interesting.
This message posted from opensolaris.org

Jason Ozolins

2005-Nov-23 06:52 UTC

head link

[zfs-discuss] Re: Re: Old-style sequential read performance

> - Sequential writes, sequential reads:  With ZFS,
> ZFS, sequential writes
> lead to sequential layout on disk. 
> So sequential reads will
> perform quite well in this case.
I was just testing sequential read performance and saw some really strange
behaviour.  Read performance from a pool of disk slices was much [b]lower[/b]
than from a raidz pool with the same number of slices, on the same disks!

Here''s the setup: Two pools, /tank (no redundancy) and /pool (raid-z),
created from parallel slices across 4 physical disks:

-bash-3.00# zpool status
  pool: pool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          raidz     ONLINE       0     0     0
            c1d0s7  ONLINE       0     0     0
            c2d0s7  ONLINE       0     0     0
            c3d0s7  ONLINE       0     0     0
            c4d0s7  ONLINE       0     0     0

  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          c1d0s3    ONLINE       0     0     0
          c2d0s3    ONLINE       0     0     0
          c3d0s3    ONLINE       0     0     0
          c4d0s3    ONLINE       0     0     0

/pool has a bit of data in it (<10GB), /tank was newly created.

-bash-3.00# time -p  dd if=/dev/zero of=/tank/tf bs=1024k count=16384
16384+0 records in
16384+0 records out
real 122.82
user 0.03
sys 12.62
-bash-3.00# time -p  dd of=/dev/null if=/tank/tf bs=1024k count=16384
16384+0 records in
16384+0 records out
real 314.02
user 0.07
sys 9.22

I repeated this, and the second run took 305 seconds.  Oh well, I think to
myself, 52MB/sec reads is kind of okay... what''s it like for the raidz
pool?

-bash-3.00# time -p  dd if=/dev/zero of=/pool/tf bs=1024k count=16384
16384+0 records in
16384+0 records out
real 218.88
user 0.08
sys 32.73
-bash-3.00# time -p  dd of=/dev/null if=/pool/tf bs=1024k count=16384
16384+0 records in
16384+0 records out
real 229.50
user 0.08
sys 21.13

So writing is slower, which I can expect, but read speed is a third
[b]faster[/b] than for the plain striped disk /tank.   This seemed odd enough to
be worth mentioning.

-Jason =:^)
This message posted from opensolaris.org

Al Hopper

2005-Nov-23 13:25 UTC

head link

[zfs-discuss] Re: Re: Old-style sequential read performance

On Tue, 22 Nov 2005, Jason Ozolins wrote:
> > - Sequential writes, sequential reads:  With ZFS,
> > ZFS, sequential writes
> > lead to sequential layout on disk.
> > So sequential reads will
> > perform quite well in this case.
.... reformatted ....
> I was just testing sequential read performance and saw some really
> strange behaviour.  Read performance from a pool of disk slices was much
> [b]lower[/b] than from a raidz pool with the same number of slices, on
> the same disks!
>
> Here''s the setup: Two pools, /tank (no redundancy) and /pool
(raid-z),
> created from parallel slices across 4 physical disks:
>
> -bash-3.00# zpool status
>   pool: pool
>  state: ONLINE
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         pool        ONLINE       0     0     0
>           raidz     ONLINE       0     0     0
>             c1d0s7  ONLINE       0     0     0
>             c2d0s7  ONLINE       0     0     0
>             c3d0s7  ONLINE       0     0     0
>             c4d0s7  ONLINE       0     0     0
>
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           c1d0s3    ONLINE       0     0     0
>           c2d0s3    ONLINE       0     0     0
>           c3d0s3    ONLINE       0     0     0
>           c4d0s3    ONLINE       0     0     0
>
> /pool has a bit of data in it (<10GB), /tank was newly created.
>
> -bash-3.00# time -p  dd if=/dev/zero of=/tank/tf bs=1024k count=16384
> 16384+0 records in
> 16384+0 records out
> real 122.82
> user 0.03
> sys 12.62
> -bash-3.00# time -p  dd of=/dev/null if=/tank/tf bs=1024k count=16384
> 16384+0 records in
> 16384+0 records out
> real 314.02
> user 0.07
> sys 9.22
>
> I repeated this, and the second run took 305 seconds.  Oh well, I think
> to myself, 52MB/sec reads is kind of okay... what''s it like for
the raidz
> pool?
>
> -bash-3.00# time -p  dd if=/dev/zero of=/pool/tf bs=1024k count=16384
> 16384+0 records in
> 16384+0 records out
> real 218.88
> user 0.08
> sys 32.73
> -bash-3.00# time -p  dd of=/dev/null if=/pool/tf bs=1024k count=16384
> 16384+0 records in
> 16384+0 records out
> real 229.50
> user 0.08
> sys 21.13
>
> So writing is slower, which I can expect, but read speed is a third
> [b]faster[/b] than for the plain striped disk /tank.  This seemed odd
> enough to be worth mentioning.
Most modern disks have a variable number of sectors per track.  If
you''re
on the cylinders close to the outer edge of the disk (platters), then you
have a physically longer track than you do if you''re on an inner track.
The drive maintains a constant # of magnetic transitions per linear
distance (bit density), so there are more sectors written on the outer
tracks than on the inner tracks.  Usually the disk is divided into
different zones, and the number of sectors per track is the same in each
zone.  In the old days it used to be 4 or 5 zones, but now there are many
more zones (I don''t have a good number - but think in terms of 25 or
30).

So the issue with the above test, is that you''ve running it on
different
cylinder boundaries which have different bit density and, therefore, will
give different results.  To make the test meaningful, you''d have to
create
pool A using some set of slices, test, destroy pool A.  Then create pool B
using the same set of slices and test again.

Please let us know the results.

PS: look at:

http://www.tomshardware.com/storage/20051117/new_toshiba_sata_drives_good_for_the_mainstream-06.html

and the read or write performance charts.  Taking one drive (randomly), for
example the Hitachi TravelStar 7K100, write performance is 53Mb/Sec on the
outer cylinders and 26Mb/Sec on the inner cylinders.  Its not unusual to
see a (close to) 2:1 ratio in performance between the outermost/innermost
cylinders.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

Jason Ozolins

2005-Nov-23 23:41 UTC

head link

[zfs-discuss] Re: Re: Old-style sequential read performance

Al Hopper wrote:> On Tue, 22 Nov 2005, Jason Ozolins wrote:
> .... reformatted ....
oops, sorry about that - I was using the ZFS forum web page to compose the
posting, and it lunched the indentation on the "zfs pool" output. 
Thanks for
fixing it up!
> Most modern disks have a variable number of sectors per track.  [...]
> So the issue with the above test, is that you''ve running it on
different
> cylinder boundaries which have different bit density and, therefore, will
> give different results.  To make the test meaningful, you''d have
to create
> pool A using some set of slices, test, destroy pool A.  Then create pool B
> using the same set of slices and test again.
Fair enough.  I realised last night that I should have made explicit that
slice 3 on all the disks was actually a lower cylinder range than slice 7.
Given that the component slices of /tank (striped) were all on lower numbered
cylinders than the component slices of /pool (raid-z), I would expect that
I/O on /tank should see better media transfer rates than I/O on /pool.  The
fact that the degradation went the other way is part of what surprised me.
> Please let us know the results.
I just tried a couple of experiments.  First, because /tank is only 23GB, I
wondered if creating a 16GB file was filling the filesystem to the point
where allocation policy might cause the file to be fragmented.  Unlikely, but
anyway, I removed the old file, and tried again with a 4GB file.  The machine
only has 1GB of memory, so there''s unlikely to be much difference due
to
cache between the 4GB sequential read and 16GB sequential read case.

# time -p  dd if=/dev/zero of=/tank/tf bs=1024k count=4096
4096+0 records in
4096+0 records out
real 32.34
user 0.01
sys 6.31

# time -p  dd if=/tank/tf bs=1024k count=4096 of=/dev/null
4096+0 records in
4096+0 records out
real 73.45
user 0.01
sys 2.28

Roughly the same read rate (56MB/sec) as for the 16GB (52MB/sec) test file.
Okay, let''s re-create /tank as RAID-Z, using the same slices:

# zpool destroy tank
# zpool create tank raidz c1d0s3 c2d0s3 c3d0s3 c4d0s3

# time -p  dd if=/dev/zero of=/tank/tf bs=1024k count=4096
4096+0 records in
4096+0 records out
real 39.48
user 0.02
sys 9.52

# time -p  dd if=/tank/tf bs=1024k count=4096 of=/dev/null
4096+0 records in
4096+0 records out
real 49.28
user 0.02
sys 2.91

Now we get a read rate of ~ 83MB/sec, and it''s definitely not down to
zoned
recording.

Machine details, fwiw:
Socket 754 Athlon 64 3000+ (1MB cache) on an Asus K8N-E motherboard
NForce 3-250 chipset
1GB of memory
4 * Seagate ST3160827AS 160GB SATA drives
c1d0 and c2d0 attached to NForce controller
c3d0 and c4d0 attached to Silicon Image 3114 controller
Machine was idle when all these tests were carried out.
> [...] Its not unusual to
> see a (close to) 2:1 ratio in performance between the outermost/innermost
> cylinders.
For sure.  I knew about zoned recording (first heard of the concept in 1984
on the Commodore PET floppy drive ;-), but there''s something else at
work here.

Cheers,
    Jason =:^)
-- 
Jason.Ozolins at anu.edu.au         ANU Supercomputer Facility
APAC Data Grid Program           Leonard Huxley Bldg 56, Mills Road
Ph:  +61 2 6125 5449             Australian National University
Fax: +61 2 6125 8199             Canberra, ACT, 0200, Australia

zfs discuss - Nov 2005 - Old-style sequential read performance

[zfs-discuss] Old-style sequential read performance

[zfs-discuss] Old-style sequential read performance

[zfs-discuss] Re: Old-style sequential read performance

[zfs-discuss] Re: Old-style sequential read performance

[zfs-discuss] Re: Old-style sequential read performance

[zfs-discuss] Re: Re: Old-style sequential read performance

[zfs-discuss] Re: Re: Old-style sequential read performance

[zfs-discuss] Re: Re: Old-style sequential read performance

[zfs-discuss] Re: Re: Old-style sequential read performance