thr3ads.net - zfs discuss - [zfs-discuss] ZFS Random Read Performance [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Paul Kraus

2009-Nov-24 14:45 UTC

[zfs-discuss] ZFS Random Read Performance

I know there have been a bunch of discussion of various ZFS
performance issues, but I did not see anything specifically on this.
In testing a new configuration of an SE-3511 (SATA) array, I ran into
an interesting ZFS performance issue. I do not believe that this is
creating a major issue for our end users (but it may), but it is
certainly impacting our nightly backups. I am only seeing 10-20 MB/sec
per thread for random read throughput using iozone for testing. Here
is the full config:

SF-V480
--- 4 x 1.2 GHz III+
--- 16 GB memory
--- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug.
SE-3511
--- 12 x 500 GB SATA drives
--- 11 disk R5
--- dual 2 Gbps FC host connection

I have the ARC size limited to 1 GB so that I can test with a rational
data set size. The total amount of data that I am testing with is 3 GB
and a 256KB record size. I tested with 1 through 20 threads.

With 1 thread I got the following results:
sequential write: 112 MB/sec.
sequential read: 221 MB/sec.
random write: 96 MB/sec.
random read: 18 MB/sec.

As I scaled the number of threads (and kept the total data size the
same) I got the following (throughput is in MB/sec):
threads  sw   sr  rw  rr
2  105  218  93 34
4  106  219  88  52
8  95  189  69  92
16  71 153 76  128

As the number of threads climbs the first thee values drop once you
get above 4 threads (one per CPU), but the fourth (random read) climbs
well past 4 threads. It is just about linear through 9 threads and
then it starts fluctuating, but continues climbing to at least 20
threads (I did not test past 20). Above 16 threads the random read
even exceeds the sequential read values.

Looking at iostat output for the LUN I am using for the 1 thread case,
for the first three tests (sequential write, sequential read, random
write) I see %b at 100 and actv climb to 35 and hang out there. For
the random read test I see %b at 5 to 7, actv at less than 1 (usually
around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14.
As the number of threads increases, the iostat values don''t really
change for the first three tests (sequential write and read), but they
climb for the random read. The array is close to saturated at about
170 MB/sec. random read (18 threads), so I know that the 18 MB/sec.
value for one thread is _not_limited by the array.

I know the 3511 is not a high performance array, but we needed lots of
bulk storage and could not afford better when we bought these 3 years
ago. But, it seems to me that there is something wrong with the random
read performance of ZFS. To test whether this is an effect of the 3511
I ran some tests on another system we have, as follows:

T2000
--- 32 thread 1 GHz
--- 32 GB memory
--- Solaris 10U8
--- 4 Internal 72 GB SAS drives

We have a zpool built of one slice on each of the 4 internal drives
configured as a striped mirror layout (2 vdevs each of 2 slices). So
I/O is spread over all 4 spindles. I started with 4 threads and 8 GB
each (32 GB total to insure I got past the ARC, it is not tuned down
on this system). I saw exactly the same ratio of sequential read to
random read (the random read performance was 23% of the sequential
read performance in both cases). Based on looking at iostat values
during the test, I am saturating all four drives with the write
operations with just 1 thread. The sequential read is saturating the
drives with anything more than 1 thread, and the random read is not
saturating the drives until I get to about 6 threads.

threads  sw  sr  rw  rr
1  100  207  88  30
2  103  370  88  53
4  98  350  90  82
8  101  434  92  95

I confirmed that the problem is not unique to either 10U6 or the IDR,
10U8 has the same behavior.

I confirmed that the problem is not unique to a FC attached disk array
or the SE-3511 in particular.

Then I went back and took another look at my original data
(SF-V480/SE-3511) and looked at throughput per thread. For the
sequential operations and the random write, the throughput per thread
fell pretty far and pretty fast, but the per thread random read
numbers fell very slowly.

Per thread throughput in MB/sec.
threads  sw  sr  rw  rr
1  112  221  96  18
2  53  109  46  17
4  26  55  22  13
8  12  24  9  12
16  5  10  5  8

So this makes me think that the random read performance issue is a
limitation per thread. Does anyone have any idea why ZFS is not
reading as fast as the underlying storage can handle in the case of
random reads ? Or am I seeing an artifact of iozone itself ? Is there
another benchmark I should be using ?

P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here:
http://www.ilk.org/~ppk/Geek/throughput-summary.ods

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
-> Technical Advisor, RPI Players

Richard Elling

2009-Nov-24 16:03 UTC

head link

[zfs-discuss] ZFS Random Read Performance

Try disabling prefetch.
  -- richard

On Nov 24, 2009, at 6:45 AM, Paul Kraus wrote:
>        I know there have been a bunch of discussion of various ZFS
> performance issues, but I did not see anything specifically on this.
> In testing a new configuration of an SE-3511 (SATA) array, I ran into
> an interesting ZFS performance issue. I do not believe that this is
> creating a major issue for our end users (but it may), but it is
> certainly impacting our nightly backups. I am only seeing 10-20 MB/sec
> per thread for random read throughput using iozone for testing. Here
> is the full config:
>
> SF-V480
> --- 4 x 1.2 GHz III+
> --- 16 GB memory
> --- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug.
> SE-3511
> --- 12 x 500 GB SATA drives
> --- 11 disk R5
> --- dual 2 Gbps FC host connection
>
> I have the ARC size limited to 1 GB so that I can test with a rational
> data set size. The total amount of data that I am testing with is 3 GB
> and a 256KB record size. I tested with 1 through 20 threads.
>
> With 1 thread I got the following results:
> sequential write: 112 MB/sec.
> sequential read: 221 MB/sec.
> random write: 96 MB/sec.
> random read: 18 MB/sec.
>
> As I scaled the number of threads (and kept the total data size the
> same) I got the following (throughput is in MB/sec):
> threads  sw   sr  rw  rr
> 2  105  218  93 34
> 4  106  219  88  52
> 8  95  189  69  92
> 16  71 153 76  128
>
> As the number of threads climbs the first thee values drop once you
> get above 4 threads (one per CPU), but the fourth (random read) climbs
> well past 4 threads. It is just about linear through 9 threads and
> then it starts fluctuating, but continues climbing to at least 20
> threads (I did not test past 20). Above 16 threads the random read
> even exceeds the sequential read values.
>
> Looking at iostat output for the LUN I am using for the 1 thread case,
> for the first three tests (sequential write, sequential read, random
> write) I see %b at 100 and actv climb to 35 and hang out there. For
> the random read test I see %b at 5 to 7, actv at less than 1 (usually
> around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14.
> As the number of threads increases, the iostat values don''t really
> change for the first three tests (sequential write and read), but they
> climb for the random read. The array is close to saturated at about
> 170 MB/sec. random read (18 threads), so I know that the 18 MB/sec.
> value for one thread is _not_limited by the array.
>
> I know the 3511 is not a high performance array, but we needed lots of
> bulk storage and could not afford better when we bought these 3 years
> ago. But, it seems to me that there is something wrong with the random
> read performance of ZFS. To test whether this is an effect of the 3511
> I ran some tests on another system we have, as follows:
>
> T2000
> --- 32 thread 1 GHz
> --- 32 GB memory
> --- Solaris 10U8
> --- 4 Internal 72 GB SAS drives
>
> We have a zpool built of one slice on each of the 4 internal drives
> configured as a striped mirror layout (2 vdevs each of 2 slices). So
> I/O is spread over all 4 spindles. I started with 4 threads and 8 GB
> each (32 GB total to insure I got past the ARC, it is not tuned down
> on this system). I saw exactly the same ratio of sequential read to
> random read (the random read performance was 23% of the sequential
> read performance in both cases). Based on looking at iostat values
> during the test, I am saturating all four drives with the write
> operations with just 1 thread. The sequential read is saturating the
> drives with anything more than 1 thread, and the random read is not
> saturating the drives until I get to about 6 threads.
>
> threads  sw  sr  rw  rr
> 1  100  207  88  30
> 2  103  370  88  53
> 4  98  350  90  82
> 8  101  434  92  95
>
> I confirmed that the problem is not unique to either 10U6 or the IDR,
> 10U8 has the same behavior.
>
> I confirmed that the problem is not unique to a FC attached disk array
> or the SE-3511 in particular.
>
> Then I went back and took another look at my original data
> (SF-V480/SE-3511) and looked at throughput per thread. For the
> sequential operations and the random write, the throughput per thread
> fell pretty far and pretty fast, but the per thread random read
> numbers fell very slowly.
>
> Per thread throughput in MB/sec.
> threads  sw  sr  rw  rr
> 1  112  221  96  18
> 2  53  109  46  17
> 4  26  55  22  13
> 8  12  24  9  12
> 16  5  10  5  8
>
> So this makes me think that the random read performance issue is a
> limitation per thread. Does anyone have any idea why ZFS is not
> reading as fast as the underlying storage can handle in the case of
> random reads ? Or am I seeing an artifact of iozone itself ? Is there
> another benchmark I should be using ?
>
> P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here:
> http://www.ilk.org/~ppk/Geek/throughput-summary.ods
>
> --  
> {--------1
> ---------2---------3---------4---------5---------6---------7---------}
> Paul Kraus
> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ 
>  )
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
> -> Technical Advisor, RPI Players
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Paul Kraus

2009-Nov-24 17:29 UTC

head link

[zfs-discuss] ZFS Random Read Performance

On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
<richard.elling at gmail.com> wrote:
> Try disabling prefetch.
Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (>95%) of reads are missing
the cache.

The reason I don''t think that this ishitting our end users is the
cache hit ratio (reported by arc_summary.pl) is 95% on the production
system (I am working on our test system and am the only one using it
right now, so all the I/O load is iozone).

I think my next step (beyond more poking with DTrace) is to try a
backup and see what I get for ARC hit ratio ... I expect it to be low,
but I may be surprised (then I have to figure out why backups are as
slow as they are). We are using NetBackup and it takes about 3 days to
do a FULL on a 3.3 TB zfs with about 30 million files. Differential
Incrementals take 16-22 hours (and almost no data changes). The
production server is an M4000, 4 dual core CPUs, 16 GB memory, and
about 25 TB of data overall. A big SAMBA file server.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
-> Technical Advisor, RPI Players

Bob Friesenhahn

2009-Nov-24 18:09 UTC

head link

[zfs-discuss] ZFS Random Read Performance

On Tue, 24 Nov 2009, Paul Kraus wrote:
> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
> <richard.elling at gmail.com> wrote:
>
>> Try disabling prefetch.
>
> Just tried it... no change in random read (still 17-18 MB/sec for a
> single thread), but sequential read performance dropped from about 200
> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
> accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
> arcstat.pl shows that the vast majority (>95%) of reads are missing
> the cache.
You will often see the best random access performance if you access 
the data using the same record size that zfs uses.  For example, if 
you request data in 256KB records, but zfs is using 128KB records, 
then zfs needs to access, reconstruct, and concatenate two 128K zfs 
records before it can return any data to the user.  This increases the 
access latency and decreases opportunity to take advantage of 
concurrency.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Nov-24 23:31 UTC

head link

[zfs-discuss] ZFS Random Read Performance

more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:
> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
> <richard.elling at gmail.com> wrote:
>
>> Try disabling prefetch.
>
> Just tried it... no change in random read (still 17-18 MB/sec for a
> single thread), but sequential read performance dropped from about 200
> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
> accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
> arcstat.pl shows that the vast majority (>95%) of reads are missing
> the cache.
hmmm... more testing needed. The question is whether the low
I/O rate is because of zfs itself, or the application? Disabling  
prefetch
will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.

Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor. The random read (with cache misses)
will stall the application, so it takes a lot of threads (>>16?) to keep
35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is "intelligent" so it actually complicates the
interpretation of the data.

You''re peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.

The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.

Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256. The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks.  You won''t notice this as easily with a single thread,
but it will be measurable with more threads. Double check the
offset with prtvtoc or format.

Writes are a completely different matter.  ZFS has a tendency to
turn random writes into sequential writes, so it is pretty much
useless to look at random write data. The sequential writes
should easily blow through the cache on the 3511.  Squinting
my eyes, I would expect the array can do around 70 MB/s
writes, or 25 256KB IOPS saturated writes.  By contrast, the
T2000 JBOD data shows consistent IOPS at the disk level
and exposes the track cache affect on the sequential read test.

Did I mention that I''m a member of BAARF?  www.baarf.com :-)

Hint: for performance work with HDDs, pay close attention to
IOPS, then convert to bandwidth for the PHB.
> The reason I don''t think that this ishitting our end users is the
> cache hit ratio (reported by arc_summary.pl) is 95% on the production
> system (I am working on our test system and am the only one using it
> right now, so all the I/O load is iozone).
>
> I think my next step (beyond more poking with DTrace) is to try a
> backup and see what I get for ARC hit ratio ... I expect it to be low,
> but I may be surprised (then I have to figure out why backups are as
> slow as they are). We are using NetBackup and it takes about 3 days to
> do a FULL on a 3.3 TB zfs with about 30 million files. Differential
> Incrementals take 16-22 hours (and almost no data changes). The
> production server is an M4000, 4 dual core CPUs, 16 GB memory, and
> about 25 TB of data overall. A big SAMBA file server.
b119 has improved stat() performance, which should make a positive
improvement of such backups.  But eventually you may need to move
to a multi-stage backup, depending on your business requirements.
  -- richard

Paul Kraus

2009-Nov-25 13:54 UTC

head link

[zfs-discuss] ZFS Random Read Performance

Richard,
        First, thank you for the detailed reply ... (comments in line below)

On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
<richard.elling at gmail.com> wrote:> more below...
>
> On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:
>
>> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
>> <richard.elling at gmail.com> wrote:
>>
>>> Try disabling prefetch.
>>
>> Just tried it... no change in random read (still 17-18 MB/sec for a
>> single thread), but sequential read performance dropped from about 200
>> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
>> accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
>> arcstat.pl shows that the vast majority (>95%) of reads are missing
>> the cache.
>
> hmmm... more testing needed. The question is whether the low
> I/O rate is because of zfs itself, or the application? Disabling prefetch
> will expose the application, because zfs is not creating additional
> and perhaps unnecessary read I/O.
The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.
> Your data which shows the sequential write, random write, and
> sequential read driving actv to 35 is because prefetching is enabled
> for the read. ?We expect the writes to drive to 35 with a sustained
> write workload of any flavor.
Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.
> The random read (with cache misses)
> will stall the application, so it takes a lot of threads (>>16?) to
keep
> 35 concurrent I/Os in the pipeline without prefetching. ?The ZFS
> prefetching algorithm is "intelligent" so it actually complicates
the
> interpretation of the data.
What bothers me is that that iostat is showing the ''disk''
device as
not being saturated during the random read test. I''ll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).
> You''re peaking at 658 256KB random IOPS for the 3511, or ~66
> IOPS per drive. ?Since ZFS will max out at 128KB per I/O, the disks
> see something more than 66 IOPS each. ?The IOPS data from
> iostat would be a better metric to observe than bandwidth. ?These
> drives are good for about 80 random IOPS each, so you may be
> close to disk saturation. ?The iostat data for IOPS and svc_t will
> confirm.
But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?
> The T2000 data (sheet 3) shows pretty consistently around
> 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
> less than I would expect, perhaps due to the measurement.
I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn''t seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).
> Also, the 3511 RAID-5 configuration will perform random reads at
> around 1/2 IOPS capacity if the partition offset is 34. ?This was the
> default long ago. ?The new default is 256.
Our 3511''s have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.
> The reason is that with
> a 34 block offset, you are almost guaranteed that a larger I/O will
> stride 2 disks. ?You won''t notice this as easily with a single
thread,
> but it will be measurable with more threads. Double check the
> offset with prtvtoc or format.
How do I check offset ... format -> verify from one of the partitionsis
below:

format> ver

Volume name = <        >
ascii name  = <SUN-StorEdge 3511-421F-517.23GB>
bytes/sector    =  512
sectors = 1084710911
accessible sectors = 1084710878
Part      Tag    Flag     First Sector          Size          Last Sector
  0        usr    wm               256       517.22GB           1084694494
  1 unassigned    wm                 0            0                0
  2 unassigned    wm                 0            0                0
  3 unassigned    wm                 0            0                0
  4 unassigned    wm                 0            0                0
  5 unassigned    wm                 0            0                0
  6 unassigned    wm                 0            0                0
  8   reserved    wm        1084694495         8.00MB           1084710878

format>
> Writes are a completely different matter. ?ZFS has a tendency to
> turn random writes into sequential writes, so it is pretty much
> useless to look at random write data. The sequential writes
> should easily blow through the cache on the 3511.
I am seeing cache utilization of 25-30% during write tests, with
occasional peaks close to 50%. Which is expected as I am testing
against one partition on one logical drive.
> ?Squinting
> my eyes, I would expect the array can do around 70 MB/s
> writes, or 25 256KB IOPS saturated writes.
iostat and the 3511 transfer rate monitor is showing peaks of 150-180
MB/sec with sustained throughput of 100 MB/sec.
> ?By contrast, the
> T2000 JBOD data shows consistent IOPS at the disk level
> and exposes the track cache affect on the sequential read test.
Yup, it is clear that we are easily hitting the read i/o limits of the
drives in the T2000.
> Did I mention that I''m a member of BAARF? ?www.baarf.com :-)
Not yet :-)
> Hint: for performance work with HDDs, pay close attention to
> IOPS, then convert to bandwidth for the PHB.
PHB ???

I do look at IOPs, but what struck me as odd was the disparate results.

<snip>
> b119 has improved stat() performance, which should make a positive
> improvement of such backups. ?But eventually you may need to move
> to a multi-stage backup, depending on your business requirements.
Due to contract issues (I am consulting at a government agency), we
cannot yet run OpenSolaris in production.

On our previous server for this application (Apple G5) we had 4 TB of
data and about 50 million files (under HFS+) and a full backup took 3
WEEKS. We went the route of explicitly specifying each directory in
the NetBackup config and got _some_ reliability. Today we have about
22 TB in over 200 ZFS datasets (not evenly distributed,
unfortunately), the largest of which is about 3.5 TB and 30 million
files.

BTW, our overall configuration is based on h/w we bought years ago and
are having to adopt as best we can. We are pushing to replace the
SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5
sets and 1 hot spare per 3511 tray, we carve up ''standard'' 512
GB
partitions, which we mirror at the ZPOOL layer across 3511 arrays. We
just add additional mirror pairs as the data in each department grows,
keeping the mirrors on different arrays :-)

More testing results in a separate email, this one is already too long.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
-> Technical Advisor, RPI Players

Paul Kraus

2009-Nov-25 15:10 UTC

head link

[zfs-discuss] ZFS Random Read Performance

I posted baseline stats at http://www.ilk.org/~ppk/Geek/

baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size

480-3511-baseline.xls is an iozone output file

iostat-baseline.txt is the iostat output for the device in use (annotated)

I also noted an odd behavior yesterady and have not had a chance to
better qualify it. I was testing various combinations of vdev
quantities and mirror quantities.

As I changed the number of vdevs (stripes) from 1 through 8 (all
backed buy paritions on the same logical disk on the 3511) there was
no real change in sequential write, random write, or random read
performance. Sequential read performance did show a drop from 216
MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
expected.

As I changed the number of mirro components things got interesting.
Keep in mind that I only have one 3511 for testing right now, I had to
use partitions from two other production 3511''s to get three mirror
components on different arrays. As expected, as I went from 1 to 2 to
3 mirror components the write performance did not change, but the read
performance was interesting... see below:

read performance
mirrors  sequential  random
1  174 MiB/sec.  23 MiB/sec.
2  229 MiB/sec.  30 MiB/sec.
3  223 MiB/sec.  125 MiB/sec.

What they heck happened here ? 1 to 2 mirrors saw a large increase in
sequential read perfromance and from 2 to 3 mirrors show a HUGE
increase in random read performance. It "feels" like the behavior of
the zfs code changed between 2 and 3 mirrors for the random read data.

Now to investigate further, I tried multiple mirrors components on the
same array (my test 3511), not that you would do this in production,
but I was curious what would happen. In this case the throughput
degraded across the board as I added mirror components, as one would
expect. In the random read case the array was delivering less overall
performance than it was when it was one part of the earlier test (16
MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
results. Sheet 8 is the last test I did last night, using the NRAID
logical disk type to try to get the 3511 to pass a disk through to
zfs, but get the advantage of the cache on the 3511. I''m not sure what
to read into those numbers.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
-> Technical Advisor, RPI Players

William D. Hathaway

2009-Nov-25 16:05 UTC

head link

[zfs-discuss] ZFS Random Read Performance

If you are using (3) 3511''s, then won''t it be possibly that
your 3GB workload will be largely or entirely served out of RAID controller
cache?

Also, I had a question for your production backups (millions of small files), do
you have atime=off set for the filesystems?  That might be helpful.
-- 
This message posted from opensolaris.org

Mike Gerdts

2009-Nov-25 17:14 UTC

head link

[zfs-discuss] ZFS Random Read Performance

On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus <pk1048 at gmail.com>
wrote:>> You''re peaking at 658 256KB random IOPS for the 3511, or ~66
>> IOPS per drive. ?Since ZFS will max out at 128KB per I/O, the disks
>> see something more than 66 IOPS each. ?The IOPS data from
>> iostat would be a better metric to observe than bandwidth. ?These
>> drives are good for about 80 random IOPS each, so you may be
>> close to disk saturation. ?The iostat data for IOPS and svc_t will
>> confirm.
>
> But ... if I am saturating the 3511 with one thread, then why do I get
> many times that performance with multiple threads ?
I''m having troubles making sense of the iostat data (I can''t
tell how
many threads at any given point), but I do see lots of times where
asvc_t * reads is in the range 850 ms to 950 ms.  That is, this is as
fast as a single threaded app with a little bit of think time can
issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1
sec).  The %busy shows that 90+% of the time there is an I/O in flight
(100 reads * 9ms = 900/1000 = 90%).  However, %busy isn''t aware of how
many I/O''s could be in flight simultaneously.

When you fire up more threads, you are able to have more I/O''s in
flight concurrently.  I don''t believe that the I/O''s per drive
is
really a limiting factor at the single threaded case, as the spec
sheet for the 3511 says that it has 1 GB of cache per controller.
Your working set is small enough that it is somewhat likely that many
of those random reads will be served from cache.  A dtrace analysis of
just how random the reads are would be interesting.  I think that
hotspot.d from the DTrace Toolkit would be a good starting place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Richard Elling

2009-Nov-25 18:19 UTC

head link

[zfs-discuss] ZFS Random Read Performance

more below...

On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote:
> Richard,
>        First, thank you for the detailed reply ... (comments in line  
> below)
>
> On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> more below...
>>
>> On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:
>>
>>> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
>>> <richard.elling at gmail.com> wrote:
>>>
>>>> Try disabling prefetch.
>>>
>>> Just tried it... no change in random read (still 17-18 MB/sec for a
>>> single thread), but sequential read performance dropped from about
>>> 200
>>> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
>>> accessed in 256 KB records. ARC is set to a max of 1 GB for
testing.
>>> arcstat.pl shows that the vast majority (>95%) of reads are
missing
>>> the cache.
>>
>> hmmm... more testing needed. The question is whether the low
>> I/O rate is because of zfs itself, or the application? Disabling  
>> prefetch
>> will expose the application, because zfs is not creating additional
>> and perhaps unnecessary read I/O.
>
> The values reported by iozone are in pretty close agreement with what
> we are seeing with iostat during the test runs. Compression is off on
> zfs (the iozone test data compresses very well and yields bogus
> results). I am looking for a good alternative to iozone for random
> testing, I did put together a crude script to spawn many dd processes
> accessing the block device itself, each with a different seek over the
> range of the disk and saw results much greater than the iozone single
> threaded random performance.
filebench is usually bundled in /usr/benchmarks or as a pkg.
vdbench is easy to use and very portable, www.vdbench.org
>> Your data which shows the sequential write, random write, and
>> sequential read driving actv to 35 is because prefetching is enabled
>> for the read.  We expect the writes to drive to 35 with a sustained
>> write workload of any flavor.
>
> Understood. I tried tuning the queue size to 50 and observed that the
> actv went to 50 (with very little difference in performance), so
> returned it to the default of 35.
Yep, bottleneck is on the back end (physical HDDs).  For arrays with  
lots
of HDDs, this queue can be deeper, but the 3500 series is way too
small to see this.  If SSDs are used on the back end, then you can
revisit this.

 From the data, it does look like the random read tests are converging
on the media capabilities of the disks in the array.  For the array you
can see the read-modify-write penalty of RAID-5 as well as the
caching and prefetching of reads.

Note: the physical I/Os are 128 KB, regardless of the iozone size
setting.  This is expected, since 128 KB is the default recordsize
limit for ZFS.
>> The random read (with cache misses)
>> will stall the application, so it takes a lot of threads (>>16?)
to
>> keep
>> 35 concurrent I/Os in the pipeline without prefetching.  The ZFS
>> prefetching algorithm is "intelligent" so it actually
complicates the
>> interpretation of the data.
>
> What bothers me is that that iostat is showing the ''disk''
device as
> not being saturated during the random read test. I''ll post iostat
> output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
> can clearly see the various test phases (sequential write, rewrite,
> sequential read, reread, random read, then random write).
Is this a single thread?  Usually this means that you aren''t creating
enough load. ZFS won''t be prefetching (as much) for a random
read workload, so iostat will expose client bottlenecks.
>> You''re peaking at 658 256KB random IOPS for the 3511, or ~66
>> IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
>> see something more than 66 IOPS each.  The IOPS data from
>> iostat would be a better metric to observe than bandwidth.  These
>> drives are good for about 80 random IOPS each, so you may be
>> close to disk saturation.  The iostat data for IOPS and svc_t will
>> confirm.
>
> But ... if I am saturating the 3511 with one thread, then why do I get
> many times that performance with multiple threads ?
>
>> The T2000 data (sheet 3) shows pretty consistently around
>> 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
>> less than I would expect, perhaps due to the measurement.
>
> I ran the T2000 test to see if 10U8 behaved better and to make sure I
> wasn''t seeing an oddity of the 480 / 3511 case. I wanted to see if
the
> random read bahavior was similar, and it was (in relative terms).
>
>> Also, the 3511 RAID-5 configuration will perform random reads at
>> around 1/2 IOPS capacity if the partition offset is 34.  This was the
>> default long ago.  The new default is 256.
>
> Our 3511''s have been running 421F (latest) for a long time :-) We
are
> religious about keeping all the 3511 FW current and matched.
>> The reason is that with
>> a 34 block offset, you are almost guaranteed that a larger I/O will
>> stride 2 disks.  You won''t notice this as easily with a single
>> thread,
>> but it will be measurable with more threads. Double check the
>> offset with prtvtoc or format.
>
> How do I check offset ... format -> verify from one of the  
> partitionsis below:
>
> format> ver
>
> Volume name = <        >
> ascii name  = <SUN-StorEdge 3511-421F-517.23GB>
> bytes/sector    =  512
> sectors = 1084710911
> accessible sectors = 1084710878
> Part      Tag    Flag     First Sector          Size          Last  
> Sector
>  0        usr    wm               256       517.22GB            
> 1084694494
This is it: First Sector = 256.  Good.
>  1 unassigned    wm                 0            0                0
>  2 unassigned    wm                 0            0                0
>  3 unassigned    wm                 0            0                0
>  4 unassigned    wm                 0            0                0
>  5 unassigned    wm                 0            0                0
>  6 unassigned    wm                 0            0                0
>  8   reserved    wm        1084694495         8.00MB            
> 1084710878
>
> format>
>
>> Writes are a completely different matter.  ZFS has a tendency to
>> turn random writes into sequential writes, so it is pretty much
>> useless to look at random write data. The sequential writes
>> should easily blow through the cache on the 3511.
>
> I am seeing cache utilization of 25-30% during write tests, with
> occasional peaks close to 50%. Which is expected as I am testing
> against one partition on one logical drive.
>
>>  Squinting
>> my eyes, I would expect the array can do around 70 MB/s
>> writes, or 25 256KB IOPS saturated writes.
>
> iostat and the 3511 transfer rate monitor is showing peaks of 150-180
> MB/sec with sustained throughput of 100 MB/sec.
[Richard tries to remember if the V480 uses schizo?]
[searching...]
[found it]
Ok, a quick browse shows that the V480 uses two schizo ASICs
as the UPA to PCI bridges. Don''t expect more than 200 MB/s from
a schizo.
http://www.sun.com/processors/manuals/External_Schizo_PRM.pdf
>>  By contrast, the
>> T2000 JBOD data shows consistent IOPS at the disk level
>> and exposes the track cache affect on the sequential read test.
>
> Yup, it is clear that we are easily hitting the read i/o limits of the
> drives in the T2000.
>
>> Did I mention that I''m a member of BAARF?  www.baarf.com :-)
>
> Not yet :-)
>
>> Hint: for performance work with HDDs, pay close attention to
>> IOPS, then convert to bandwidth for the PHB.
>
> PHB ???
Not a fan of Dilbert? :-)
> I do look at IOPs, but what struck me as odd was the disparate  
> results.
>
> <snip>
>
>> b119 has improved stat() performance, which should make a positive
>> improvement of such backups.  But eventually you may need to move
>> to a multi-stage backup, depending on your business requirements.
>
> Due to contract issues (I am consulting at a government agency), we
> cannot yet run OpenSolaris in production.
Look for CR6775100 to be rolled into a Solaris 10 patch.  It might take
another 6 months or so, if it gets backported.
> On our previous server for this application (Apple G5) we had 4 TB of
> data and about 50 million files (under HFS+) and a full backup took 3
> WEEKS. We went the route of explicitly specifying each directory in
> the NetBackup config and got _some_ reliability. Today we have about
> 22 TB in over 200 ZFS datasets (not evenly distributed,
> unfortunately), the largest of which is about 3.5 TB and 30 million
> files.
Yep, this is becoming more common as people build larger file systems.
I briefly describe the multistage backup here:
http://richardelling.blogspot.com/2009/08/backups-for-file-systems-with-millions.html
Of course, there are quite a few design details that will vary based on
business requirements...
> BTW, our overall configuration is based on h/w we bought years ago and
> are having to adopt as best we can. We are pushing to replace the
> SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5
> sets and 1 hot spare per 3511 tray, we carve up
''standard'' 512 GB
> partitions, which we mirror at the ZPOOL layer across 3511 arrays. We
> just add additional mirror pairs as the data in each department grows,
> keeping the mirrors on different arrays :-)
In general, RAID-5 (or raidz) performs poorly for random reads.  It
gets worse when the reads are small.
  -- richard

Richard Elling

2009-Nov-25 18:36 UTC

head link

[zfs-discuss] ZFS Random Read Performance

more below...

On Nov 25, 2009, at 7:10 AM, Paul Kraus wrote:
> I posted baseline stats at http://www.ilk.org/~ppk/Geek/
>
> baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size
>
> 480-3511-baseline.xls is an iozone output file
>
> iostat-baseline.txt is the iostat output for the device in use  
> (annotated)
>
> I also noted an odd behavior yesterady and have not had a chance to
> better qualify it. I was testing various combinations of vdev
> quantities and mirror quantities.
>
> As I changed the number of vdevs (stripes) from 1 through 8 (all
> backed buy paritions on the same logical disk on the 3511) there was
> no real change in sequential write, random write, or random read
> performance. Sequential read performance did show a drop from 216
> MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
> expected.
>
> As I changed the number of mirro components things got interesting.
> Keep in mind that I only have one 3511 for testing right now, I had to
> use partitions from two other production 3511''s to get three
mirror
> components on different arrays. As expected, as I went from 1 to 2 to
> 3 mirror components the write performance did not change, but the read
> performance was interesting... see below:
>
> read performance
> mirrors  sequential  random
> 1  174 MiB/sec.  23 MiB/sec.
> 2  229 MiB/sec.  30 MiB/sec.
> 3  223 MiB/sec.  125 MiB/sec.
>
> What they heck happened here ? 1 to 2 mirrors saw a large increase in
> sequential read perfromance and from 2 to 3 mirrors show a HUGE
> increase in random read performance. It "feels" like the behavior
of
> the zfs code changed between 2 and 3 mirrors for the random read data.
I can''t explain this.  It may require a detailed understanding of the
hardware configuration to identify the potential bottleneck.

The ZFS mirroring code doesn''t care how many mirrors there are, it
just goes through the list.  If the performance is not symmetrical from
all sides of the mirror, then YMMV.
> Now to investigate further, I tried multiple mirrors components on the
> same array (my test 3511), not that you would do this in production,
> but I was curious what would happen. In this case the throughput
> degraded across the board as I added mirror components, as one would
> expect. In the random read case the array was delivering less overall
> performance than it was when it was one part of the earlier test (16
> MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
> http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
> results. Sheet 8 is the last test I did last night, using the NRAID
> logical disk type to try to get the 3511 to pass a disk through to
> zfs, but get the advantage of the cache on the 3511. I''m not sure
what
> to read into those numbers.
I read it as the single array, as configured, with 10+1 RAID-5 can  
deliver
around 130 random read IOPS @ 128 KB.
  -- richard

zfs discuss - Nov 2009 - ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance

[zfs-discuss] ZFS Random Read Performance