thr3ads.net - zfs discuss - [zfs-discuss] Lots of overhead with ZFS

If this information is useful, please help other people find it:
Share via:

Marko Milisavljevic

2007-May-14 07:53 UTC

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

I was trying to simply test bandwidth that Solaris/ZFS (Nevada b63) can deliver
from a drive, and doing this:
dd if=(raw disk) of=/dev/null gives me around 80MB/s, while dd if=(file on ZFS)
of=/dev/null gives me only 35MB/s!?. I am getting basically the same result
whether it is single zfs drive, mirror or a stripe (I am testing with two
Seagate 7200.10 320G drives hanging off the same interface card).

On the test machine I also have an old disk with UFS on PATA interface (Seagate
7200.7 120G). dd from raw disk gives 58MB/s and dd from file on UFS gives 45MB/s
- far less relative slowdown compared to raw disk.

This is just an AthlonXP 2500+ with 32bit PCI SATA sil3114 card, but
nonetheless, the hardware has the bandwidth to fully saturate the hard drive, as
seen by dd from the raw disk device. What is going on? Am I doing something
wrong or is ZFS just not designed to be used on humble hardware?

My goal is to have it go fast enough to saturate gigabit ethernet - around
75MB/s. I don''t plan on replacing hardware - after all, Linux with
RAID10 gives me this already. I was hoping to switch to Solaris/ZFS to get
checksums (which wouldn''t seem to account for slowness, because CPU
stays under 25% during all this).

I can temporarily scrape together an x64 machine with ICH7 SATA interface -
I''ll try the same test with same drives on that to elliminate
32-bitness and PCI slowness from the equation. And while someone will say dd has
little to do with real-life file server performance - it actually has a lot to
do with it, because most of use of this server is to copy multi-gigabyte files
to and fro a few times per day. Hardly any random access involved (fragmentation
aside).
 
 
This message posted from opensolaris.org

Al Hopper

2007-May-14 13:11 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

On Mon, 14 May 2007, Marko Milisavljevic wrote:

[ ... reformatted ....]> I was trying to simply test bandwidth that Solaris/ZFS (Nevada b63) can
> deliver from a drive, and doing this: dd if=(raw disk) of=/dev/null
> gives me around 80MB/s, while dd if=(file on ZFS) of=/dev/null gives me
> only 35MB/s!?. I am getting basically the same result whether it is
> single zfs drive, mirror or a stripe (I am testing with two Seagate
> 7200.10 320G drives hanging off the same interface card).
Which interface card?

... snip ....

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Richard Elling

2007-May-14 15:57 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> I was trying to simply test bandwidth that Solaris/ZFS (Nevada b63) can
deliver from a drive, and doing this:
> dd if=(raw disk) of=/dev/null gives me around 80MB/s, while dd if=(file on
ZFS) of=/dev/null gives me only 35MB/s!?. I am getting basically the same result
whether it is single zfs drive, mirror or a stripe (I am testing with two
Seagate 7200.10 320G drives hanging off the same interface card).
Checksum is a contributor.  AthlonXPs are long in the tooth.  Disable checksum
and experiment.
  -- richard
> On the test machine I also have an old disk with UFS on PATA interface
(Seagate 7200.7 120G). dd from raw disk gives 58MB/s and dd from file on UFS
gives 45MB/s - far less relative slowdown compared to raw disk.
> 
> This is just an AthlonXP 2500+ with 32bit PCI SATA sil3114 card, but
nonetheless, the hardware has the bandwidth to fully saturate the hard drive, as
seen by dd from the raw disk device. What is going on? Am I doing something
wrong or is ZFS just not designed to be used on humble hardware?
> 
> My goal is to have it go fast enough to saturate gigabit ethernet - around
75MB/s. I don''t plan on replacing hardware - after all, Linux with
RAID10 gives me this already. I was hoping to switch to Solaris/ZFS to get
checksums (which wouldn''t seem to account for slowness, because CPU
stays under 25% during all this).
> 
> I can temporarily scrape together an x64 machine with ICH7 SATA interface -
I''ll try the same test with same drives on that to elliminate
32-bitness and PCI slowness from the equation. And while someone will say dd has
little to do with real-life file server performance - it actually has a lot to
do with it, because most of use of this server is to copy multi-gigabyte files
to and fro a few times per day. Hardly any random access involved (fragmentation
aside).
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Marko Milisavljevic

2007-May-14 20:02 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

To reply to my own message.... this article offers lots of insight into why dd
access directly through raw disk is fast, while accessing a file through the
file system may be slow.

http://www.informit.com/articles/printerfriendly.asp?p=606585&rl=1

So, I guess what I''m wondering now is, does it happen to everyone that
ZFS is under half the speed of raw disk access? What speeds are other people
getting trying to dd a file through zfs file system? Something like

dd if=/pool/mount/file of=/dev/null bs=128k (assuming you are using default ZFS
block size)

how does that compare to:

dd if=/dev/dsk/diskinzpool of=/dev/null bs=128k count=10000

If you could please post your MB/s and show output of zpool status so we can see
your disk configuration I would appreciate it. Please use file that is 100MB or
more - result is be too random with small files. Also make sure zfs is not
caching the file already!

What I am seeing is that ZFS performance for sequential access is about 45% of
raw disk access, while UFS (as well as ext3 on Linux) is around 70%. For
workload consisting mostly of reading large files sequentially, it would seem
then that ZFS is the wrong tool performance-wise. But, it could be just my
setup, so I would appreciate more data points.
 
 
This message posted from opensolaris.org

johansen-osdev at sun.com

2007-May-14 20:43 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

This certainly isn''t the case on my machine.

$ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out

real        1.3
user        0.0
sys         1.2

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       22.3
user        0.0
sys         2.2

This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.

My pool is configured into a 46 disk RAID-0 stripe.  I''m going to omit
the zpool status output for the sake of brevity.
> What I am seeing is that ZFS performance for sequential access is
> about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
> around 70%. For workload consisting mostly of reading large files
> sequentially, it would seem then that ZFS is the wrong tool
> performance-wise. But, it could be just my setup, so I would
> appreciate more data points.
This isn''t what we''ve observed in much of our performance
testing.
It may be a problem with your config, although I''m not an expert on
storage configurations.  Would you mind providing more details about
your controller, disks, and machine setup?

-j

Marko Milisavljevic

2007-May-14 21:41 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Thank you for those numbers.

I should have mentioned that I was mostly interested in single disk or small
array performance, as it is not possible for dd to meaningfully access
multiple-disk configurations without going through the file system. I find
it curious that there is such a large slowdown by going through file system
(with single drive configuration), especially compared to UFS or ext3.

I simply have a small SOHO server and I am trying to evaluate which OS to
use to keep a redundant disk array. With unreliable consumer-level hardware,
ZFS and the checksum feature are very interesting and the primary selling
point compared to a Linux setup, for as long as ZFS can generate enough
bandwidth from the drive array to saturate single gigabit ethernet.

My hardware at the moment is the "wrong" choice for Solaris/ZFS - PCI
3114
SATA controller on a 32-bit AthlonXP, according to many posts I found.
However, since dd over raw disk is capable of extracting 75+MB/s from this
setup, I keep feeling that surely I must be able to get at least that much
from reading a pair of striped or mirrored ZFS drives. But I can''t -
single
drive or 2-drive stripes or mirrors, I only get around 34MB/s going through
ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)
Everything is stock Nevada b63 installation, so I haven''t messed it up
with
misguided tuning attempts. Don''t know if it matters, but test file was
created originally from /dev/random. Compression is off, and everything is
default. CPU utilization remains low at all times (haven''t seen it go
over
25%).

On 5/14/07, johansen-osdev at sun.com <johansen-osdev at sun.com>
wrote:>
> This certainly isn''t the case on my machine.
>
> $ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
>
> real        1.3
> user        0.0
> sys         1.2
>
> # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
>
> real       22.3
> user        0.0
> sys         2.2
>
> This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.
>
> My pool is configured into a 46 disk RAID-0 stripe.  I''m going to
omit
> the zpool status output for the sake of brevity.
>
> > What I am seeing is that ZFS performance for sequential access is
> > about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
> > around 70%. For workload consisting mostly of reading large files
> > sequentially, it would seem then that ZFS is the wrong tool
> > performance-wise. But, it could be just my setup, so I would
> > appreciate more data points.
>
> This isn''t what we''ve observed in much of our performance
testing.
> It may be a problem with your config, although I''m not an expert
on
> storage configurations.  Would you mind providing more details about
> your controller, disks, and machine setup?
>
> -j
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/161154b9/attachment.html>

Marko Milisavljevic

2007-May-14 22:16 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

I missed an important conclusion from j''s data, and that is that single
disk
raw access gives him 56MB/s, and RAID 0 array gives him 961/46=21MB/s per
disk, which comes in at 38% of potential performance. That is in the
ballpark of getting 45% of potential performance, as I am seeing with my
puny setup of single or dual drives. Of course, I don''t expect a
complex
file system to match raw disk dd performance, but it doesn''t compare
favourably to common file systems like UFS or ext3, so the question remains,
is ZFS overhead normally this big? That would mean that one needs to have at
least 4-5 way stripe to generate enough data to saturate gigabit ethernet,
compared to 2-3 way stripe on a "lesser" filesystem, a possibly
important
consideration in SOHO situation.

On 5/14/07, johansen-osdev at sun.com <johansen-osdev at sun.com>
wrote:>
> This certainly isn''t the case on my machine.
>
> $ /usr/bin/time dd if=/test/filebench/largefile2 of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
>
> real        1.3
> user        0.0
> sys         1.2
>
> # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
>
> real       22.3
> user        0.0
> sys         2.2
>
> This looks like 56 MB/s on the /dev/dsk and 961 MB/s on the pool.
>
> My pool is configured into a 46 disk RAID-0 stripe.  I''m going to
omit
> the zpool status output for the sake of brevity.
>
> > What I am seeing is that ZFS performance for sequential access is
> > about 45% of raw disk access, while UFS (as well as ext3 on Linux) is
> > around 70%. For workload consisting mostly of reading large files
> > sequentially, it would seem then that ZFS is the wrong tool
> > performance-wise. But, it could be just my setup, so I would
> > appreciate more data points.
>
> This isn''t what we''ve observed in much of our performance
testing.
> It may be a problem with your config, although I''m not an expert
on
> storage configurations.  Would you mind providing more details about
> your controller, disks, and machine setup?
>
> -j
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/c048c7de/attachment.html>

Al Hopper

2007-May-14 22:44 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

On Mon, 14 May 2007, Marko Milisavljevic wrote:
> To reply to my own message.... this article offers lots of insight into why
dd access directly through raw disk is fast, while accessing a file through the
file system may be slow.
>
> http://www.informit.com/articles/printerfriendly.asp?p=606585&rl=1
>
> So, I guess what I''m wondering now is, does it happen to everyone
that ZFS is under half the speed of raw disk access? What speeds are other
people getting trying to dd a file through zfs file system? Something like
>
> dd if=/pool/mount/file of=/dev/null bs=128k (assuming you are using default
ZFS block size)
>
> how does that compare to:
>
> dd if=/dev/dsk/diskinzpool of=/dev/null bs=128k count=10000
>
> If you could please post your MB/s and show output of zpool status so we
> can see your disk configuration I would appreciate it. Please use file
> that is 100MB or more - result is be too random with small files. Also
> make sure zfs is not caching the file already!
# ptime dd if=./allhomeal20061209_01.tar of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real        6.407
user        0.008
sys         1.624

  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0

3-way mirror:

10000+0 records in
10000+0 records out

real       12.500
user        0.007
sys         1.216

2-way mirror:

10000+0 records in
10000+0 records out

real       18.356
user        0.006
sys         0.935


# psrinfo -v
Status of virtual processor 0 as of: 05/14/2007 17:31:18
  on-line since 05/03/2007 08:01:21.
  The i386 processor operates at 2009 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 1 as of: 05/14/2007 17:31:18
  on-line since 05/03/2007 08:01:24.
  The i386 processor operates at 2009 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 2 as of: 05/14/2007 17:31:18
  on-line since 05/03/2007 08:01:26.
  The i386 processor operates at 2009 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 3 as of: 05/14/2007 17:31:18
  on-line since 05/03/2007 08:01:28.
  The i386 processor operates at 2009 MHz,
        and has an i387 compatible floating point processor.

> What I am seeing is that ZFS performance for sequential access is about 45%
of raw disk access, while UFS (as well as ext3 on Linux) is around 70%. For
workload consisting mostly of reading large files sequentially, it would seem
then that ZFS is the wrong tool performance-wise. But, it could be just my
setup, so I would appreciate more data points.
>
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Richard Elling

2007-May-14 22:52 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> I missed an important conclusion from j''s data, and that is that
single
> disk raw access gives him 56MB/s, and RAID 0 array gives him 
> 961/46=21MB/s per disk, which comes in at 38% of potential performance. 
> That is in the ballpark of getting 45% of potential performance, as I am 
> seeing with my puny setup of single or dual drives. Of course, I
don''t
> expect a complex file system to match raw disk dd performance, but it 
> doesn''t compare favourably to common file systems like UFS or
ext3, so
> the question remains, is ZFS overhead normally this big? That would mean 
> that one needs to have at least 4-5 way stripe to generate enough data 
> to saturate gigabit ethernet, compared to 2-3 way stripe on a
"lesser"
> filesystem, a possibly important consideration in SOHO situation.
Could you post iostat data for these runs?

Also, as I suggested previously, try with checksum off.  AthlonXP
doesn''t
have a reputation as a speed deamon.

BTW, for 7,200 rpm drives, which are typical in desktops, 56 MBytes/s
isn''t bad.  The media speed will range from perhaps [30-40]-[60-75]
MBytes/s
judging from a quick scan of disk vendor datasheets.  In other words, it
would not surprise me to see 4-5 way stripe being required to keep a
GbE saturated.
  -- richard

Marko Milisavljevic

2007-May-14 22:53 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Thank you, Al.

Would you mind also doing:

ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000

to see the raw performance of underlying hardware.

On 5/14/07, Al Hopper <al at logical-approach.com>
wrote:>
> # ptime dd if=./allhomeal20061209_01.tar of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
>
> real        6.407
> user        0.008
> sys         1.624
>
>   pool: tank
> state: ONLINE
> scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           raidz1    ONLINE       0     0     0
>             c2t0d0  ONLINE       0     0     0
>             c2t1d0  ONLINE       0     0     0
>             c2t2d0  ONLINE       0     0     0
>             c2t3d0  ONLINE       0     0     0
>             c2t4d0  ONLINE       0     0     0
>
> 3-way mirror:
>
> 10000+0 records in
> 10000+0 records out
>
> real       12.500
> user        0.007
> sys         1.216
>
> 2-way mirror:
>
> 10000+0 records in
> 10000+0 records out
>
> real       18.356
> user        0.006
> sys         0.935
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/53ed3bcf/attachment.html>

Bart Smaalders

2007-May-14 22:59 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> I missed an important conclusion from j''s data, and that is that
single
> disk raw access gives him 56MB/s, and RAID 0 array gives him 
> 961/46=21MB/s per disk, which comes in at 38% of potential performance. 
> That is in the ballpark of getting 45% of potential performance, as I am 
> seeing with my puny setup of single or dual drives. Of course, I
don''t
> expect a complex file system to match raw disk dd performance, but it 
> doesn''t compare favourably to common file systems like UFS or
ext3, so
> the question remains, is ZFS overhead normally this big? That would mean 
> that one needs to have at least 4-5 way stripe to generate enough data 
> to saturate gigabit ethernet, compared to 2-3 way stripe on a
"lesser"
> filesystem, a possibly important consideration in SOHO situation.
> 

I don''t see this on my system, but it has more CPU (dual
core 2.6 GHz).  It saturates a GB net w/ 4 drives & samba,
not working hard at all.  A thumper does 2 GB/sec w 2 dual
core CPUs.

Do you have compression enabled?  This can be a choke point
for weak CPUs.

- Bart


Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Ian Collins

2007-May-14 23:15 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> To reply to my own message.... this article offers lots of insight into why
dd access directly through raw disk is fast, while accessing a file through the
file system may be slow.
>
> http://www.informit.com/articles/printerfriendly.asp?p=606585&rl=1
>
> So, I guess what I''m wondering now is, does it happen to everyone
that ZFS is under half the speed of raw disk access? What speeds are other
people getting trying to dd a file through zfs file system? Something like
>
> dd if=/pool/mount/file of=/dev/null bs=128k (assuming you are using default
ZFS block size)
>
> how does that compare to:
>
> dd if=/dev/dsk/diskinzpool of=/dev/null bs=128k count=10000
>
>   Testing on a old Athlon MP box, two U160 10K SCSI drives.

bash-3.00# time dd if=/dev/dsk/c2t0d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real    0m44.470s
user    0m0.018s
sys     0m8.290s

 time dd if=/test/play/sol-nv-b62-x86-dvd.iso of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out

real    0m22.714s
user    0m0.020s
sys     0m3.228s

 zpool status
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0

Ian

Marko Milisavljevic

2007-May-14 23:27 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Right now, the AthlonXP machine is booted into Linux, and I''m getting
same
raw speed as when it is in Solaris, from PCI Sil3114 with Seagate 320G (
7200.10):

dd if=/dev/sdb of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 16.7756 seconds, 78.1 MB/s

sudo dd if=./test.mov of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 24.2731 seconds, 54.0 MB/s <-- some
overhead compared to raw speed of same disk above

same machine, onboard ATA, Seagate 120G:
dd if=/dev/hda of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 22.5892 seconds, 58.0 MB/s

On another machine with Pentium D 3.0GHz and ICH7 onboard SATA in AHCI mode,
running Darwin OS:

from a Seagate 500G (7200.10):
dd if=/dev/rdisk0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 17.697512 secs (74062388 bytes/sec)

same disk, access through file system (HFS+)
dd if=./Summer\ 2006\ with\ Cohen\ 4 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 20.381901 secs (64308035 bytes/sec) <- very
small overhead compared to raw access above!

same Intel machine, Seagate 200G (7200.8, I think):
dd if=/dev/rdisk1 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 20.850229 secs (62863578 bytes/sec)

Modern disk drives are definitely fast and pushing close to 80MB/s raw
performance. And some file systems can get over 85% of that with simple
sequential access. So far, on these particular hardware and software
combinations, I have, filesystem performance as percentage of raw disk
performance for sequential unchached read:

HFS+: 86%
ext3 and UFS: 70%
ZFS: 45%

On 5/14/07, Richard Elling <Richard.Elling at sun.com>
wrote:>
> Marko Milisavljevic wrote:
> > I missed an important conclusion from j''s data, and that is
that single
> > disk raw access gives him 56MB/s, and RAID 0 array gives him
> > 961/46=21MB/s per disk, which comes in at 38% of potential
performance.
> > That is in the ballpark of getting 45% of potential performance, as I
am
> > seeing with my puny setup of single or dual drives. Of course, I
don''t
> > expect a complex file system to match raw disk dd performance, but it
> > doesn''t compare favourably to common file systems like UFS or
ext3, so
> > the question remains, is ZFS overhead normally this big? That would
mean
> > that one needs to have at least 4-5 way stripe to generate enough data
> > to saturate gigabit ethernet, compared to 2-3 way stripe on a
"lesser"
> > filesystem, a possibly important consideration in SOHO situation.
>
> Could you post iostat data for these runs?
>
> Also, as I suggested previously, try with checksum off.  AthlonXP
doesn''t
> have a reputation as a speed deamon.
>
> BTW, for 7,200 rpm drives, which are typical in desktops, 56 MBytes/s
> isn''t bad.  The media speed will range from perhaps
[30-40]-[60-75]
> MBytes/s
> judging from a quick scan of disk vendor datasheets.  In other words, it
> would not surprise me to see 4-5 way stripe being required to keep a
> GbE saturated.
>   -- richard
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/a8df8776/attachment.html>

Marko Milisavljevic

2007-May-14 23:39 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Thank you, Ian,

You are getting ZFS over 2-disk RAID-0 to be twice as fast as dd raw disk
read on one disk, which sounds more encouraging. But, there is something odd
with dd from raw drive - it is only 28MB/s or so, if I divided that right? I
would expect it to be around 100MB/s on 10K drives, or at least that should
be roughly potential throughput rate. Compared to throughput from ZFS 2-disk
RAID-0 which is showing 57MB/s. Any idea why raw dd read is so slow?

Also, I wonder if everyone is using different dd command then I am - I get
summary line that shows elapsed time and MB/s.

On 5/14/07, Ian Collins <ian at ianshome.com>
wrote:>
> Marko Milisavljevic wrote:
> > To reply to my own message.... this article offers lots of insight
into
> why dd access directly through raw disk is fast, while accessing a file
> through the file system may be slow.
> >
> > http://www.informit.com/articles/printerfriendly.asp?p=606585&rl=1
> >
> > So, I guess what I''m wondering now is, does it happen to
everyone that
> ZFS is under half the speed of raw disk access? What speeds are other
people
> getting trying to dd a file through zfs file system? Something like
> >
> > dd if=/pool/mount/file of=/dev/null bs=128k (assuming you are using
> default ZFS block size)
> >
> > how does that compare to:
> >
> > dd if=/dev/dsk/diskinzpool of=/dev/null bs=128k count=10000
> >
> >
> Testing on a old Athlon MP box, two U160 10K SCSI drives.
>
> bash-3.00# time dd if=/dev/dsk/c2t0d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
>
> real    0m44.470s
> user    0m0.018s
> sys     0m8.290s
>
> time dd if=/test/play/sol-nv-b62-x86-dvd.iso of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
>
> real    0m22.714s
> user    0m0.020s
> sys     0m3.228s
>
> zpool status
>   pool: test
> state: ONLINE
> scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         test        ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t0d0  ONLINE       0     0     0
>             c2t1d0  ONLINE       0     0     0
>
> Ian
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/f6e4ed93/attachment.html>

Nick G

2007-May-15 01:19 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Don''t know how much this will help, but my results:

Ultra 20 we just got at work:

 # uname -a
SunOS unknown 5.10 Generic_118855-15 i86pc i386 i86pc

raw disk
dd if=/dev/dsk/c1d0s6 of=/dev/null bs=128k count=10000  0.00s user 2.16s system
14% cpu 15.131 total

1,280,000k in 15.131 seconds
84768k/s

through filesystem
dd if=testfile of=/dev/null bs=128k count=10000  0.01s user 0.88s system 4% cpu
19.666 total

1,280,000k in 19.666 seconds
65087k/s


AMD64 Freebsd 7 on a Lenovo something or other, Athlon X2 3800+

 uname -a 
FreeBSD  7.0-CURRENT-200705 FreeBSD 7.0-CURRENT-200705 #0: Fri May 11 14:41:37
UTC 2007     root@:/usr/src/sys/amd64/compile/ZFS  amd64

raw disk
dd if=/dev/ad6p1 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 17.126926 secs (76529787 bytes/sec)
(74735k/s)

filesystem
# dd of=/dev/null if=testfile bs=128k count=10000
10000+0 records in
10000+0 records out
1310720000 bytes transferred in 17.174395 secs (76318263 bytes/sec)
(74529k/s)

Odd to say the least since "du" for instance is faster on Solaris
ZFS...

FWIW Freebsd is running version 6 of ZFS and the unpatched but _new_ Ultra 20 is
running version 2 of ZFS according to zdb


Make sure you''re all patched up?
 
 
This message posted from opensolaris.org

johansen-osdev at sun.com

2007-May-15 01:42 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

Marko,

I tried this experiment again using 1 disk and got nearly identical
times:

# /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       21.4
user        0.0
sys         2.4

$ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       21.0
user        0.0
sys         0.7

> [I]t is not possible for dd to meaningfully access multiple-disk
> configurations without going through the file system. I find it
> curious that there is such a large slowdown by going through file
> system (with single drive configuration), especially compared to UFS
> or ext3.
Comparing a filesystem to raw dd access isn''t a completely fair
comparison either.  Few filesystems actually layout all of their data
and metadata so that every read is a completely sequential read.
> I simply have a small SOHO server and I am trying to evaluate which OS to
> use to keep a redundant disk array. With unreliable consumer-level
hardware,
> ZFS and the checksum feature are very interesting and the primary selling
> point compared to a Linux setup, for as long as ZFS can generate enough
> bandwidth from the drive array to saturate single gigabit ethernet.
I would take Bart''s reccomendation and go with Solaris on something
like a
dual-core box with 4 disks.
> My hardware at the moment is the "wrong" choice for Solaris/ZFS -
PCI 3114
> SATA controller on a 32-bit AthlonXP, according to many posts I found.
Bill Moore lists some controller reccomendations here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
> However, since dd over raw disk is capable of extracting 75+MB/s from this
> setup, I keep feeling that surely I must be able to get at least that much
> from reading a pair of striped or mirrored ZFS drives. But I can''t
- single
> drive or 2-drive stripes or mirrors, I only get around 34MB/s going through
> ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)
Maybe this is a problem with your controller?  What happens when you
have two simultaneous dd''s to different disks running?  This would
simulate the case where you''re reading from the two disks at the same
time.

-j

Al Hopper

2007-May-15 03:16 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

On Mon, 14 May 2007, Marko Milisavljevic wrote:
> Thank you, Al.
>
> Would you mind also doing:
>
> ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000
# ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000

real       20.046
user        0.013
sys         3.568

> to see the raw performance of underlying hardware.
Regards,

Al Hopper

Marko Milisavljevic

2007-May-15 05:48 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

I am very grateful to everyone who took the time to run a few tests to help
me figure what is going on. As per j''s suggestions, I tried some
simultaneous reads, and a few other things, and I am getting interesting and
confusing results.

All tests are done using two Seagate 320G drives on sil3114. In each test I
am using dd if=.... of=/dev/null bs=128k count=10000. Each drive is freshly
formatted with one 2G file copied to it. That way dd from raw disk and from
file are using roughly same area of disk. I tried using raw, zfs and ufs,
single drives and two simultaneously (just executing dd commands in separate
terminal windows). These are snapshots of iostat -xnczpm 3 captured
somewhere in the middle of the operation. I am not bothering to report CPU%
as it never rose over 50%, and was uniformly proportional to reported
throughput.

single drive raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1378.4    0.0 77190.7    0.0  0.0  1.7    0.0    1.2   0  98 c0d1

single drive, ufs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1255.1    0.0 69949.6    0.0  0.0  1.8    0.0    1.4   0 100 c0d0

Small slowdown, but pretty good.

single drive, zfs file
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  258.3    0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1

Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
r/s gives 256K, as I would imagine it should.

simultaneous raw:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  797.0    0.0 44632.0    0.0  0.0  1.8    0.0    2.3   0 100 c0d0
  795.7    0.0 44557.4    0.0  0.0  1.8    0.0    2.3   0 100 c0d1

This PCI interface seems to be saturated at 90MB/s. Adequate if the goal is
to serve files on gigabit SOHO network.

sumultaneous raw on c0d1 and ufs on c0d0:
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  722.4    0.0 40246.8    0.0  0.0  1.8    0.0    2.5   0 100 c0d0
  717.1    0.0 40156.2    0.0  0.0  1.8    0.0    2.5   0  99 c0d1

hmm, can no longer get the 90MB/sec.

simultaneous zfs on c0d1 and raw on c0d0:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.7    0.0    1.8  0.0  0.0    0.0    0.1   0   0 c1d0
  334.9    0.0 18756.0    0.0  0.0  1.9    0.0    5.5   0  97 c0d0
  172.5    0.0 22074.6    0.0 33.0  2.0  191.3   11.6 100 100 c0d1

Everything is slow.

What happens if we throw onboard IDE interface into the mix?
simultaneous raw SATA and raw PATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1036.3    0.3 58033.9    0.3  0.0  1.6    0.0    1.6   0  99 c1d0
 1422.6    0.0 79668.3    0.0  0.0  1.6    0.0    1.1   1  98 c0d0

Both at maximum throughput.

Read ZFS on SATA drive and raw disk on PATA interface:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 1018.9    0.3 57056.1    4.0  0.0  1.7    0.0    1.7   0  99 c1d0
  268.4    0.0 34353.1    0.0 33.0  2.0  122.9    7.5 100 100 c0d0

SATA is slower with ZFS as expected by now, but ATA remains at full speed.
So they are operating quite independantly. Except...

What if we read a UFS file from the PATA disk and ZFS from SATA:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
  224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0

Now that is confusing! Why did SATA/ZFS slow down too? I''ve retried
this a
number of times, not a fluke.

Finally, after reviewing all this, I''ve noticed another interesting
bit...
whenever I read from raw disks or UFS files, SATA or PATA, kr/s over r/s is
56k, suggesting that underlying IO system is using that as some kind of a
native block size? (even though dd is requesting 128k). But when reading ZFS
files, this always comes to 128k, which is expected, since that is ZFS
default (and same thing happens regardless of bs= in dd). On the theory that
my system just doesn''t like 128k reads (I''m desperate!), and
that this would
explain the whole slowdown and wait/wsvc_t column, I tried changing recsize
to 32k and rewriting the test file. However, accessing ZFS files continues
to show 128k reads, and it is just as slow. Is there a way to either confirm
that the ZFS file in question is indeed written with 32k records or, even
better, to force ZFS to use 56k when accessing the disk. Or perhaps I just
misunderstand implications of iostat output.

I''ve repeated each of these tests a few times and doublechecked, and
the
numbers, although snapshots of a point in time, fairly represent averages.

I have no idea what to make of all this, except that it ZFS has a problem
with this hardware/drivers that UFS and other traditional file systems,
don''t. Is it a bug in the driver that ZFS is inadvertently exposing? A
specific feature that ZFS assumes the hardware to have, but it doesn''t?
Who
knows! I will have to give up on Solaris/ZFS on this hardware for now, but I
hope to try it again sometime in the future. I''ll give FreeBSD/ZFS a
spin to
see if it fares better (although at this point in its development it is
probably more risky then just sticking with Linux and missing out on ZFS).

(Another contributor suggested turning checksumming off - it made no
difference. Same for atime. Compression was always off.)

On 5/14/07, johansen-osdev at sun.com <johansen-osdev at sun.com>
wrote:>
> Marko,
>
> I tried this experiment again using 1 disk and got nearly identical
> times:
>
> # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
>
> real       21.4
> user        0.0
> sys         2.4
>
> $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
>
> real       21.0
> user        0.0
> sys         0.7
>
>
> > [I]t is not possible for dd to meaningfully access multiple-disk
> > configurations without going through the file system. I find it
> > curious that there is such a large slowdown by going through file
> > system (with single drive configuration), especially compared to UFS
> > or ext3.
>
> Comparing a filesystem to raw dd access isn''t a completely fair
> comparison either.  Few filesystems actually layout all of their data
> and metadata so that every read is a completely sequential read.
>
> > I simply have a small SOHO server and I am trying to evaluate which OS
> to
> > use to keep a redundant disk array. With unreliable consumer-level
> hardware,
> > ZFS and the checksum feature are very interesting and the primary
> selling
> > point compared to a Linux setup, for as long as ZFS can generate
enough
> > bandwidth from the drive array to saturate single gigabit ethernet.
>
> I would take Bart''s reccomendation and go with Solaris on
something like a
> dual-core box with 4 disks.
>
> > My hardware at the moment is the "wrong" choice for
Solaris/ZFS - PCI
> 3114
> > SATA controller on a 32-bit AthlonXP, according to many posts I found.
>
> Bill Moore lists some controller reccomendations here:
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
>
> > However, since dd over raw disk is capable of extracting 75+MB/s from
> this
> > setup, I keep feeling that surely I must be able to get at least that
> much
> > from reading a pair of striped or mirrored ZFS drives. But I
can''t -
> single
> > drive or 2-drive stripes or mirrors, I only get around 34MB/s going
> through
> > ZFS. (I made sure mirror was rebuilt and I resilvered the stripes.)
>
> Maybe this is a problem with your controller?  What happens when you
> have two simultaneous dd''s to different disks running?  This would
> simulate the case where you''re reading from the two disks at the
same
> time.
>
> -j
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070514/b332dc89/attachment.html>

Nick G

2007-May-15 11:31 UTC

head link

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

> I have no idea what to make of all
> this, except that it ZFS has a problem with this
> hardware/drivers that UFS and other traditional file
> systems, don''t. Is it a bug in the driver that
> ZFS is inadvertently exposing? A specific feature
> that ZFS assumes the hardware to have, but it
> doesn''t? Who knows! I will have to give up on
> Solaris/ZFS on this hardware for now, but I hope to
> try it again sometime in the future. I''ll give
> FreeBSD/ZFS a spin to see if it fares better
> (although at this point in its development it is
> probably more risky then just sticking with Linux and
> missing out on ZFS).

If you do give FreeBSD a try, if just for the sake of seeing if ZFS continues to
perform badly on your hardware, use the 200705 snapshot or newer, and make sure
your turn off the debugging support that is built in to -CURRENT by default, ZFS
seems to like _fast_ memory.

Make malloc behave like a release:
# cd /etc
# ln -s malloc.conf aj

Rebuild your kernel to disable sanity checks in -CURRENT, you could probably
just comment out WITNESS* and INVARIANT*, but I wanted to test the equivalent of
a production release system here, so I commented all of it out and recompiled.

#makeoptions  DEBUG=-g                # Build kernel with gdb(1) debug symbols
#options      KDB                     # Enable kernel debugger support.
#options      DDB                     # Support DDB.
#options      GDB                     # Support remote GDB.
#options      INVARIANTS              # Enable calls of extra sanity checking
#options      INVARIANT_SUPPORT       # Extra sanity checks of internal
structures, required by INVARIANTS
#options      WITNESS                 # Enable checks to detect deadlocks and
cycles
#options      WITNESS_SKIPSPIN        # Don''t run witness on spinlocks
for speed


Your filesystem/data should be safe on FreeBSD right now since pretty much all
of the core ZFS code is the same. That doesn''t mean something else
won''t cause a panic/reboot, since it is a devel branch! You are right
to be hesitant to put it into production for a client. If it''s just for
home use, I say go for it, I''ve been beating on it for a few days and
have been pleasantly suprised. Obviously if you can trigger a panic,
you''d want to reenable debugging if you care to fix it.
 
 
This message posted from opensolaris.org

Jürgen Keil

2007-May-15 17:13 UTC

head link

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

> Would you mind also doing:
>
> ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000
>
> to see the raw performance of underlying hardware.
This dd command is reading from the block device,
which might cache dataand probably splits requests
into "maxphys" pieces (which happens to be 56K on an 
x86 box).

I''d read from the raw device, /dev/rdsk/c2t1d0 ...
 
 
This message posted from opensolaris.org

Jonathan Edwards

2007-May-15 17:35 UTC

head link

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

On May 15, 2007, at 13:13, J?rgen Keil wrote:
>> Would you mind also doing:
>>
>> ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000
>>
>> to see the raw performance of underlying hardware.
>
> This dd command is reading from the block device,
> which might cache dataand probably splits requests
> into "maxphys" pieces (which happens to be 56K on an
> x86 box).
to increase this to say 8MB, add the following to /etc/system:

set maxphys=0x800000

and you''ll probably want to increase sd_max_xfer_size as
well (should be 256K on x86/x64) .. add the following to
/kernel/drv/sd.conf:

sd_max_xfer_size=0x800000;

then reboot to get the kernel and sd tunings to take.

---
.je

btw - the defaults on sparc:
maxphys = 128K
ssd_max_xfer_size = maxphys
sd_max_xfer_size = maxphys

johansen-osdev at sun.com

2007-May-15 21:03 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

> Each drive is freshly formatted with one 2G file copied to it. 
How are you creating each of these files?

Also, would you please include a the output from the isalist(1) command?
> These are snapshots of iostat -xnczpm 3 captured somewhere in the
> middle of the operation.
Have you double-checked that this isn''t a measurement problem by
measuring zfs with zpool iostat (see zpool(1M)) and verifying that
outputs from both iostats match?
> single drive, zfs file
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  258.3    0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1
> 
> Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s /
> r/s gives 256K, as I would imagine it should.
Not sure.  If we can figure out why ZFS is slower than raw disk access
in your case, it may explain why you''re seeing these results.
> What if we read a UFS file from the PATA disk and ZFS from SATA:
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
>  224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0
> 
> Now that is confusing! Why did SATA/ZFS slow down too? I''ve
retried this a
> number of times, not a fluke.
This could be cache interference.  ZFS and UFS use different caches.

How much memory is in this box?
> I have no idea what to make of all this, except that it ZFS has a problem
> with this hardware/drivers that UFS and other traditional file systems,
> don''t. Is it a bug in the driver that ZFS is inadvertently
exposing? A
> specific feature that ZFS assumes the hardware to have, but it
doesn''t? Who
> knows!
This may be a more complicated interaction than just ZFS and your
hardware.  There are a number of layers of drivers underneath ZFS that
may also be interacting with your hardware in an unfavorable way.

If you''d like to do a little poking with MDB, we can see the features
that your SATA disks claim they support.

As root, type mdb -k, and then at the ">" prompt that appears,
enter the
following command (this is one very long line):

*sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t
satahba_dev_port | ::array void* 32 | ::print void* | ::grep ".!=0" |
::print sata_cport_info_t cport_devp.cport_sata_drive | ::print -a
sata_drive_info_t satadrv_features_support satadrv_settings
satadrv_features_enabled

This should show satadrv_features_support, satadrv_settings, and
satadrv_features_enabled for each SATA disk on the system.

The values for these variables are defined in:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/sata/impl/sata.h

this is the relevant snippet for interpreting these values:

/*
 * Device feature_support (satadrv_features_support)
 */
#define SATA_DEV_F_DMA                  0x01
#define SATA_DEV_F_LBA28                0x02
#define SATA_DEV_F_LBA48                0x04
#define SATA_DEV_F_NCQ                  0x08
#define SATA_DEV_F_SATA1                0x10
#define SATA_DEV_F_SATA2                0x20
#define SATA_DEV_F_TCQ                  0x40    /* Non NCQ tagged queuing */

/*
 * Device features enabled (satadrv_features_enabled)
 */
#define SATA_DEV_F_E_TAGGED_QING        0x01    /* Tagged queuing enabled */
#define SATA_DEV_F_E_UNTAGGED_QING      0x02    /* Untagged queuing enabled */

/*
 * Drive settings flags (satdrv_settings)
 */
#define SATA_DEV_READ_AHEAD             0x0001  /* Read Ahead enabled */
#define SATA_DEV_WRITE_CACHE            0x0002  /* Write cache ON */
#define SATA_DEV_SERIAL_FEATURES        0x8000  /* Serial ATA feat.  enabled */
#define SATA_DEV_ASYNCH_NOTIFY          0x2000  /* Asynch-event enabled */

This may give us more information if this is indeed a problem with
hardware/drivers supporting the right features.

-j

Matthew Ahrens

2007-May-16 02:10 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> I was trying to simply test bandwidth that Solaris/ZFS (Nevada b63) can
> deliver from a drive, and doing this: dd if=(raw disk) of=/dev/null gives
> me around 80MB/s, while dd if=(file on ZFS) of=/dev/null gives me only
> 35MB/s!?.
Our experience is that ZFS gets very close to raw performance for streaming
reads (assuming that there is adequate CPU and memory available).

When doing reads, prefetching (and thus caching) is a critical component of
performance. It may be that ZFS''s prefetching or caching is
misbehaving somehow.

Your machine is 32-bit, right? This could be causing some caching pain...
How much memory do you have? While you''re running the test on ZFS, can
you
send the output of:

echo ::memstat | mdb -k
echo ::arc | mdb -k

Next, try running your test with prefetch disabled, by putting
set zfs:zfs_prefetch_disable=1
in /etc/system and rebooting before running your test. Send the
''iostat
-xnpcz'' output while this test is running.

Finally, on modern drive the streaming performance can vary by up to 2x when
reading the outside vs. the inside of the disk. If your pool had been used
before you created your test file, it could be laid out on the inside part of
the disk. Then you would be comparing raw reads of the outside of the disk
vs. zfs reads of the inside of the disk. When the pool is empty, ZFS will
start allocating from the outside, so you can try destroying and recreating
your pool and creating the file on the fresh pool. Alternatively, create a
small partition (say, 10% of the disk size) and do your tests on that to
ensure that the file is not far from where your raw reads are going.

Let us know how that goes.

--matt

Marko Milisavljevic

2007-May-16 05:09 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Hello Matthew,

Yes, my machine is 32-bit, with 1.5G of RAM.

-bash-3.00# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     123249               481   32%
Anon                        33704               131    9%
Exec and libs                7637                29    2%
Page cache                   1116                 4    0%
Free (cachelist)           222661               869   57%
Free (freelist)              2685                10    1%

Total                      391052              1527
Physical                   391051              1527

-bash-3.00# echo ::arc | mdb -k
{
    anon = -759566176
    mru = -759566136
    mru_ghost = -759566096
    mfu = -759566056
    mfu_ghost = -759566016
    size = 0x17f20c00
    p = 0x160ef900
    c = 0x17f16ae0
    c_min = 0x4000000
    c_max = 0x1da00000
    hits = 0x353b
    misses = 0x264b
    deleted = 0x13bc
    recycle_miss = 0x31
    mutex_miss = 0
    evict_skip = 0
    hash_elements = 0x127b
    hash_elements_max = 0x1a19
    hash_collisions = 0x61
    hash_chains = 0x4c
    hash_chain_max = 0x1
    no_grow = 1
}

now lets try:
set zfs:zfs_prefetch_disable=1

bingo!

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  609.0    0.0 77910.0    0.0  0.0  0.8    0.0    1.4   0  83 c0d0

only 1-2 % slower then dd from /dev/dsk. Do you think this is general
32-bit problem, or specific to this combination of hardware? I am
using PCI/SATA Sil3114 card, and other then ZFS, performance of this
interface has some limitations in Solaris. That is, single drive gives
80MB/s, but doing dd /dev/dsk/xyz simultaneously on 2 drives attached
to the card gives only 46MB/s each. On Linux, however, that gives
60MB/s each, close to saturating theoretical throughput of PCI bus.
Having both drives in zpool stripe gives, with prefetch disabled,
close to 45MB/s each through dd from zfs file. I think that under
Solaris, this card is accessed through ATA driver.

There shouldn''t be any issues on inside vs outside. all the reading is
done on the first gig or two of the drive, as there is nothing else on
them, except one 2 gig file. (well, i''m assuming simple copy onto a
newly formatted zfs drive puts it at start of the drive.) Drives are
completely owned by ZFS, using zpool create c0d0 c0d1

Finally, should I file a bug somewhere regarding prefetch, or is this
a known issue?

Many thanks.

On 5/15/07, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:> Marko Milisavljevic wrote:
> > I was trying to simply test bandwidth that Solaris/ZFS (Nevada b63)
can
> > deliver from a drive, and doing this: dd if=(raw disk) of=/dev/null
gives
> > me around 80MB/s, while dd if=(file on ZFS) of=/dev/null gives me only
> > 35MB/s!?.
>
> Our experience is that ZFS gets very close to raw performance for streaming
> reads (assuming that there is adequate CPU and memory available).
>
> When doing reads, prefetching (and thus caching) is a critical component of
> performance.  It may be that ZFS''s prefetching or caching is
misbehaving somehow.
>
> Your machine is 32-bit, right?  This could be causing some caching pain...
> How much memory do you have?  While you''re running the test on
ZFS, can you
> send the output of:
>
> echo ::memstat | mdb -k
> echo ::arc | mdb -k
>
> Next, try running your test with prefetch disabled, by putting
> set zfs:zfs_prefetch_disable=1
> in /etc/system and rebooting before running your test.  Send the
''iostat
> -xnpcz'' output while this test is running.
>
> Finally, on modern drive the streaming performance can vary by up to 2x
when
> reading the outside vs. the inside of the disk.  If your pool had been used
> before you created your test file, it could be laid out on the inside part
of
> the disk.  Then you would be comparing raw reads of the outside of the disk
> vs. zfs reads of the inside of the disk.  When the pool is empty, ZFS will
> start allocating from the outside, so you can try destroying and recreating
> your pool and creating the file on the fresh pool.  Alternatively, create a
> small partition (say, 10% of the disk size) and do your tests on that to
> ensure that the file is not far from where your raw reads are going.
>
> Let us know how that goes.
>
> --matt
>

Marko Milisavljevic

2007-May-16 05:14 UTC

head link

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

I tried as you suggested, but I notice that output from iostat while
doing dd if=/dev/dsk/... still shows that reading is done in 56k
chunks. I haven''t see any change in performance. Perhaps iostat
doesn''t say what I think it does. Using dd if=/dev/rdsk/.. gives 256k,
and dd if=zfsfile gives 128k read sizes.

On 5/15/07, Jonathan Edwards <Jonathan.Edwards at sun.com>
wrote:>
> On May 15, 2007, at 13:13, J?rgen Keil wrote:
>
> >> Would you mind also doing:
> >>
> >> ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000
> >>
> >> to see the raw performance of underlying hardware.
> >
> > This dd command is reading from the block device,
> > which might cache dataand probably splits requests
> > into "maxphys" pieces (which happens to be 56K on an
> > x86 box).
>
> to increase this to say 8MB, add the following to /etc/system:
>
> set maxphys=0x800000
>
> and you''ll probably want to increase sd_max_xfer_size as
> well (should be 256K on x86/x64) .. add the following to
> /kernel/drv/sd.conf:
>
> sd_max_xfer_size=0x800000;
>
> then reboot to get the kernel and sd tunings to take.
>
> ---
> .je
>
> btw - the defaults on sparc:
> maxphys = 128K
> ssd_max_xfer_size = maxphys
> sd_max_xfer_size = maxphys
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Marko Milisavljevic

2007-May-16 05:41 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

On 5/15/07, johansen-osdev at sun.com <johansen-osdev at sun.com>
wrote:> > Each drive is freshly formatted with one 2G file copied to it.
>
> How are you creating each of these files?
zpool create tank c0d0 c0d1; zfs create tank/test; cp ~/bigfile /tank/test/
Actual content of the file is random junk from /dev/random.
> Also, would you please include a the output from the isalist(1) command?
pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86
> Have you double-checked that this isn''t a measurement problem by
> measuring zfs with zpool iostat (see zpool(1M)) and verifying that
> outputs from both iostats match?
Both give same kb/s.
> How much memory is in this box?
1.5g, I can see in /var/adm/messages that it is recognized.
> As root, type mdb -k, and then at the ">" prompt that appears,
enter the
> following command (this is one very long line):
>
> *sata_hba_list::list sata_hba_inst_t satahba_next | ::print sata_hba_inst_t
satahba_dev_port | ::array void* 32 | ::print void* | ::grep ".!=0" |
::print sata_cport_info_t cport_devp.cport_sata_drive | ::print -a
sata_drive_info_t satadrv_features_support satadrv_settings
satadrv_features_enabled
This gives me "mdb: failed to dereference symbol: unknown symbol
name". I don''t know enough about the syntax here to try to isolate
which token it is complaining about. But, I don''t know if my PCI/SATA
card is going through sd driver, if that is what commands above
assume... my understanding is that sil3114 goes through ata driver, as
per this blog: http://blogs.sun.com/mlf/entry/ata_on_solaris_x86_at

If there is any other testing I can do, I would be happy to.

Marko Milisavljevic

2007-May-16 09:47 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Got excited too quickly on one thing... reading single zfs file does give me
almost same speed as dd /dev/dsk... around 78MB/s... however, creating a
2-drive stripe, still doesn''t perform as well as it ought to:

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  294.3    0.0 37675.6    0.0  0.0  0.4    0.0    1.4   0  40 c3d0
  293.0    0.0 37504.9    0.0  0.0  0.4    0.0    1.4   0  40 c3d1

Simultaneous dd on those 2 drives from /dev/dsk runs at 46MB/s per drive.
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  800.4    0.0 44824.6    0.0  0.0  1.8    0.0    2.2   0  99 c3d0
  792.1    0.0 44357.9    0.0  0.0  1.8    0.0    2.2   0  98 c3d1

(and in Linux it saturates PCI bus at 60MB/s per drive)

On 5/15/07, Marko Milisavljevic <marko at cognistudio.com>
wrote:>
> set zfs:zfs_prefetch_disable=1
>
> bingo!
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   609.0    0.0 77910.0    0.0  0.0  0.8    0.0    1.4   0  83 c0d0
>
> only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> 32-bit problem, or specific to this combination of hardware? I am
> using PCI/SATA Sil3114 card, and other then ZFS, performance of this
> interface has some limitations in Solaris. That is, single drive gives
> 80MB/s, but doing dd /dev/dsk/xyz simultaneously on 2 drives attached
> to the card gives only 46MB/s each. On Linux, however, that gives
> 60MB/s each, close to saturating theoretical throughput of PCI bus.
> Having both drives in zpool stripe gives, with prefetch disabled,
> close to 45MB/s each through dd from zfs file.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070516/79ce320b/attachment.html>

Matthew Ahrens

2007-May-16 16:29 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> Got excited too quickly on one thing... reading single zfs file does 
> give me almost same speed as dd /dev/dsk... around 78MB/s... however, 
> creating a 2-drive stripe, still doesn''t perform as well as it
ought to:
Yes, that makes sense.  Because prefetch is disabled, ZFS will only 
issue one read i/o at a time (for that stream).  This is one of the 
reasons prefetch is important :-)

Eg, in your output below you can see that each disk is only busy 40% of 
the time when using ZFS with no prefetch:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   294.3    0.0 37675.6    0.0  0.0  0.4    0.0    1.4   0  40 c3d0
>   293.0    0.0 37504.9    0.0  0.0  0.4    0.0    1.4   0  40 c3d1
> 
> Simultaneous dd on those 2 drives from /dev/dsk runs at 46MB/s per drive.
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   800.4    0.0 44824.6    0.0  0.0  1.8    0.0    2.2   0  99 c3d0
>   792.1    0.0 44357.9    0.0  0.0  1.8    0.0    2.2   0  98 c3d1
--matt

Matthew Ahrens

2007-May-16 16:32 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> now lets try:
> set zfs:zfs_prefetch_disable=1
> 
> bingo!
> 
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  609.0    0.0 77910.0    0.0  0.0  0.8    0.0    1.4   0  83 c0d0
> 
> only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> 32-bit problem, or specific to this combination of hardware?
I suspect that it''s fairly generic, but more analysis will be
necessary.
> Finally, should I file a bug somewhere regarding prefetch, or is this
> a known issue?
It may be related to 6469558, but yes please do file another bug report. 
  I''ll have someone on the ZFS team take a look at it.

--matt

Marko Milisavljevic

2007-May-16 17:06 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

I will do that, but I''ll do a couple of things first, to try to isolate
the
problem more precisely:

- Use ZFS on a plain PATA drive on onboard IDE connector to see if it works
with prefetch on this 32-bit machine.
- Use this PCI-SATA card in a 64-bit, 2g RAM machine and see how it performs
there, and also compare it to that machine''s onboard ICH7 SATA
interface (I
assume I can force it to use AHCI drivers or not by changing the mode of
operation for ICH7 in BIOS).

Marko

On 5/16/07, Matthew Ahrens <Matthew.Ahrens at sun.com>
wrote:>
>
> > Finally, should I file a bug somewhere regarding prefetch, or is this
> > a known issue?
>
> It may be related to 6469558, but yes please do file another bug report.
>   I''ll have someone on the ZFS team take a look at it.
>
> --matt
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070516/023d80f5/attachment.html>

johansen-osdev at sun.com

2007-May-16 17:26 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

> >*sata_hba_list::list sata_hba_inst_t satahba_next | ::print 
> >sata_hba_inst_t satahba_dev_port | ::array void* 32 | ::print void* | 
> >::grep ".!=0" | ::print sata_cport_info_t
cport_devp.cport_sata_drive |
> >::print -a sata_drive_info_t satadrv_features_support satadrv_settings 
> >satadrv_features_enabled
> This gives me "mdb: failed to dereference symbol: unknown symbol
> name". 
You may not have the SATA module installed.  If you type:

::modinfo !  grep sata

and don''t get any output, your sata driver is attached some other way.

My apologies for the confusion.

-K

johansen-osdev at sun.com

2007-May-16 18:38 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

At Matt''s request, I did some further experiments and have found that
this appears to be particular to your hardware.  This is not a general
32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
and 64-bit kernel.  I got identical results:

64-bit
=====
$ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out

real       20.1
user        0.0
sys         1.2

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       19.0
user        0.0
sys         2.6

65 Mb/s

32-bit
=====
/usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
count=10000
10000+0 records in
10000+0 records out

real       20.1
user        0.0
sys         1.7

62 Mb/s

# /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=10000
10000+0 records in
10000+0 records out

real       19.1
user        0.0
sys         4.3

65 Mb/s

-j

On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens
wrote:> Marko Milisavljevic wrote:
> >now lets try:
> >set zfs:zfs_prefetch_disable=1
> >
> >bingo!
> >
> >   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > 609.0    0.0 77910.0    0.0  0.0  0.8    0.0    1.4   0  83 c0d0
> >
> >only 1-2 % slower then dd from /dev/dsk. Do you think this is general
> >32-bit problem, or specific to this combination of hardware?
> 
> I suspect that it''s fairly generic, but more analysis will be
necessary.
> 
> >Finally, should I file a bug somewhere regarding prefetch, or is this
> >a known issue?
> 
> It may be related to 6469558, but yes please do file another bug report. 
>  I''ll have someone on the ZFS team take a look at it.
> 
> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

johansen-osdev at sun.com

2007-May-16 20:18 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko,
Matt and I discussed this offline some more and he had a couple of ideas
about double-checking your hardware.

It looks like your controller (or disks, maybe?) is having trouble with
multiple simultaneous I/Os to the same disk.  It looks like prefetch
aggravates this problem.

When I asked Matt what we could do to verify that it''s the number of
concurrent I/Os that is causing performance to be poor, he had the
following suggestions:

	set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then
	iostat should show 1 outstanding io and perf should be good.

	or turn prefetch off, and have multiple threads reading
	concurrently, then iostat should show multiple outstanding ios
	and perf should be bad.

Let me know if you have any additional questions.

-j

On Wed, May 16, 2007 at 11:38:24AM -0700, johansen-osdev at sun.com
wrote:> At Matt''s request, I did some further experiments and have found
that
> this appears to be particular to your hardware.  This is not a general
> 32-bit problem.  I re-ran this experiment on a 1-disk pool using a 32
> and 64-bit kernel.  I got identical results:
> 
> 64-bit
> =====> 
> $ /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
> 
> real       20.1
> user        0.0
> sys         1.2
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
> 
> real       19.0
> user        0.0
> sys         2.6
> 
> 65 Mb/s
> 
> 32-bit
> =====> 
> /usr/bin/time dd if=/testpool1/filebench/testfile of=/dev/null bs=128k
> count=10000
> 10000+0 records in
> 10000+0 records out
> 
> real       20.1
> user        0.0
> sys         1.7
> 
> 62 Mb/s
> 
> # /usr/bin/time dd if=/dev/dsk/c1t3d0 of=/dev/null bs=128k count=10000
> 10000+0 records in
> 10000+0 records out
> 
> real       19.1
> user        0.0
> sys         4.3
> 
> 65 Mb/s
> 
> -j
> 
> On Wed, May 16, 2007 at 09:32:35AM -0700, Matthew Ahrens wrote:
> > Marko Milisavljevic wrote:
> > >now lets try:
> > >set zfs:zfs_prefetch_disable=1
> > >
> > >bingo!
> > >
> > >   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> > > 609.0    0.0 77910.0    0.0  0.0  0.8    0.0    1.4   0  83 c0d0
> > >
> > >only 1-2 % slower then dd from /dev/dsk. Do you think this is
general
> > >32-bit problem, or specific to this combination of hardware?
> > 
> > I suspect that it''s fairly generic, but more analysis will be
necessary.
> > 
> > >Finally, should I file a bug somewhere regarding prefetch, or is
this
> > >a known issue?
> > 
> > It may be related to 6469558, but yes please do file another bug
report.
> >  I''ll have someone on the ZFS team take a look at it.
> > 
> > --matt
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Marko Milisavljevic

2007-May-17 06:58 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Thank you, following your suggestion improves things - reading a ZFS
file from a RAID-0 pair now gives me 95MB/sec - about the same as from
/dev/dsk. What I find surprising is that reading from RAID-1 2-drive
zpool gives me only 56MB/s - I imagined it would be roughly like
reading from RAID-0. I can see that it can''t be identical - when
reading mirrored drives simultaneously, some data will need to be
skipped if the file is laid out sequentially, but it doesn''t seem
intuitively obvious how my broken drvers/card would affect it to that
degree, especially since reading from a file from one-disk zpool gives
me 70MB/s. My plan was to make 4-disk RAID-Z - we''ll see how it works
out when all drives arrive.

Given how common Sil3114 chipset is in
my-old-computer-became-home-server segment, I am sure this workaround
will be appreciated by many who google their way here. And just in
case it is not clear, what j means below is to add these two lines in
/etc/system:

set zfs:zfs_vdev_min_pending=1
set zfs:zfs_vdev_max_pending=1

I''ve been doing a lot of reading, and it seem unlikely that any effort
will be made to address the driver performance with either ATA or
Sil311x chipset specifically - by the time more pressing enhancements
are made with various SATA drivers, this will be too obsolete to
matter.

With your workaround things are working well enough for the purpose
that I am able to chose Solaris over Linux - thanks again.

Marko

On 5/16/07, johansen-osdev at sun.com <johansen-osdev at sun.com>
wrote:> Marko,
> Matt and I discussed this offline some more and he had a couple of ideas
> about double-checking your hardware.
>
> It looks like your controller (or disks, maybe?) is having trouble with
> multiple simultaneous I/Os to the same disk.  It looks like prefetch
> aggravates this problem.
>
> When I asked Matt what we could do to verify that it''s the number
of
> concurrent I/Os that is causing performance to be poor, he had the
> following suggestions:
>
>         set zfs_vdev_{min,max}_pending=1 and run with prefetch on, then
>         iostat should show 1 outstanding io and perf should be good.
>
>         or turn prefetch off, and have multiple threads reading
>         concurrently, then iostat should show multiple outstanding ios
>         and perf should be bad.
>
> Let me know if you have any additional questions.
>
> -j

Richard Elling

2007-May-17 14:50 UTC

head link

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

queuing theory should explain this rather nicely.  iostat measures
%busy by counting if there is an entry in the queue for the clock
ticks.  There are two queues, one in the controller and one on the
disk.  As you can clearly see the way ZFS pushes the load is very
different than dd or UFS.
  -- richard

Marko Milisavljevic wrote:> I am very grateful to everyone who took the time to run a few tests to 
> help me figure what is going on. As per j''s suggestions, I tried
some
> simultaneous reads, and a few other things, and I am getting interesting 
> and confusing results.
> 
> All tests are done using two Seagate 320G drives on sil3114. In each 
> test I am using dd if=.... of=/dev/null bs=128k count=10000. Each drive 
> is freshly formatted with one 2G file copied to it. That way dd from raw 
> disk and from file are using roughly same area of disk. I tried using 
> raw, zfs and ufs, single drives and two simultaneously (just executing 
> dd commands in separate terminal windows). These are snapshots of iostat 
> -xnczpm 3 captured somewhere in the middle of the operation. I am not 
> bothering to report CPU% as it never rose over 50%, and was uniformly 
> proportional to reported throughput.
> 
> single drive raw:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  1378.4    0.0 77190.7    0.0  0.0  1.7    0.0    1.2   0  98 c0d1
> 
> single drive, ufs file
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  1255.1    0.0 69949.6    0.0  0.0  1.8    0.0    1.4   0 100 c0d0
> 
> Small slowdown, but pretty good.
> 
> single drive, zfs file
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   258.3     0.0 33066.6    0.0 33.0  2.0  127.7    7.7 100 100 c0d1
> 
> Now that is odd. Why so much waiting? Also, unlike with raw or UFS, kr/s 
> / r/s gives 256K, as I would imagine it should.
> 
> simultaneous raw:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   797.0    0.0 44632.0    0.0  0.0  1.8    0.0    2.3   0 100 c0d0
>   795.7    0.0 44557.4    0.0  0.0  1.8    0.0    2.3   0 100 c0d1
> 
> This PCI interface seems to be saturated at 90MB/s. Adequate if the goal 
> is to serve files on gigabit SOHO network.
> 
> sumultaneous raw on c0d1 and ufs on c0d0:
>                     extended device statistics             
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   722.4    0.0 40246.8    0.0  0.0   1.8    0.0    2.5   0 100 c0d0
>   717.1    0.0 40156.2    0.0  0.0  1.8    0.0    2.5   0  99 c0d1
> 
> hmm, can no longer get the 90MB/sec.
> 
> simultaneous zfs on c0d1 and raw on c0d0:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0    0.7    0.0    1.8  0.0  0.0    0.0    0.1   0   0 c1d0
>   334.9    0.0 18756.0    0.0  0.0  1.9    0.0    5.5   0  97 c0d0
>   172.5    0.0 22074.6    0.0 33.0  2.0  191.3   11.6 100 100 c0d1
> 
> Everything is slow.
> 
> What happens if we throw onboard IDE interface into the mix?
> simultaneous raw SATA and raw PATA:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  1036.3    0.3 58033.9    0.3  0.0  1.6    0.0    1.6   0  99 c1d0
>  1422.6    0.0 79668.3    0.0  0.0  1.6    0.0    1.1   1  98 c0d0
> 
> Both at maximum throughput.
> 
> Read ZFS on SATA drive and raw disk on PATA interface:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  1018.9    0.3 57056.1    4.0  0.0  1.7    0.0    1.7   0  99 c1d0
>   268.4    0.0 34353.1     0.0 33.0  2.0  122.9    7.5 100 100 c0d0
> 
> SATA is slower with ZFS as expected by now, but ATA remains at full 
> speed. So they are operating quite independantly. Except...
> 
> What if we read a UFS file from the PATA disk and ZFS from SATA:
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   792.8    0.0 44092.9    0.0  0.0  1.8    0.0    2.2   1  98 c1d0
>   224.0    0.0 28675.2    0.0 33.0  2.0  147.3    8.9 100 100 c0d0
>  
> Now that is confusing! Why did SATA/ZFS slow down too? I''ve
retried this
> a number of times, not a fluke.
> 
> Finally, after reviewing all this, I''ve noticed another
interesting
> bit... whenever I read from raw disks or UFS files, SATA or PATA, kr/s 
> over r/s is 56k, suggesting that underlying IO system is using that as 
> some kind of a native block size? (even though dd is requesting 128k). 
> But when reading ZFS files, this always comes to 128k, which is 
> expected, since that is ZFS default (and same thing happens regardless 
> of bs= in dd). On the theory that my system just doesn''t like 128k
reads
> (I''m desperate!), and that this would explain the whole slowdown
and
> wait/wsvc_t column, I tried changing recsize to 32k and rewriting the 
> test file. However, accessing ZFS files continues to show 128k reads, 
> and it is just as slow. Is there a way to either confirm that the ZFS 
> file in question is indeed written with 32k records or, even better, to 
> force ZFS to use 56k when accessing the disk. Or perhaps I just 
> misunderstand implications of iostat output.
> 
> I''ve repeated each of these tests a few times and doublechecked,
and the
> numbers, although snapshots of a point in time, fairly represent averages.
> 
> I have no idea what to make of all this, except that it ZFS has a 
> problem with this hardware/drivers that UFS and other traditional file 
> systems, don''t. Is it a bug in the driver that ZFS is
inadvertently
> exposing? A specific feature that ZFS assumes the hardware to have, but 
> it doesn''t? Who knows! I will have to give up on Solaris/ZFS on
this
> hardware for now, but I hope to try it again sometime in the future. 
> I''ll give FreeBSD/ZFS a spin to see if it fares better (although
at this
> point in its development it is probably more risky then just sticking 
> with Linux and missing out on ZFS).
> 
> (Another contributor suggested turning checksumming off - it made no 
> difference. Same for atime. Compression was always off.)
> 
> On 5/14/07, * johansen-osdev at sun.com <mailto:johansen-osdev at
sun.com>*
> <johansen-osdev at sun.com <mailto:johansen-osdev at sun.com>>
wrote:
> 
>     Marko,
> 
>     I tried this experiment again using 1 disk and got nearly identical
>     times:
> 
>     # /usr/bin/time dd if=/dev/dsk/c0t0d0 of=/dev/null bs=128k count=10000
>     10000+0 records in
>     10000+0 records out
> 
>     real       21.4
>     user        0.0
>     sys         2.4
> 
>     $ /usr/bin/time dd if=/test/filebench/testfile of=/dev/null bs=128k
>     count=10000
>     10000+0 records in
>     10000+0 records out
> 
>     real       21.0
>     user         0.0
>     sys         0.7
> 
> 
>      > [I]t is not possible for dd to meaningfully access multiple-disk
>      > configurations without going through the file system. I find it
>      > curious that there is such a large slowdown by going through file
>      > system (with single drive configuration), especially compared to
UFS
>      > or ext3.
> 
>     Comparing a filesystem to raw dd access isn''t a completely
fair
>     comparison either.  Few filesystems actually layout all of their data
>     and metadata so that every read is a completely sequential read.
> 
>      > I simply have a small SOHO server and I am trying to evaluate
>     which OS to
>      > use to keep a redundant disk array. With unreliable
>     consumer-level hardware,
>      > ZFS and the checksum feature are very interesting and the primary
>     selling
>      > point compared to a Linux setup, for as long as ZFS can generate
>     enough
>      > bandwidth from the drive array to saturate single gigabit
ethernet.
> 
>     I would take Bart''s reccomendation and go with Solaris on
something
>     like a
>     dual-core box with 4 disks.
> 
>      > My hardware at the moment is the "wrong" choice for
Solaris/ZFS -
>     PCI 3114
>      > SATA controller on a 32-bit AthlonXP, according to many posts I
>     found.
> 
>     Bill Moore lists some controller reccomendations here:
> 
>    
http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html
>    
<http://mail.opensolaris.org/pipermail/zfs-discuss/2006-March/016874.html>
> 
>      > However, since dd over raw disk is capable of extracting 75+MB/s
>     from this
>      > setup, I keep feeling that surely I must be able to get at least
>     that much
>      > from reading a pair of striped or mirrored ZFS drives. But I
>     can''t - single
>      > drive or 2-drive stripes or mirrors, I only get around 34MB/s
>     going through
>      > ZFS. (I made sure mirror was rebuilt and I resilvered the
stripes.)
> 
>     Maybe this is a problem with your controller?  What happens when you
>     have two simultaneous dd''s to different disks running?  This
would
>     simulate the case where you''re reading from the two disks at
the same
>     time.
> 
>     -j
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Torrey McMahon

2007-May-19 20:49 UTC

head link

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

Jonathan Edwards wrote:>
> On May 15, 2007, at 13:13, J?rgen Keil wrote:
>
>>> Would you mind also doing:
>>>
>>> ptime dd if=/dev/dsk/c2t1d0 of=/dev/null bs=128k count=10000
>>>
>>> to see the raw performance of underlying hardware.
>>
>> This dd command is reading from the block device,
>> which might cache dataand probably splits requests
>> into "maxphys" pieces (which happens to be 56K on an
>> x86 box).
>
> to increase this to say 8MB, add the following to /etc/system:
>
> set maxphys=0x800000
>
> and you''ll probably want to increase sd_max_xfer_size as
> well (should be 256K on x86/x64) .. add the following to
> /kernel/drv/sd.conf:
>
> sd_max_xfer_size=0x800000;
>
> then reboot to get the kernel and sd tunings to take.
>
> ---
> .je
>
> btw - the defaults on sparc:
> maxphys = 128K
> ssd_max_xfer_size = maxphys
> sd_max_xfer_size = maxphys
Maybe we should file a bug to increase the max transfer request sizes?

Trygve Laugstøl

2007-May-20 10:48 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Marko Milisavljevic wrote:> Thank you, following your suggestion improves things - reading a ZFS
> file from a RAID-0 pair now gives me 95MB/sec - about the same as from
> /dev/dsk. What I find surprising is that reading from RAID-1 2-drive
> zpool gives me only 56MB/s - I imagined it would be roughly like
> reading from RAID-0. I can see that it can''t be identical - when
> reading mirrored drives simultaneously, some data will need to be
> skipped if the file is laid out sequentially, but it doesn''t seem
> intuitively obvious how my broken drvers/card would affect it to that
> degree, especially since reading from a file from one-disk zpool gives
> me 70MB/s. My plan was to make 4-disk RAID-Z - we''ll see how it
works
> out when all drives arrive.
> 
> Given how common Sil3114 chipset is in
> my-old-computer-became-home-server segment, I am sure this workaround
> will be appreciated by many who google their way here. And just in
> case it is not clear, what j means below is to add these two lines in
> /etc/system:
> 
> set zfs:zfs_vdev_min_pending=1
> set zfs:zfs_vdev_max_pending=1
I just tried the same myself but got these warnins when booting:

May 20 01:22:29 deservio genunix: [ID 492708 kern.notice] sorry, 
variable ''zfs_vdev_min_pending'' is not defined in the
''zfs''
May 20 01:22:29 deservio genunix: [ID 966847 kern.notice] module
May 20 01:22:29 deservio genunix: [ID 100000 kern.notice]
May 20 01:22:29 deservio genunix: [ID 492708 kern.notice] sorry, 
variable ''zfs_vdev_max_pending'' is not defined in the
''zfs''
May 20 01:22:29 deservio genunix: [ID 966847 kern.notice] module
May 20 01:22:29 deservio genunix: [ID 100000 kern.notice]

I''m running b60.

Marko Milisavljevic

2007-May-21 05:42 UTC

head link

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

It is definitely defined in b63... not sure when it got introduced.

http://src.opensolaris.org/source/xref/onnv/aside/usr/src/cmd/mdb/common/modules/zfs/zfs.c

shows tunable parameters for ZFS, under "zfs_params(...)"

On 5/20/07, Trygve Laugst?l <trygvis at codehaus.org>
wrote:> Marko Milisavljevic wrote:
> > Given how common Sil3114 chipset is in
> > my-old-computer-became-home-server segment, I am sure this workaround
> > will be appreciated by many who google their way here. And just in
> > case it is not clear, what j means below is to add these two lines in
> > /etc/system:
> >
> > set zfs:zfs_vdev_min_pending=1
> > set zfs:zfs_vdev_max_pending=1
>
> I just tried the same myself but got these warnins when booting:
>
> May 20 01:22:29 deservio genunix: [ID 492708 kern.notice] sorry,
> variable ''zfs_vdev_min_pending'' is not defined in the
''zfs''
> May 20 01:22:29 deservio genunix: [ID 966847 kern.notice] module
> May 20 01:22:29 deservio genunix: [ID 100000 kern.notice]
> May 20 01:22:29 deservio genunix: [ID 492708 kern.notice] sorry,
> variable ''zfs_vdev_max_pending'' is not defined in the
''zfs''
> May 20 01:22:29 deservio genunix: [ID 966847 kern.notice] module
> May 20 01:22:29 deservio genunix: [ID 100000 kern.notice]
>
> I''m running b60.

Seemingly Similar Threads

Search for more reasonably related threads

zfs discuss - May 2007 - Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Re: Re: Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

[zfs-discuss] Lots of overhead with ZFS - what am I doing wrong?

Seemingly Similar Threads