thr3ads.net - zfs discuss - [zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Robert Milkowski

2006-Apr-21 10:49 UTC

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

Hi.

I was doing some tests with sequential writing large files.
I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO).

bash-3.00# zpool status p-86-102
  pool: p-86-102
 state: ONLINE
 scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006
config:

        NAME                       STATE     READ WRITE CKSUM
        p-86-102                   ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E0119495A0d0  ONLINE       0     0     0
            c5t500000E0118EDB20d0  ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E01194A720d0  ONLINE       0     0     0
            c5t500000E0118EDA10d0  ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E01194A750d0  ONLINE       0     0     0
            c5t500000E01190E6B0d0  ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E01194A8C0d0  ONLINE       0     0     0
            c5t500000E0118F21C0d0  ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E011949570d0  ONLINE       0     0     0
            c5t500000E0118F1FD0d0  ONLINE       0     0     0
          mirror                   ONLINE       0     0     0
            c5t500000E011949480d0  ONLINE       0     0     0
            c5t500000E01190E5B0d0  ONLINE       0     0     0

errors: No known data errors
bash-3.00#



dd if=/dev/zero of=/p-86-102/q1 bs=1024k

Using iostat I get something like:

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1942.4    0.0 248622.7  0.0 420.0    0.0  216.2   0 1200 c5
    0.0  161.0    0.0 20611.2  0.0 35.0    0.0  217.3   0 100
c5t500000E0118F1FD0d0
    0.0  160.0    0.0 20483.3  0.0 35.0    0.0  218.7   0 100
c5t500000E01194A8C0d0
    0.0  163.0    0.0 20867.4  0.0 35.0    0.0  214.7   0 100
c5t500000E01194A720d0
    0.0  169.0    0.0 21635.7  0.0 35.0    0.0  207.0   0 100
c5t500000E011949570d0
    0.0  152.0    0.0 19459.4  0.0 35.0    0.0  230.2   0 100
c5t500000E0118EDA10d0
    0.0  160.0    0.0 20483.7  0.0 35.0    0.0  218.7   0 100
c5t500000E01190E5B0d0
    0.0  161.0    0.0 20611.8  0.0 35.0    0.0  217.3   0 100
c5t500000E0118F21C0d0
    0.0  159.0    0.0 20355.8  0.0 35.0    0.0  220.1   0 100
c5t500000E01190E6B0d0
    0.0  168.0    0.0 21508.2  0.0 35.0    0.0  208.3   0 100
c5t500000E01194A750d0
    0.0  172.0    0.0 22020.3  0.0 35.0    0.0  203.4   0 100
c5t500000E011949480d0
    0.0  163.0    0.0 20868.1  0.0 35.0    0.0  214.7   0 100
c5t500000E0119495A0d0
    0.0  154.0    0.0 19715.9  0.0 35.0    0.0  227.2   0 100
c5t500000E0118EDB20d0

So ZFS issues to individual disks IO with 128KB in size.

Now let''s see how much we can get with single disk with different IO
sizes.

dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  185.0    0.0 23684.9  0.0  1.0    0.0    5.3   0  98 c5
    0.0  185.0    0.0 23685.3  0.0  1.0    0.0    5.3   0  98
c5t500000E0119495A0d0

dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  102.9    0.0 52707.6  0.0  1.0    0.0    9.4   0  97 c5
    0.0  102.9    0.0 52706.5  0.0  1.0    0.0    9.4   0  97
c5t500000E0119495A0d0


dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   65.0    0.0 66566.3  0.0  1.0    0.0   14.7   0  96 c5
    0.0   65.0    0.0 66567.1  0.0  1.0    0.0   14.7   0  96
c5t500000E0119495A0d0


Ok, I tried again this time with second mirrors offlined (so host
doesn''t have to write data twice as it could affect the performance).

bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0 c5t500000E0118EDA10d0
c5t500000E01190E6B0d0 c5t500000E0118F21C0d0 c5t500000E0118F1FD0d0
c5t500000E01190E5B0d0
Bringing device c5t500000E0118EDB20d0 offline
Bringing device c5t500000E0118EDA10d0 offline
Bringing device c5t500000E01190E6B0d0 offline
Bringing device c5t500000E0118F21C0d0 offline
Bringing device c5t500000E0118F1FD0d0 offline
Bringing device c5t500000E01190E5B0d0 offline
bash-3.00# zpool status p-86-102   pool: p-86-102
 state: DEGRADED
status: One or more devices has been taken offline by the adminstrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using ''zpool online'' or replace the
device with
        ''zpool replace''.
 scrub: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        p-86-102                   DEGRADED     0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E0119495A0d0  ONLINE       0     0     0
            c5t500000E0118EDB20d0  OFFLINE      0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E01194A720d0  ONLINE       0     0     0
            c5t500000E0118EDA10d0  OFFLINE      0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E01194A750d0  ONLINE       0     0     0
            c5t500000E01190E6B0d0  OFFLINE      0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E01194A8C0d0  ONLINE       0     0     0
            c5t500000E0118F21C0d0  OFFLINE      0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E011949570d0  ONLINE       0     0     0
            c5t500000E0118F1FD0d0  OFFLINE      0     0     0
          mirror                   DEGRADED     0     0     0
            c5t500000E011949480d0  ONLINE       0     0     0
            c5t500000E01190E5B0d0  OFFLINE      0     0     0

errors: No known data errors

And now:
dd if=/dev/zero of=/p-86-102/q2 bs=1024k

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 1532.1    0.0 196107.5  0.0 210.0    0.0  137.0   0 600 c5
    0.0  239.0    0.0 30591.4  0.0 35.0    0.0  146.4   0 100
c5t500000E01194A8C0d0
    0.0  256.0    0.0 32767.4  0.0 35.0    0.0  136.7   0 100
c5t500000E01194A720d0
    0.0  255.0    0.0 32639.7  0.0 35.0    0.0  137.2   0 100
c5t500000E011949570d0
    0.0  263.0    0.0 33668.1  0.0 35.0    0.0  133.0   0 100
c5t500000E01194A750d0
    0.0  278.0    0.0 35588.6  0.0 35.0    0.0  125.9   0 100
c5t500000E011949480d0
    0.0  241.0    0.0 30852.0  0.0 35.0    0.0  145.2   0 100
c5t500000E0119495A0d0


Well, it looks like if ZFS could issue larger IOs it could greatly affect
performance for large sequential writes (2x?).
 
 
This message posted from opensolaris.org

Gregory Shaw

2006-Apr-21 12:54 UTC

head link

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

Would the maxphys system parameter have an impact on the below?

On Apr 21, 2006, at 4:49 AM, Robert Milkowski wrote:
> Hi.
>
> I was doing some tests with sequential writing large files.
> I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO).
>
> bash-3.00# zpool status p-86-102
>   pool: p-86-102
>  state: ONLINE
>  scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         p-86-102                   ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E0119495A0d0  ONLINE       0     0     0
>             c5t500000E0118EDB20d0  ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E01194A720d0  ONLINE       0     0     0
>             c5t500000E0118EDA10d0  ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E01194A750d0  ONLINE       0     0     0
>             c5t500000E01190E6B0d0  ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E01194A8C0d0  ONLINE       0     0     0
>             c5t500000E0118F21C0d0  ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E011949570d0  ONLINE       0     0     0
>             c5t500000E0118F1FD0d0  ONLINE       0     0     0
>           mirror                   ONLINE       0     0     0
>             c5t500000E011949480d0  ONLINE       0     0     0
>             c5t500000E01190E5B0d0  ONLINE       0     0     0
>
> errors: No known data errors
> bash-3.00#
>
>
>
> dd if=/dev/zero of=/p-86-102/q1 bs=1024k
>
> Using iostat I get something like:
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0 1942.4    0.0 248622.7  0.0 420.0    0.0  216.2   0 1200 c5
>     0.0  161.0    0.0 20611.2  0.0 35.0    0.0  217.3   0 100  
> c5t500000E0118F1FD0d0
>     0.0  160.0    0.0 20483.3  0.0 35.0    0.0  218.7   0 100  
> c5t500000E01194A8C0d0
>     0.0  163.0    0.0 20867.4  0.0 35.0    0.0  214.7   0 100  
> c5t500000E01194A720d0
>     0.0  169.0    0.0 21635.7  0.0 35.0    0.0  207.0   0 100  
> c5t500000E011949570d0
>     0.0  152.0    0.0 19459.4  0.0 35.0    0.0  230.2   0 100  
> c5t500000E0118EDA10d0
>     0.0  160.0    0.0 20483.7  0.0 35.0    0.0  218.7   0 100  
> c5t500000E01190E5B0d0
>     0.0  161.0    0.0 20611.8  0.0 35.0    0.0  217.3   0 100  
> c5t500000E0118F21C0d0
>     0.0  159.0    0.0 20355.8  0.0 35.0    0.0  220.1   0 100  
> c5t500000E01190E6B0d0
>     0.0  168.0    0.0 21508.2  0.0 35.0    0.0  208.3   0 100  
> c5t500000E01194A750d0
>     0.0  172.0    0.0 22020.3  0.0 35.0    0.0  203.4   0 100  
> c5t500000E011949480d0
>     0.0  163.0    0.0 20868.1  0.0 35.0    0.0  214.7   0 100  
> c5t500000E0119495A0d0
>     0.0  154.0    0.0 19715.9  0.0 35.0    0.0  227.2   0 100  
> c5t500000E0118EDB20d0
>
> So ZFS issues to individual disks IO with 128KB in size.
>
> Now let''s see how much we can get with single disk with different
> IO sizes.
>
> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k
>
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  185.0    0.0 23684.9  0.0  1.0    0.0    5.3   0  98 c5
>     0.0  185.0    0.0 23685.3  0.0  1.0    0.0    5.3   0  98  
> c5t500000E0119495A0d0
>
> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  102.9    0.0 52707.6  0.0  1.0    0.0    9.4   0  97 c5
>     0.0  102.9    0.0 52706.5  0.0  1.0    0.0    9.4   0  97  
> c5t500000E0119495A0d0
>
>
> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k
>
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0   65.0    0.0 66566.3  0.0  1.0    0.0   14.7   0  96 c5
>     0.0   65.0    0.0 66567.1  0.0  1.0    0.0   14.7   0  96  
> c5t500000E0119495A0d0
>
>
> Ok, I tried again this time with second mirrors offlined (so host  
> doesn''t have to write data twice as it could affect the
performance).
>
> bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0  
> c5t500000E0118EDA10d0 c5t500000E01190E6B0d0 c5t500000E0118F21C0d0  
> c5t500000E0118F1FD0d0 c5t500000E01190E5B0d0
> Bringing device c5t500000E0118EDB20d0 offline
> Bringing device c5t500000E0118EDA10d0 offline
> Bringing device c5t500000E01190E6B0d0 offline
> Bringing device c5t500000E0118F21C0d0 offline
> Bringing device c5t500000E0118F1FD0d0 offline
> Bringing device c5t500000E01190E5B0d0 offline
> bash-3.00# zpool status p-86-102   pool: p-86-102
>  state: DEGRADED
> status: One or more devices has been taken offline by the  
> adminstrator.
>         Sufficient replicas exist for the pool to continue  
> functioning in a
>         degraded state.
> action: Online the device using ''zpool online'' or replace
the
> device with
>         ''zpool replace''.
>  scrub: none requested
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         p-86-102                   DEGRADED     0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E0119495A0d0  ONLINE       0     0     0
>             c5t500000E0118EDB20d0  OFFLINE      0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E01194A720d0  ONLINE       0     0     0
>             c5t500000E0118EDA10d0  OFFLINE      0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E01194A750d0  ONLINE       0     0     0
>             c5t500000E01190E6B0d0  OFFLINE      0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E01194A8C0d0  ONLINE       0     0     0
>             c5t500000E0118F21C0d0  OFFLINE      0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E011949570d0  ONLINE       0     0     0
>             c5t500000E0118F1FD0d0  OFFLINE      0     0     0
>           mirror                   DEGRADED     0     0     0
>             c5t500000E011949480d0  ONLINE       0     0     0
>             c5t500000E01190E5B0d0  OFFLINE      0     0     0
>
> errors: No known data errors
>
> And now:
> dd if=/dev/zero of=/p-86-102/q2 bs=1024k
>
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0 1532.1    0.0 196107.5  0.0 210.0    0.0  137.0   0 600 c5
>     0.0  239.0    0.0 30591.4  0.0 35.0    0.0  146.4   0 100  
> c5t500000E01194A8C0d0
>     0.0  256.0    0.0 32767.4  0.0 35.0    0.0  136.7   0 100  
> c5t500000E01194A720d0
>     0.0  255.0    0.0 32639.7  0.0 35.0    0.0  137.2   0 100  
> c5t500000E011949570d0
>     0.0  263.0    0.0 33668.1  0.0 35.0    0.0  133.0   0 100  
> c5t500000E01194A750d0
>     0.0  278.0    0.0 35588.6  0.0 35.0    0.0  125.9   0 100  
> c5t500000E011949480d0
>     0.0  241.0    0.0 30852.0  0.0 35.0    0.0  145.2   0 100  
> c5t500000E0119495A0d0
>
>
> Well, it looks like if ZFS could issue larger IOs it could greatly  
> affect performance for large sequential writes (2x?).
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 673-8273        Fax: (303) 673-8273
ITCTO Group, Sun Microsystems Inc.
1 StorageTek Drive MS 4382              greg.shaw at sun.com (work)
Louisville, CO 80028-4382                 shaw at fmsoft.com (home)
"When Microsoft writes an application for Linux, I''ve Won." -
Linus
Torvalds

Roch Bourbonnais - Performance Engineering

2006-Apr-21 13:35 UTC

head link

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

We believe that issuing larger I/O would not impact ZFS
performance that much. What must be considered both I/O size 
and concurrency. ZFS will issue 128K I/O but many of them
concurrently whereas 

 dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k

only issues 1x128K at a time (does it not ?).

Now, we still have to explain why you saturate at 250MB/sec.
Is your dd process eating up a full CPU ?

-r

Robert Milkowski writes:
 > Hi.
 > 
 > I was doing some tests with sequential writing large files.
 > I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO).
 > 
 > bash-3.00# zpool status p-86-102
 >   pool: p-86-102
 >  state: ONLINE
 >  scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006
 > config:
 > 
 >         NAME                       STATE     READ WRITE CKSUM
 >         p-86-102                   ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E0119495A0d0  ONLINE       0     0     0
 >             c5t500000E0118EDB20d0  ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E01194A720d0  ONLINE       0     0     0
 >             c5t500000E0118EDA10d0  ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E01194A750d0  ONLINE       0     0     0
 >             c5t500000E01190E6B0d0  ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E01194A8C0d0  ONLINE       0     0     0
 >             c5t500000E0118F21C0d0  ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E011949570d0  ONLINE       0     0     0
 >             c5t500000E0118F1FD0d0  ONLINE       0     0     0
 >           mirror                   ONLINE       0     0     0
 >             c5t500000E011949480d0  ONLINE       0     0     0
 >             c5t500000E01190E5B0d0  ONLINE       0     0     0
 > 
 > errors: No known data errors
 > bash-3.00#
 > 
 > 
 > 
 > dd if=/dev/zero of=/p-86-102/q1 bs=1024k
 > 
 > Using iostat I get something like:
 > 
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0 1942.4    0.0 248622.7  0.0 420.0    0.0  216.2   0 1200 c5
 >     0.0  161.0    0.0 20611.2  0.0 35.0    0.0  217.3   0 100
c5t500000E0118F1FD0d0
 >     0.0  160.0    0.0 20483.3  0.0 35.0    0.0  218.7   0 100
c5t500000E01194A8C0d0
 >     0.0  163.0    0.0 20867.4  0.0 35.0    0.0  214.7   0 100
c5t500000E01194A720d0
 >     0.0  169.0    0.0 21635.7  0.0 35.0    0.0  207.0   0 100
c5t500000E011949570d0
 >     0.0  152.0    0.0 19459.4  0.0 35.0    0.0  230.2   0 100
c5t500000E0118EDA10d0
 >     0.0  160.0    0.0 20483.7  0.0 35.0    0.0  218.7   0 100
c5t500000E01190E5B0d0
 >     0.0  161.0    0.0 20611.8  0.0 35.0    0.0  217.3   0 100
c5t500000E0118F21C0d0
 >     0.0  159.0    0.0 20355.8  0.0 35.0    0.0  220.1   0 100
c5t500000E01190E6B0d0
 >     0.0  168.0    0.0 21508.2  0.0 35.0    0.0  208.3   0 100
c5t500000E01194A750d0
 >     0.0  172.0    0.0 22020.3  0.0 35.0    0.0  203.4   0 100
c5t500000E011949480d0
 >     0.0  163.0    0.0 20868.1  0.0 35.0    0.0  214.7   0 100
c5t500000E0119495A0d0
 >     0.0  154.0    0.0 19715.9  0.0 35.0    0.0  227.2   0 100
c5t500000E0118EDB20d0
 > 
 > So ZFS issues to individual disks IO with 128KB in size.
 > 
 > Now let''s see how much we can get with single disk with different
IO sizes.
 > 
 > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k
 > 
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  185.0    0.0 23684.9  0.0  1.0    0.0    5.3   0  98 c5
 >     0.0  185.0    0.0 23685.3  0.0  1.0    0.0    5.3   0  98
c5t500000E0119495A0d0
 > 
 > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k
 > 
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  102.9    0.0 52707.6  0.0  1.0    0.0    9.4   0  97 c5
 >     0.0  102.9    0.0 52706.5  0.0  1.0    0.0    9.4   0  97
c5t500000E0119495A0d0
 > 
 > 
 > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k
 > 
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   65.0    0.0 66566.3  0.0  1.0    0.0   14.7   0  96 c5
 >     0.0   65.0    0.0 66567.1  0.0  1.0    0.0   14.7   0  96
c5t500000E0119495A0d0
 > 
 > 
 > Ok, I tried again this time with second mirrors offlined (so host
doesn''t have to write data twice as it could affect the performance).
 > 
 > bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0
c5t500000E0118EDA10d0 c5t500000E01190E6B0d0 c5t500000E0118F21C0d0
c5t500000E0118F1FD0d0 c5t500000E01190E5B0d0
 > Bringing device c5t500000E0118EDB20d0 offline
 > Bringing device c5t500000E0118EDA10d0 offline
 > Bringing device c5t500000E01190E6B0d0 offline
 > Bringing device c5t500000E0118F21C0d0 offline
 > Bringing device c5t500000E0118F1FD0d0 offline
 > Bringing device c5t500000E01190E5B0d0 offline
 > bash-3.00# zpool status p-86-102   pool: p-86-102
 >  state: DEGRADED
 > status: One or more devices has been taken offline by the adminstrator.
 >         Sufficient replicas exist for the pool to continue functioning in
a
 >         degraded state.
 > action: Online the device using ''zpool online'' or
replace the device with
 >         ''zpool replace''.
 >  scrub: none requested
 > config:
 > 
 >         NAME                       STATE     READ WRITE CKSUM
 >         p-86-102                   DEGRADED     0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E0119495A0d0  ONLINE       0     0     0
 >             c5t500000E0118EDB20d0  OFFLINE      0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E01194A720d0  ONLINE       0     0     0
 >             c5t500000E0118EDA10d0  OFFLINE      0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E01194A750d0  ONLINE       0     0     0
 >             c5t500000E01190E6B0d0  OFFLINE      0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E01194A8C0d0  ONLINE       0     0     0
 >             c5t500000E0118F21C0d0  OFFLINE      0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E011949570d0  ONLINE       0     0     0
 >             c5t500000E0118F1FD0d0  OFFLINE      0     0     0
 >           mirror                   DEGRADED     0     0     0
 >             c5t500000E011949480d0  ONLINE       0     0     0
 >             c5t500000E01190E5B0d0  OFFLINE      0     0     0
 > 
 > errors: No known data errors
 > 
 > And now:
 > dd if=/dev/zero of=/p-86-102/q2 bs=1024k
 > 
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0 1532.1    0.0 196107.5  0.0 210.0    0.0  137.0   0 600 c5
 >     0.0  239.0    0.0 30591.4  0.0 35.0    0.0  146.4   0 100
c5t500000E01194A8C0d0
 >     0.0  256.0    0.0 32767.4  0.0 35.0    0.0  136.7   0 100
c5t500000E01194A720d0
 >     0.0  255.0    0.0 32639.7  0.0 35.0    0.0  137.2   0 100
c5t500000E011949570d0
 >     0.0  263.0    0.0 33668.1  0.0 35.0    0.0  133.0   0 100
c5t500000E01194A750d0
 >     0.0  278.0    0.0 35588.6  0.0 35.0    0.0  125.9   0 100
c5t500000E011949480d0
 >     0.0  241.0    0.0 30852.0  0.0 35.0    0.0  145.2   0 100
c5t500000E0119495A0d0
 > 
 > 
 > Well, it looks like if ZFS could issue larger IOs it could greatly affect
performance for large sequential writes (2x?).
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2006-Apr-29 13:13 UTC

head link

[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks

To make things simpler I did test with only one disk.

bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   62.8    0.0 64333.2  0.0  1.0    0.0   15.3   0  96 c5
    0.0   62.8    0.0 64334.1  0.0  1.0    0.0   15.3   0  96
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   65.0    0.0 66554.5  0.0  1.0    0.0   14.7   0  96 c5
    0.0   65.0    0.0 66554.3  0.0  1.0    0.0   14.7   0  96
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   62.0    0.0 63494.3  0.0  1.0    0.0   15.5   0  96 c5
    0.0   62.0    0.0 63493.9  0.0  1.0    0.0   15.5   0  96
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   64.0    0.0 65531.7  0.0  1.0    0.0   15.0   0  96 c5
    0.0   64.0    0.0 65532.3  0.0  1.0    0.0   15.0   0  96
c5t500000E0119495A0d0


bash-3.00# zpool create one c5t500000E0119495A0d0
bash-3.00# dd if=/dev/zero of=/one/q2 bs=1024k

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  388.0    0.0 49666.2  0.0 35.0    0.0   90.2   0 100 c5
    0.0  388.0    0.0 49667.7  0.0 35.0    0.0   90.2   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  380.0    0.0 48640.1  0.0 35.0    0.0   92.1   0 100 c5
    0.0  380.0    0.0 48640.1  0.0 35.0    0.0   92.1   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  379.0    0.0 48516.8  0.0 35.0    0.0   92.3   0 100 c5
    0.0  379.0    0.0 48517.1  0.0 35.0    0.0   92.3   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  371.0    0.0 47484.3  0.0 35.0    0.0   94.3   0 100 c5
    0.0  371.0    0.0 47484.0  0.0 35.0    0.0   94.3   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  378.0    0.0 48382.0  0.0 35.0    0.0   92.6   0 100 c5
    0.0  378.0    0.0 48382.0  0.0 35.0    0.0   92.6   0 100
c5t500000E0119495A0d0


Well looks like ZFS is slower in this case and can''t saturate signle
disk (~30% less performance than dd).

Lets try dd with 128KB block:

bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  182.1    0.0 23305.7  0.0  1.0    0.0    5.4   0  98 c5
    0.0  182.1    0.0 23306.0  0.0  1.0    0.0    5.4   0  98
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  181.9    0.0 23284.8  0.0  1.0    0.0    5.4   0  98 c5
    0.0  181.9    0.0 23284.3  0.0  1.0    0.0    5.4   0  98
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  185.0    0.0 23681.8  0.0  1.0    0.0    5.3   0  98 c5
    0.0  185.0    0.0 23682.0  0.0  1.0    0.0    5.3   0  98
c5t500000E0119495A0d0

ok, so when writing with 128KB block ZFS issues about 2x more IOs to signle disk
than dd with 128KB block resulting in ZFS with double the throughput than dd.

I checked with 8MB block size also:

bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=8192k
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    9.8    0.0 80642.8  0.0  0.9    0.0   93.7   0  92 c5
    0.0    9.8    0.0 80641.9  0.0  0.9    0.0   93.7   0  92
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   10.2    0.0 83267.8  0.0  0.9    0.0   91.1   0  93 c5
    0.0   10.2    0.0 83268.1  0.0  0.9    0.0   91.1   0  93
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    9.8    0.0 80552.4  0.0  0.9    0.0   93.8   0  92 c5
    0.0    9.8    0.0 80551.6  0.0  0.9    0.0   93.8   0  92
c5t500000E0119495A0d0


However using larger IO sizes for dd it can easly overcome ZFS by a margin of
30% (or even almost two times slower if 8MB block size is used).
I would say that ZFS would definitely benefit from larger block sizes - at least
for large sequential large file writing.
 
 
This message posted from opensolaris.org

Richard Elling

2006-May-01 01:51 UTC

head link

[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks

comment below...

On Sat, 2006-04-29 at 06:13 -0700, Robert Milkowski
wrote:> To make things simpler I did test with only one disk.
> 
> bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0
> 
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0   62.8    0.0 64333.2  0.0  1.0    0.0   15.3   0  96 c5
>     0.0   62.8    0.0 64334.1  0.0  1.0    0.0   15.3   0  96
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0   65.0    0.0 66554.5  0.0  1.0    0.0   14.7   0  96 c5
>     0.0   65.0    0.0 66554.3  0.0  1.0    0.0   14.7   0  96
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0   62.0    0.0 63494.3  0.0  1.0    0.0   15.5   0  96 c5
>     0.0   62.0    0.0 63493.9  0.0  1.0    0.0   15.5   0  96
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0   64.0    0.0 65531.7  0.0  1.0    0.0   15.0   0  96 c5
>     0.0   64.0    0.0 65532.3  0.0  1.0    0.0   15.0   0  96
c5t500000E0119495A0d0
> 
> 
> bash-3.00# zpool create one c5t500000E0119495A0d0
> bash-3.00# dd if=/dev/zero of=/one/q2 bs=1024k
> 
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  388.0    0.0 49666.2  0.0 35.0    0.0   90.2   0 100 c5
>     0.0  388.0    0.0 49667.7  0.0 35.0    0.0   90.2   0 100
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  380.0    0.0 48640.1  0.0 35.0    0.0   92.1   0 100 c5
>     0.0  380.0    0.0 48640.1  0.0 35.0    0.0   92.1   0 100
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  379.0    0.0 48516.8  0.0 35.0    0.0   92.3   0 100 c5
>     0.0  379.0    0.0 48517.1  0.0 35.0    0.0   92.3   0 100
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  371.0    0.0 47484.3  0.0 35.0    0.0   94.3   0 100 c5
>     0.0  371.0    0.0 47484.0  0.0 35.0    0.0   94.3   0 100
c5t500000E0119495A0d0
>                     extended device statistics
>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     0.0  378.0    0.0 48382.0  0.0 35.0    0.0   92.6   0 100 c5
>     0.0  378.0    0.0 48382.0  0.0 35.0    0.0   92.6   0 100
c5t500000E0119495A0d0
> 
> 
> Well looks like ZFS is slower in this case and can''t saturate
signle disk (~30% less performance than dd).
I disagree with your cause, but agree with your observations.  Here''s
why.  In the dd case, you are doing pure sequential I/O.  The asvc_t
(service time of the queue in the disk) is ~ 15 ms leading to the
disk being 96% busy.  In the ZFS case, asvc_t is ~ 92 ms and the
disk is 100% busy.  ZFS is clearly saturating the disk (100%) and 
dd is not (96%).  ZFS, or any filesystem, will at some point need
to update the metadata which will cause extra, perhaps long, seeks.
A dd to the raw device won''t.  For disks which handle multiple
outstanding commands or RAID controllers, this is more difficult
to see as there is another entity rescheduling the I/O at or near
the disk and will try to avoid the long seeks.  To go along with
this rescheduling is another layer of caching with its optimal
block size, and so on...
 -- richard

Robert Milkowski

2006-May-12 11:22 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Well I have just tested UFS on the same disk.

bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0
newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0: (y/n)? y
mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024 16 1 1
8192 t 0 -1 1 1024 n
Warning: 5810 sector(s) in last cylinder unallocated
/dev/rdsk/c5t500000E0119495A0d0s0:      143358286 sectors in 23334 cylinders of
48 tracks, 128 sectors
        69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
............................
super-block backups for last 10 cylinder groups at:
 142447776, 142546208, 142644640, 142743072, 142841504, 142939936, 143038368,
 143136800, 143235232, 143333664
bash-3.00# mkdir /mnt/1
bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /mnt/1

bash-3.00# dd if=/dev/zero of=/mnt/1/q1 bs=8192k
^C110+0 records in
110+0 records out
bash-3.00#
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    5.0   25.0   35.0 82408.8  0.0  3.6    0.0  120.3   0  99 c5
    5.0   25.0   35.0 82409.7  0.0  3.6    0.0  120.3   0  99
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   25.0   28.0 79832.1  0.0  3.9    0.0  133.4   0  97 c5
    4.0   25.0   28.0 79831.5  0.0  3.9    0.0  133.4   0  97
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    6.0   25.0   42.0 81921.3  0.0  4.7    0.0  151.6   0 100 c5
    6.0   25.0   42.0 81921.4  0.0  4.7    0.0  151.6   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    4.0   21.0   28.0 73555.6  0.0  3.5    0.0  138.7   0  97 c5
    4.0   21.0   28.0 73555.7  0.0  3.5    0.0  138.7   0  97
c5t500000E0119495A0d0


bash-3.00# tunefs -a 2048 /mnt/1
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   22.0    0.0 83240.1  0.0  3.5    0.0  157.1   0  97 c5
    0.0   22.0    0.0 83240.5  0.0  3.5    0.0  157.1   0  97
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   19.0    0.0 81837.1  0.0  3.4    0.0  180.1   0  98 c5
    0.0   19.0    0.0 81837.2  0.0  3.4    0.0  180.1   0  98
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   21.0    0.0 94004.1  0.0  4.6    0.0  218.1   0 100 c5
    0.0   21.0    0.0 94002.6  0.0  4.6    0.0  218.1   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   20.0    0.0 70116.6  0.0  4.3    0.0  216.5   0 100 c5
    0.0   20.0    0.0 70116.7  0.0  4.3    0.0  216.5   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   21.0    0.0 82140.7  0.0  3.3    0.0  158.0   0  95 c5
    0.0   21.0    0.0 82140.8  0.0  3.3    0.0  158.0   0  95
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   72.0    0.0 82279.7  0.0  5.0    0.0   69.9   0  98 c5
    0.0   72.0    0.0 82279.6  0.0  5.0    0.0   69.9   0  98
c5t500000E0119495A0d0

So sometimes it still can even push more from the disk. 

So even UFS in this case is much faster than ZFS. And UFS issued something like
3,5MB block sizes.


bash-3.00# tunefs -a 16 /mnt/1
maximum contiguous block count changes from 2048 to 16

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  350.9    0.0 44533.6  0.0 118.1    0.0  336.6   0 100 c5
    0.0  350.9    0.0 44531.0  0.0 118.1    0.0  336.6   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  381.0    0.0 48466.4  0.0 112.9    0.0  296.4   0 100 c5
    0.0  381.0    0.0 48468.7  0.0 112.9    0.0  296.4   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  369.9    0.0 47057.3  0.0 110.8    0.0  299.6   0 100 c5
    0.0  369.9    0.0 47057.3  0.0 110.8    0.0  299.6   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  399.1    0.0 50566.4  0.0 108.8    0.0  272.7   0 100 c5
    0.0  399.1    0.0 50566.5  0.0 108.8    0.0  272.7   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  345.0    0.0 44171.3  0.0 87.7    0.0  254.3   0 100 c5
    0.0  345.0    0.0 44171.4  0.0 87.7    0.0  254.3   0 100
c5t500000E0119495A0d0

So now UFS was issuing 128KB IOs and now with UFS I get similar performance to
ZFS.

So I would say that larger IOs greatly could help ZFS performance while writing
large sequential files (with large writes).
 
 
This message posted from opensolaris.org

Roch Bourbonnais - Performance Engineering

2006-May-12 12:28 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hi Robert,

Could you try 35 concurrent dd each issuing 128K I/O ?
That would be closer to how ZFS would behave.

-r

Robert Milkowski writes:
 > Well I have just tested UFS on the same disk.
 > 
 > bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0
 > newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0:
(y/n)? y
 > mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024
16 1 1 8192 t 0 -1 1 1024 n
 > Warning: 5810 sector(s) in last cylinder unallocated
 > /dev/rdsk/c5t500000E0119495A0d0s0:      143358286 sectors in 23334
cylinders of 48 tracks, 128 sectors
 >         69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
 > super-block backups (for fsck -F ufs -o b=#) at:
 >  32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488,
885920,
 > Initializing cylinder groups:
 > ............................
 > super-block backups for last 10 cylinder groups at:
 >  142447776, 142546208, 142644640, 142743072, 142841504, 142939936,
143038368,
 >  143136800, 143235232, 143333664
 > bash-3.00# mkdir /mnt/1
 > bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /mnt/1
 > 
 > bash-3.00# dd if=/dev/zero of=/mnt/1/q1 bs=8192k
 > ^C110+0 records in
 > 110+0 records out
 > bash-3.00#
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     5.0   25.0   35.0 82408.8  0.0  3.6    0.0  120.3   0  99 c5
 >     5.0   25.0   35.0 82409.7  0.0  3.6    0.0  120.3   0  99
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     4.0   25.0   28.0 79832.1  0.0  3.9    0.0  133.4   0  97 c5
 >     4.0   25.0   28.0 79831.5  0.0  3.9    0.0  133.4   0  97
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     6.0   25.0   42.0 81921.3  0.0  4.7    0.0  151.6   0 100 c5
 >     6.0   25.0   42.0 81921.4  0.0  4.7    0.0  151.6   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     4.0   21.0   28.0 73555.6  0.0  3.5    0.0  138.7   0  97 c5
 >     4.0   21.0   28.0 73555.7  0.0  3.5    0.0  138.7   0  97
c5t500000E0119495A0d0
 > 
 > 
 > bash-3.00# tunefs -a 2048 /mnt/1
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   22.0    0.0 83240.1  0.0  3.5    0.0  157.1   0  97 c5
 >     0.0   22.0    0.0 83240.5  0.0  3.5    0.0  157.1   0  97
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   19.0    0.0 81837.1  0.0  3.4    0.0  180.1   0  98 c5
 >     0.0   19.0    0.0 81837.2  0.0  3.4    0.0  180.1   0  98
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   21.0    0.0 94004.1  0.0  4.6    0.0  218.1   0 100 c5
 >     0.0   21.0    0.0 94002.6  0.0  4.6    0.0  218.1   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   20.0    0.0 70116.6  0.0  4.3    0.0  216.5   0 100 c5
 >     0.0   20.0    0.0 70116.7  0.0  4.3    0.0  216.5   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   21.0    0.0 82140.7  0.0  3.3    0.0  158.0   0  95 c5
 >     0.0   21.0    0.0 82140.8  0.0  3.3    0.0  158.0   0  95
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0   72.0    0.0 82279.7  0.0  5.0    0.0   69.9   0  98 c5
 >     0.0   72.0    0.0 82279.6  0.0  5.0    0.0   69.9   0  98
c5t500000E0119495A0d0
 > 
 > So sometimes it still can even push more from the disk. 
 > 
 > So even UFS in this case is much faster than ZFS. And UFS issued something
like 3,5MB block sizes.
 > 
 > 
 > bash-3.00# tunefs -a 16 /mnt/1
 > maximum contiguous block count changes from 2048 to 16
 > 
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  350.9    0.0 44533.6  0.0 118.1    0.0  336.6   0 100 c5
 >     0.0  350.9    0.0 44531.0  0.0 118.1    0.0  336.6   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  381.0    0.0 48466.4  0.0 112.9    0.0  296.4   0 100 c5
 >     0.0  381.0    0.0 48468.7  0.0 112.9    0.0  296.4   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  369.9    0.0 47057.3  0.0 110.8    0.0  299.6   0 100 c5
 >     0.0  369.9    0.0 47057.3  0.0 110.8    0.0  299.6   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  399.1    0.0 50566.4  0.0 108.8    0.0  272.7   0 100 c5
 >     0.0  399.1    0.0 50566.5  0.0 108.8    0.0  272.7   0 100
c5t500000E0119495A0d0
 >                     extended device statistics
 >     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 >     0.0  345.0    0.0 44171.3  0.0 87.7    0.0  254.3   0 100 c5
 >     0.0  345.0    0.0 44171.4  0.0 87.7    0.0  254.3   0 100
c5t500000E0119495A0d0
 > 
 > So now UFS was issuing 128KB IOs and now with UFS I get similar
 > performance to ZFS. 
 > 
 > So I would say that larger IOs greatly could help ZFS performance
 > while writing large sequential files (with large writes). 
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2006-May-12 15:07 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Roch,

Friday, May 12, 2006, 2:28:59 PM, you wrote:

RBPE> Hi Robert,

RBPE> Could you try 35 concurrent dd each issuing 128K I/O ?
RBPE> That would be closer to how ZFS would behave.

You mean to UFS?

ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s).

But what does it proof?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch Bourbonnais - Performance Engineering

2006-May-12 15:31 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Robert Milkowski writes:
 > Hello Roch,
 > 
 > Friday, May 12, 2006, 2:28:59 PM, you wrote:
 > 
 > RBPE> Hi Robert,
 > 
 > RBPE> Could you try 35 concurrent dd each issuing 128K I/O ?
 > RBPE> That would be closer to how ZFS would behave.
 > 
 > You mean to UFS?
 > 
 > ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s).
 > 
 > But what does it proof?

It does not prove my point at least. Actually I also tried
it and it does not generate the I/O pattern that ZFS uses;
I did not analyze this but UFS gets in the way.

I don''t have a raw device to play with at this instant but
what we (I) have to do is find the right script that will
cause 35 concurrent 128K I/O to be dumped into a spindle
repeateadly.  They can be as random as you like. 

This, I guarantee you, will saturate your spindle (or get
really close to it). And this is the I/O pattern that ZFS
generates during a pool sync operation.

-r

Robert Milkowski

2006-May-14 20:55 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Roch,

Friday, May 12, 2006, 5:31:10 PM, you wrote:

RBPE> Robert Milkowski writes:
 >> Hello Roch,
 >> 
 >> Friday, May 12, 2006, 2:28:59 PM, you wrote:
 >> 
 >> RBPE> Hi Robert,
 >> 
 >> RBPE> Could you try 35 concurrent dd each issuing 128K I/O ?
 >> RBPE> That would be closer to how ZFS would behave.
 >> 
 >> You mean to UFS?
 >> 
 >> ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s).
 >> 
 >> But what does it proof?

RBPE> It does not prove my point at least. Actually I also tried
RBPE> it and it does not generate the I/O pattern that ZFS uses;
RBPE> I did not analyze this but UFS gets in the way.

RBPE> I don''t have a raw device to play with at this instant but
RBPE> what we (I) have to do is find the right script that will
RBPE> cause 35 concurrent 128K I/O to be dumped into a spindle
RBPE> repeateadly.  They can be as random as you like. 

RBPE> This, I guarantee you, will saturate your spindle (or get
RBPE> really close to it). And this is the I/O pattern that ZFS
RBPE> generates during a pool sync operation.

ok, the same disk, the same host.

bash-3.00# cat dd32.sh
#!/bin/sh

dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &

bash-3.00# ./dd32.sh

bash-3.00# iostat -xnzC 1

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  374.0    0.0 47874.6  0.0 33.0    0.0   88.1   0 100 c5
    0.0  374.0    0.0 47875.2  0.0 33.0    0.0   88.1   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  367.1    0.0 46985.6  0.0 33.0    0.0   89.8   0 100 c5
    0.0  367.1    0.0 46985.7  0.0 33.0    0.0   89.8   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  355.0    0.0 45440.3  0.0 33.0    0.0   92.9   0 100 c5
    0.0  355.0    0.0 45439.9  0.0 33.0    0.0   92.9   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  385.9    0.0 49395.4  0.0 33.0    0.0   85.4   0 100 c5
    0.0  385.9    0.0 49395.3  0.0 33.0    0.0   85.4   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  380.0    0.0 48635.9  0.0 33.0    0.0   86.7   0 100 c5
    0.0  380.0    0.0 48635.4  0.0 33.0    0.0   86.7   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  361.1    0.0 46224.7  0.0 33.0    0.0   91.3   0 100 c5
    0.0  361.1    0.0 46225.3  0.0 33.0    0.0   91.3   0 100
c5t500000E0119495A0d0


These numbers are very similar to those I get with ZFS.
But it''s much less than single dd writing with 8MB block size to UFS
or raw-device.

It still looks like issuing larger IOs does in fact offer much better
throughput.
    
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-May-14 21:08 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Robert,

Sunday, May 14, 2006, 10:55:42 PM, you wrote:

RM> Hello Roch,

RM> Friday, May 12, 2006, 5:31:10 PM, you wrote:

RBPE>> Robert Milkowski writes:
 >>> Hello Roch,
 >>> 
 >>> Friday, May 12, 2006, 2:28:59 PM, you wrote:
 >>> 
 >>> RBPE> Hi Robert,
 >>> 
 >>> RBPE> Could you try 35 concurrent dd each issuing 128K I/O ?
 >>> RBPE> That would be closer to how ZFS would behave.
 >>> 
 >>> You mean to UFS?
 >>> 
 >>> ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s).
 >>> 
 >>> But what does it proof?

RBPE>> It does not prove my point at least. Actually I also tried
RBPE>> it and it does not generate the I/O pattern that ZFS uses;
RBPE>> I did not analyze this but UFS gets in the way.

RBPE>> I don''t have a raw device to play with at this instant but
RBPE>> what we (I) have to do is find the right script that will
RBPE>> cause 35 concurrent 128K I/O to be dumped into a spindle
RBPE>> repeateadly.  They can be as random as you like. 

RBPE>> This, I guarantee you, will saturate your spindle (or get
RBPE>> really close to it). And this is the I/O pattern that ZFS
RBPE>> generates during a pool sync operation.

RM> ok, the same disk, the same host.

RM> bash-3.00# cat dd32.sh
RM> #!/bin/sh

RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &
RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0  bs=128k &

RM> bash-3.00# ./dd32.sh

RM> bash-3.00# iostat -xnzC 1

RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  374.0    0.0 47874.6  0.0 33.0    0.0   88.1   0 100 c5
RM>     0.0  374.0    0.0 47875.2  0.0 33.0    0.0   88.1   0 100
c5t500000E0119495A0d0
RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  367.1    0.0 46985.6  0.0 33.0    0.0   89.8   0 100 c5
RM>     0.0  367.1    0.0 46985.7  0.0 33.0    0.0   89.8   0 100
c5t500000E0119495A0d0
RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  355.0    0.0 45440.3  0.0 33.0    0.0   92.9   0 100 c5
RM>     0.0  355.0    0.0 45439.9  0.0 33.0    0.0   92.9   0 100
c5t500000E0119495A0d0
RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  385.9    0.0 49395.4  0.0 33.0    0.0   85.4   0 100 c5
RM>     0.0  385.9    0.0 49395.3  0.0 33.0    0.0   85.4   0 100
c5t500000E0119495A0d0
RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  380.0    0.0 48635.9  0.0 33.0    0.0   86.7   0 100 c5
RM>     0.0  380.0    0.0 48635.4  0.0 33.0    0.0   86.7   0 100
c5t500000E0119495A0d0
RM>                     extended device statistics
RM>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
RM>     0.0  361.1    0.0 46224.7  0.0 33.0    0.0   91.3   0 100 c5
RM>     0.0  361.1    0.0 46225.3  0.0 33.0    0.0   91.3   0 100
c5t500000E0119495A0d0


RM> These numbers are very similar to those I get with ZFS.
RM> But it''s much less than single dd writing with 8MB block size to
UFS
RM> or raw-device.

RM> It still looks like issuing larger IOs does in fact offer much better
RM> throughput.
RM>     


bash-3.00# zpool create one c5t500000E0119495A0d0
bash-3.00# zfs set atime=off one

bash-3.00# cat dd32-1.sh
#!/bin/sh

dd if=/dev/zero of=/one/q1 bs=128k &
dd if=/dev/zero of=/one/q2 bs=128k &
dd if=/dev/zero of=/one/q3 bs=128k &
dd if=/dev/zero of=/one/q4 bs=128k &
dd if=/dev/zero of=/one/q5 bs=128k &
dd if=/dev/zero of=/one/q6 bs=128k &
dd if=/dev/zero of=/one/q7 bs=128k &
dd if=/dev/zero of=/one/q8 bs=128k &
dd if=/dev/zero of=/one/q9 bs=128k &
dd if=/dev/zero of=/one/q10 bs=128k &
dd if=/dev/zero of=/one/q11 bs=128k &
dd if=/dev/zero of=/one/q12 bs=128k &
dd if=/dev/zero of=/one/q13 bs=128k &
dd if=/dev/zero of=/one/q14 bs=128k &
dd if=/dev/zero of=/one/q15 bs=128k &
dd if=/dev/zero of=/one/q16 bs=128k &
dd if=/dev/zero of=/one/q17 bs=128k &
dd if=/dev/zero of=/one/q18 bs=128k &
dd if=/dev/zero of=/one/q19 bs=128k &
dd if=/dev/zero of=/one/q20 bs=128k &
dd if=/dev/zero of=/one/q21 bs=128k &
dd if=/dev/zero of=/one/q22 bs=128k &
dd if=/dev/zero of=/one/q23 bs=128k &
dd if=/dev/zero of=/one/q24 bs=128k &
dd if=/dev/zero of=/one/q25 bs=128k &
dd if=/dev/zero of=/one/q26 bs=128k &
dd if=/dev/zero of=/one/q27 bs=128k &
dd if=/dev/zero of=/one/q28 bs=128k &
dd if=/dev/zero of=/one/q29 bs=128k &
dd if=/dev/zero of=/one/q30 bs=128k &
dd if=/dev/zero of=/one/q31 bs=128k &
dd if=/dev/zero of=/one/q32 bs=128k &


bash-3.00# iostat -xnzC 1

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  390.0    0.0 49916.6  0.0 34.9    0.0   89.5   0 100 c5
    0.0  390.0    0.0 49917.7  0.0 34.9    0.0   89.5   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  389.9    0.0 49911.5  0.0 34.9    0.0   89.5   0 100 c5
    0.0  389.9    0.0 49911.4  0.0 34.9    0.0   89.5   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  383.5    0.0 49089.1  0.0 34.9    0.0   91.0   0 100 c5
    0.0  383.5    0.0 49087.8  0.0 34.9    0.0   91.0   0 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  393.5    0.0 50371.9  0.0 34.9    0.0   88.6   0 100 c5
    0.0  393.5    0.0 50373.3  0.0 34.9    0.0   88.6   0 100
c5t500000E0119495A0d0


bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0
newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0: (y/n)? y
mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024 16 1 1
8192 t 0 -1 1 1024 n
Warning: 5810 sector(s) in last cylinder unallocated
/dev/rdsk/c5t500000E0119495A0d0s0:      143358286 sectors in 23334 cylinders of
48 tracks, 128 sectors
        69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920,
Initializing cylinder groups:
............................
super-block backups for last 10 cylinder groups at:
 142447776, 142546208, 142644640, 142743072, 142841504, 142939936, 143038368,
 143136800, 143235232, 143333664
bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /one
bash-3.00#
bash-3.00# ./dd32-1.sh


bash-3.00# iostat -xnzC 1

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.0  833.7    7.0 6885.6 137.5 256.0  164.7  306.7   0 100 c5
    1.0  833.7    7.0 6885.5 137.5 256.0  164.7  306.7 100 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  829.9    0.0 6855.4 130.6 256.0  157.3  308.5   0 100 c5
    0.0  829.9    0.0 6855.4 130.6 256.0  157.3  308.5 100 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  799.1    0.0 6488.8 113.6 256.0  142.2  320.4   0 100 c5
    0.0  799.1    0.0 6488.8 113.6 256.0  142.2  320.4 100 100
c5t500000E0119495A0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.0  813.0    7.0 6527.8 110.7 217.3  136.0  267.0   0 100 c5
    1.0  813.0    7.0 6527.8 110.7 217.3  136.0  267.0  68 100
c5t500000E0119495A0d0


So with many sequential write streams ZFS behaves much better.
But still with one stream ZFS is much worse.

    

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch Bourbonnais - Performance Engineering

2006-May-15 13:23 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

The question put forth is whether the ZFS 128K blocksize is sufficient 
to saturate a regular disk. There is great body of evidence that shows 
that the bigger the write sizes and matching large FS clustersize lead 
to more throughput. The counter point is that ZFS schedules it''s I/O
like nothing else seen before and manages to sature a single disk
using enough concurrent 128K I/O.

<There a few things I did here for the first time; so I may have erred
at places. So I am proposing this for review by the community>

I first measured the throughput of a write(2)  to raw device using for
instance this;

	dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

On   Solaris we would  see  some overhead of   reading  the block from
/dev/zero and then issuing the write call.  The tightest function that
fences the I/O is default_physio(). That  function will issue the I/O to
the device then wait for it  to complete.  If  we take the elapse time
spent in   this function and  count the  bytes that   are I/O-ed, this
should give  a   good  hint as   to   the throughput  the    device is
providing.  The above  dd command will  issue  a single I/O at  a time
(d-script to measure is attached).

Trying different blocksizes I see:

   Bytes   Elapse of phys IO     Size	    
   Sent
   
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s

    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s

   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s

Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
will chunk up data  in 128K blocks.  Now  the dd command interact with
memory. But the I/O are scheduled under the control of spa_sync().  So
in  the d-script (attached) I check  for the start  of an spa_sync and
time that based on elapse.  At the same  time I  gather the number  of
bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
When the spa_sync completes we are  sure that all  those are on stable
storage. The script is a bit more  complex because there are 2 threads
that   issue  spa_sync, but  only     one  of them actually    becomes
activated. So the script will print out  some spurious lines of output
at times. I measure I/O with the script while this runs:


	dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

And I see:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
case ZFS enables the write cache. Note  that even with the write cache
enabled, when the spa_sync()  completes, it will  be after a  flush of
the cache has been executed. So the 60MB/sec do correspond to data set
to the platter. I just tried disabling the cache  (with format -e) but
I  am not sure if  that is taken into account  by ZFS; Results are the
same 60MB/sec. This will have to be confirmed.

With write cache enabled,  the physio test reaches 66  MB/s as soon as
we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
platter when the timed function completes.

Another variable  not  fully  controled  is the   physical  (cylinder)
locations of  the I/O. It could be  that some of the  differences come
from that.

What do I take away ?

	a single 2MB physical I/O will get 46 MB/sec out of my disk.

	35  concurrent  128K I/O  sustained  followed  by metadata I/O
	followed by flush  of  the write cache  allows  ZFS to get  60
	MB/sec out of the same disk.


This is what underwrites my belief that 128K blocksize is sufficiently
large. Now, nothing  here proves    that  256K would not give     more
throughput; so nothing is really settled. But I hope this helps put us 
on common ground.


-r



-------------- next part --------------
A non-text attachment was scrubbed...
Name: phys.d
Type: application/octet-stream
Size: 586 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060515/0c9f0c86/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spa_sync.d
Type: application/octet-stream
Size: 986 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060515/0c9f0c86/attachment-0001.obj>

Anton B. Rang

2006-May-16 14:17 UTC

head link

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

One issue is what we mean by "saturation."  It''s easy to
bring a disk to 100% busy.  We need to keep this discussion in the context of a
workload.  Generally when people care about streaming throghput of a disk,
it''s because they are reading or writing a single large file, and they
want to reach as closely as possible the full media rate.

Once the write cache on a device is enabled, it''s quite easy to
maximize write performance. All you need is to move data quickly enough into the
device''s buffer so that it''s never found to be empty while the
head is writing; and, of course, avoid ever moving the disk head away. Since the
media rate is typically fairly low (e.g. 20-80 MB/sec), this isn''t that
hard on either FibreChannel or SCSI, and shouldn''t be too difficult for
ATA either.  Very small requests are hurt by protocol and stack overhead, but
moderately large requests (1-2 MB) can usually reach the full rate, at least for
a single disk.  (Disk arrays often have faster back ends than the interconnect,
so are always limited by protocol and stack overhead, even for large transfers.)

With a disabled write cache, there will always be some protocol and stack
overhead; and with less sophisticated disks, you''ll miss on average
half a revolution of the disk each time you write (as you wait for the first
sector to go beneath the head). More sophisticated disks will reorder data
during the write, and the most sophisticated (with FC/SCSI interfaces) can
actually get the data from the host out-of-order to match the sectors passing
underneath the head. In this case the only way to come close to disk rates with
smaller writes is to issue overlapping commands, with tags allowing the device
to reorder them, and hope that the device has enough buffering to reorder all
writes into sequence.

Disk seeks remain the worst enemy of streaming performance, however. 
There''s no way to avoid that.  ZFS should be able to achieve high write
performance as long as it can allocate blocks (including metadata) in a forward
direction and minimize the number of times the ?berblock must be written. Reads
will be more challenging unless the data was written contiguously.  The biggest
issue with 128K block size for ZFS, I suspect, will be the seek between each
read.  Even a fast (1 ms) seek represents 60KB of lost data transfer on a disk
which can transfer data at 60 MBps.
 
 
This message posted from opensolaris.org

Roch Bourbonnais - Performance Engineering

2006-May-16 14:33 UTC

head link

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

Anton B. Rang writes:
 > One issue is what we mean by "saturation."  It''s easy
to
bring a disk to 100% busy.  We need to keep this discussion
in the context of a workload.  Generally when people care
about streaming throghput of a disk, it''s because they are
reading or writing a single large file, and they want to
reach as closely as possible the full media rate. 
 > 
 > Once the write cache on a device is enabled, it''s quite
easy to maximize write performance. All you need is to move
data quickly enough into the device''s buffer so that it''s
never found to be empty while the head is writing; and, of
course, avoid ever moving the disk head away. Since the
media rate is typically fairly low (e.g. 20-80 MB/sec), this
isn''t that hard on either FibreChannel or SCSI, and
shouldn''t be too difficult for ATA either.  Very small
requests are hurt by protocol and stack overhead, but
moderately large requests (1-2 MB) can usually reach the
full rate, at least for a single disk.  (Disk arrays often
have faster back ends than the interconnect, so are always
limited by protocol and stack overhead, even for large
transfers.) 
 > 
 > With a disabled write cache, there will always be some
protocol and stack overhead; and with less sophisticated
disks, you''ll miss on average half a revolution of the disk
each time you write (as you wait for the first sector to go
beneath the head). More sophisticated disks will reorder
data during the write, and the most sophisticated (with
FC/SCSI interfaces) can actually get the data from the host
out-of-order to match the sectors passing underneath the
head. In this case the only way to come close to disk rates
with smaller writes is to issue overlapping commands, with
tags allowing the device to reorder them, and hope that the
device has enough buffering to reorder all writes into
sequence. 
 > 
 > Disk seeks remain the worst enemy of streaming
performance, however.  There''s no way to avoid that.  ZFS
should be able to achieve high write performance as long as
it can allocate blocks (including metadata) in a forward
direction and minimize the number of times the ??berblock
must be written. Reads will be more challenging unless the
data was written contiguously.  The biggest issue with 128K
block size for ZFS, I suspect, will be the seek between each
read.  Even a fast (1 ms) seek represents 60KB of lost data
transfer on a disk which can transfer data at 60 MBps. 
 >  
 >  
 > This message posted from opensolaris.org
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Ok so lets consider your 2MB read. You have the option of
setting in in one contiguous place on the disk or split it
into 16 x 128K chunks, somewhat spread all over.

Now you issue a read to that 2MB of data.

As you noted, you  either have to wait for  the head to find
the 2MB block and  stream it, or  you dump 16 I/O descriptor
into an intelligent  controller; Wherever the  head is there
is data to be gotten from the get go. I can''t swear it wins
the game, but it should be real close.

I just did an experiment and could see > 60MB of data out of 
a 35G disk using 128K chunks (> 450 IOPS).

Disruptive.

-r

Anton Rang

2006-May-16 15:50 UTC

head link

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

> Ok so lets consider your 2MB read. You have the option of
> setting in in one contiguous place on the disk or split it
> into 16 x 128K chunks, somewhat spread all over.
>
> Now you issue a read to that 2MB of data.
>
> As you noted, you  either have to wait for  the head to find
> the 2MB block and  stream it, or  you dump 16 I/O descriptor
> into an intelligent  controller; Wherever the  head is there
> is data to be gotten from the get go. I can''t swear it wins
> the game, but it should be real close.
Well, the full specs aren''t available, but a little math and
studying some models can get us close.  :-)

Let''s presume we''re using an enterprise-class disk, say a 37
GB
Seagate Cheetah.  This is best-case for seeks as it uses so
little of the platter and runs at 15K RPM.

Large-block case:

On average, to reach the 2 MB, we''ll take 3.5ms.  Transfer can
then proceed at media rate (average 110 MB/sec) and be sent to
the host over a 200 MB/sec channel.  3.5 ms seek, 18.1 ms data
transfer, total time 21.6 ms for a rate of 92.6 MB/sec.

Small-block case:

Each seek will be shorter than the average since we are ordering
them optimally.  A single-track seek is 0.2 ms; average is 3.5ms;
if we assume linear scaling (which isn''t quite right) then
we''re
looking at 1/8 of 3.7 ms = 0.46 ms.  We do 16 seeks, for 7.36 ms,
and our data transfer time is the same (18.1 ms), for a rate of
25.46 ms, a rate of 78.5 MB/sec.

Not too bad.  It''s pretty clear why these drives are pricey.  :-)

Mmmm, actually it''s not that good.  There are 50K tracks on this
35 GB disk, so each track holds 700 KB.  We''re only storing 128KB
on each track, so on average we''ll need to wait nearly 1/2 of a
revolution before we see any of our data under the head.  At 15K
RPM, that''s not so bad, only 2ms, but we''ve got 16 times to
wait,
adding 32 ms, dropping our rate to roughly half what we''d get
otherwise.  (Older disks should, surprisingly, do better since
they have less data packed onto each track!)

Looking at a 250 GB "near-line" SATA disk, and presuming its
controller does the same optimizations, things are different.
Average seek time is 8ms, with single-track seek time of 0.8ms,
so 15 additional seeks will cost roughly 30 ms.  A half-track
wait is 4ms (60ms in total).  Things are going pretty slow now.
> I just did an experiment and could see > 60MB of data out of
> a 35G disk using 128K chunks (> 450 IOPS).
On the only disk I have handy, I get 36 MB/sec with concurrent
128 KB chunks, 38 MB/sec with non-concurrent 2 MB chunks,
39 MB/sec with 2 MB chunks.  But I''m issuing all of these I/O
operations sequentially -- no seeks.
> Disruptive.
What is?

Multiple I/Os outstanding to a device isn''t precisely new.  ;-)

Honestly, adding seeks is -never- going to improve performance.
Giving the drive the opportunity to reorder I/O operations will,
but splitting a single operation up can never speed it up, though
if you get lucky it won''t slow down.

Anton

Robert Milkowski

2006-May-19 00:26 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Roch,

Monday, May 15, 2006, 3:23:14 PM, you wrote:

RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient
RBPE> to saturate a regular disk. There is great body of evidence that shows
RBPE> that the bigger the write sizes and matching large FS clustersize lead
RBPE> to more throughput. The counter point is that ZFS schedules
it''s I/O
RBPE> like nothing else seen before and manages to sature a single disk
RBPE> using enough concurrent 128K I/O.

Nevertheless I get much more throughput using UFS and writing with
large block than using ZFS on the same disk. And the difference is
actually quite big in favor of UFS.


RBPE> <There a few things I did here for the first time; so I may have
erred
RBPE> at places. So I am proposing this for review by the community>

RBPE> I first measured the throughput of a write(2)  to raw device using for
RBPE> instance this;

RBPE>         dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

RBPE> On   Solaris we would  see  some overhead of   reading  the block from
RBPE> /dev/zero and then issuing the write call.  The tightest function that
RBPE> fences the I/O is default_physio(). That  function will issue the I/O
to
RBPE> the device then wait for it  to complete.  If  we take the elapse time
RBPE> spent in   this function and  count the  bytes that   are I/O-ed, this
RBPE> should give  a   good  hint as   to   the throughput  the    device is
RBPE> providing.  The above  dd command will  issue  a single I/O at  a time
RBPE> (d-script to measure is attached).

RBPE> Trying different blocksizes I see:

RBPE>    Bytes   Elapse of phys IO     Size       
RBPE>    Sent
RBPE>    
RBPE>    8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

RBPE>    9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

RBPE>    31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

RBPE>    78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

RBPE>    124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

RBPE>    178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

RBPE>    226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

RBPE>    226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s

RBPE>     32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s

RBPE>    224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s

Just to be sure - you did reconfigure system to actually allow larger
IO sizes?

RBPE> Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
RBPE> will chunk up data  in 128K blocks.  Now  the dd command interact with
RBPE> memory. But the I/O are scheduled under the control of spa_sync().  So
RBPE> in  the d-script (attached) I check  for the start  of an spa_sync and
RBPE> time that based on elapse.  At the same  time I  gather the number  of
RBPE> bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
RBPE> When the spa_sync completes we are  sure that all  those are on stable
RBPE> storage. The script is a bit more  complex because there are 2 threads
RBPE> that   issue  spa_sync, but  only     one  of them actually    becomes
RBPE> activated. So the script will print out  some spurious lines of output
RBPE> at times. I measure I/O with the script while this runs:


RBPE>         dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

RBPE> And I see:

RBPE>    1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
RBPE>    1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


RBPE> OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
RBPE> case ZFS enables the write cache. Note  that even with the write cache
RBPE> enabled, when the spa_sync()  completes, it will  be after a  flush of
RBPE> the cache has been executed. So the 60MB/sec do correspond to data set
RBPE> to the platter. I just tried disabling the cache  (with format -e) but
RBPE> I  am not sure if  that is taken into account  by ZFS; Results are the
RBPE> same 60MB/sec. This will have to be confirmed.

RBPE> With write cache enabled,  the physio test reaches 66  MB/s as soon as
RBPE> we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
RBPE> platter when the timed function completes.

RBPE> Another variable  not  fully  controled  is the   physical  (cylinder)
RBPE> locations of  the I/O. It could be  that some of the  differences come
RBPE> from that.

RBPE> What do I take away ?

RBPE>         a single 2MB physical I/O will get 46 MB/sec out of my disk.

RBPE>         35  concurrent  128K I/O  sustained  followed  by metadata I/O
RBPE>         followed by flush  of  the write cache  allows  ZFS to get  60
RBPE>         MB/sec out of the same disk.


RBPE> This is what underwrites my belief that 128K blocksize is sufficiently
RBPE> large. Now, nothing  here proves    that  256K would not give     more
RBPE> throughput; so nothing is really settled. But I hope this helps put us
RBPE> on common ground.

This is really interesting because what I see here with very similar
test is the opposite. What kind of disk do you use? (mine is 15K 73GB
FCm connected with dual path to host with MPxIO).

I use iostat to see actual throughput - you Dtrace - maybe we measure
different things?





-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Roch Bourbonnais - Performance Engineering

2006-May-19 13:53 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Robert Milkowski writes:
 > Hello Roch,
 > 
 > Monday, May 15, 2006, 3:23:14 PM, you wrote:
 > 
 > RBPE> The question put forth is whether the ZFS 128K blocksize is
sufficient
 > RBPE> to saturate a regular disk. There is great body of evidence that
shows
 > RBPE> that the bigger the write sizes and matching large FS clustersize
lead
 > RBPE> to more throughput. The counter point is that ZFS schedules
it''s I/O
 > RBPE> like nothing else seen before and manages to sature a single disk
 > RBPE> using enough concurrent 128K I/O.
 > 
 > Nevertheless I get much more throughput using UFS and writing with
 > large block than using ZFS on the same disk. And the difference is
 > actually quite big in favor of UFS.
 > 

Absolutely. Isn''t this issue though ?

	6415647 Sequential writing is jumping

We will have to fix this to allow dd to get more throughput.
I''m pretty sure the fix won''t need to increase the
blocksize though.

I''ll be picking up this thread again I hope next week. I have lots
of homework to do to respond properly.

-r

Roch Bourbonnais - Performance Engineering

2006-May-22 13:42 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Robert Says:

  Just to be sure - you did reconfigure system to actually allow larger
  IO sizes?


Sure enough, I messed up (I had no tuning to get the above data); So
1 MB was my max transfer sizes. Using 8MB I now see:

   Bytes   Elapse of phys IO     Size	    
   Sent
   
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s
   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s
   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s
   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s
   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s
   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s
   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s
   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46 MB/s)
    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46 MB/s)
   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47 MB/s)
   272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new  data)
   288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data)

Data  was corrected  after  it was pointed out   that, physio will  be
throttled by maxphys. New data was obtained after settings

	/etc/system: set maxphys=8338608
	/kernel/drv/sd.conf sd_max_xfer_size=0x800000
	/kernel/drv/ssd.cond ssd_max_xfer_size=0x800000

	And setting un_max_xfer_size in "struct sd_lun".
	That address was figured out using dtrace and knowing that
	sdmin() calls ddi_get_soft_state (details avail upon request).
	
	And of course disabling the write cache (using format -e)

	With this in place I verified that each sdwrite() up to 8M 
	would lead to a single biodone interrupts using this:

	dtrace -n ''biodone:entry,sdwrite:entry{@a[probefunc,
stack(20)]=count()}''

	Note that for 16M and 32M raw device writes, each default_physio
	will issue a series of 8M I/O. And so we don''t
	expect any more throughput from that.


The script  used  to measure  the  rates  (phys.d)  was also
modified since I was  counting the bytes  before the I/O had
completed and that made a  big difference for the very large
I/O sizes.

If you take the  8M case, the  above rates correspond to the
time it takes to issue  and wait for a  single 8M I/O to the
sd driver. So this time certainly does include  1 seek and ~
0.13 seconds  of data transfer, then  the time to respond to
the  interrupt, finally the wakeup  of the thread waiting in
default_physio(). Given that the data  transfer rate using 4
MB is very close to  the one using 8  MB, I''d say that at 60
MB/sec all the fixed-cost  element are well amortized.  So I
would conclude from this that  the limiting factor is now at
the  device itself or  on the data  channel between the disk
and the host.


Now recall that the throughput that ZFS gets during an
spa_sync when submitted to a single dd and knowing that ZFS
will work with 128K I/O:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


My disk is

       <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>.

As you say, we don''t measure things the same  way. At the dd
to raw level  I think our  data, with  my mistake corrected,
will now be similar.  At the ZFS level, we cannot use iostat
quite _yet_ because of

	6415647 Sequential writing is jumping

With  iostat, the 1 second  average will see, at times, some
period in  which we won''t issue any  I/O. So it''s not a good
measure of the capacity of a disk. This is why I reverted to
my script  which  times  the  I/O rate  but   only  "when it
counts". When we fix  6415647,  the expectation is  that  we
will  sustain    that  throughput for  whatever    times  is
necessary.   At that point,  I expect the throughput as seen
from iostat or the throughput from a ptime of dd itself will
all converge.

And  so,  after a moment of  doubt,  I am  still inclined to
believe that 128K I/Os, when issued properly can lead to, if
not saturation, a very good throughput from a basic disk.

Now, Anton''s demonstration is convincing  in it''s own way. I
can concur  that any  seek  time  is  unproductive and  will
degrade throughput at the device level. But if the weak link
is the  data transfer rate  between the device and the host,
then it can  be argued that   the seek time can actually  be
hidden behind  some data transfer time ?  At 60MB/sec, a 128K
data transfer takes 2ms which maybe is sufficient to get the
head to the next block ?  My disk does reach  > 450 IOPS when
controled by ZFS so it all adds up.


Bear  in mind also, that   the  throughput is  not the  only
consideration when setting  the ZFS recordsize.  The smaller
the record size the more manageable  the disk block will be.
So everything is  a tradeoff and  at this point 128K appears
sufficiently large ... at least for a while.


-r

____________________________________________________________________________________
Roch Bourbonnais                        Sun Microsystems, Icnc-Grenoble 
Senior Performance Analyst              180, Avenue De L''Europe, 38330,
					Montbonnot Saint Martin, France
Performance & Availability Engineering  
http://icncweb.france/~rbourbon		http://blogs.sun.com/roller/page/roch
Roch.Bourbonnais at Sun.Com		(+33).4.76.18.83.20



New scripts to measure dd to raw throughput:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: phys.d
Type: application/octet-stream
Size: 606 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/26b1ec85/attachment.obj>

Robert Milkowski

2006-May-23 21:19 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Roch,

Monday, May 22, 2006, 3:42:41 PM, you wrote:



RBPE>   Robert Says:

RBPE>   Just to be sure - you did reconfigure system to actually allow larger
RBPE>   IO sizes?


RBPE> Sure enough, I messed up (I had no tuning to get the above data); So
RBPE> 1 MB was my max transfer sizes. Using 8MB I now see:

RBPE>    Bytes   Elapse of phys IO     Size       
RBPE>    Sent
RBPE>    
RBPE>    8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s
RBPE>    9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s
RBPE>    31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s
RBPE>    78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s
RBPE>    124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s
RBPE>    178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s
RBPE>    226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s
RBPE>    226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was
46 MB/s)
RBPE>     32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was
46 MB/s)
RBPE>    224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was
47 MB/s)
RBPE>    272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new 
data)
RBPE>    288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new
data)

RBPE> Data  was corrected  after  it was pointed out   that, physio will  be
RBPE> throttled by maxphys. New data was obtained after settings

RBPE>         /etc/system: set maxphys=8338608
RBPE>         /kernel/drv/sd.conf sd_max_xfer_size=0x800000
RBPE>         /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000

RBPE>         And setting un_max_xfer_size in "struct sd_lun".
RBPE>         That address was figured out using dtrace and knowing that
RBPE>         sdmin() calls ddi_get_soft_state (details avail upon request).
RBPE>         
RBPE>         And of course disabling the write cache (using format -e)

RBPE>         With this in place I verified that each sdwrite() up to 8M 
RBPE>         would lead to a single biodone interrupts using this:

RBPE>         dtrace -n ''biodone:entry,sdwrite:entry{@a[probefunc,
stack(20)]=count()}''

RBPE>         Note that for 16M and 32M raw device writes, each
default_physio
RBPE>         will issue a series of 8M I/O. And so we don''t
RBPE>         expect any more throughput from that.


RBPE> The script  used  to measure  the  rates  (phys.d)  was also
RBPE> modified since I was  counting the bytes  before the I/O had
RBPE> completed and that made a  big difference for the very large
RBPE> I/O sizes.

RBPE> If you take the  8M case, the  above rates correspond to the
RBPE> time it takes to issue  and wait for a  single 8M I/O to the
RBPE> sd driver. So this time certainly does include  1 seek and ~
RBPE> 0.13 seconds  of data transfer, then  the time to respond to
RBPE> the  interrupt, finally the wakeup  of the thread waiting in
RBPE> default_physio(). Given that the data  transfer rate using 4
RBPE> MB is very close to  the one using 8  MB, I''d say that at 60
RBPE> MB/sec all the fixed-cost  element are well amortized.  So I
RBPE> would conclude from this that  the limiting factor is now at
RBPE> the  device itself or  on the data  channel between the disk
RBPE> and the host.


RBPE> Now recall that the throughput that ZFS gets during an
RBPE> spa_sync when submitted to a single dd and knowing that ZFS
RBPE> will work with 128K I/O:

RBPE>    1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
RBPE>    1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
RBPE>    1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s


RBPE> My disk is

RBPE>        <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>.

Is it over FC or just SCSI/SAS?

I have to try again with SAS/SCSI - maybe due to more overhead in FC
larger IOs give better results than on SCSI?

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2006-May-23 21:23 UTC

head link

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

Hello Roch,

Friday, May 19, 2006, 3:53:35 PM, you wrote:

RBPE> Robert Milkowski writes:
 >> Hello Roch,
 >> 
 >> Monday, May 15, 2006, 3:23:14 PM, you wrote:
 >> 
 >> RBPE> The question put forth is whether the ZFS 128K blocksize is
sufficient
 >> RBPE> to saturate a regular disk. There is great body of evidence
that shows
 >> RBPE> that the bigger the write sizes and matching large FS
clustersize lead
 >> RBPE> to more throughput. The counter point is that ZFS schedules
it''s I/O
 >> RBPE> like nothing else seen before and manages to sature a single
disk
 >> RBPE> using enough concurrent 128K I/O.
 >> 
 >> Nevertheless I get much more throughput using UFS and writing with
 >> large block than using ZFS on the same disk. And the difference is
 >> actually quite big in favor of UFS.
 >> 

RBPE> Absolutely. Isn''t this issue though ?

RBPE>         6415647 Sequential writing is jumping

RBPE> We will have to fix this to allow dd to get more throughput.
RBPE> I''m pretty sure the fix won''t need to increase the
RBPE> blocksize though.

Maybe - but it also means that until this is addressed it doesn''t
make any sense to compare ZFS to other filesysystems with sequential
writing... The question is how well above problem is understood and
when is it going to be corrected? And why in your test cases which are
similar to mine you do not see dd to raw device to be actually faster
by any important factor? (again, maybe you are using SCSI and I do use
FC).

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Torsten "Paul" Eichstädt

2007-Oct-24 17:44 UTC

head link

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

Hi,

what is the exact syntax to enable large transfer sizes?
I didn''t find any documentation, so I guessed the following:
etc/system: set maxphys=8338608
/kernel/drv/sd.conf: name="sd" parent="scsi"
        sd_max_xfer_size=0x800000;
/kernel/drv/ssd.conf: name="ssd" parent="scsi_vhci"
sd_max_xfer_size=0x800000;
(I have FC drives)

Where can I teach myself about the disadvantages? I searched for an article or
paper about "Why 128k blocksize is enough" written by the ZFS
designer, but could not find it...

Thx in advance, Paul
 
 
This message posted from opensolaris.org

Anton B. Rang

2007-Oct-25 04:30 UTC

head link

[zfs-discuss] Due to 128KB limit in ZFS it can''t

See the QFS documentation:

  http://docs.sun.com/source/817-4091-10/chapter8.html#59255

(Steps 1 through 4 would apply to any file system which can issue multi-megabyte
I/O requests.)
 
 
This message posted from opensolaris.org

zfs discuss - Apr 2006 - Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks

[zfs-discuss] Due to 128KB limit in ZFS it can''t