Robert Milkowski
2006-Apr-21 10:49 UTC
[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks
Hi. I was doing some tests with sequential writing large files. I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO). bash-3.00# zpool status p-86-102 pool: p-86-102 state: ONLINE scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006 config: NAME STATE READ WRITE CKSUM p-86-102 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E0119495A0d0 ONLINE 0 0 0 c5t500000E0118EDB20d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E01194A720d0 ONLINE 0 0 0 c5t500000E0118EDA10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E01194A750d0 ONLINE 0 0 0 c5t500000E01190E6B0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E01194A8C0d0 ONLINE 0 0 0 c5t500000E0118F21C0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E011949570d0 ONLINE 0 0 0 c5t500000E0118F1FD0d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t500000E011949480d0 ONLINE 0 0 0 c5t500000E01190E5B0d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# dd if=/dev/zero of=/p-86-102/q1 bs=1024k Using iostat I get something like: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1942.4 0.0 248622.7 0.0 420.0 0.0 216.2 0 1200 c5 0.0 161.0 0.0 20611.2 0.0 35.0 0.0 217.3 0 100 c5t500000E0118F1FD0d0 0.0 160.0 0.0 20483.3 0.0 35.0 0.0 218.7 0 100 c5t500000E01194A8C0d0 0.0 163.0 0.0 20867.4 0.0 35.0 0.0 214.7 0 100 c5t500000E01194A720d0 0.0 169.0 0.0 21635.7 0.0 35.0 0.0 207.0 0 100 c5t500000E011949570d0 0.0 152.0 0.0 19459.4 0.0 35.0 0.0 230.2 0 100 c5t500000E0118EDA10d0 0.0 160.0 0.0 20483.7 0.0 35.0 0.0 218.7 0 100 c5t500000E01190E5B0d0 0.0 161.0 0.0 20611.8 0.0 35.0 0.0 217.3 0 100 c5t500000E0118F21C0d0 0.0 159.0 0.0 20355.8 0.0 35.0 0.0 220.1 0 100 c5t500000E01190E6B0d0 0.0 168.0 0.0 21508.2 0.0 35.0 0.0 208.3 0 100 c5t500000E01194A750d0 0.0 172.0 0.0 22020.3 0.0 35.0 0.0 203.4 0 100 c5t500000E011949480d0 0.0 163.0 0.0 20868.1 0.0 35.0 0.0 214.7 0 100 c5t500000E0119495A0d0 0.0 154.0 0.0 19715.9 0.0 35.0 0.0 227.2 0 100 c5t500000E0118EDB20d0 So ZFS issues to individual disks IO with 128KB in size. Now let''s see how much we can get with single disk with different IO sizes. dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 185.0 0.0 23684.9 0.0 1.0 0.0 5.3 0 98 c5 0.0 185.0 0.0 23685.3 0.0 1.0 0.0 5.3 0 98 c5t500000E0119495A0d0 dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 102.9 0.0 52707.6 0.0 1.0 0.0 9.4 0 97 c5 0.0 102.9 0.0 52706.5 0.0 1.0 0.0 9.4 0 97 c5t500000E0119495A0d0 dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 65.0 0.0 66566.3 0.0 1.0 0.0 14.7 0 96 c5 0.0 65.0 0.0 66567.1 0.0 1.0 0.0 14.7 0 96 c5t500000E0119495A0d0 Ok, I tried again this time with second mirrors offlined (so host doesn''t have to write data twice as it could affect the performance). bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0 c5t500000E0118EDA10d0 c5t500000E01190E6B0d0 c5t500000E0118F21C0d0 c5t500000E0118F1FD0d0 c5t500000E01190E5B0d0 Bringing device c5t500000E0118EDB20d0 offline Bringing device c5t500000E0118EDA10d0 offline Bringing device c5t500000E01190E6B0d0 offline Bringing device c5t500000E0118F21C0d0 offline Bringing device c5t500000E0118F1FD0d0 offline Bringing device c5t500000E01190E5B0d0 offline bash-3.00# zpool status p-86-102 pool: p-86-102 state: DEGRADED status: One or more devices has been taken offline by the adminstrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using ''zpool online'' or replace the device with ''zpool replace''. scrub: none requested config: NAME STATE READ WRITE CKSUM p-86-102 DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c5t500000E0119495A0d0 ONLINE 0 0 0 c5t500000E0118EDB20d0 OFFLINE 0 0 0 mirror DEGRADED 0 0 0 c5t500000E01194A720d0 ONLINE 0 0 0 c5t500000E0118EDA10d0 OFFLINE 0 0 0 mirror DEGRADED 0 0 0 c5t500000E01194A750d0 ONLINE 0 0 0 c5t500000E01190E6B0d0 OFFLINE 0 0 0 mirror DEGRADED 0 0 0 c5t500000E01194A8C0d0 ONLINE 0 0 0 c5t500000E0118F21C0d0 OFFLINE 0 0 0 mirror DEGRADED 0 0 0 c5t500000E011949570d0 ONLINE 0 0 0 c5t500000E0118F1FD0d0 OFFLINE 0 0 0 mirror DEGRADED 0 0 0 c5t500000E011949480d0 ONLINE 0 0 0 c5t500000E01190E5B0d0 OFFLINE 0 0 0 errors: No known data errors And now: dd if=/dev/zero of=/p-86-102/q2 bs=1024k extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1532.1 0.0 196107.5 0.0 210.0 0.0 137.0 0 600 c5 0.0 239.0 0.0 30591.4 0.0 35.0 0.0 146.4 0 100 c5t500000E01194A8C0d0 0.0 256.0 0.0 32767.4 0.0 35.0 0.0 136.7 0 100 c5t500000E01194A720d0 0.0 255.0 0.0 32639.7 0.0 35.0 0.0 137.2 0 100 c5t500000E011949570d0 0.0 263.0 0.0 33668.1 0.0 35.0 0.0 133.0 0 100 c5t500000E01194A750d0 0.0 278.0 0.0 35588.6 0.0 35.0 0.0 125.9 0 100 c5t500000E011949480d0 0.0 241.0 0.0 30852.0 0.0 35.0 0.0 145.2 0 100 c5t500000E0119495A0d0 Well, it looks like if ZFS could issue larger IOs it could greatly affect performance for large sequential writes (2x?). This message posted from opensolaris.org
Gregory Shaw
2006-Apr-21 12:54 UTC
[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks
Would the maxphys system parameter have an impact on the below? On Apr 21, 2006, at 4:49 AM, Robert Milkowski wrote:> Hi. > > I was doing some tests with sequential writing large files. > I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO). > > bash-3.00# zpool status p-86-102 > pool: p-86-102 > state: ONLINE > scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006 > config: > > NAME STATE READ WRITE CKSUM > p-86-102 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E0119495A0d0 ONLINE 0 0 0 > c5t500000E0118EDB20d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A720d0 ONLINE 0 0 0 > c5t500000E0118EDA10d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A750d0 ONLINE 0 0 0 > c5t500000E01190E6B0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A8C0d0 ONLINE 0 0 0 > c5t500000E0118F21C0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E011949570d0 ONLINE 0 0 0 > c5t500000E0118F1FD0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E011949480d0 ONLINE 0 0 0 > c5t500000E01190E5B0d0 ONLINE 0 0 0 > > errors: No known data errors > bash-3.00# > > > > dd if=/dev/zero of=/p-86-102/q1 bs=1024k > > Using iostat I get something like: > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1942.4 0.0 248622.7 0.0 420.0 0.0 216.2 0 1200 c5 > 0.0 161.0 0.0 20611.2 0.0 35.0 0.0 217.3 0 100 > c5t500000E0118F1FD0d0 > 0.0 160.0 0.0 20483.3 0.0 35.0 0.0 218.7 0 100 > c5t500000E01194A8C0d0 > 0.0 163.0 0.0 20867.4 0.0 35.0 0.0 214.7 0 100 > c5t500000E01194A720d0 > 0.0 169.0 0.0 21635.7 0.0 35.0 0.0 207.0 0 100 > c5t500000E011949570d0 > 0.0 152.0 0.0 19459.4 0.0 35.0 0.0 230.2 0 100 > c5t500000E0118EDA10d0 > 0.0 160.0 0.0 20483.7 0.0 35.0 0.0 218.7 0 100 > c5t500000E01190E5B0d0 > 0.0 161.0 0.0 20611.8 0.0 35.0 0.0 217.3 0 100 > c5t500000E0118F21C0d0 > 0.0 159.0 0.0 20355.8 0.0 35.0 0.0 220.1 0 100 > c5t500000E01190E6B0d0 > 0.0 168.0 0.0 21508.2 0.0 35.0 0.0 208.3 0 100 > c5t500000E01194A750d0 > 0.0 172.0 0.0 22020.3 0.0 35.0 0.0 203.4 0 100 > c5t500000E011949480d0 > 0.0 163.0 0.0 20868.1 0.0 35.0 0.0 214.7 0 100 > c5t500000E0119495A0d0 > 0.0 154.0 0.0 19715.9 0.0 35.0 0.0 227.2 0 100 > c5t500000E0118EDB20d0 > > So ZFS issues to individual disks IO with 128KB in size. > > Now let''s see how much we can get with single disk with different > IO sizes. > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 185.0 0.0 23684.9 0.0 1.0 0.0 5.3 0 98 c5 > 0.0 185.0 0.0 23685.3 0.0 1.0 0.0 5.3 0 98 > c5t500000E0119495A0d0 > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 102.9 0.0 52707.6 0.0 1.0 0.0 9.4 0 97 c5 > 0.0 102.9 0.0 52706.5 0.0 1.0 0.0 9.4 0 97 > c5t500000E0119495A0d0 > > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 65.0 0.0 66566.3 0.0 1.0 0.0 14.7 0 96 c5 > 0.0 65.0 0.0 66567.1 0.0 1.0 0.0 14.7 0 96 > c5t500000E0119495A0d0 > > > Ok, I tried again this time with second mirrors offlined (so host > doesn''t have to write data twice as it could affect the performance). > > bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0 > c5t500000E0118EDA10d0 c5t500000E01190E6B0d0 c5t500000E0118F21C0d0 > c5t500000E0118F1FD0d0 c5t500000E01190E5B0d0 > Bringing device c5t500000E0118EDB20d0 offline > Bringing device c5t500000E0118EDA10d0 offline > Bringing device c5t500000E01190E6B0d0 offline > Bringing device c5t500000E0118F21C0d0 offline > Bringing device c5t500000E0118F1FD0d0 offline > Bringing device c5t500000E01190E5B0d0 offline > bash-3.00# zpool status p-86-102 pool: p-86-102 > state: DEGRADED > status: One or more devices has been taken offline by the > adminstrator. > Sufficient replicas exist for the pool to continue > functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the > device with > ''zpool replace''. > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > p-86-102 DEGRADED 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E0119495A0d0 ONLINE 0 0 0 > c5t500000E0118EDB20d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A720d0 ONLINE 0 0 0 > c5t500000E0118EDA10d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A750d0 ONLINE 0 0 0 > c5t500000E01190E6B0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A8C0d0 ONLINE 0 0 0 > c5t500000E0118F21C0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E011949570d0 ONLINE 0 0 0 > c5t500000E0118F1FD0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E011949480d0 ONLINE 0 0 0 > c5t500000E01190E5B0d0 OFFLINE 0 0 0 > > errors: No known data errors > > And now: > dd if=/dev/zero of=/p-86-102/q2 bs=1024k > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1532.1 0.0 196107.5 0.0 210.0 0.0 137.0 0 600 c5 > 0.0 239.0 0.0 30591.4 0.0 35.0 0.0 146.4 0 100 > c5t500000E01194A8C0d0 > 0.0 256.0 0.0 32767.4 0.0 35.0 0.0 136.7 0 100 > c5t500000E01194A720d0 > 0.0 255.0 0.0 32639.7 0.0 35.0 0.0 137.2 0 100 > c5t500000E011949570d0 > 0.0 263.0 0.0 33668.1 0.0 35.0 0.0 133.0 0 100 > c5t500000E01194A750d0 > 0.0 278.0 0.0 35588.6 0.0 35.0 0.0 125.9 0 100 > c5t500000E011949480d0 > 0.0 241.0 0.0 30852.0 0.0 35.0 0.0 145.2 0 100 > c5t500000E0119495A0d0 > > > Well, it looks like if ZFS could issue larger IOs it could greatly > affect performance for large sequential writes (2x?). > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Roch Bourbonnais - Performance Engineering
2006-Apr-21 13:35 UTC
[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks
We believe that issuing larger I/O would not impact ZFS performance that much. What must be considered both I/O size and concurrency. ZFS will issue 128K I/O but many of them concurrently whereas dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k only issues 1x128K at a time (does it not ?). Now, we still have to explain why you saturate at 250MB/sec. Is your dd process eating up a full CPU ? -r Robert Milkowski writes: > Hi. > > I was doing some tests with sequential writing large files. > I created mirrored pool (12xFC 15K disks, JBOD, FC-AL, MPxIO). > > bash-3.00# zpool status p-86-102 > pool: p-86-102 > state: ONLINE > scrub: scrub completed with 0 errors on Fri Apr 21 12:31:37 2006 > config: > > NAME STATE READ WRITE CKSUM > p-86-102 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E0119495A0d0 ONLINE 0 0 0 > c5t500000E0118EDB20d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A720d0 ONLINE 0 0 0 > c5t500000E0118EDA10d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A750d0 ONLINE 0 0 0 > c5t500000E01190E6B0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E01194A8C0d0 ONLINE 0 0 0 > c5t500000E0118F21C0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E011949570d0 ONLINE 0 0 0 > c5t500000E0118F1FD0d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c5t500000E011949480d0 ONLINE 0 0 0 > c5t500000E01190E5B0d0 ONLINE 0 0 0 > > errors: No known data errors > bash-3.00# > > > > dd if=/dev/zero of=/p-86-102/q1 bs=1024k > > Using iostat I get something like: > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1942.4 0.0 248622.7 0.0 420.0 0.0 216.2 0 1200 c5 > 0.0 161.0 0.0 20611.2 0.0 35.0 0.0 217.3 0 100 c5t500000E0118F1FD0d0 > 0.0 160.0 0.0 20483.3 0.0 35.0 0.0 218.7 0 100 c5t500000E01194A8C0d0 > 0.0 163.0 0.0 20867.4 0.0 35.0 0.0 214.7 0 100 c5t500000E01194A720d0 > 0.0 169.0 0.0 21635.7 0.0 35.0 0.0 207.0 0 100 c5t500000E011949570d0 > 0.0 152.0 0.0 19459.4 0.0 35.0 0.0 230.2 0 100 c5t500000E0118EDA10d0 > 0.0 160.0 0.0 20483.7 0.0 35.0 0.0 218.7 0 100 c5t500000E01190E5B0d0 > 0.0 161.0 0.0 20611.8 0.0 35.0 0.0 217.3 0 100 c5t500000E0118F21C0d0 > 0.0 159.0 0.0 20355.8 0.0 35.0 0.0 220.1 0 100 c5t500000E01190E6B0d0 > 0.0 168.0 0.0 21508.2 0.0 35.0 0.0 208.3 0 100 c5t500000E01194A750d0 > 0.0 172.0 0.0 22020.3 0.0 35.0 0.0 203.4 0 100 c5t500000E011949480d0 > 0.0 163.0 0.0 20868.1 0.0 35.0 0.0 214.7 0 100 c5t500000E0119495A0d0 > 0.0 154.0 0.0 19715.9 0.0 35.0 0.0 227.2 0 100 c5t500000E0118EDB20d0 > > So ZFS issues to individual disks IO with 128KB in size. > > Now let''s see how much we can get with single disk with different IO sizes. > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 185.0 0.0 23684.9 0.0 1.0 0.0 5.3 0 98 c5 > 0.0 185.0 0.0 23685.3 0.0 1.0 0.0 5.3 0 98 c5t500000E0119495A0d0 > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=512k > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 102.9 0.0 52707.6 0.0 1.0 0.0 9.4 0 97 c5 > 0.0 102.9 0.0 52706.5 0.0 1.0 0.0 9.4 0 97 c5t500000E0119495A0d0 > > > dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=1024k > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 65.0 0.0 66566.3 0.0 1.0 0.0 14.7 0 96 c5 > 0.0 65.0 0.0 66567.1 0.0 1.0 0.0 14.7 0 96 c5t500000E0119495A0d0 > > > Ok, I tried again this time with second mirrors offlined (so host doesn''t have to write data twice as it could affect the performance). > > bash-3.00# zpool offline p-86-102 c5t500000E0118EDB20d0 c5t500000E0118EDA10d0 c5t500000E01190E6B0d0 c5t500000E0118F21C0d0 c5t500000E0118F1FD0d0 c5t500000E01190E5B0d0 > Bringing device c5t500000E0118EDB20d0 offline > Bringing device c5t500000E0118EDA10d0 offline > Bringing device c5t500000E01190E6B0d0 offline > Bringing device c5t500000E0118F21C0d0 offline > Bringing device c5t500000E0118F1FD0d0 offline > Bringing device c5t500000E01190E5B0d0 offline > bash-3.00# zpool status p-86-102 pool: p-86-102 > state: DEGRADED > status: One or more devices has been taken offline by the adminstrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using ''zpool online'' or replace the device with > ''zpool replace''. > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > p-86-102 DEGRADED 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E0119495A0d0 ONLINE 0 0 0 > c5t500000E0118EDB20d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A720d0 ONLINE 0 0 0 > c5t500000E0118EDA10d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A750d0 ONLINE 0 0 0 > c5t500000E01190E6B0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E01194A8C0d0 ONLINE 0 0 0 > c5t500000E0118F21C0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E011949570d0 ONLINE 0 0 0 > c5t500000E0118F1FD0d0 OFFLINE 0 0 0 > mirror DEGRADED 0 0 0 > c5t500000E011949480d0 ONLINE 0 0 0 > c5t500000E01190E5B0d0 OFFLINE 0 0 0 > > errors: No known data errors > > And now: > dd if=/dev/zero of=/p-86-102/q2 bs=1024k > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 1532.1 0.0 196107.5 0.0 210.0 0.0 137.0 0 600 c5 > 0.0 239.0 0.0 30591.4 0.0 35.0 0.0 146.4 0 100 c5t500000E01194A8C0d0 > 0.0 256.0 0.0 32767.4 0.0 35.0 0.0 136.7 0 100 c5t500000E01194A720d0 > 0.0 255.0 0.0 32639.7 0.0 35.0 0.0 137.2 0 100 c5t500000E011949570d0 > 0.0 263.0 0.0 33668.1 0.0 35.0 0.0 133.0 0 100 c5t500000E01194A750d0 > 0.0 278.0 0.0 35588.6 0.0 35.0 0.0 125.9 0 100 c5t500000E011949480d0 > 0.0 241.0 0.0 30852.0 0.0 35.0 0.0 145.2 0 100 c5t500000E0119495A0d0 > > > Well, it looks like if ZFS could issue larger IOs it could greatly affect performance for large sequential writes (2x?). > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2006-Apr-29 13:13 UTC
[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks
To make things simpler I did test with only one disk. bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 62.8 0.0 64333.2 0.0 1.0 0.0 15.3 0 96 c5 0.0 62.8 0.0 64334.1 0.0 1.0 0.0 15.3 0 96 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 65.0 0.0 66554.5 0.0 1.0 0.0 14.7 0 96 c5 0.0 65.0 0.0 66554.3 0.0 1.0 0.0 14.7 0 96 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 62.0 0.0 63494.3 0.0 1.0 0.0 15.5 0 96 c5 0.0 62.0 0.0 63493.9 0.0 1.0 0.0 15.5 0 96 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 64.0 0.0 65531.7 0.0 1.0 0.0 15.0 0 96 c5 0.0 64.0 0.0 65532.3 0.0 1.0 0.0 15.0 0 96 c5t500000E0119495A0d0 bash-3.00# zpool create one c5t500000E0119495A0d0 bash-3.00# dd if=/dev/zero of=/one/q2 bs=1024k extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 388.0 0.0 49666.2 0.0 35.0 0.0 90.2 0 100 c5 0.0 388.0 0.0 49667.7 0.0 35.0 0.0 90.2 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 380.0 0.0 48640.1 0.0 35.0 0.0 92.1 0 100 c5 0.0 380.0 0.0 48640.1 0.0 35.0 0.0 92.1 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 379.0 0.0 48516.8 0.0 35.0 0.0 92.3 0 100 c5 0.0 379.0 0.0 48517.1 0.0 35.0 0.0 92.3 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 371.0 0.0 47484.3 0.0 35.0 0.0 94.3 0 100 c5 0.0 371.0 0.0 47484.0 0.0 35.0 0.0 94.3 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 378.0 0.0 48382.0 0.0 35.0 0.0 92.6 0 100 c5 0.0 378.0 0.0 48382.0 0.0 35.0 0.0 92.6 0 100 c5t500000E0119495A0d0 Well looks like ZFS is slower in this case and can''t saturate signle disk (~30% less performance than dd). Lets try dd with 128KB block: bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 182.1 0.0 23305.7 0.0 1.0 0.0 5.4 0 98 c5 0.0 182.1 0.0 23306.0 0.0 1.0 0.0 5.4 0 98 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 181.9 0.0 23284.8 0.0 1.0 0.0 5.4 0 98 c5 0.0 181.9 0.0 23284.3 0.0 1.0 0.0 5.4 0 98 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 185.0 0.0 23681.8 0.0 1.0 0.0 5.3 0 98 c5 0.0 185.0 0.0 23682.0 0.0 1.0 0.0 5.3 0 98 c5t500000E0119495A0d0 ok, so when writing with 128KB block ZFS issues about 2x more IOs to signle disk than dd with 128KB block resulting in ZFS with double the throughput than dd. I checked with 8MB block size also: bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=8192k extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 9.8 0.0 80642.8 0.0 0.9 0.0 93.7 0 92 c5 0.0 9.8 0.0 80641.9 0.0 0.9 0.0 93.7 0 92 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 10.2 0.0 83267.8 0.0 0.9 0.0 91.1 0 93 c5 0.0 10.2 0.0 83268.1 0.0 0.9 0.0 91.1 0 93 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 9.8 0.0 80552.4 0.0 0.9 0.0 93.8 0 92 c5 0.0 9.8 0.0 80551.6 0.0 0.9 0.0 93.8 0 92 c5t500000E0119495A0d0 However using larger IO sizes for dd it can easly overcome ZFS by a margin of 30% (or even almost two times slower if 8MB block size is used). I would say that ZFS would definitely benefit from larger block sizes - at least for large sequential large file writing. This message posted from opensolaris.org
Richard Elling
2006-May-01 01:51 UTC
[zfs-discuss] Re: Due to 128KB limit in ZFS it can''t saturate disks
comment below... On Sat, 2006-04-29 at 06:13 -0700, Robert Milkowski wrote:> To make things simpler I did test with only one disk. > > bash-3.00# dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 62.8 0.0 64333.2 0.0 1.0 0.0 15.3 0 96 c5 > 0.0 62.8 0.0 64334.1 0.0 1.0 0.0 15.3 0 96 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 65.0 0.0 66554.5 0.0 1.0 0.0 14.7 0 96 c5 > 0.0 65.0 0.0 66554.3 0.0 1.0 0.0 14.7 0 96 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 62.0 0.0 63494.3 0.0 1.0 0.0 15.5 0 96 c5 > 0.0 62.0 0.0 63493.9 0.0 1.0 0.0 15.5 0 96 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 64.0 0.0 65531.7 0.0 1.0 0.0 15.0 0 96 c5 > 0.0 64.0 0.0 65532.3 0.0 1.0 0.0 15.0 0 96 c5t500000E0119495A0d0 > > > bash-3.00# zpool create one c5t500000E0119495A0d0 > bash-3.00# dd if=/dev/zero of=/one/q2 bs=1024k > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 388.0 0.0 49666.2 0.0 35.0 0.0 90.2 0 100 c5 > 0.0 388.0 0.0 49667.7 0.0 35.0 0.0 90.2 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 380.0 0.0 48640.1 0.0 35.0 0.0 92.1 0 100 c5 > 0.0 380.0 0.0 48640.1 0.0 35.0 0.0 92.1 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 379.0 0.0 48516.8 0.0 35.0 0.0 92.3 0 100 c5 > 0.0 379.0 0.0 48517.1 0.0 35.0 0.0 92.3 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 371.0 0.0 47484.3 0.0 35.0 0.0 94.3 0 100 c5 > 0.0 371.0 0.0 47484.0 0.0 35.0 0.0 94.3 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 378.0 0.0 48382.0 0.0 35.0 0.0 92.6 0 100 c5 > 0.0 378.0 0.0 48382.0 0.0 35.0 0.0 92.6 0 100 c5t500000E0119495A0d0 > > > Well looks like ZFS is slower in this case and can''t saturate signle disk (~30% less performance than dd).I disagree with your cause, but agree with your observations. Here''s why. In the dd case, you are doing pure sequential I/O. The asvc_t (service time of the queue in the disk) is ~ 15 ms leading to the disk being 96% busy. In the ZFS case, asvc_t is ~ 92 ms and the disk is 100% busy. ZFS is clearly saturating the disk (100%) and dd is not (96%). ZFS, or any filesystem, will at some point need to update the metadata which will cause extra, perhaps long, seeks. A dd to the raw device won''t. For disks which handle multiple outstanding commands or RAID controllers, this is more difficult to see as there is another entity rescheduling the I/O at or near the disk and will try to avoid the long seeks. To go along with this rescheduling is another layer of caching with its optimal block size, and so on... -- richard
Robert Milkowski
2006-May-12 11:22 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Well I have just tested UFS on the same disk. bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0 newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0: (y/n)? y mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024 16 1 1 8192 t 0 -1 1 1024 n Warning: 5810 sector(s) in last cylinder unallocated /dev/rdsk/c5t500000E0119495A0d0s0: 143358286 sectors in 23334 cylinders of 48 tracks, 128 sectors 69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920, Initializing cylinder groups: ............................ super-block backups for last 10 cylinder groups at: 142447776, 142546208, 142644640, 142743072, 142841504, 142939936, 143038368, 143136800, 143235232, 143333664 bash-3.00# mkdir /mnt/1 bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /mnt/1 bash-3.00# dd if=/dev/zero of=/mnt/1/q1 bs=8192k ^C110+0 records in 110+0 records out bash-3.00# extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5.0 25.0 35.0 82408.8 0.0 3.6 0.0 120.3 0 99 c5 5.0 25.0 35.0 82409.7 0.0 3.6 0.0 120.3 0 99 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 4.0 25.0 28.0 79832.1 0.0 3.9 0.0 133.4 0 97 c5 4.0 25.0 28.0 79831.5 0.0 3.9 0.0 133.4 0 97 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 6.0 25.0 42.0 81921.3 0.0 4.7 0.0 151.6 0 100 c5 6.0 25.0 42.0 81921.4 0.0 4.7 0.0 151.6 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 4.0 21.0 28.0 73555.6 0.0 3.5 0.0 138.7 0 97 c5 4.0 21.0 28.0 73555.7 0.0 3.5 0.0 138.7 0 97 c5t500000E0119495A0d0 bash-3.00# tunefs -a 2048 /mnt/1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 22.0 0.0 83240.1 0.0 3.5 0.0 157.1 0 97 c5 0.0 22.0 0.0 83240.5 0.0 3.5 0.0 157.1 0 97 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 19.0 0.0 81837.1 0.0 3.4 0.0 180.1 0 98 c5 0.0 19.0 0.0 81837.2 0.0 3.4 0.0 180.1 0 98 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 21.0 0.0 94004.1 0.0 4.6 0.0 218.1 0 100 c5 0.0 21.0 0.0 94002.6 0.0 4.6 0.0 218.1 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 20.0 0.0 70116.6 0.0 4.3 0.0 216.5 0 100 c5 0.0 20.0 0.0 70116.7 0.0 4.3 0.0 216.5 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 21.0 0.0 82140.7 0.0 3.3 0.0 158.0 0 95 c5 0.0 21.0 0.0 82140.8 0.0 3.3 0.0 158.0 0 95 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 72.0 0.0 82279.7 0.0 5.0 0.0 69.9 0 98 c5 0.0 72.0 0.0 82279.6 0.0 5.0 0.0 69.9 0 98 c5t500000E0119495A0d0 So sometimes it still can even push more from the disk. So even UFS in this case is much faster than ZFS. And UFS issued something like 3,5MB block sizes. bash-3.00# tunefs -a 16 /mnt/1 maximum contiguous block count changes from 2048 to 16 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 350.9 0.0 44533.6 0.0 118.1 0.0 336.6 0 100 c5 0.0 350.9 0.0 44531.0 0.0 118.1 0.0 336.6 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 381.0 0.0 48466.4 0.0 112.9 0.0 296.4 0 100 c5 0.0 381.0 0.0 48468.7 0.0 112.9 0.0 296.4 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 369.9 0.0 47057.3 0.0 110.8 0.0 299.6 0 100 c5 0.0 369.9 0.0 47057.3 0.0 110.8 0.0 299.6 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 399.1 0.0 50566.4 0.0 108.8 0.0 272.7 0 100 c5 0.0 399.1 0.0 50566.5 0.0 108.8 0.0 272.7 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 345.0 0.0 44171.3 0.0 87.7 0.0 254.3 0 100 c5 0.0 345.0 0.0 44171.4 0.0 87.7 0.0 254.3 0 100 c5t500000E0119495A0d0 So now UFS was issuing 128KB IOs and now with UFS I get similar performance to ZFS. So I would say that larger IOs greatly could help ZFS performance while writing large sequential files (with large writes). This message posted from opensolaris.org
Roch Bourbonnais - Performance Engineering
2006-May-12 12:28 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hi Robert, Could you try 35 concurrent dd each issuing 128K I/O ? That would be closer to how ZFS would behave. -r Robert Milkowski writes: > Well I have just tested UFS on the same disk. > > bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0 > newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0: (y/n)? y > mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024 16 1 1 8192 t 0 -1 1 1024 n > Warning: 5810 sector(s) in last cylinder unallocated > /dev/rdsk/c5t500000E0119495A0d0s0: 143358286 sectors in 23334 cylinders of 48 tracks, 128 sectors > 69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g) > super-block backups (for fsck -F ufs -o b=#) at: > 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920, > Initializing cylinder groups: > ............................ > super-block backups for last 10 cylinder groups at: > 142447776, 142546208, 142644640, 142743072, 142841504, 142939936, 143038368, > 143136800, 143235232, 143333664 > bash-3.00# mkdir /mnt/1 > bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /mnt/1 > > bash-3.00# dd if=/dev/zero of=/mnt/1/q1 bs=8192k > ^C110+0 records in > 110+0 records out > bash-3.00# > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 5.0 25.0 35.0 82408.8 0.0 3.6 0.0 120.3 0 99 c5 > 5.0 25.0 35.0 82409.7 0.0 3.6 0.0 120.3 0 99 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 4.0 25.0 28.0 79832.1 0.0 3.9 0.0 133.4 0 97 c5 > 4.0 25.0 28.0 79831.5 0.0 3.9 0.0 133.4 0 97 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 6.0 25.0 42.0 81921.3 0.0 4.7 0.0 151.6 0 100 c5 > 6.0 25.0 42.0 81921.4 0.0 4.7 0.0 151.6 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 4.0 21.0 28.0 73555.6 0.0 3.5 0.0 138.7 0 97 c5 > 4.0 21.0 28.0 73555.7 0.0 3.5 0.0 138.7 0 97 c5t500000E0119495A0d0 > > > bash-3.00# tunefs -a 2048 /mnt/1 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 22.0 0.0 83240.1 0.0 3.5 0.0 157.1 0 97 c5 > 0.0 22.0 0.0 83240.5 0.0 3.5 0.0 157.1 0 97 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 19.0 0.0 81837.1 0.0 3.4 0.0 180.1 0 98 c5 > 0.0 19.0 0.0 81837.2 0.0 3.4 0.0 180.1 0 98 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 21.0 0.0 94004.1 0.0 4.6 0.0 218.1 0 100 c5 > 0.0 21.0 0.0 94002.6 0.0 4.6 0.0 218.1 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 20.0 0.0 70116.6 0.0 4.3 0.0 216.5 0 100 c5 > 0.0 20.0 0.0 70116.7 0.0 4.3 0.0 216.5 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 21.0 0.0 82140.7 0.0 3.3 0.0 158.0 0 95 c5 > 0.0 21.0 0.0 82140.8 0.0 3.3 0.0 158.0 0 95 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 72.0 0.0 82279.7 0.0 5.0 0.0 69.9 0 98 c5 > 0.0 72.0 0.0 82279.6 0.0 5.0 0.0 69.9 0 98 c5t500000E0119495A0d0 > > So sometimes it still can even push more from the disk. > > So even UFS in this case is much faster than ZFS. And UFS issued something like 3,5MB block sizes. > > > bash-3.00# tunefs -a 16 /mnt/1 > maximum contiguous block count changes from 2048 to 16 > > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 350.9 0.0 44533.6 0.0 118.1 0.0 336.6 0 100 c5 > 0.0 350.9 0.0 44531.0 0.0 118.1 0.0 336.6 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 381.0 0.0 48466.4 0.0 112.9 0.0 296.4 0 100 c5 > 0.0 381.0 0.0 48468.7 0.0 112.9 0.0 296.4 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 369.9 0.0 47057.3 0.0 110.8 0.0 299.6 0 100 c5 > 0.0 369.9 0.0 47057.3 0.0 110.8 0.0 299.6 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 399.1 0.0 50566.4 0.0 108.8 0.0 272.7 0 100 c5 > 0.0 399.1 0.0 50566.5 0.0 108.8 0.0 272.7 0 100 c5t500000E0119495A0d0 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 0.0 345.0 0.0 44171.3 0.0 87.7 0.0 254.3 0 100 c5 > 0.0 345.0 0.0 44171.4 0.0 87.7 0.0 254.3 0 100 c5t500000E0119495A0d0 > > So now UFS was issuing 128KB IOs and now with UFS I get similar > performance to ZFS. > > So I would say that larger IOs greatly could help ZFS performance > while writing large sequential files (with large writes). > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Robert Milkowski
2006-May-12 15:07 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Roch, Friday, May 12, 2006, 2:28:59 PM, you wrote: RBPE> Hi Robert, RBPE> Could you try 35 concurrent dd each issuing 128K I/O ? RBPE> That would be closer to how ZFS would behave. You mean to UFS? ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s). But what does it proof? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch Bourbonnais - Performance Engineering
2006-May-12 15:31 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Robert Milkowski writes: > Hello Roch, > > Friday, May 12, 2006, 2:28:59 PM, you wrote: > > RBPE> Hi Robert, > > RBPE> Could you try 35 concurrent dd each issuing 128K I/O ? > RBPE> That would be closer to how ZFS would behave. > > You mean to UFS? > > ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s). > > But what does it proof? It does not prove my point at least. Actually I also tried it and it does not generate the I/O pattern that ZFS uses; I did not analyze this but UFS gets in the way. I don''t have a raw device to play with at this instant but what we (I) have to do is find the right script that will cause 35 concurrent 128K I/O to be dumped into a spindle repeateadly. They can be as random as you like. This, I guarantee you, will saturate your spindle (or get really close to it). And this is the I/O pattern that ZFS generates during a pool sync operation. -r
Robert Milkowski
2006-May-14 20:55 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Roch, Friday, May 12, 2006, 5:31:10 PM, you wrote: RBPE> Robert Milkowski writes: >> Hello Roch, >> >> Friday, May 12, 2006, 2:28:59 PM, you wrote: >> >> RBPE> Hi Robert, >> >> RBPE> Could you try 35 concurrent dd each issuing 128K I/O ? >> RBPE> That would be closer to how ZFS would behave. >> >> You mean to UFS? >> >> ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s). >> >> But what does it proof? RBPE> It does not prove my point at least. Actually I also tried RBPE> it and it does not generate the I/O pattern that ZFS uses; RBPE> I did not analyze this but UFS gets in the way. RBPE> I don''t have a raw device to play with at this instant but RBPE> what we (I) have to do is find the right script that will RBPE> cause 35 concurrent 128K I/O to be dumped into a spindle RBPE> repeateadly. They can be as random as you like. RBPE> This, I guarantee you, will saturate your spindle (or get RBPE> really close to it). And this is the I/O pattern that ZFS RBPE> generates during a pool sync operation. ok, the same disk, the same host. bash-3.00# cat dd32.sh #!/bin/sh dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & bash-3.00# ./dd32.sh bash-3.00# iostat -xnzC 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 374.0 0.0 47874.6 0.0 33.0 0.0 88.1 0 100 c5 0.0 374.0 0.0 47875.2 0.0 33.0 0.0 88.1 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 367.1 0.0 46985.6 0.0 33.0 0.0 89.8 0 100 c5 0.0 367.1 0.0 46985.7 0.0 33.0 0.0 89.8 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 355.0 0.0 45440.3 0.0 33.0 0.0 92.9 0 100 c5 0.0 355.0 0.0 45439.9 0.0 33.0 0.0 92.9 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 385.9 0.0 49395.4 0.0 33.0 0.0 85.4 0 100 c5 0.0 385.9 0.0 49395.3 0.0 33.0 0.0 85.4 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 380.0 0.0 48635.9 0.0 33.0 0.0 86.7 0 100 c5 0.0 380.0 0.0 48635.4 0.0 33.0 0.0 86.7 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 361.1 0.0 46224.7 0.0 33.0 0.0 91.3 0 100 c5 0.0 361.1 0.0 46225.3 0.0 33.0 0.0 91.3 0 100 c5t500000E0119495A0d0 These numbers are very similar to those I get with ZFS. But it''s much less than single dd writing with 8MB block size to UFS or raw-device. It still looks like issuing larger IOs does in fact offer much better throughput. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski
2006-May-14 21:08 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Robert, Sunday, May 14, 2006, 10:55:42 PM, you wrote: RM> Hello Roch, RM> Friday, May 12, 2006, 5:31:10 PM, you wrote: RBPE>> Robert Milkowski writes: >>> Hello Roch, >>> >>> Friday, May 12, 2006, 2:28:59 PM, you wrote: >>> >>> RBPE> Hi Robert, >>> >>> RBPE> Could you try 35 concurrent dd each issuing 128K I/O ? >>> RBPE> That would be closer to how ZFS would behave. >>> >>> You mean to UFS? >>> >>> ok, I did try and I get about 8-9MB/s with about 1100 IO/s (w/s). >>> >>> But what does it proof? RBPE>> It does not prove my point at least. Actually I also tried RBPE>> it and it does not generate the I/O pattern that ZFS uses; RBPE>> I did not analyze this but UFS gets in the way. RBPE>> I don''t have a raw device to play with at this instant but RBPE>> what we (I) have to do is find the right script that will RBPE>> cause 35 concurrent 128K I/O to be dumped into a spindle RBPE>> repeateadly. They can be as random as you like. RBPE>> This, I guarantee you, will saturate your spindle (or get RBPE>> really close to it). And this is the I/O pattern that ZFS RBPE>> generates during a pool sync operation. RM> ok, the same disk, the same host. RM> bash-3.00# cat dd32.sh RM> #!/bin/sh RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> dd if=/dev/zero of=/dev/rdsk/c5t500000E0119495A0d0s0 bs=128k & RM> bash-3.00# ./dd32.sh RM> bash-3.00# iostat -xnzC 1 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 374.0 0.0 47874.6 0.0 33.0 0.0 88.1 0 100 c5 RM> 0.0 374.0 0.0 47875.2 0.0 33.0 0.0 88.1 0 100 c5t500000E0119495A0d0 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 367.1 0.0 46985.6 0.0 33.0 0.0 89.8 0 100 c5 RM> 0.0 367.1 0.0 46985.7 0.0 33.0 0.0 89.8 0 100 c5t500000E0119495A0d0 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 355.0 0.0 45440.3 0.0 33.0 0.0 92.9 0 100 c5 RM> 0.0 355.0 0.0 45439.9 0.0 33.0 0.0 92.9 0 100 c5t500000E0119495A0d0 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 385.9 0.0 49395.4 0.0 33.0 0.0 85.4 0 100 c5 RM> 0.0 385.9 0.0 49395.3 0.0 33.0 0.0 85.4 0 100 c5t500000E0119495A0d0 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 380.0 0.0 48635.9 0.0 33.0 0.0 86.7 0 100 c5 RM> 0.0 380.0 0.0 48635.4 0.0 33.0 0.0 86.7 0 100 c5t500000E0119495A0d0 RM> extended device statistics RM> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device RM> 0.0 361.1 0.0 46224.7 0.0 33.0 0.0 91.3 0 100 c5 RM> 0.0 361.1 0.0 46225.3 0.0 33.0 0.0 91.3 0 100 c5t500000E0119495A0d0 RM> These numbers are very similar to those I get with ZFS. RM> But it''s much less than single dd writing with 8MB block size to UFS RM> or raw-device. RM> It still looks like issuing larger IOs does in fact offer much better RM> throughput. RM> bash-3.00# zpool create one c5t500000E0119495A0d0 bash-3.00# zfs set atime=off one bash-3.00# cat dd32-1.sh #!/bin/sh dd if=/dev/zero of=/one/q1 bs=128k & dd if=/dev/zero of=/one/q2 bs=128k & dd if=/dev/zero of=/one/q3 bs=128k & dd if=/dev/zero of=/one/q4 bs=128k & dd if=/dev/zero of=/one/q5 bs=128k & dd if=/dev/zero of=/one/q6 bs=128k & dd if=/dev/zero of=/one/q7 bs=128k & dd if=/dev/zero of=/one/q8 bs=128k & dd if=/dev/zero of=/one/q9 bs=128k & dd if=/dev/zero of=/one/q10 bs=128k & dd if=/dev/zero of=/one/q11 bs=128k & dd if=/dev/zero of=/one/q12 bs=128k & dd if=/dev/zero of=/one/q13 bs=128k & dd if=/dev/zero of=/one/q14 bs=128k & dd if=/dev/zero of=/one/q15 bs=128k & dd if=/dev/zero of=/one/q16 bs=128k & dd if=/dev/zero of=/one/q17 bs=128k & dd if=/dev/zero of=/one/q18 bs=128k & dd if=/dev/zero of=/one/q19 bs=128k & dd if=/dev/zero of=/one/q20 bs=128k & dd if=/dev/zero of=/one/q21 bs=128k & dd if=/dev/zero of=/one/q22 bs=128k & dd if=/dev/zero of=/one/q23 bs=128k & dd if=/dev/zero of=/one/q24 bs=128k & dd if=/dev/zero of=/one/q25 bs=128k & dd if=/dev/zero of=/one/q26 bs=128k & dd if=/dev/zero of=/one/q27 bs=128k & dd if=/dev/zero of=/one/q28 bs=128k & dd if=/dev/zero of=/one/q29 bs=128k & dd if=/dev/zero of=/one/q30 bs=128k & dd if=/dev/zero of=/one/q31 bs=128k & dd if=/dev/zero of=/one/q32 bs=128k & bash-3.00# iostat -xnzC 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 390.0 0.0 49916.6 0.0 34.9 0.0 89.5 0 100 c5 0.0 390.0 0.0 49917.7 0.0 34.9 0.0 89.5 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 389.9 0.0 49911.5 0.0 34.9 0.0 89.5 0 100 c5 0.0 389.9 0.0 49911.4 0.0 34.9 0.0 89.5 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 383.5 0.0 49089.1 0.0 34.9 0.0 91.0 0 100 c5 0.0 383.5 0.0 49087.8 0.0 34.9 0.0 91.0 0 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 393.5 0.0 50371.9 0.0 34.9 0.0 88.6 0 100 c5 0.0 393.5 0.0 50373.3 0.0 34.9 0.0 88.6 0 100 c5t500000E0119495A0d0 bash-3.00# newfs -v /dev/rdsk/c5t500000E0119495A0d0s0 newfs: construct a new file system /dev/rdsk/c5t500000E0119495A0d0s0: (y/n)? y mkfs -F ufs /dev/rdsk/c5t500000E0119495A0d0s0 143358287 128 48 8192 1024 16 1 1 8192 t 0 -1 1 1024 n Warning: 5810 sector(s) in last cylinder unallocated /dev/rdsk/c5t500000E0119495A0d0s0: 143358286 sectors in 23334 cylinders of 48 tracks, 128 sectors 69999.2MB in 1459 cyl groups (16 c/g, 48.00MB/g, 5824 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920, Initializing cylinder groups: ............................ super-block backups for last 10 cylinder groups at: 142447776, 142546208, 142644640, 142743072, 142841504, 142939936, 143038368, 143136800, 143235232, 143333664 bash-3.00# mount -o noatime /dev/dsk/c5t500000E0119495A0d0s0 /one bash-3.00# bash-3.00# ./dd32-1.sh bash-3.00# iostat -xnzC 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.0 833.7 7.0 6885.6 137.5 256.0 164.7 306.7 0 100 c5 1.0 833.7 7.0 6885.5 137.5 256.0 164.7 306.7 100 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 829.9 0.0 6855.4 130.6 256.0 157.3 308.5 0 100 c5 0.0 829.9 0.0 6855.4 130.6 256.0 157.3 308.5 100 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 799.1 0.0 6488.8 113.6 256.0 142.2 320.4 0 100 c5 0.0 799.1 0.0 6488.8 113.6 256.0 142.2 320.4 100 100 c5t500000E0119495A0d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.0 813.0 7.0 6527.8 110.7 217.3 136.0 267.0 0 100 c5 1.0 813.0 7.0 6527.8 110.7 217.3 136.0 267.0 68 100 c5t500000E0119495A0d0 So with many sequential write streams ZFS behaves much better. But still with one stream ZFS is much worse. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch Bourbonnais - Performance Engineering
2006-May-15 13:23 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
The question put forth is whether the ZFS 128K blocksize is sufficient to saturate a regular disk. There is great body of evidence that shows that the bigger the write sizes and matching large FS clustersize lead to more throughput. The counter point is that ZFS schedules it''s I/O like nothing else seen before and manages to sature a single disk using enough concurrent 128K I/O. <There a few things I did here for the first time; so I may have erred at places. So I am proposing this for review by the community> I first measured the throughput of a write(2) to raw device using for instance this; dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024 On Solaris we would see some overhead of reading the block from /dev/zero and then issuing the write call. The tightest function that fences the I/O is default_physio(). That function will issue the I/O to the device then wait for it to complete. If we take the elapse time spent in this function and count the bytes that are I/O-ed, this should give a good hint as to the throughput the device is providing. The above dd command will issue a single I/O at a time (d-script to measure is attached). Trying different blocksizes I see: Bytes Elapse of phys IO Size Sent 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s Now lets see what ZFS gets. I measure using single dd process. ZFS will chunk up data in 128K blocks. Now the dd command interact with memory. But the I/O are scheduled under the control of spa_sync(). So in the d-script (attached) I check for the start of an spa_sync and time that based on elapse. At the same time I gather the number of bytes and keep a count of I/O (bdev_strategy) that are being issued. When the spa_sync completes we are sure that all those are on stable storage. The script is a bit more complex because there are 2 threads that issue spa_sync, but only one of them actually becomes activated. So the script will print out some spurious lines of output at times. I measure I/O with the script while this runs: dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000 And I see: 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s OK, I cheated. Here, ZFS is given a full disk to play with. In this case ZFS enables the write cache. Note that even with the write cache enabled, when the spa_sync() completes, it will be after a flush of the cache has been executed. So the 60MB/sec do correspond to data set to the platter. I just tried disabling the cache (with format -e) but I am not sure if that is taken into account by ZFS; Results are the same 60MB/sec. This will have to be confirmed. With write cache enabled, the physio test reaches 66 MB/s as soon as we are issuing 16KB I/Os. Here clearly though, data is not on the platter when the timed function completes. Another variable not fully controled is the physical (cylinder) locations of the I/O. It could be that some of the differences come from that. What do I take away ? a single 2MB physical I/O will get 46 MB/sec out of my disk. 35 concurrent 128K I/O sustained followed by metadata I/O followed by flush of the write cache allows ZFS to get 60 MB/sec out of the same disk. This is what underwrites my belief that 128K blocksize is sufficiently large. Now, nothing here proves that 256K would not give more throughput; so nothing is really settled. But I hope this helps put us on common ground. -r -------------- next part -------------- A non-text attachment was scrubbed... Name: phys.d Type: application/octet-stream Size: 586 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060515/0c9f0c86/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: spa_sync.d Type: application/octet-stream Size: 986 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060515/0c9f0c86/attachment-0001.obj>
Anton B. Rang
2006-May-16 14:17 UTC
[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks
One issue is what we mean by "saturation." It''s easy to bring a disk to 100% busy. We need to keep this discussion in the context of a workload. Generally when people care about streaming throghput of a disk, it''s because they are reading or writing a single large file, and they want to reach as closely as possible the full media rate. Once the write cache on a device is enabled, it''s quite easy to maximize write performance. All you need is to move data quickly enough into the device''s buffer so that it''s never found to be empty while the head is writing; and, of course, avoid ever moving the disk head away. Since the media rate is typically fairly low (e.g. 20-80 MB/sec), this isn''t that hard on either FibreChannel or SCSI, and shouldn''t be too difficult for ATA either. Very small requests are hurt by protocol and stack overhead, but moderately large requests (1-2 MB) can usually reach the full rate, at least for a single disk. (Disk arrays often have faster back ends than the interconnect, so are always limited by protocol and stack overhead, even for large transfers.) With a disabled write cache, there will always be some protocol and stack overhead; and with less sophisticated disks, you''ll miss on average half a revolution of the disk each time you write (as you wait for the first sector to go beneath the head). More sophisticated disks will reorder data during the write, and the most sophisticated (with FC/SCSI interfaces) can actually get the data from the host out-of-order to match the sectors passing underneath the head. In this case the only way to come close to disk rates with smaller writes is to issue overlapping commands, with tags allowing the device to reorder them, and hope that the device has enough buffering to reorder all writes into sequence. Disk seeks remain the worst enemy of streaming performance, however. There''s no way to avoid that. ZFS should be able to achieve high write performance as long as it can allocate blocks (including metadata) in a forward direction and minimize the number of times the ?berblock must be written. Reads will be more challenging unless the data was written contiguously. The biggest issue with 128K block size for ZFS, I suspect, will be the seek between each read. Even a fast (1 ms) seek represents 60KB of lost data transfer on a disk which can transfer data at 60 MBps. This message posted from opensolaris.org
Roch Bourbonnais - Performance Engineering
2006-May-16 14:33 UTC
[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks
Anton B. Rang writes: > One issue is what we mean by "saturation." It''s easy to bring a disk to 100% busy. We need to keep this discussion in the context of a workload. Generally when people care about streaming throghput of a disk, it''s because they are reading or writing a single large file, and they want to reach as closely as possible the full media rate. > > Once the write cache on a device is enabled, it''s quite easy to maximize write performance. All you need is to move data quickly enough into the device''s buffer so that it''s never found to be empty while the head is writing; and, of course, avoid ever moving the disk head away. Since the media rate is typically fairly low (e.g. 20-80 MB/sec), this isn''t that hard on either FibreChannel or SCSI, and shouldn''t be too difficult for ATA either. Very small requests are hurt by protocol and stack overhead, but moderately large requests (1-2 MB) can usually reach the full rate, at least for a single disk. (Disk arrays often have faster back ends than the interconnect, so are always limited by protocol and stack overhead, even for large transfers.) > > With a disabled write cache, there will always be some protocol and stack overhead; and with less sophisticated disks, you''ll miss on average half a revolution of the disk each time you write (as you wait for the first sector to go beneath the head). More sophisticated disks will reorder data during the write, and the most sophisticated (with FC/SCSI interfaces) can actually get the data from the host out-of-order to match the sectors passing underneath the head. In this case the only way to come close to disk rates with smaller writes is to issue overlapping commands, with tags allowing the device to reorder them, and hope that the device has enough buffering to reorder all writes into sequence. > > Disk seeks remain the worst enemy of streaming performance, however. There''s no way to avoid that. ZFS should be able to achieve high write performance as long as it can allocate blocks (including metadata) in a forward direction and minimize the number of times the ??berblock must be written. Reads will be more challenging unless the data was written contiguously. The biggest issue with 128K block size for ZFS, I suspect, will be the seek between each read. Even a fast (1 ms) seek represents 60KB of lost data transfer on a disk which can transfer data at 60 MBps. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Ok so lets consider your 2MB read. You have the option of setting in in one contiguous place on the disk or split it into 16 x 128K chunks, somewhat spread all over. Now you issue a read to that 2MB of data. As you noted, you either have to wait for the head to find the 2MB block and stream it, or you dump 16 I/O descriptor into an intelligent controller; Wherever the head is there is data to be gotten from the get go. I can''t swear it wins the game, but it should be real close. I just did an experiment and could see > 60MB of data out of a 35G disk using 128K chunks (> 450 IOPS). Disruptive. -r
Anton Rang
2006-May-16 15:50 UTC
[zfs-discuss] Re: Re[5]: Re: Re: Due to 128KB limit in ZFS it can''tsaturate disks
> Ok so lets consider your 2MB read. You have the option of > setting in in one contiguous place on the disk or split it > into 16 x 128K chunks, somewhat spread all over. > > Now you issue a read to that 2MB of data. > > As you noted, you either have to wait for the head to find > the 2MB block and stream it, or you dump 16 I/O descriptor > into an intelligent controller; Wherever the head is there > is data to be gotten from the get go. I can''t swear it wins > the game, but it should be real close.Well, the full specs aren''t available, but a little math and studying some models can get us close. :-) Let''s presume we''re using an enterprise-class disk, say a 37 GB Seagate Cheetah. This is best-case for seeks as it uses so little of the platter and runs at 15K RPM. Large-block case: On average, to reach the 2 MB, we''ll take 3.5ms. Transfer can then proceed at media rate (average 110 MB/sec) and be sent to the host over a 200 MB/sec channel. 3.5 ms seek, 18.1 ms data transfer, total time 21.6 ms for a rate of 92.6 MB/sec. Small-block case: Each seek will be shorter than the average since we are ordering them optimally. A single-track seek is 0.2 ms; average is 3.5ms; if we assume linear scaling (which isn''t quite right) then we''re looking at 1/8 of 3.7 ms = 0.46 ms. We do 16 seeks, for 7.36 ms, and our data transfer time is the same (18.1 ms), for a rate of 25.46 ms, a rate of 78.5 MB/sec. Not too bad. It''s pretty clear why these drives are pricey. :-) Mmmm, actually it''s not that good. There are 50K tracks on this 35 GB disk, so each track holds 700 KB. We''re only storing 128KB on each track, so on average we''ll need to wait nearly 1/2 of a revolution before we see any of our data under the head. At 15K RPM, that''s not so bad, only 2ms, but we''ve got 16 times to wait, adding 32 ms, dropping our rate to roughly half what we''d get otherwise. (Older disks should, surprisingly, do better since they have less data packed onto each track!) Looking at a 250 GB "near-line" SATA disk, and presuming its controller does the same optimizations, things are different. Average seek time is 8ms, with single-track seek time of 0.8ms, so 15 additional seeks will cost roughly 30 ms. A half-track wait is 4ms (60ms in total). Things are going pretty slow now.> I just did an experiment and could see > 60MB of data out of > a 35G disk using 128K chunks (> 450 IOPS).On the only disk I have handy, I get 36 MB/sec with concurrent 128 KB chunks, 38 MB/sec with non-concurrent 2 MB chunks, 39 MB/sec with 2 MB chunks. But I''m issuing all of these I/O operations sequentially -- no seeks.> Disruptive.What is? Multiple I/Os outstanding to a device isn''t precisely new. ;-) Honestly, adding seeks is -never- going to improve performance. Giving the drive the opportunity to reorder I/O operations will, but splitting a single operation up can never speed it up, though if you get lucky it won''t slow down. Anton
Robert Milkowski
2006-May-19 00:26 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Roch, Monday, May 15, 2006, 3:23:14 PM, you wrote: RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient RBPE> to saturate a regular disk. There is great body of evidence that shows RBPE> that the bigger the write sizes and matching large FS clustersize lead RBPE> to more throughput. The counter point is that ZFS schedules it''s I/O RBPE> like nothing else seen before and manages to sature a single disk RBPE> using enough concurrent 128K I/O. Nevertheless I get much more throughput using UFS and writing with large block than using ZFS on the same disk. And the difference is actually quite big in favor of UFS. RBPE> <There a few things I did here for the first time; so I may have erred RBPE> at places. So I am proposing this for review by the community> RBPE> I first measured the throughput of a write(2) to raw device using for RBPE> instance this; RBPE> dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024 RBPE> On Solaris we would see some overhead of reading the block from RBPE> /dev/zero and then issuing the write call. The tightest function that RBPE> fences the I/O is default_physio(). That function will issue the I/O to RBPE> the device then wait for it to complete. If we take the elapse time RBPE> spent in this function and count the bytes that are I/O-ed, this RBPE> should give a good hint as to the throughput the device is RBPE> providing. The above dd command will issue a single I/O at a time RBPE> (d-script to measure is attached). RBPE> Trying different blocksizes I see: RBPE> Bytes Elapse of phys IO Size RBPE> Sent RBPE> RBPE> 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s RBPE> 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s RBPE> 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s RBPE> 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s RBPE> 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s RBPE> 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s RBPE> 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s RBPE> 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 46 MB/s RBPE> 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 46 MB/s RBPE> 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 47 MB/s Just to be sure - you did reconfigure system to actually allow larger IO sizes? RBPE> Now lets see what ZFS gets. I measure using single dd process. ZFS RBPE> will chunk up data in 128K blocks. Now the dd command interact with RBPE> memory. But the I/O are scheduled under the control of spa_sync(). So RBPE> in the d-script (attached) I check for the start of an spa_sync and RBPE> time that based on elapse. At the same time I gather the number of RBPE> bytes and keep a count of I/O (bdev_strategy) that are being issued. RBPE> When the spa_sync completes we are sure that all those are on stable RBPE> storage. The script is a bit more complex because there are 2 threads RBPE> that issue spa_sync, but only one of them actually becomes RBPE> activated. So the script will print out some spurious lines of output RBPE> at times. I measure I/O with the script while this runs: RBPE> dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000 RBPE> And I see: RBPE> 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s RBPE> 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s RBPE> OK, I cheated. Here, ZFS is given a full disk to play with. In this RBPE> case ZFS enables the write cache. Note that even with the write cache RBPE> enabled, when the spa_sync() completes, it will be after a flush of RBPE> the cache has been executed. So the 60MB/sec do correspond to data set RBPE> to the platter. I just tried disabling the cache (with format -e) but RBPE> I am not sure if that is taken into account by ZFS; Results are the RBPE> same 60MB/sec. This will have to be confirmed. RBPE> With write cache enabled, the physio test reaches 66 MB/s as soon as RBPE> we are issuing 16KB I/Os. Here clearly though, data is not on the RBPE> platter when the timed function completes. RBPE> Another variable not fully controled is the physical (cylinder) RBPE> locations of the I/O. It could be that some of the differences come RBPE> from that. RBPE> What do I take away ? RBPE> a single 2MB physical I/O will get 46 MB/sec out of my disk. RBPE> 35 concurrent 128K I/O sustained followed by metadata I/O RBPE> followed by flush of the write cache allows ZFS to get 60 RBPE> MB/sec out of the same disk. RBPE> This is what underwrites my belief that 128K blocksize is sufficiently RBPE> large. Now, nothing here proves that 256K would not give more RBPE> throughput; so nothing is really settled. But I hope this helps put us RBPE> on common ground. This is really interesting because what I see here with very similar test is the opposite. What kind of disk do you use? (mine is 15K 73GB FCm connected with dual path to host with MPxIO). I use iostat to see actual throughput - you Dtrace - maybe we measure different things? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch Bourbonnais - Performance Engineering
2006-May-19 13:53 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Robert Milkowski writes: > Hello Roch, > > Monday, May 15, 2006, 3:23:14 PM, you wrote: > > RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient > RBPE> to saturate a regular disk. There is great body of evidence that shows > RBPE> that the bigger the write sizes and matching large FS clustersize lead > RBPE> to more throughput. The counter point is that ZFS schedules it''s I/O > RBPE> like nothing else seen before and manages to sature a single disk > RBPE> using enough concurrent 128K I/O. > > Nevertheless I get much more throughput using UFS and writing with > large block than using ZFS on the same disk. And the difference is > actually quite big in favor of UFS. > Absolutely. Isn''t this issue though ? 6415647 Sequential writing is jumping We will have to fix this to allow dd to get more throughput. I''m pretty sure the fix won''t need to increase the blocksize though. I''ll be picking up this thread again I hope next week. I have lots of homework to do to respond properly. -r
Roch Bourbonnais - Performance Engineering
2006-May-22 13:42 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Robert Says: Just to be sure - you did reconfigure system to actually allow larger IO sizes? Sure enough, I messed up (I had no tuning to get the above data); So 1 MB was my max transfer sizes. Using 8MB I now see: Bytes Elapse of phys IO Size Sent 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46 MB/s) 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46 MB/s) 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47 MB/s) 272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new data) 288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data) Data was corrected after it was pointed out that, physio will be throttled by maxphys. New data was obtained after settings /etc/system: set maxphys=8338608 /kernel/drv/sd.conf sd_max_xfer_size=0x800000 /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000 And setting un_max_xfer_size in "struct sd_lun". That address was figured out using dtrace and knowing that sdmin() calls ddi_get_soft_state (details avail upon request). And of course disabling the write cache (using format -e) With this in place I verified that each sdwrite() up to 8M would lead to a single biodone interrupts using this: dtrace -n ''biodone:entry,sdwrite:entry{@a[probefunc, stack(20)]=count()}'' Note that for 16M and 32M raw device writes, each default_physio will issue a series of 8M I/O. And so we don''t expect any more throughput from that. The script used to measure the rates (phys.d) was also modified since I was counting the bytes before the I/O had completed and that made a big difference for the very large I/O sizes. If you take the 8M case, the above rates correspond to the time it takes to issue and wait for a single 8M I/O to the sd driver. So this time certainly does include 1 seek and ~ 0.13 seconds of data transfer, then the time to respond to the interrupt, finally the wakeup of the thread waiting in default_physio(). Given that the data transfer rate using 4 MB is very close to the one using 8 MB, I''d say that at 60 MB/sec all the fixed-cost element are well amortized. So I would conclude from this that the limiting factor is now at the device itself or on the data channel between the disk and the host. Now recall that the throughput that ZFS gets during an spa_sync when submitted to a single dd and knowing that ZFS will work with 128K I/O: 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s My disk is <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>. As you say, we don''t measure things the same way. At the dd to raw level I think our data, with my mistake corrected, will now be similar. At the ZFS level, we cannot use iostat quite _yet_ because of 6415647 Sequential writing is jumping With iostat, the 1 second average will see, at times, some period in which we won''t issue any I/O. So it''s not a good measure of the capacity of a disk. This is why I reverted to my script which times the I/O rate but only "when it counts". When we fix 6415647, the expectation is that we will sustain that throughput for whatever times is necessary. At that point, I expect the throughput as seen from iostat or the throughput from a ptime of dd itself will all converge. And so, after a moment of doubt, I am still inclined to believe that 128K I/Os, when issued properly can lead to, if not saturation, a very good throughput from a basic disk. Now, Anton''s demonstration is convincing in it''s own way. I can concur that any seek time is unproductive and will degrade throughput at the device level. But if the weak link is the data transfer rate between the device and the host, then it can be argued that the seek time can actually be hidden behind some data transfer time ? At 60MB/sec, a 128K data transfer takes 2ms which maybe is sufficient to get the head to the next block ? My disk does reach > 450 IOPS when controled by ZFS so it all adds up. Bear in mind also, that the throughput is not the only consideration when setting the ZFS recordsize. The smaller the record size the more manageable the disk block will be. So everything is a tradeoff and at this point 128K appears sufficiently large ... at least for a while. -r ____________________________________________________________________________________ Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France Performance & Availability Engineering http://icncweb.france/~rbourbon http://blogs.sun.com/roller/page/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20 New scripts to measure dd to raw throughput: -------------- next part -------------- A non-text attachment was scrubbed... Name: phys.d Type: application/octet-stream Size: 606 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060522/26b1ec85/attachment.obj>
Robert Milkowski
2006-May-23 21:19 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Roch, Monday, May 22, 2006, 3:42:41 PM, you wrote: RBPE> Robert Says: RBPE> Just to be sure - you did reconfigure system to actually allow larger RBPE> IO sizes? RBPE> Sure enough, I messed up (I had no tuning to get the above data); So RBPE> 1 MB was my max transfer sizes. Using 8MB I now see: RBPE> Bytes Elapse of phys IO Size RBPE> Sent RBPE> RBPE> 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s RBPE> 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s RBPE> 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s RBPE> 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s RBPE> 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s RBPE> 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s RBPE> 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s RBPE> 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46 MB/s) RBPE> 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46 MB/s) RBPE> 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47 MB/s) RBPE> 272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new data) RBPE> 288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data) RBPE> Data was corrected after it was pointed out that, physio will be RBPE> throttled by maxphys. New data was obtained after settings RBPE> /etc/system: set maxphys=8338608 RBPE> /kernel/drv/sd.conf sd_max_xfer_size=0x800000 RBPE> /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000 RBPE> And setting un_max_xfer_size in "struct sd_lun". RBPE> That address was figured out using dtrace and knowing that RBPE> sdmin() calls ddi_get_soft_state (details avail upon request). RBPE> RBPE> And of course disabling the write cache (using format -e) RBPE> With this in place I verified that each sdwrite() up to 8M RBPE> would lead to a single biodone interrupts using this: RBPE> dtrace -n ''biodone:entry,sdwrite:entry{@a[probefunc, stack(20)]=count()}'' RBPE> Note that for 16M and 32M raw device writes, each default_physio RBPE> will issue a series of 8M I/O. And so we don''t RBPE> expect any more throughput from that. RBPE> The script used to measure the rates (phys.d) was also RBPE> modified since I was counting the bytes before the I/O had RBPE> completed and that made a big difference for the very large RBPE> I/O sizes. RBPE> If you take the 8M case, the above rates correspond to the RBPE> time it takes to issue and wait for a single 8M I/O to the RBPE> sd driver. So this time certainly does include 1 seek and ~ RBPE> 0.13 seconds of data transfer, then the time to respond to RBPE> the interrupt, finally the wakeup of the thread waiting in RBPE> default_physio(). Given that the data transfer rate using 4 RBPE> MB is very close to the one using 8 MB, I''d say that at 60 RBPE> MB/sec all the fixed-cost element are well amortized. So I RBPE> would conclude from this that the limiting factor is now at RBPE> the device itself or on the data channel between the disk RBPE> and the host. RBPE> Now recall that the throughput that ZFS gets during an RBPE> spa_sync when submitted to a single dd and knowing that ZFS RBPE> will work with 128K I/O: RBPE> 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s RBPE> 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s RBPE> 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s RBPE> My disk is RBPE> <HITACHI-DK32EJ36NSUN36G-PQ08-33.92GB>. Is it over FC or just SCSI/SAS? I have to try again with SAS/SCSI - maybe due to more overhead in FC larger IOs give better results than on SCSI? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski
2006-May-23 21:23 UTC
[zfs-discuss] Re: Re: Due to 128KB limit in ZFS it can''t saturate disks
Hello Roch, Friday, May 19, 2006, 3:53:35 PM, you wrote: RBPE> Robert Milkowski writes: >> Hello Roch, >> >> Monday, May 15, 2006, 3:23:14 PM, you wrote: >> >> RBPE> The question put forth is whether the ZFS 128K blocksize is sufficient >> RBPE> to saturate a regular disk. There is great body of evidence that shows >> RBPE> that the bigger the write sizes and matching large FS clustersize lead >> RBPE> to more throughput. The counter point is that ZFS schedules it''s I/O >> RBPE> like nothing else seen before and manages to sature a single disk >> RBPE> using enough concurrent 128K I/O. >> >> Nevertheless I get much more throughput using UFS and writing with >> large block than using ZFS on the same disk. And the difference is >> actually quite big in favor of UFS. >> RBPE> Absolutely. Isn''t this issue though ? RBPE> 6415647 Sequential writing is jumping RBPE> We will have to fix this to allow dd to get more throughput. RBPE> I''m pretty sure the fix won''t need to increase the RBPE> blocksize though. Maybe - but it also means that until this is addressed it doesn''t make any sense to compare ZFS to other filesysystems with sequential writing... The question is how well above problem is understood and when is it going to be corrected? And why in your test cases which are similar to mine you do not see dd to raw device to be actually faster by any important factor? (again, maybe you are using SCSI and I do use FC). -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Torsten "Paul" Eichstädt
2007-Oct-24 17:44 UTC
[zfs-discuss] Due to 128KB limit in ZFS it can''t saturate disks
Hi, what is the exact syntax to enable large transfer sizes? I didn''t find any documentation, so I guessed the following: etc/system: set maxphys=8338608 /kernel/drv/sd.conf: name="sd" parent="scsi" sd_max_xfer_size=0x800000; /kernel/drv/ssd.conf: name="ssd" parent="scsi_vhci" sd_max_xfer_size=0x800000; (I have FC drives) Where can I teach myself about the disadvantages? I searched for an article or paper about "Why 128k blocksize is enough" written by the ZFS designer, but could not find it... Thx in advance, Paul This message posted from opensolaris.org
See the QFS documentation: http://docs.sun.com/source/817-4091-10/chapter8.html#59255 (Steps 1 through 4 would apply to any file system which can issue multi-megabyte I/O requests.) This message posted from opensolaris.org