I''m doing a "zfs send -R | zfs receive" on a snv_129 system. The target filesystem has dedup enabled, but since it was upgraded from b125 the existing data is not deduped. The pool is an 8-disk raidz2. The system has 8gb of memory, and a dual core Athlon 4850e cpu. I''ve set dedup=verify at the top level filesystem, which is inherited by everything. I started the send this morning, and as of now it''s only send 590gb of a 867gb filesystem. According to "zpool iostat 60", it''s writing at about 12-13mb/sec. Reads tend to be bursty, which indicates that the bottleneck is in writing. I realize that dedup is an intensive task, especially with the verify option. However, the system load is only at 0.8, and zpool-tank is only using ~ 10% of the cpu. There''s no time spent in iowait. Any ideas on what I might do to speed things up? $ arcstat.pl -x 5 3 Time mfu mru mfug mrug eskip mtxmis rmis dread pread read 01:06:33 28M 2M 694K 694K 1M 1K 0 20M 11M 31M <State Changed> 01:06:38 721 187 18 18 2 0 0 785 114 899 01:06:43 755 162 46 46 2 0 0 733 217 950 $ arcstat.pl 5 3 Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 01:06:59 31M 3M 10 1M 6 1M 16 1M 5 3G 3G 01:07:04 785 130 16 87 11 42 83 87 12 3G 3G 01:07:09 647 195 30 84 16 111 88 89 20 3G 3G Thanks -B -- Brandon High : bhigh at freaks.com When in doubt, use brute force.
I''ll first suggest questioning the measurement of speed you''re getting, 12.5Mb/sec. I''ll suggest another, more accurate method: date ; zfs send somefilesystem | pv -b | ssh somehost "zfs receive foo" ; date At any given time, you can see how many bytes have transferred in aggregate, and what time it started. I know you have to do some calculation this way, but it''s really the best aggregator that I know of, to measure the average speed. You''ll stop seeing the "burstiness" caused by buffering, and actually see the speed of whatever is the bottleneck. Assuming you''re on GB ether ... You said 12.5 MB/s, which is coincidentally exactly 100Mb/sec. But assuming you''re tunneling over ssh, then the actual network utilization is probably 200Mb/sec. The best you could possibly hope for is 4x faster, so while you might call that "very slow," I think it''s only a little bit slow. Also, you''re reading from raidz2? Or you''re writing to raidz2? Raidz2 is not exactly the fastest configuration in the world. You might try doing a "zfs send" to /dev/null and see how fast the disks themselves actually go. Reading is probably not the issue. But receiving, with dedup, and verification, while writing to raidz2. Maybe. If you want better performance, think about using a bunch of mirrors, and concatenating (striping) them together. You''ll have better performance but less usable space that way.> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Brandon High > Sent: Wednesday, December 16, 2009 4:08 AM > To: ZFS discuss > Subject: [zfs-discuss] zfs zend is very slow > > I''m doing a "zfs send -R | zfs receive" on a snv_129 system. The > target filesystem has dedup enabled, but since it was upgraded from > b125 the existing data is not deduped. > > The pool is an 8-disk raidz2. The system has 8gb of memory, and a dual > core Athlon 4850e cpu. > > I''ve set dedup=verify at the top level filesystem, which is inherited > by everything. > > I started the send this morning, and as of now it''s only send 590gb of > a 867gb filesystem. According to "zpool iostat 60", it''s writing at > about 12-13mb/sec. Reads tend to be bursty, which indicates that the > bottleneck is in writing. > > I realize that dedup is an intensive task, especially with the verify > option. However, the system load is only at 0.8, and zpool-tank is > only using ~ 10% of the cpu. There''s no time spent in iowait. > > Any ideas on what I might do to speed things up? > > $ arcstat.pl -x 5 3 > Time mfu mru mfug mrug eskip mtxmis rmis dread pread > read > 01:06:33 28M 2M 694K 694K 1M 1K 0 20M 11M > 31M > <State Changed> > 01:06:38 721 187 18 18 2 0 0 785 114 > 899 > 01:06:43 755 162 46 46 2 0 0 733 217 > 950 > > $ arcstat.pl 5 3 > Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz > c > 01:06:59 31M 3M 10 1M 6 1M 16 1M 5 3G > 3G > 01:07:04 785 130 16 87 11 42 83 87 12 3G > 3G > 01:07:09 647 195 30 84 16 111 88 89 20 3G > 3G > > Thanks > -B > > -- > Brandon High : bhigh at freaks.com > When in doubt, use brute force. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, 16 Dec 2009, Brandon High wrote:> I''ve set dedup=verify at the top level filesystem, which is inherited > by everything. > > I started the send this morning, and as of now it''s only send 590gb of > a 867gb filesystem. According to "zpool iostat 60", it''s writing at > about 12-13mb/sec. Reads tend to be bursty, which indicates that the > bottleneck is in writing.Someone else here complained just a day or two ago about a similar problem but with an update to the sending system and not to the receiving system. In his case ''zfs send'' to /dev/null was still quite fast and the network was also quite fast (when tested with benchmark software). The implication is that ssh network transfer performace may have dropped with the update. It is wise to rule out other factors before blaming zfs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Wed, Dec 16, 2009 at 7:41 AM, Edward Ned Harvey <solaris at nedharvey.com> wrote:> I''ll first suggest questioning the measurement of speed you''re getting, > 12.5Mb/sec. ?I''ll suggest another, more accurate method: > date ; zfs send somefilesystem | pv -b | ssh somehost "zfs receive foo" ; > dateThe send failed (I toggled com.sun:auto-snapshot on the new fs so snapshots would continue running elsewhere), but the summary line reads: summary: 760 GByte in 18 h 35 min 11.6 MB/s, 9x empty> speed. ?You''ll stop seeing the "burstiness" caused by buffering, and > actually see the speed of whatever is the bottleneck.60 second intervals on iostat provided enough smoothing, it would appear.> Assuming you''re on GB ether ... You said 12.5 MB/s, which is coincidentally > exactly 100Mb/sec. ?But assuming you''re tunneling over ssh, then the actualI''m sending to another fs in the same pool to enable dedup. No network involved.> Also, you''re reading from raidz2? ?Or you''re writing to raidz2?Same zpool, raidz2.> Raidz2 is not exactly the fastest configuration in the world. ?You might try > doing a "zfs send" to /dev/null and see how fast the disks themselves > actually go.I''m running it right now through mbuffer, since that gives a speed estimate. It''s well above 250 MB/s. Maybe I should have said that zfs receive is slow, since that appears to be the case. I noticed the same issue on another host, but assumed that was because it has a slow CPU and less RAM (Atom 330, 2GB). It was doing a receive at about the same speed to a 2 disk non-redundant pool over gigabit.> If you want better performance, think about using a bunch of mirrors, and > concatenating (striping) them together. ?You''ll have better performance but > less usable space that way.I don''t need the performance for normal use. Space and MTTDL were my priorities. -B -- Brandon High : bhigh at freaks.com Better to understand a little than to misunderstand a lot.
On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> ?In his case ''zfs send'' to /dev/null was still quite fast and the network > was also quite fast (when tested with benchmark software). ?The implication > is that ssh network transfer performace may have dropped with the update.zfs send appears to be fast still, but receive is slow. I tried a pipe from the send to the receive, as well as using mbuffer with a 100mb buffer, both wrote at ~ 12 MB/s. -B -- Brandon High : bhigh at freaks.com Indecision is the key to flexibility.
Mine is similar (4-disk RAIDZ1) - send/recv with dedup on: <4MB/sec - send/recv with dedup off: ~80M/sec - send > /dev/null: ~200MB/sec. I know dedup can save some disk bandwidth on write, but it shouldn''t save much read bandwidth (so I think these numbers are right). There''s a warning in a Jeff Bonwick post that if the DDT (de-dupe tables) don''t fit in RAM, things will be "slower". Wonder what that threshold is? Second try of the same "recv" appears to go randomly faster (5-12MB bursting to 100MB/sec briefly) - DDT in core should make the second try quite a bit faster, but it''s not as fast as I''d expect. My zdb -D output: DDT-sha256-zap-duplicate: 633396 entries, size 361 on disk, 179 in core DDT-sha256-zap-unique: 5054608 entries, size 350 on disk, 185 in core 6M entries doesn''t sound like that much for a box with 6GB of RAM. CPU load is also low. mike On Wed, Dec 16, 2009 at 8:19 AM, Brandon High <bhigh at freaks.com> wrote:> On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: > > In his case ''zfs send'' to /dev/null was still quite fast and the network > > was also quite fast (when tested with benchmark software). The > implication > > is that ssh network transfer performace may have dropped with the update. > > zfs send appears to be fast still, but receive is slow. > > I tried a pipe from the send to the receive, as well as using mbuffer > with a 100mb buffer, both wrote at ~ 12 MB/s. > > -B > > -- > Brandon High : bhigh at freaks.com > Indecision is the key to flexibility. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091216/4719ba72/attachment.html>
On Wed, Dec 16, 2009 at 12:19 PM, Michael Herf <mbherf at gmail.com> wrote:> Mine is similar (4-disk RAIDZ1) > ?- send/recv with dedup on: <4MB/sec > ?- send/recv with dedup off: ~80M/sec > ?- send > /dev/null: ~200MB/sec. > I know dedup can save some disk bandwidth on write, but it shouldn''t save > much read bandwidth (so I think these numbers are right). > There''s a warning in a Jeff Bonwick post that if the DDT (de-dupe tables) > don''t fit in RAM, things will be "slower". > Wonder what that threshold is? > Second try of the same "recv" appears to go randomly faster (5-12MB bursting > to 100MB/sec briefly) - DDT in core should make the second try quite a bit > faster, but it''s not as fast as I''d expect. > My zdb -D output: > DDT-sha256-zap-duplicate: 633396 entries, size 361 on disk, 179 in core > DDT-sha256-zap-unique: 5054608 entries, size 350 on disk, 185 in core > 6M entries doesn''t sound like that much for a box with 6GB of RAM. > > CPU load is also low. > mike > > On Wed, Dec 16, 2009 at 8:19 AM, Brandon High <bhigh at freaks.com> wrote: >> >> On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn >> <bfriesen at simple.dallas.tx.us> wrote: >> > ?In his case ''zfs send'' to /dev/null was still quite fast and the >> > network >> > was also quite fast (when tested with benchmark software). ?The >> > implication >> > is that ssh network transfer performace may have dropped with the >> > update. >> >> zfs send appears to be fast still, but receive is slow. >> >> I tried a pipe from the send to the receive, as well as using mbuffer >> with a 100mb buffer, both wrote at ~ 12 MB/s. >> >> -B >> >> -- >> Brandon High : bhigh at freaks.com >> Indecision is the key to flexibility. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >I''m seeing similar results, though my file systems currently have de-dupe disabled, and only compression enable, both systems being snv_129. An old 111 build is also sending to the 129 main file server slow, when it used to be the 111 could send about 25MB/sec over SSH to the main file server which used to run 127. Since 128 however, the main file server is receiving ZFS snapshots at a fraction of the previous speed. 129 fixed it a bit, I was literaly getting just a couple hundred -BYTES- a second on 128, but 129 I can get about 9-10MB/sec if I''m lucky, but usually 4-5MB/sec. No other configuration changes on the network occured, except for my X4540''s being upgraded to snv_129. It does appear to be the zfs receive part, because I can send to /dev/null at close to 800MB/sec (42 drives in 5-6 disk vdevs, RAID-Z) Something must''ve changed in either SSH, or the ZFS receive bits to cause this, but sadly since I upgrade my pool, I cannot roll back these hosts :( -- Brent Jones brent at servuhome.net
> I''m seeing similar results, though my file systems currently have > de-dupe disabled, and only compression enable, both systems beingI can''t say this is your issue, but you can count on slow writes with compression on. How slow is slow? Don''t know. Irrelevant in this case? Possibly.
On Wed, Dec 16, 2009 at 7:43 PM, Edward Ned Harvey <solaris at nedharvey.com> wrote:>> I''m seeing similar results, though my file systems currently have >> de-dupe disabled, and only compression enable, both systems being > > I can''t say this is your issue, but you can count on slow writes with > compression on. ?How slow is slow? ?Don''t know. ?Irrelevant in this case? > Possibly. > >I''m willing to accept slower writes with compression enabled, par for the course. Local writes, even with compression enabled, can still exceed 500MB/sec, with moderate to high CPU usage. These problems seem to have manifested after snv_128, and seemingly only affect ZFS receive speeds. Local pool performance is still very fast. -- Brent Jones brent at servuhome.net
Le 17 d?c. 09 ? 03:19, Brent Jones a ?crit :> > Something must''ve changed in either SSH, or the ZFS receive bits to > cause this, but sadly since I upgrade my pool, I cannot roll back > these hosts :(I''m not sure that''s the best way, but to look at how ssh is slowing down the transfer, I''m usually doing something like that: zfs send -R data/voxel at 2009-12-14_01h44m47 | pv | ssh hex "cat > / dev/null" I usually get approximately 60 MB/s. Changing the cipher can enhance the transfer rate noticeably: "-c arcfour" gives an extra 10 MB/s here. If you found that ssh is what is slowing down the transfer, maybe another cipher than the default one can help. Ga?tan -- Ga?tan Lehmann Biologie du D?veloppement et de la Reproduction INRA de Jouy-en-Josas (France) tel: +33 1 34 65 29 66 fax: 01 34 65 29 09 http://voxel.jouy.inra.fr http://www.itk.org http://www.mandriva.org http://www.bepo.fr -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: Ceci est une signature ?lectronique PGP URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091217/688ef443/attachment.bin>
> I''m willing to accept slower writes with compression enabled, par for > the course. Local writes, even with compression enabled, can still > exceed 500MB/sec, with moderate to high CPU usage. > These problems seem to have manifested after snv_128, and seemingly > only affect ZFS receive speeds. Local pool performance is still very > fast.Now we''re getting somewhere. ;-) You''ve tested the source disk (result: fast.) You''ve tested the destination disk without zfs receive (result: fast.) Now the only two ingredients left are: Ssh performance, or zfs receive performance. So, to conclusively identify and prove and measure that zfs receive is the problem, how about this: zfs send somefilessytem | ssh somehost ''cat > /dev/null'' If that goes slow, then ssh is the culprit. If that goes fast ... and then you change to "zfs receive" and that goes slow ... Now you''ve scientifically shown that zfs receive is slow.
I have observed the opposite, and I believe that all writes are slow to my dedup''d pool. I used local rsync (no ssh) for one of my migrations (so it was restartable, as it took *4 days*), and the writes were slow just like zfs recv. I have not seen fast writes of real data to the deduped volume, if you''re copying enough data. (I assume there''s some sort of writeback behavior to make small writes faster?) Of course if you just use mkfile, it does run amazingly fast. mike Edward Ned Harvey wrote:> > I''m willing to accept slower writes with compression enabled, par for > the course. Local writes, even with compression enabled, can still > exceed 500MB/sec, with moderate to high CPU usage. > These problems seem to have manifested after snv_128, and seemingly > only affect ZFS receive speeds. Local pool performance is still very > fast. > > > Now we''re getting somewhere. ;-) > You''ve tested the source disk (result: fast.) > You''ve tested the destination disk without zfs receive (result: fast.) > Now the only two ingredients left are: > > Ssh performance, or zfs receive performance. > > So, to conclusively identify and prove and measure that zfs receive is the > problem, how about this: > zfs send somefilessytem | ssh somehost ''cat > /dev/null'' > > If that goes slow, then ssh is the culprit. > If that goes fast ... and then you change to "zfs receive" and that goes > slow ... Now you''ve scientifically shown that zfs receive is slow. > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091217/64ce7a20/attachment.html>
It looks like the kernel is using a lot of memory, which may be part of the performance problem. The ARC has shrunk to 1G, and the kernel is using up over 5G. I''m doing a send|receive of 683G of data. I started it last night around 1am, and as of right now it''s only sent 450GB. That''s about 8.5MB/sec. Are there any other stats, or dtrace scripts I can look at to determine what''s happening? bhigh at basestar:~$ pfexec mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp rootnex scsi_vhci zfs sata sd sockfs ip hook neti sctp arp usba fctl random crypto cpc fcip smbsrv nfs lofs ufs logindmux ptm sppp ipc ]> ::memstatPage Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 1405991 5492 67% ZFS File Data 223137 871 11% Anon 396743 1549 19% Exec and libs 1936 7 0% Page cache 5221 20 0% Free (cachelist) 9181 35 0% Free (freelist) 52685 205 3% Total 2094894 8183 Physical 2094893 8183 bhigh at basestar:~$ arcstat.pl 5 3 Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 16:05:33 204M 6M 3 3M 5 3M 2 3M 1 1G 1G 16:05:38 562 101 18 97 17 4 23 97 17 1G 1G 16:05:43 1K 709 39 71 6 637 94 79 15 1G 1G -B -- Brandon High : bhigh at freaks.com Always try to do things in chronological order; it''s less confusing that way.
My ARC is ~3GB. I''m doing a test that copies 10GB of data to a volume where the blocks should dedupe 100% with existing data. First time, the test that runs <5MB sec, seems to average 10-30% ARC *miss* rate. <400 arc reads/sec. When things are working at disk bandwidth, I''m getting 3-5% ARC misses. Up to 7k arc reads/sec. If I do a "recv" on a small dataset, then immediately destroy & replay the same thing, I can get "in-core" dedupe performance, and it''s truly amazing. Does anyone know how big the dedupe tables are, and if they can be given some priority/prefetch in ARC? I think I have enough RAM to make this work. mike On Thu, Dec 17, 2009 at 4:12 PM, Brandon High <bhigh at freaks.com> wrote:> It looks like the kernel is using a lot of memory, which may be part > of the performance problem. The ARC has shrunk to 1G, and the kernel > is using up over 5G. > > I''m doing a send|receive of 683G of data. I started it last night > around 1am, and as of right now it''s only sent 450GB. That''s about > 8.5MB/sec. > > Are there any other stats, or dtrace scripts I can look at to > determine what''s happening? > > bhigh at basestar:~$ pfexec mdb -k > Loading modules: [ unix genunix specfs dtrace mac cpu.generic > cpu_ms.AuthenticAMD.15 uppc pcplusmp rootnex scsi_vhci zfs sata sd > sockfs ip hook neti sctp arp usba fctl random crypto cpc fcip smbsrv > nfs lofs ufs logindmux ptm sppp ipc ] > > ::memstat > Page Summary Pages MB %Tot > ------------ ---------------- ---------------- ---- > Kernel 1405991 5492 67% > ZFS File Data 223137 871 11% > Anon 396743 1549 19% > Exec and libs 1936 7 0% > Page cache 5221 20 0% > Free (cachelist) 9181 35 0% > Free (freelist) 52685 205 3% > > Total 2094894 8183 > Physical 2094893 8183 > > bhigh at basestar:~$ arcstat.pl 5 3 > Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 16:05:33 204M 6M 3 3M 5 3M 2 3M 1 1G 1G > 16:05:38 562 101 18 97 17 4 23 97 17 1G 1G > 16:05:43 1K 709 39 71 6 637 94 79 15 1G 1G > > -B > > -- > Brandon High : bhigh at freaks.com > Always try to do things in chronological order; it''s less confusing that > way. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091217/b617c287/attachment.html>
On Wed, Dec 16, 2009 at 8:19 AM, Brandon High <bhigh at freaks.com> wrote:> On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: >> ?In his case ''zfs send'' to /dev/null was still quite fast and the network >> was also quite fast (when tested with benchmark software). ?The implication >> is that ssh network transfer performace may have dropped with the update. > > zfs send appears to be fast still, but receive is slow. > > I tried a pipe from the send to the receive, as well as using mbuffer > with a 100mb buffer, both wrote at ~ 12 MB/s.I did a little bit of testing today. I''m sending from a snv_129 system, using a 2.31GB filesystem to test. The sender has 8GB of DDR2-800 memory and a Athlon X2 4850e cpu. It''s using 8x WD Green 5400rpm 1TB drives on a PCI-X controller, in a raidz2. The receiver has 2GB of DDR2-533 memory and a Atom 330 cpu. It''s using 2 Hitachi 7200rpm 1TB drives in a non-redundant zpool. I destroyed and recreated the zpool on the receiver between tests. Doing a send to /dev/null completes in under a second, since the entire dataset can be cached. Sending across the network to a snv_118 system via netcat, then to /dev/null took 45.496s and 40.384s. Sending across the network to a snv_118 system via netcat, then to /tank/test took 45.496s and 40.384s. Sending across the network via netcat and recv''ing on a snv_118 system took 101s and 97s. I rebooted the receiver to a snv_128a BE and did the same tests. Sending across the network to a snv_128a system via netcat, then to /dev/null took 43.067s. Sending across the network via netcat and recv''ing on a snv_128a system took 98s with dedup=off. Sending across the network via netcat and recv''ing on a snv_128a system took 121s with dedup=on. Sending across the network via netcat and recv''ing on a snv_128a system took 134s with dedup=verify It looks like the receive times didn''t change much for a small dataset. The change from fletcher4 to sha256 when enabling dedup is probably responsible for the slowdown. I suspect that the dataset is too small to run into the performance problems I was seeing. I''ll try later with a larger filesystem and see what the numbers look like. -B -- Brandon High : bhigh at freaks.com