Good day, The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to 100+ disks in pool. But the speed doesn''t vary in any degree. As I understand ''zfs send'' is a limiting factor. I did tests by sending to /dev/null. It worked out too slow and absolutely not scalable. None of cpu/memory/disk activity were in peak load, so there is of room for improvement. Is there any bug report or article that addresses this problem? Any workaround or solution? I found these guys have the same result - around 7 Mbytes/s for ''send'' and 70 Mbytes for ''recv''. http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html Thank you in advance, Anatoly Legkodymov.
On 11/15/11 23:05, Anatoly wrote:> Good day, > > The speed of send/recv is around 30-60 MBytes/s for initial send and > 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk > to 100+ disks in pool. But the speed doesn''t vary in any degree. As I > understand ''zfs send'' is a limiting factor. I did tests by sending to > /dev/null. It worked out too slow and absolutely not scalable. > None of cpu/memory/disk activity were in peak load, so there is of > room for improvement. > > Is there any bug report or article that addresses this problem? Any > workaround or solution? > > I found these guys have the same result - around 7 Mbytes/s for ''send'' > and 70 Mbytes for ''recv''. > http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.htmlWell, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, the send runs at almost 100Mbytes/sec, so it''s pretty much limited by the ethernet. Since you have provided none of the diagnostic data you collected, it''s difficult to guess what the limiting factor is for you. -- Andrew Gabriel
On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel <andrew.gabriel at oracle.com>wrote:> On 11/15/11 23:05, Anatoly wrote: > >> Good day, >> >> The speed of send/recv is around 30-60 MBytes/s for initial send and >> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk to >> 100+ disks in pool. But the speed doesn''t vary in any degree. As I >> understand ''zfs send'' is a limiting factor. I did tests by sending to >> /dev/null. It worked out too slow and absolutely not scalable. >> None of cpu/memory/disk activity were in peak load, so there is of room >> for improvement. >> >> Is there any bug report or article that addresses this problem? Any >> workaround or solution? >> >> I found these guys have the same result - around 7 Mbytes/s for ''send'' >> and 70 Mbytes for ''recv''. >> http://wikitech-static.**wikimedia.org/articles/z/f/s/** >> Zfs_replication.html<http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html> >> > > Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk mirror, > the send runs at almost 100Mbytes/sec, so it''s pretty much limited by the > ethernet. > > Since you have provided none of the diagnostic data you collected, it''s > difficult to guess what the limiting factor is for you. > > -- > Andrew Gabriel > > >So all the bugs have been fixed? I seem to recall people on this mailing list using mbuff to speed it up because it was so bursty and slow at one point. IE: http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/ --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111115/31388112/attachment.html>
On Wed, Nov 16 at 3:05, Anatoly wrote:>Good day, > >The speed of send/recv is around 30-60 MBytes/s for initial send and >17-25 MBytes/s for incremental. I have seen lots of setups with 1 >disk to 100+ disks in pool. But the speed doesn''t vary in any degree. >As I understand ''zfs send'' is a limiting factor. I did tests by >sending to /dev/null. It worked out too slow and absolutely not >scalable. >None of cpu/memory/disk activity were in peak load, so there is of >room for improvement.My belief is that initial/incremental may be affecting it because of initial versus incremental efficiency of the data layout in the pools, not because of something inherent in the send/recv process itself. There are various send/recv improvements (e.g. don''t use SSH as a tunnel) but even that shouldn''t be capping you at 17MBytes/sec. My incrementals get me ~35MB/s consistently. Each incremental is 10-50GB worth of transfer. cheap gig switch, no jumbo frames Source = 2 mirrored vdevs + l2arc ssd, CPU = xeon E5520, 6GB RAM Destination = 4-drive raidz1, CPU = c2d E4500 @2.2GHz, 2GB RAM tunnel is un-tuned SSH>I found these guys have the same result - around 7 Mbytes/s for >''send'' and 70 Mbytes for ''recv''. >http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.htmlTheir data doesn''t match mine. -- Eric D. Mudama edmudama at bounceswoosh.org
On 11/15/11 23:40, Tim Cook wrote:> On Tue, Nov 15, 2011 at 5:17 PM, Andrew Gabriel > <andrew.gabriel at oracle.com <mailto:andrew.gabriel at oracle.com>> wrote: > > On 11/15/11 23:05, Anatoly wrote: > > Good day, > > The speed of send/recv is around 30-60 MBytes/s for initial > send and 17-25 MBytes/s for incremental. I have seen lots of > setups with 1 disk to 100+ disks in pool. But the speed > doesn''t vary in any degree. As I understand ''zfs send'' is a > limiting factor. I did tests by sending to /dev/null. It > worked out too slow and absolutely not scalable. > None of cpu/memory/disk activity were in peak load, so there > is of room for improvement. > > Is there any bug report or article that addresses this > problem? Any workaround or solution? > > I found these guys have the same result - around 7 Mbytes/s > for ''send'' and 70 Mbytes for ''recv''. > http://wikitech-static.wikimedia.org/articles/z/f/s/Zfs_replication.html > > > Well, if I do a zfs send/recv over 1Gbit ethernet from a 2 disk > mirror, the send runs at almost 100Mbytes/sec, so it''s pretty much > limited by the ethernet. > > Since you have provided none of the diagnostic data you collected, > it''s difficult to guess what the limiting factor is for you. > > -- > Andrew Gabriel > > > > So all the bugs have been fixed?Probably not, but the OP''s implication that zfs send has a specific rate limit in the range suggested is demonstrably untrue. So I don''t know what''s limiting the OP''s send rate. (I could guess a few possibilities, but that''s pointless without the data.)> I seem to recall people on this mailing list using mbuff to speed it > up because it was so bursty and slow at one point. IE: > http://blogs.everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/ >Yes, this idea originally came from me, having analyzed the send/receive traffic behavior in combination with network connection behavior. However, it''s the receive side that''s bursty around the TXG commits, not the send side, so that doesn''t match the issue the OP is seeing. (The buffer sizes in that blog are not optimal, although any buffer at the receive side will make a significant improvement if the network bandwidth is same order of magnitude as the send/recv are capable of.) -- Andrew Gabriel
On 11/16/11 01:01 PM, Eric D. Mudama wrote:> On Wed, Nov 16 at 3:05, Anatoly wrote: >> Good day, >> >> The speed of send/recv is around 30-60 MBytes/s for initial send and >> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 >> disk to 100+ disks in pool. But the speed doesn''t vary in any degree. >> As I understand ''zfs send'' is a limiting factor. I did tests by >> sending to /dev/null. It worked out too slow and absolutely not >> scalable. >> None of cpu/memory/disk activity were in peak load, so there is of >> room for improvement. > My belief is that initial/incremental may be affecting it because of > initial versus incremental efficiency of the data layout in the pools, > not because of something inherent in the send/recv process itself. > > There are various send/recv improvements (e.g. don''t use SSH as a > tunnel) but even that shouldn''t be capping you at 17MBytes/sec. > > My incrementals get me ~35MB/s consistently. Each incremental is > 10-50GB worth of transfer.While my incremental sizes are much smaller, the rates I see for dense (large blocks of changes, such as media files) incrementals is about the same. I do see much lower rates for more scattered (such as filesystems with documents) changes. -- Ian.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Anatoly > > The speed of send/recv is around 30-60 MBytes/s for initial send and > 17-25 MBytes/s for incremental. I have seen lots of setups with 1 diskI suggest watching zpool iostat before, during, and after the send to /dev/null. Actually, I take that back - zpool iostat seems to measure virtual IOPS, as I just did this on my laptop a minute ago, I saw 1.2k ops, which is at least 5-6x higher than my hard drive can handle, which can only mean it''s reading a lot of previously aggregated small blocks from disk, which are now sequentially organized on disk. How do you measure physical iops? Is it just regular iostat? I have seriously put zero effort into answering this question (sorry.) I have certainly noticed a delay in the beginning, while the system thinks about stuff for a little while to kick off an incremental... And it''s acknowledged and normal that incrementals are likely fragmented all over the place so you could be IOPS limited (hence watching the iostat). Also, whenever I sit and watch it for long times, I see that it varies enormously. For 5 minutes it will be (some speed), and for 5 minutes it will be 5x higher... Whatever it is, it''s something we likely are all seeing, but probably just ignoring. If you can find it in your heart to just ignore it too, then great, no problem. ;-) Otherwise, it''s a matter of digging in and characterizing to learn more about it.
On Tue, November 15, 2011 20:08, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Anatoly >> >> The speed of send/recv is around 30-60 MBytes/s for initial send and >> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk > > I suggest watching zpool iostat before, during, and after the send to > /dev/null. Actually, I take that back - zpool iostat seems to measure > virtual IOPS, as I just did this on my laptop a minute ago, I saw 1.2k > ops, > which is at least 5-6x higher than my hard drive can handle, which can > only > mean it''s reading a lot of previously aggregated small blocks from disk, > which are now sequentially organized on disk. How do you measure physical > iops? Is it just regular iostat? I have seriously put zero effort into > answering this question (sorry.) > > I have certainly noticed a delay in the beginning, while the system thinks > about stuff for a little while to kick off an incremental... And it''s > acknowledged and normal that incrementals are likely fragmented all over > the > place so you could be IOPS limited (hence watching the iostat). > > Also, whenever I sit and watch it for long times, I see that it varies > enormously. For 5 minutes it will be (some speed), and for 5 minutes it > will be 5x higher... > > Whatever it is, it''s something we likely are all seeing, but probably just > ignoring. If you can find it in your heart to just ignore it too, then > great, no problem. ;-) Otherwise, it''s a matter of digging in and > characterizing to learn more about it.I see rather variable io stats while sending incremental backups. The receiver is a USB disk, so fairly slow, but I get 30MB/s in a good stretch. I''m compressing the ZFS filesystem on the receiving end, but much of my content is already-compressed photo files, so it doesn''t make a huge difference. Helps some, though, and at 30MB/s there''s no shortage of CPU horsepower to handle the compression. The raw files are around 12MB each, probably not fragmented much (they''re just copied over from memory cards). For a small number of the files, there''s a photoshop file that''s much bigger (sometimes more than 1GB, if it''s a stitched panorama with layers of changes). And then there are sidecar XMP files, mostly two per image, and for most of them web-resolution images, 100kB. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Tue, November 15, 2011 17:05, Anatoly wrote:> Good day, > > The speed of send/recv is around 30-60 MBytes/s for initial send and > 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk > to 100+ disks in pool. But the speed doesn''t vary in any degree. As I > understand ''zfs send'' is a limiting factor. I did tests by sending to > /dev/null. It worked out too slow and absolutely not scalable. > None of cpu/memory/disk activity were in peak load, so there is of room > for improvement.What you''re probably seeing with incremental sends is that the disks being read are hitting their IOPS limits. Zfs send does random reads all over the place -- every block that''s changed since the last incremental send is read, in TXG order. So that''s essentially random reads all of the disk. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Good day, I''ve just made clean test for sequential data read. System has 45 mirror vdevs. 1. Create 160GB random file. 2. Read it to /dev/null. 3. Do Snaspshot and send it to /dev/null. 4. Compare results. 1. Write speed is slow due to ''urandom'': # dd if=/dev/urandom bs=128k | pv > big_file 161118683136 bytes (161 GB) copied, 3962.15 seconds, 40.7 MB/s 2. Read file normally: # time dd if=./big_file bs=128k of=/dev/null 161118683136 bytes (161 GB) copied, 103.455 seconds, 1.6 GB/s real 1m43.459s user 0m0.899s sys 1m25.078s 3. Snapshot & send: # zfs snapshot volume/test@A # time zfs send volume/test@A > /dev/null real 7m20.635s user 0m0.004s sys 0m52.760s 4. As you see, there is 4 times difference on pure sequential read, greenhouse conditions. I repeated tests couple of times to check ARC influence - no much difference. Real send speed on this system is around 60 MBytes/s with some 100 peak. File read operation is good scaled for large number of disks. But ''zfs send'' is lame. In normal conditions moving of large portions of data may take days to weeks. It can''t fill 10G Ethernet connection, sometimes even 1G. Best regards, Anatoly Legkodymov. On 16.11.2011 06:08, Edward Ned Harvey wrote: From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss- bounces@opensolaris.org] On Behalf Of Anatoly The speed of send/recv is around 30-60 MBytes/s for initial send and 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk I suggest watching zpool iostat before, during, and after the send to /dev/null. Actually, I take that back - zpool iostat seems to measure virtual IOPS, as I just did this on my laptop a minute ago, I saw 1.2k ops, which is at least 5-6x higher than my hard drive can handle, which can only mean it''s reading a lot of previously aggregated small blocks from disk, which are now sequentially organized on disk. How do you measure physical iops? Is it just regular iostat? I have seriously put zero effort into answering this question (sorry.) I have certainly noticed a delay in the beginning, while the system thinks about stuff for a little while to kick off an incremental... And it''s acknowledged and normal that incrementals are likely fragmented all over the place so you could be IOPS limited (hence watching the iostat). Also, whenever I sit and watch it for long times, I see that it varies enormously. For 5 minutes it will be (some speed), and for 5 minutes it will be 5x higher... Whatever it is, it''s something we likely are all seeing, but probably just ignoring. If you can find it in your heart to just ignore it too, then great, no problem. ;-) Otherwise, it''s a matter of digging in and characterizing to learn more about it. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Nov 16, 2011 at 11:07 AM, Anatoly <legko777 at fastmail.fm> wrote:> I''ve just made clean test for sequential data read. System has 45 mirror > vdevs. > > 1. Create 160GB random file. > 2. Read it to /dev/null. > 3. Do Snaspshot and send it to /dev/null. > 4. Compare results.What OS? The following is under Solaris 10U9 with CPU_2010-10 + an IDR for a SAS/SATA drive bug. I just had to replicate over 20TB of small files, `zfs send -R <zfs at snap> | zfs recv -e <zfs>`, and I got an AVERAGE throughput of over 77MB/sec. (over 6TB /day). The entire replication took just over 3 days. The source zpool is on J4400 750GB SATA drives, 110 of them in a RAIDz2 configuration (22 vdevs of 5 disks each), the target was a pair of old h/w raid boxes (one without any NVRAM cache) and a zpool configuration of 6 striped vdevs (a total of 72 drives behind the h/w raid controller doing raid5, this is temporary and only for moving data physically around, so the lack of ZFS redundancy is not an issue). There are over 2300 snapshots on the source side and we were replicating close to 2000 of them. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Wed, Nov 16 at 9:35, David Dyer-Bennet wrote:> >On Tue, November 15, 2011 17:05, Anatoly wrote: >> Good day, >> >> The speed of send/recv is around 30-60 MBytes/s for initial send and >> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk >> to 100+ disks in pool. But the speed doesn''t vary in any degree. As I >> understand ''zfs send'' is a limiting factor. I did tests by sending to >> /dev/null. It worked out too slow and absolutely not scalable. >> None of cpu/memory/disk activity were in peak load, so there is of room >> for improvement. > >What you''re probably seeing with incremental sends is that the disks being >read are hitting their IOPS limits. Zfs send does random reads all over >the place -- every block that''s changed since the last incremental send is >read, in TXG order. So that''s essentially random reads all of the disk.Anatoly didn''t state whether his 160GB file test was done on a virgin pool, or whether it was allocated out of an existing pool. If the latter, your comment is the likely explanation. If the former, your comment wouldn''t explain the slow performance. --eric -- Eric D. Mudama edmudama at bounceswoosh.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Anatoly > > I''ve just made clean test for sequential data read. System has 45 mirror > vdevs.90 disks in the system... I bet you have a lot of ram?> 2. Read file normally: > # time dd if=./big_file bs=128k of=/dev/null > 161118683136 bytes (161 GB) copied, 103.455 seconds, 1.6 GB/sI wonder how much of that is being read back from cache. Would it be impossible to reboot, or otherwise invalidate the cache, before reading the file back? With 90 disks, in theory, you should be able to read something like 90Gbit 11GB / sec. But of course various bus speed bottlenecks come into play, so I don''t think the 1.6GB/s is unrealistically high in any way.> 3. Snapshot & send: > # zfs snapshot volume/test at A > # time zfs send volume/test at A > /dev/null > real??? 7m20.635s > user??? 0m0.004s > sys???? 0m52.760sThis doesn''t surprise me, based on gut feel, I don''t think zfs send performs optimally, in general. I think your results are probably correct, and even if you revisit all this, doing the reboots (or cache invalidation) and/or using a newly created pool, as anyone here might suggest... I think you''ll still see the same results. Somewhat unpredictably. Even so, I always find zfs send performance still beats the pants off any alternative... rsync and whatnot.
On Nov 16, 2011, at 7:35 AM, David Dyer-Bennet wrote:> > On Tue, November 15, 2011 17:05, Anatoly wrote: >> Good day, >> >> The speed of send/recv is around 30-60 MBytes/s for initial send and >> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk >> to 100+ disks in pool. But the speed doesn''t vary in any degree. As I >> understand ''zfs send'' is a limiting factor. I did tests by sending to >> /dev/null. It worked out too slow and absolutely not scalable. >> None of cpu/memory/disk activity were in peak load, so there is of room >> for improvement. > > What you''re probably seeing with incremental sends is that the disks being > read are hitting their IOPS limits. Zfs send does random reads all over > the place -- every block that''s changed since the last incremental send is > read, in TXG order. So that''s essentially random reads all of the disk.Not necessarily. I''ve seen sustained zfs sends in the 600+ MB/sec range for modest servers. It does depend on how the data is used more than the hardware it is stored upon. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9