-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 hi, i have two systems, A (Solaris 10 update 5) and B (Solaris 10 update 6). i''m using ''zfs send -i'' to replicate changes on A to B. however, the ''zfs recv'' on B is running extremely slowly. if i run the zfs send on A and redirect output to a file, it sends at 2MB/sec. but when i use ''zfs send ... | ssh B zfs recv'', the speed drops to 200KB/sec. according to iostat, B (which is otherwise idle) is doing ~20MB/sec of disk reads, and very little writing. i don''t believe the problem is ssh, as the systems are on the same LAN, and running ''tar'' over ssh runs much faster (20MB/sec or more). is this slowness normal? is there any way to improve it? (the idea here is to use B as a backup of A, but if i can only replicate at 200KB/s, it''s not going to be able to keep up with the load...) both systems are X4500s with 16GB ram, 48 SATA disks and 4 2.8GHz cores. thanks, river. -----BEGIN PGP SIGNATURE----- iD8DBQFJE3k5IXd7fCuc5vIRAs2JAJ0W0dYVgfyNUXGWHbg59D5mQgq9jQCfWUsm 5/c8g4JMmtIj59mZ5ghkdIY=QNG3 -----END PGP SIGNATURE-----
On Fri 07/11/08 12:09 , River Tarnell river at loreley.flyingparchment.org.uk sent:> > hi, > > i have two systems, A (Solaris 10 update 5) and B (Solaris 10 update 6). > i''musing ''zfs send -i'' to replicate changes on A to B. however, the ''zfs > recv'' onB is running extremely slowly. if i run the zfs send on A and redirect > outputto a file, it sends at 2MB/sec. but when i use ''zfs send ... | ssh B > zfsrecv'', the speed drops to 200KB/sec. according to iostat, B (which is > otherwise idle) is doing ~20MB/sec of disk reads, and very little > writing. > i don''t believe the problem is ssh, as the systems are on the same LAN, > andrunning ''tar'' over ssh runs much faster (20MB/sec or more). > > is this slowness normal? is there any way to improve it? (the idea here > is touse B as a backup of A, but if i can only replicate at 200KB/s, it''s not > goingto be able to keep up with the load...) >That''s very slow. What''s the nature of your data? I''m currently replicating data between an x4500 and an x4540 and I see about 50% of ftp transfer speed for zfs sens/receive (about 60GB/hour). Time each phase (send to a file, copy the file to B and receive from the file). When I tried this on a filesystem with a range of file sizes, I had about 30% of the total transfer time in send, 50% in copy and 20% in receive. -- Ian.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Collins:> That''s very slow. What''s the nature of your data?mainly two sets of mid-sized files; one of 200KB-2MB in size and other under 50KB. they are organised into subdirectories, A/B/C/<file>. each directory has 18,000-25,000 files. total data size is around 2.5TB. hm, something changed while i was writing this mail: now the transfer is running at 2MB/sec, and the read i/o has disappeared. that''s still slower than i''d expect, but an improvement.> Time each phase (send to a file, copy the file to B and receive from the file). When I tried this on a filesystem with a range of file sizes, I had about 30% of the total transfer time in send, 50% in copy and 20% in receive.i''d rather not interrupt the current send, as it''s quite large. once it''s finished, i''ll test with smaller changes... - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJE4mXIXd7fCuc5vIRAv0/AJoCRtMBN1/WD7zVVRzV2n4xeqBvyACeLNL/ rLB1iHlu4xZdUPSiNj/iWl4=+F7d -----END PGP SIGNATURE-----
On Thu, Nov 6, 2008 at 4:19 PM, River Tarnell <river at loreley.flyingparchment.org.uk> wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ian Collins: >> That''s very slow. What''s the nature of your data? > > mainly two sets of mid-sized files; one of 200KB-2MB in size and other under > 50KB. they are organised into subdirectories, A/B/C/<file>. each directory > has 18,000-25,000 files. total data size is around 2.5TB. > > hm, something changed while i was writing this mail: now the transfer is > running at 2MB/sec, and the read i/o has disappeared. that''s still slower than > i''d expect, but an improvement. > >> Time each phase (send to a file, copy the file to B and receive from the file). When I tried this on a filesystem with a range of file sizes, I had about 30% of the total transfer time in send, 50% in copy and 20% in receive. > > i''d rather not interrupt the current send, as it''s quite large. once it''s > finished, i''ll test with smaller changes... > > - river. > -----BEGIN PGP SIGNATURE----- > > iD8DBQFJE4mXIXd7fCuc5vIRAv0/AJoCRtMBN1/WD7zVVRzV2n4xeqBvyACeLNL/ > rLB1iHlu4xZdUPSiNj/iWl4> =+F7d > -----END PGP SIGNATURE----- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Theres been a couple threads about this now, tracked some bug ID''s/ticket: 6333409 6418042 66104157 If you wanna see the status -- Brent Jones brent at servuhome.net
River Tarnell wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ian Collins: > >> That''s very slow. What''s the nature of your data? >> > > mainly two sets of mid-sized files; one of 200KB-2MB in size and other under > 50KB. they are organised into subdirectories, A/B/C/<file>. each directory > has 18,000-25,000 files. total data size is around 2.5TB. > > hm, something changed while i was writing this mail: now the transfer is > running at 2MB/sec, and the read i/o has disappeared. that''s still slower than > i''d expect, but an improvement. > >The transfer I mentioned just completed, 1.45.TB sent in 84832 seconds (17.9MB/sec). This was during a working day when the sever and network were busy. The best ftp speed I managed was 59 MB/sec over then same network. -- Ian.
River Tarnell wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > hi, > > i have two systems, A (Solaris 10 update 5) and B (Solaris 10 update 6). i''m > using ''zfs send -i'' to replicate changes on A to B. however, the ''zfs recv'' on > B is running extremely slowly.I''m sorry, I didn''t notice the "-i" in your original message. I get the same problem sending incremental streams between Thumpers. -- Ian.
Brent Jones wrote:> Theres been a couple threads about this now, tracked some bug ID''s/ticket: > > 6333409 > 6418042I see these are fixed in build 102. Are they targeted to get back to Solaris 10 via a patch? If not, is it worth escalating the issue with support to get a patch? -- Ian.
Ian Collins wrote:> Brent Jones wrote: >> Theres been a couple threads about this now, tracked some bug ID''s/ticket: >> >> 6333409 >> 6418042 > I see these are fixed in build 102. > > Are they targeted to get back to Solaris 10 via a patch? > > If not, is it worth escalating the issue with support to get a patch?Given the issue described is slow zfs recv over network, I suspect this is: 6729347 Poor zfs receive performance across networks This is quite easily worked around by putting a buffering program between the network and the zfs receive. There is a public domain "mbuffer" which should work, although I haven''t tried it as I wrote my own. The buffer size you need is about 5 seconds worth of data. In my case of 7200RPM disks (in a mirror and not striped) and a gigabit ethernet link, the disks are the limiting factor at around 57MB/sec sustained i/o, so I used a 250MB buffer to best effect. If I recall correctly, that speeded up the zfs send/recv across the network by about 3 times, and it then ran at the disk platter speed. -- Andrew
Andrew Gabriel wrote:> Ian Collins wrote: > >> Brent Jones wrote: >> >>> Theres been a couple threads about this now, tracked some bug ID''s/ticket: >>> >>> 6333409 >>> 6418042 >>> >> I see these are fixed in build 102. >> >> Are they targeted to get back to Solaris 10 via a patch? >> >> If not, is it worth escalating the issue with support to get a patch? >> > > Given the issue described is slow zfs recv over network, I suspect this is: > > 6729347 Poor zfs receive performance across networks > > This is quite easily worked around by putting a buffering program > between the network and the zfs receive. There is a public domain > "mbuffer" which should work, although I haven''t tried it as I wrote my > own. The buffer size you need is about 5 seconds worth of data. In my > case of 7200RPM disks (in a mirror and not striped) and a gigabit > ethernet link, the disks are the limiting factor at around 57MB/sec > sustained i/o, so I used a 250MB buffer to best effect. If I recall > correctly, that speeded up the zfs send/recv across the network by about > 3 times, and it then ran at the disk platter speed. > >Did this apply to incremental sends as well? I can live with ~20MB/sec for full sends, but ~1MB/sec for incremental sends is a killer. -- Ian.
Ian Collins wrote:> Andrew Gabriel wrote: >> Ian Collins wrote: >> >>> Brent Jones wrote: >>> >>>> Theres been a couple threads about this now, tracked some bug ID''s/ticket: >>>> >>>> 6333409 >>>> 6418042 >>>> >>> I see these are fixed in build 102. >>> >>> Are they targeted to get back to Solaris 10 via a patch? >>> >>> If not, is it worth escalating the issue with support to get a patch? >>> >> Given the issue described is slow zfs recv over network, I suspect this is: >> >> 6729347 Poor zfs receive performance across networks >> >> This is quite easily worked around by putting a buffering program >> between the network and the zfs receive. There is a public domain >> "mbuffer" which should work, although I haven''t tried it as I wrote my >> own. The buffer size you need is about 5 seconds worth of data. In my >> case of 7200RPM disks (in a mirror and not striped) and a gigabit >> ethernet link, the disks are the limiting factor at around 57MB/sec >> sustained i/o, so I used a 250MB buffer to best effect. If I recall >> correctly, that speeded up the zfs send/recv across the network by about >> 3 times, and it then ran at the disk platter speed. > > Did this apply to incremental sends as well? I can live with ~20MB/sec > for full sends, but ~1MB/sec for incremental sends is a killer.It doesn''t help the ~1MB/sec periods in incrementals, but it does help the fast periods in incrementals. -- Andrew
Andrew Gabriel wrote:> Ian Collins wrote: >> Andrew Gabriel wrote: >>> Ian Collins wrote: >>> >>>> Brent Jones wrote: >>>> >>>>> Theres been a couple threads about this now, tracked some bug >>>>> ID''s/ticket: >>>>> >>>>> 6333409 >>>>> 6418042 >>>>> >>>> I see these are fixed in build 102. >>>> >>>> Are they targeted to get back to Solaris 10 via a patch? >>>> If not, is it worth escalating the issue with support to get a patch? >>>> >>> Given the issue described is slow zfs recv over network, I suspect >>> this is: >>> >>> 6729347 Poor zfs receive performance across networks >>> >>> This is quite easily worked around by putting a buffering program >>> between the network and the zfs receive. There is a public domain >>> "mbuffer" which should work, although I haven''t tried it as I wrote >>> my own. The buffer size you need is about 5 seconds worth of data. >>> In my case of 7200RPM disks (in a mirror and not striped) and a >>> gigabit ethernet link, the disks are the limiting factor at around >>> 57MB/sec sustained i/o, so I used a 250MB buffer to best effect. If >>> I recall correctly, that speeded up the zfs send/recv across the >>> network by about 3 times, and it then ran at the disk platter speed. >> >> Did this apply to incremental sends as well? I can live with ~20MB/sec >> for full sends, but ~1MB/sec for incremental sends is a killer. > > It doesn''t help the ~1MB/sec periods in incrementals, but it does help > the fast periods in incrementals. >:) I don''t see the 5 second bursty behaviour described in the bug report. It''s more like 5 second interval gaps in the network traffic while the data is written to disk. -- Ian.
Ian Collins wrote:> Andrew Gabriel wrote: >> Ian Collins wrote: >>> Andrew Gabriel wrote: >>>> Given the issue described is slow zfs recv over network, I suspect >>>> this is: >>>> >>>> 6729347 Poor zfs receive performance across networks >>>> >>>> This is quite easily worked around by putting a buffering program >>>> between the network and the zfs receive. There is a public domain >>>> "mbuffer" which should work, although I haven''t tried it as I wrote >>>> my own. The buffer size you need is about 5 seconds worth of data. >>>> In my case of 7200RPM disks (in a mirror and not striped) and a >>>> gigabit ethernet link, the disks are the limiting factor at around >>>> 57MB/sec sustained i/o, so I used a 250MB buffer to best effect. If >>>> I recall correctly, that speeded up the zfs send/recv across the >>>> network by about 3 times, and it then ran at the disk platter speed. >>> >>> Did this apply to incremental sends as well? I can live with ~20MB/sec >>> for full sends, but ~1MB/sec for incremental sends is a killer. >> It doesn''t help the ~1MB/sec periods in incrementals, but it does help >> the fast periods in incrementals. >> > :) > > I don''t see the 5 second bursty behaviour described in the bug report. > It''s more like 5 second interval gaps in the network traffic while the > data is written to disk.That is exactly the issue. When the zfs recv data has been written, zfs recv starts reading the network again, but there''s only a tiny amount of data buffered in the TCP/IP stack, so it has to wait for the network to heave more data across. In effect, it''s a single buffered copy. The addition of a buffer program turns it into a double-buffered (or cyclic buffered) copy, with the disks running flat out continuously, and the network streaming data across continuously at the disk platter speed. What are your theoretical max speeds for network and disk i/o? Taking the smaller of these two, are you seeing the sustained send/recv performance match that (excluding the ~1MB/sec periods which is some other problem)? The effect described in that bug is most obvious when the disk and network speeds are same order of magnitude (as in the example I gave above). Given my disk i/o rate above, if the network is much faster (say, 10GB), then it''s going to cope with the bursty nature of the traffic better. If the network is much slower (say, 100MB), then it''s going to be running flat out anyway and again you won''t notice the bursty reads (a colleague measured only 20% gain in that case, rather than my 200% gain). -- Andrew
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrew Gabriel:> This is quite easily worked around by putting a buffering program > between the network and the zfs receive.i tested inserting mbuffer with a 250MB buffer between the zfs send and zfs recv. unfortunately, it seems to make very little different to my incremental send speed. mbuffer reported the average speed after the transfer as: summary: 81.3 GByte in 30 h 28 min 32.4 sec - average of 777 kB/s i suppose this is only a benefit when the send is running at a reasonable speed, i.e. for full sends, not incrementals. - river. -----BEGIN PGP SIGNATURE----- iD8DBQFJGBf9IXd7fCuc5vIRArr4AKDBBYkztna7+vlELs51mFNrGu1GawCgs8BA jyetTgZeYb2B5Y+xOgDkorM=Z84D -----END PGP SIGNATURE-----
I have an open ticket to have these putback into Solaris 10. On Fri, Nov 7, 2008 at 3:24 PM, Ian Collins <ian at ianshome.com> wrote:> Brent Jones wrote: > > Theres been a couple threads about this now, tracked some bug > ID''s/ticket: > > > > 6333409 > > 6418042 > I see these are fixed in build 102. > > Are they targeted to get back to Solaris 10 via a patch? > > If not, is it worth escalating the issue with support to get a patch? > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081110/a87db28d/attachment.html>
River Tarnell wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Andrew Gabriel: > >> This is quite easily worked around by putting a buffering program >> between the network and the zfs receive. >> > > i tested inserting mbuffer with a 250MB buffer between the zfs send and zfs > recv. unfortunately, it seems to make very little different to my incremental > send speed. mbuffer reported the average speed after the transfer as: > > summary: 81.3 GByte in 30 h 28 min 32.4 sec - average of 777 kB/s > > i suppose this is only a benefit when the send is running at a reasonable > speed, i.e. for full sends, not incrementals. > >A similar test (which yields similar results) is to sent an incremental to a file over NFS and then receive from the file. -- Ian.
If anyone out there has a support contract with sun that covers Solaris 10 support. Feel free to email me and/or sun and have them add you to my support case. The Sun Case is 66104157 and I am seeking to have 6333409 and 6418042 putback into Solaris 10. CR 6712788 was closed as a duplicate of CR 6421958, the fix for which is scheduled to be included in Update 6. On Mon, Nov 10, 2008 at 12:24 PM, Scott Williamson < scott.williamson at gmail.com> wrote:> I have an open ticket to have these putback into Solaris 10. > > > On Fri, Nov 7, 2008 at 3:24 PM, Ian Collins <ian at ianshome.com> wrote: > >> Brent Jones wrote: >> > Theres been a couple threads about this now, tracked some bug >> ID''s/ticket: >> > >> > 6333409 >> > 6418042 >> I see these are fixed in build 102. >> >> Are they targeted to get back to Solaris 10 via a patch? >> >> If not, is it worth escalating the issue with support to get a patch? >> >> -- >> Ian. >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081110/6bda08c0/attachment.html>
Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote:> That is exactly the issue. When the zfs recv data has been written, zfs > recv starts reading the network again, but there''s only a tiny amount of > data buffered in the TCP/IP stack, so it has to wait for the network to > heave more data across. In effect, it''s a single buffered copy. The > addition of a buffer program turns it into a double-buffered (or cyclic > buffered) copy, with the disks running flat out continuously, and the > network streaming data across continuously at the disk platter speed.rmt and star increase the Socket read/write buffer size via setsockopt(STDOUT_FILENO, SOL_SOCKET, SO_SNDBUF, setsockopt(STDIN_FILENO, SOL_SOCKET, SO_RCVBUF, when doing "remote tape access". This has a notable effect on throughput. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling schrieb:> Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote: > >> That is exactly the issue. When the zfs recv data has been written, zfs >> recv starts reading the network again, but there''s only a tiny amount of >> data buffered in the TCP/IP stack, so it has to wait for the network to >> heave more data across. In effect, it''s a single buffered copy. The >> addition of a buffer program turns it into a double-buffered (or cyclic >> buffered) copy, with the disks running flat out continuously, and the >> network streaming data across continuously at the disk platter speed. > > rmt and star increase the Socket read/write buffer size via > > setsockopt(STDOUT_FILENO, SOL_SOCKET, SO_SNDBUF, > setsockopt(STDIN_FILENO, SOL_SOCKET, SO_RCVBUF, > > when doing "remote tape access". > > This has a notable effect on throughput. > > J?rg >yesterday, I''ve release a new version of mbuffer, which also enlarges the default TCP buffer size. So everybody using mbuffer for network data transfer might want to update. For everybody unfamiliar with mbuffer, it might be worth to note that it has a bunch of additional features like e.g. sending to multiple clients at once, high/low watermark flushing to prevent tape drives from stop/rewind/restart cycles. - Thomas
Hello Thomas, What is mbuffer? Where might I go to read more about it? Thanks, Jerry> > yesterday, I''ve release a new version of mbuffer, which also enlarges > the default TCP buffer size. So everybody using mbuffer for network data > transfer might want to update. > > For everybody unfamiliar with mbuffer, it might be worth to note that it > has a bunch of additional features like e.g. sending to multiple clients > at once, high/low watermark flushing to prevent tape drives from > stop/rewind/restart cycles. > > - Thomas
Thomas Maier-Komor
2008-Nov-14 16:35 UTC
[zfs-discuss] mbuffer WAS''zfs recv'' is very slow
Jerry K schrieb:> Hello Thomas, > > What is mbuffer? Where might I go to read more about it? > > Thanks, > > Jerry > > > >> >> yesterday, I''ve release a new version of mbuffer, which also enlarges >> the default TCP buffer size. So everybody using mbuffer for network data >> transfer might want to update. >> >> For everybody unfamiliar with mbuffer, it might be worth to note that it >> has a bunch of additional features like e.g. sending to multiple clients >> at once, high/low watermark flushing to prevent tape drives from >> stop/rewind/restart cycles. >> >> - ThomasThe man page is included in the source, which you get over here: http://www.maier-komor.de/mbuffer.html New release are announce on freshmeat.org. Maybe I should add an html of the man page to the homepage of mbuffer... - Thomas
Joerg Schilling wrote:> Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote: > >> That is exactly the issue. When the zfs recv data has been written, zfs >> recv starts reading the network again, but there''s only a tiny amount of >> data buffered in the TCP/IP stack, so it has to wait for the network to >> heave more data across. In effect, it''s a single buffered copy. The >> addition of a buffer program turns it into a double-buffered (or cyclic >> buffered) copy, with the disks running flat out continuously, and the >> network streaming data across continuously at the disk platter speed. > > rmt and star increase the Socket read/write buffer size via > > setsockopt(STDOUT_FILENO, SOL_SOCKET, SO_SNDBUF, > setsockopt(STDIN_FILENO, SOL_SOCKET, SO_RCVBUF, > > when doing "remote tape access". > > This has a notable effect on throughput.Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That''s many orders of magnitude bigger than SO_RCVBUF can go. -- Andrew
Andrew Gabriel wrote:> Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I > need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That''s > many orders of magnitude bigger than SO_RCVBUF can go.No -- that''s wrong -- should read 250MB buffer! Still some orders of magnitude bigger than SO_RCVBUF can go. -- Andrew
Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote:> Andrew Gabriel wrote: > > Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I > > need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That''s > > many orders of magnitude bigger than SO_RCVBUF can go. > > No -- that''s wrong -- should read 250MB buffer! > Still some orders of magnitude bigger than SO_RCVBUF can go.It''s affordable e.g. on a X4540 with 64 GB of RAM. ZFS started with constraints that could not be made true in 2001. On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio as 256 MB : 4 GB. BTW: a lot of numbers in Solaris did not grow since a long time and thus create problems now. Just think about the maxphys values.... 63 kB on x86 does not even allow to write a single BluRay disk sector with a single transfer. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Fri, Nov 14, 2008 at 10:04 AM, Joerg Schilling <Joerg.Schilling at fokus.fraunhofer.de> wrote:> Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote: > >> Andrew Gabriel wrote: >> > Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I >> > need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That''s >> > many orders of magnitude bigger than SO_RCVBUF can go. >> >> No -- that''s wrong -- should read 250MB buffer! >> Still some orders of magnitude bigger than SO_RCVBUF can go. > > It''s affordable e.g. on a X4540 with 64 GB of RAM. > > ZFS started with constraints that could not be made true in 2001. > > On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could > set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio > as 256 MB : 4 GB. > > BTW: a lot of numbers in Solaris did not grow since a long time and > thus create problems now. Just think about the maxphys values.... > 63 kB on x86 does not even allow to write a single BluRay disk sector > with a single transfer. > > J?rg > > -- > EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin > js at cs.tu-berlin.de (uni) > schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ > URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I''d like to see Sun''s position on the speed at which large file systems perform ZFS send/receive. I expect my X4540''s to nearly fill 48TB (or more considering compression), and taking 24 hours to transfer 100GB is, well, I could do better on an ISDN line from 1995. -- Brent Jones brent at servuhome.net
On Fri, 14 Nov 2008, Joerg Schilling wrote:> > On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could > set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio > as 256 MB : 4 GB. > > BTW: a lot of numbers in Solaris did not grow since a long time and > thus create problems now. Just think about the maxphys values.... > 63 kB on x86 does not even allow to write a single BluRay disk sector > with a single transfer.Bloating kernel memory is not the right answer. Solaris comes with a quite effective POSIX threads library (standard since 1996) which makes it easy to quickly shuttle the data into a buffer in your own application. One thread deals with the network while the other thread deals with the device. I imagine that this is what the supreme mbuffer program is doing. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Joerg Schilling wrote:> Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote: >> Andrew Gabriel wrote: >>> Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I >>> need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That''s >>> many orders of magnitude bigger than SO_RCVBUF can go. >> No -- that''s wrong -- should read 250MB buffer! >> Still some orders of magnitude bigger than SO_RCVBUF can go. > > It''s affordable e.g. on a X4540 with 64 GB of RAM.I guess the architectures with limited 256MB and 512MB kernel address space are mostly retired now.> ZFS started with constraints that could not be made true in 2001. > > On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could > set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio > as 256 MB : 4 GB. > > BTW: a lot of numbers in Solaris did not grow since a long time and > thus create problems now. Just think about the maxphys values.... > 63 kB on x86 does not even allow to write a single BluRay disk sector > with a single transfer.I have put together a simple set of figures I use to compare how disks and systems have changed over the 25 year life of ufs/ffs, which I sometimes use when I give ZFS presentations... 25 years ago Now factor ------------ --- ------ Disk RPM 3,600 10,000 x3 Disk IOPS 30 300 x10 Disk Data rate 0.96MB/s 75MB/s x80 Capacity 100MB 1TB x10,000 System MIPS 4 400,000 x100,000 -- Andrew
----- original Nachricht -------- Betreff: Re: [zfs-discuss] ''zfs recv'' is very slow Gesendet: Fr, 14. Nov 2008 Von: Bob Friesenhahn<bfriesen at simple.dallas.tx.us>> On Fri, 14 Nov 2008, Joerg Schilling wrote: > > > > On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could > > set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio > > as 256 MB : 4 GB. > > > > BTW: a lot of numbers in Solaris did not grow since a long time and > > thus create problems now. Just think about the maxphys values.... > > 63 kB on x86 does not even allow to write a single BluRay disk sector > > with a single transfer. > > Bloating kernel memory is not the right answer. Solaris comes with a > quite effective POSIX threads library (standard since 1996) which > makes it easy to quickly shuttle the data into a buffer in your own > application. One thread deals with the network while the other thread > deals with the device. I imagine that this is what the supreme > mbuffer program is doing. > > BobBasically, mbuffer just does this - but it additionally has a whole bunch of extra functionality. At least there are people who use it to lengthen the live of their tape drives with the high/low watermark feature... Thomas --- original Nachricht Ende ----
>BTW: a lot of numbers in Solaris did not grow since a long time and >thus create problems now. Just think about the maxphys values.... >63 kB on x86 does not even allow to write a single BluRay disk sector >with a single transfer.Any "fixed value" will soon be too small (think about ufs_throttles, socket buffers, etc) I''m not sure, however, that making a bigger socket buffer will help all that much; it''s somewhat wrong to give all the kernel to the data even though we know that it won''t be all in flight. But zfs could certainly use bigger buffers; just like mbuffer, I also wrote my own "pipebuffer" which does pretty much the same. Casper
Andrew Gabriel <Andrew.Gabriel at Sun.COM> wrote:> I have put together a simple set of figures I use to compare how disks > and systems have changed over the 25 year life of ufs/ffs, which I > sometimes use when I give ZFS presentations... > > 25 years ago Now factor > ------------ --- ------ > Disk RPM 3,600 10,000 x3 > Disk IOPS 30 300 x10 > Disk Data rate 0.96MB/s 75MB/s x80 > Capacity 100MB 1TB x10,000 > System MIPS 4 400,000 x100,000The best rate I did see in 1985 was 800 kB/s (w. linear reads) now I see 120 MB/s this is more than x100 ;-) J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
On Fri, 14 Nov 2008, Joerg Schilling wrote:>> ------------ --- ------ >> Disk RPM 3,600 10,000 x3 > > The best rate I did see in 1985 was 800 kB/s (w. linear reads) > now I see 120 MB/s this is more than x100 ;-)Yes. And how that SSDs are entering the market, the disk RPM has dropped down to zero. 10,000 --> 0. I am not sure how to interpret that. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Casper.Dik at Sun.COM wrote:> But zfs could certainly use bigger buffers; just like mbuffer, I also > wrote my own "pipebuffer" which does pretty much the same.You too? (My "buffer" program which I used to diagnose the problem is attached to the bugid ;-) I know Chris Gerhard wrote one too. Seems like there''s a strong case to have such a program bundled in Solaris. -- Andrew
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 14 Nov 2008, Joerg Schilling wrote: > >> ------------ --- ------ > >> Disk RPM 3,600 10,000 x3 > > > > The best rate I did see in 1985 was 800 kB/s (w. linear reads) > > now I see 120 MB/s this is more than x100 ;-) > > Yes. And how that SSDs are entering the market, the disk RPM has > dropped down to zero. 10,000 --> 0. I am not sure how to interpret > that.My tests on a OCZ SSD show a "transfer latence" of ~ 0.1 ms, even SSDs have something similar to "seek times". J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
>On Fri, 14 Nov 2008, Joerg Schilling wrote: >>> ------------ --- ------ >>> Disk RPM 3,600 10,000 x3 >> >> The best rate I did see in 1985 was 800 kB/s (w. linear reads) >> now I see 120 MB/s this is more than x100 ;-) > >Yes. And how that SSDs are entering the market, the disk RPM has >dropped down to zero. 10,000 --> 0. I am not sure how to interpret >that.Not zero, infinite RPMs. (Latency is 1/RPM and when RPM becomes infinite, then latency is 0) Casper
Bob Friesenhahn wrote:> On Fri, 14 Nov 2008, Joerg Schilling wrote: >>> ------------ --- ------ >>> Disk RPM 3,600 10,000 x3 >> >> The best rate I did see in 1985 was 800 kB/s (w. linear reads) >> now I see 120 MB/s this is more than x100 ;-) > > Yes. And how that SSDs are entering the market, the disk RPM has > dropped down to zero. 10,000 --> 0. I am not sure how to interpret that.I don''t have a data rate for SSD''s, but a hard limit is going to be the 3Gb/s SATA/SAS bus which is going to be around 300MB/s. I''ve no idea how close they come to this in practice. For IOPS (Input/Output operations per second), the figures are mind-blowing... 15K SAS drive Enterprise SSD ------------- -------------- 180 Write IOPS 7,000 Write IOPS 320 Read IOPS 35,000 Read IOPS I don''t have figures for a SATA drive, but they''re lower than SAS. The SSD figures exceed the capabilities of some disk controllers, which can make them difficult to measure. The read IOPS figure is pretty close to being limited by the SAS bus. -- Andrew
> > Seems like there''s a strong case to have such a program bundled in Solaris. >I think, the idea of having a separate configurable buffer program with a high feature set fits into UNIX philosophy of having small programs that can be used as building blocks to solve larger problems. mbuffer is already bundled with several Linux distros. And that is also the reason its feature set expanded over time. In the beginning there wasn''t even support for network transfers. Today mbuffer supports direct transfer to multiple receivers, data transfer rate limitation, high/low water mark algorithm, on the fly md5 calculation, multi volume tape access, usage of sendfile, and has a configurable buffer size/layout. So ZFS send/receive is just another use case for this tool. - Thomas
Casper.Dik at Sun.COM wrote:> > > >BTW: a lot of numbers in Solaris did not grow since a long time and > >thus create problems now. Just think about the maxphys values.... > >63 kB on x86 does not even allow to write a single BluRay disk sector > >with a single transfer. > > > Any "fixed value" will soon be too small (think about ufs_throttles, > socket buffers, etc)The maxphys limit of 56kB or 63kB in the early 1980s was a result of the fact that many DMA controllers could only handle 16 Bit counters and because (on a multi-tasking environment) a typical DMA speed of 600 kB/s would result in ~ 0.1 seconds of wait time for other users. In 1980, Disk sector sizes have been 512 bytes. In 1995, the DVD was introduced with 32 kB sector size. Now we have BluRay disks with 64 kB sector size. On many systems, cdrecord cannot write a single BLuRay sector in a single SCSI transfer. This is bad. With today''s constraints, I would expect to see typical maxphys values of ~ 2 MB. Linux typically allows this but Solaris does not. In addition, the ioctl DKIOCINFO in many cases returns wrong (too big) numbers for maxphys which causes cdrecord to fail. Solaris needs to aproach today''s reality with some parameters. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
Andrew Gabriel wrote:> Ian Collins wrote: >> >> >> I don''t see the 5 second bursty behaviour described in the bug >> report. It''s more like 5 second interval gaps in the network traffic >> while the >> data is written to disk. > > That is exactly the issue. When the zfs recv data has been written, > zfs recv starts reading the network again, but there''s only a tiny > amount of data buffered in the TCP/IP stack, so it has to wait for the > network to heave more data across. In effect, it''s a single buffered > copy. The addition of a buffer program turns it into a double-buffered > (or cyclic buffered) copy, with the disks running flat out > continuously, and the network streaming data across continuously at > the disk platter speed. > > What are your theoretical max speeds for network and disk i/o? > Taking the smaller of these two, are you seeing the sustained > send/recv performance match that (excluding the ~1MB/sec periods which > is some other problem)? >I''ve just finished a small application to couple zfs_send and zfs_receive through a socket to remove ssh from the equation and the speed up is better than 2x. I have a small (140K) buffer on the sending side to ensure the minimum number of sent packets The times I get for 3.1GB of data (b101 ISO and some smaller files) to a modest mirror at the receive end are: 1m36s for cp over NFS, 2m48s for zfs send though ssh and 1m14s through a socket. -- Ian.
Ian Collins wrote:> I''ve just finished a small application to couple zfs_send and > zfs_receive through a socket to remove ssh from the equation and the > speed up is better than 2x. I have a small (140K) buffer on the sending > side to ensure the minimum number of sent packets > > The times I get for 3.1GB of data (b101 ISO and some smaller files) to a > modest mirror at the receive end are: > > 1m36s for cp over NFS, > 2m48s for zfs send though ssh and > 1m14s through a socket.So the best speed is equivalent to 42MB/s. Can''t tell from this what the limiting factor is (might be the disks). It would be interesting to try putting a buffer (5 x 42MB = 210MB initial stab) at the recv side and see if you get any improvement. -- Andrew
Andrew Gabriel wrote:> Ian Collins wrote: >> I''ve just finished a small application to couple zfs_send and >> zfs_receive through a socket to remove ssh from the equation and the >> speed up is better than 2x. I have a small (140K) buffer on the sending >> side to ensure the minimum number of sent packets >> >> The times I get for 3.1GB of data (b101 ISO and some smaller files) to a >> modest mirror at the receive end are: >> >> 1m36s for cp over NFS, >> 2m48s for zfs send though ssh and >> 1m14s through a socket. > > So the best speed is equivalent to 42MB/s. > > Can''t tell from this what the limiting factor is (might be the disks). >It probably is.> It would be interesting to try putting a buffer (5 x 42MB = 210MB > initial stab) at the recv side and see if you get any improvement. >That''s my next test.... -- Ian.
Ian Collins wrote:> Andrew Gabriel wrote: > >> Ian Collins wrote: >> >>> I''ve just finished a small application to couple zfs_send and >>> zfs_receive through a socket to remove ssh from the equation and the >>> speed up is better than 2x. I have a small (140K) buffer on the sending >>> side to ensure the minimum number of sent packets >>> >>> The times I get for 3.1GB of data (b101 ISO and some smaller files) to a >>> modest mirror at the receive end are: >>> >>> 1m36s for cp over NFS, >>> 2m48s for zfs send though ssh and >>> 1m14s through a socket. >>> >> So the best speed is equivalent to 42MB/s. >> >> Can''t tell from this what the limiting factor is (might be the disks). >> > It probably is. > > >> It would be interesting to try putting a buffer (5 x 42MB = 210MB >> initial stab) at the recv side and see if you get any improvement. >>It took a while... I was able to get about 47MB/s with a 256MB circular input buffer. I think that''s about as fast it can go, the buffer fills so receive processing is the bottleneck. Bonnie++ shows the pool (a mirror) block write speed is 58MB/s. When I reverse the transfer to the faster box, the rate drops to 35MB/s with neither the send nor receive buffer filling. So send processing appears to be the limit in this case. -- Ian.
Richard Elling wrote:> Ian Collins wrote: >> Ian Collins wrote: >>> Andrew Gabriel wrote: >>>> Ian Collins wrote: >>>>> I''ve just finished a small application to couple zfs_send and >>>>> zfs_receive through a socket to remove ssh from the equation and the >>>>> speed up is better than 2x. I have a small (140K) buffer on the >>>>> sending >>>>> side to ensure the minimum number of sent packets >>>>> >>>>> The times I get for 3.1GB of data (b101 ISO and some smaller >>>>> files) to a >>>>> modest mirror at the receive end are: >>>>> >>>>> 1m36s for cp over NFS, >>>>> 2m48s for zfs send though ssh and >>>>> 1m14s through a socket. >>>>> >>>> So the best speed is equivalent to 42MB/s. >>>> It would be interesting to try putting a buffer (5 x 42MB = 210MB >>>> initial stab) at the recv side and see if you get any improvement. >>>> >> It took a while... >> >> I was able to get about 47MB/s with a 256MB circular input buffer. I >> think that''s about as fast it can go, the buffer fills so receive >> processing is the bottleneck. Bonnie++ shows the pool (a mirror) block >> write speed is 58MB/s. >> >> When I reverse the transfer to the faster box, the rate drops to 35MB/s >> with neither the send nor receive buffer filling. So send processing >> appears to be the limit in this case. > Those rates are what I would expect writing to a single disk. > How is the pool configured? >The "slow" system has a single mirror pool of two SATA drives, the faster one a stripe of 4 mirrors and an IDE SD boot drive. ZFS send though ssh from the slow to the fast box takes 189 seconds, the direct socket connection send takes 82 seconds. -- Ian.
On Sat, Dec 6, 2008 at 11:40 AM, Ian Collins <ian at ianshome.com> wrote:> Richard Elling wrote: >> Ian Collins wrote: >>> Ian Collins wrote: >>>> Andrew Gabriel wrote: >>>>> Ian Collins wrote: >>>>>> I''ve just finished a small application to couple zfs_send and >>>>>> zfs_receive through a socket to remove ssh from the equation and the >>>>>> speed up is better than 2x. I have a small (140K) buffer on the >>>>>> sending >>>>>> side to ensure the minimum number of sent packets >>>>>> >>>>>> The times I get for 3.1GB of data (b101 ISO and some smaller >>>>>> files) to a >>>>>> modest mirror at the receive end are: >>>>>> >>>>>> 1m36s for cp over NFS, >>>>>> 2m48s for zfs send though ssh and >>>>>> 1m14s through a socket. >>>>>> >>>>> So the best speed is equivalent to 42MB/s. >>>>> It would be interesting to try putting a buffer (5 x 42MB = 210MB >>>>> initial stab) at the recv side and see if you get any improvement. >>>>> >>> It took a while... >>> >>> I was able to get about 47MB/s with a 256MB circular input buffer. I >>> think that''s about as fast it can go, the buffer fills so receive >>> processing is the bottleneck. Bonnie++ shows the pool (a mirror) block >>> write speed is 58MB/s. >>> >>> When I reverse the transfer to the faster box, the rate drops to 35MB/s >>> with neither the send nor receive buffer filling. So send processing >>> appears to be the limit in this case. >> Those rates are what I would expect writing to a single disk. >> How is the pool configured? >> > The "slow" system has a single mirror pool of two SATA drives, the > faster one a stripe of 4 mirrors and an IDE SD boot drive. > > ZFS send though ssh from the slow to the fast box takes 189 seconds, the > direct socket connection send takes 82 seconds. > > -- > Ian. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Reviving an old discussion, but has the core issue been addressed in regards to zfs send/recv performance issues? I''m not able to find any new bug reports on bugs.opensolaris.org related to this, but my search kung-fu may be weak. Using mbuffer can speed it up dramatically, but this seems like a hack without addressing a real problem with zfs send/recv. Trying to send any meaningful sized snapshots from say an X4540 takes up to 24 hours, for as little as 300GB changerate. -- Brent Jones brent at servuhome.net
Hi, Brent Jones wrote:> > Using mbuffer can speed it up dramatically, but this seems like a hack > without addressing a real problem with zfs send/recv. > Trying to send any meaningful sized snapshots from say an X4540 takes > up to 24 hours, for as little as 300GB changerate.I have not found a solution yet also. But it seems to depend highly on the distribution of file sizes, number of files per directory or whatever. The last tests I made showed still more than 50 hours for 700 GB and ~45 hours for 5 TB (both tests were null tests where zfs send wrote to /dev/null). Cheers from a still puzzled Carsten
Brent Jones wrote:> Reviving an old discussion, but has the core issue been addressed in > regards to zfs send/recv performance issues? I''m not able to find any > new bug reports on bugs.opensolaris.org related to this, but my search > kung-fu may be weak.I raised: CR 6729347 Poor zfs receive performance across networks (Seems to still be in the Dispatched state nearly half a year later.) This relates mainly to full archives, and is most obvious when the disk throughput is the same order of magnitude as the network throughput. (It becomes less obvious if one is significantly different from the other, either way around.) There appears to be an additional problem for incrementals, which spend long periods sending almost no data at all (I presume this is when zfs send is searching for changed blocks to send). I don''t know off-hand of a bugid for this.> Using mbuffer can speed it up dramatically, but this seems like a hack > without addressing a real problem with zfs send/recv.I don''t think it''s a hack, but something along these lines should be more properly integrated into the zfs receive command or documented.> Trying to send any meaningful sized snapshots from say an X4540 takes > up to 24 hours, for as little as 300GB changerate.Are those incrementals from a much larger filesystem? If so, that''s probably mainly the the other problem. -- Andrew
On Wed, Jan 7, 2009 at 12:36 AM, Andrew Gabriel <Andrew.Gabriel at sun.com> wrote:> Brent Jones wrote: > >> Reviving an old discussion, but has the core issue been addressed in >> regards to zfs send/recv performance issues? I''m not able to find any >> new bug reports on bugs.opensolaris.org related to this, but my search >> kung-fu may be weak. > > I raised: > CR 6729347 Poor zfs receive performance across networks > (Seems to still be in the Dispatched state nearly half a year later.) > > This relates mainly to full archives, and is most obvious when > the disk throughput is the same order of magnitude as the network > throughput. (It becomes less obvious if one is significantly > different from the other, either way around.) > > There appears to be an additional problem for incrementals, which > spend long periods sending almost no data at all (I presume this > is when zfs send is searching for changed blocks to send). > I don''t know off-hand of a bugid for this. > >> Using mbuffer can speed it up dramatically, but this seems like a hack >> without addressing a real problem with zfs send/recv. > > I don''t think it''s a hack, but something along these lines should > be more properly integrated into the zfs receive command or > documented. > >> Trying to send any meaningful sized snapshots from say an X4540 takes >> up to 24 hours, for as little as 300GB changerate. > > Are those incrementals from a much larger filesystem? > If so, that''s probably mainly the the other problem.Yah, the incrementals are from a 30TB volume, with about 1TB used. Watching iostat on each side during the incremental sends, the sender side is hardly doing anything, maybe 50iops read, and that could be from other machines accessing it, really light load. The receiving side however, for about 3 minutes it is peaking around 1500 iops reads, and no writes. It will do that for 3-5 minutes, then it will calm down and only read sporadically, and write about 1MB/sec. Using Mbuffer can get the writes to spike to 20-30MB/sec, but the initial massive reads still remain. I have yet to devise a script that starts Mbuffer zfs recv on the receiving side with proper parameters, then start an Mbuffer ZFS send on the sending side, but I may work on one later this week. I''d like the snapshots to be sent every 15 minutes, just to keep the amount of change that needs to be sent as low as possible. Not sure if its worth opening a case with Sun since we have a support contract...> > -- > Andrew >-- Brent Jones brent at servuhome.net
Hello> Yah, the incrementals are from a 30TB volume, with about 1TB used. > Watching iostat on each side during the incremental sends, the sender > side is hardly doing anything, maybe 50iops read, and that could be > from other machines accessing it, really light load. > The receiving side however, for about 3 minutes it is peaking around > 1500 iops reads, and no writes.Have you tries truss on both sides? From my experiments I found that sending side on beginning of the transfer mostly sleeps while receiving lists all available snapshots on the syncing file system. So if you have a lot of snapshots on receiving side (as in my case) the process will take long time sending no data but listing the snapshots. The worst case is if you use recursive sync of hundreds of file system with hundreds of snapshots on each. I''m sure this must be optimized somehow otherwise it''s almost useless in practice. Regards Mike