samba-bugs@samba.org
2007-Dec-06 09:39 UTC
DO NOT REPLY [Bug 5124] New: Lessons to learn from other tools, better use of resources, speed gains
https://bugzilla.samba.org/show_bug.cgi?id=5124 Summary: Lessons to learn from other tools, better use of resources, speed gains Product: rsync Version: 3.0.0 Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P3 Component: core AssignedTo: wayned@samba.org ReportedBy: forge@dr.ea.ms QAContact: rsync-qa@samba.org I would love to see rsync grow up to working better. One huge lesson that could be learned would be in taking ideas from other tools. One such tool would be lftp. I would totally go bananas if rsync could scan dirs while syncing in parallel. The gains from doing this are phenomenal. Since lftp already has a very good working model on how to accomplish this feat, I suggest that it is taken a look at. if you have never experienced how much better lftp works over wget, give it a try, and you will see exactly what I am talking about. Granted, the way rsync does it's job is vastly different, and for good reasons, however with today's more modern systems, that have dual cores, hyper threads, and plain amazing speeds, I see no reason not to at least offer the option to do such a very useful thing. Having limits on the server side, and limits set from the client side is also a good thing too, which is something ftpd's and httpd's have. Thanks to all who have helped to develop this killer tool! Let's make it get faster now! -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug, or are watching the QA contact.
samba-bugs@samba.org
2007-Dec-06 13:02 UTC
DO NOT REPLY [Bug 5124] Lessons to learn from other tools, better use of resources, speed gains
https://bugzilla.samba.org/show_bug.cgi?id=5124 ------- Comment #1 from matt@mattmccutchen.net 2007-12-06 07:02 CST ------- How does the feature you want differ from the incremental recursion mode that is being added to rsync 3.0.0? -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug, or are watching the QA contact.
samba-bugs@samba.org
2007-Dec-07 10:40 UTC
DO NOT REPLY [Bug 5124] Lessons to learn from other tools, better use of resources, speed gains
https://bugzilla.samba.org/show_bug.cgi?id=5124 ------- Comment #2 from forge@dr.ea.ms 2007-12-07 04:40 CST ------- It does so in parallel, via fork()... Have not looked to see if rsync is doint the same thing, but the point is that it opens multiple sockets to do it's job. -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug, or are watching the QA contact.
samba-bugs at samba.org
2009-Oct-28 14:34 UTC
DO NOT REPLY [Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 matt at mattmccutchen.net changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Lessons to learn from other |Parallelize the rsync run |tools, better use of |using multiple threads |resources, speed gains |and/or connections ------- Comment #3 from matt at mattmccutchen.net 2009-10-28 09:34 CST ------- A stab at a more meaningful summary, and some thoughts: My first reaction to the suggestion to use multiple connections is that it's a gimmick to get a higher total bandwidth allocation from routers that allocate bandwidth per connection; IMO, that would not be an appropriate goal. But there's another more fundamental benefit, even if the total bandwidth were to remain the same: loss of a single packet won't stall the rsync run because the other connections can continue (at least for a while) without that packet. But why stop at several streams? Rsync could use datagrams (UDP) and just act on packets as they arrive, so that loss of a packet doesn't affect /any/ of the other packets. The only drawback is that we have really good tooling for working with streams (pipes, nc, port forwarding, TLS, etc.), while the tooling for datagrams is nonexistent or less mature (there is Datagram TLS, but I've never tried it). Rather than implement the UDP stuff ad-hoc for rsync, I would like to see it adopt an application-level scheduler that maintains a list of active tasks (scanning a directory, transferring a file, etc.) and handles the rudiments of accepting a packet and calling the appropriate routine to take the next step on that task. If the scheduler would support asynchronous I/O, rsync could use that to dramatically cut time blocked on I/O by letting the OS decide the order in which to fulfill requests based on the actual layout of the files on disk. Once rsync exposes a set of available tasks to the scheduler, it becomes trivial to vary the number of OS threads in which the tasks run. This would be awesome but is probably better pursued in a successor to rsync. -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug, or are watching the QA contact.
samba-bugs at samba.org
2014-Jan-14 17:40 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #4 from Haravikk <me at haravikk.com> 2014-01-14 17:40:22 UTC --- I see this is quite old, and to be honest I'm not completely familiar with rsync's implementation, but more rsync performance is of benefit to everyone so I thought I'd chip in my thoughts. While UDP would be a good option, it's a fairly complex one to implement as you'd essentially be reinventing the wheel when it came to re-requesting packets etc., though I believe there may be new libraries out there that could help with this; many Bittorrent clients for example now use ?TP (micro Transport Protocol) which is basically just UDP with some failure tolerance, though this would still require some form of SSL support for widespread adoption. Personally I don't think the number of TCP connections is the problem though as a single connection should be capable of utilising all available bandwidth. That said, one of the problems with TCP is the self-adjusting frame-size, so to get the most out of a connection you really need to utilise it at a constant rate, otherwise the window size will go down, this means any long pauses waiting for the next chunk of the file-list can result in performance dropping until the next file starts being sent. An alternative fix for this problem is to do something similar to Google's SPDY protocol for HTTP, which is to multiplex several TCP connections together. Basically, rsync would add its own information to packets, allowing them to be quickly routed to/from multiple threads at each end, while sending all packets over a single connection. This means you can have file-list packets mixed in with multiple packets from various different file being transferred etc.; TCP will continue to ensure they arrive in the correct order etc., and all rsync has to do is setup an appropriate number of threads for generating chunks of the file-list, performing delta comparisons, and transferring files. Basically you end up with one thread acting a message dispatch service for this single connection, taking all messages received and sending them to appropriate worker threads, and packing outgoing messages ready to send down the TCP connection; the worker threads then perform file/folder comparison for different parts of the sync operation. Not that this latter option isn't still complex and a lot of work, but IMO it's the best way to do things (if rsync isn't already), and it allows rsync to run multiple file/folder comparisons simultaneously depending upon the hardware at each end and the current speed of the sync operation. -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
samba-bugs at samba.org
2014-Jan-19 03:10 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #5 from Andrew J. Kroll <forge at dr.ea.ms> 2014-01-19 03:10:35 UTC --- Actually having two or three TCP streams at the same time has proven to be faster, because it can scan ahead. It is proven that if you download one large file, while downloading several smaller ones, that the entire transfer is faster because the handshake turn-around is hidden. It has nothing to do with getting around per-connection bandwidth limiters, although in some cases it can help with that too. Another proven case is your typical modern web browser. There is a very good reason why multiple connections are used to load in those pretty pictures you see. It is all about getting around the latency by using TCP as a double buffer. What is needed is the ability to be scanning on one side while transferring a file, and if we have a match as we go with the other process, start sending it on a second stream. Again, look at how lftp does it. the concept simply works fantastic. You get multiple dir scans in parallel, and data when it is to be updated, while still scanning. UDP? Interesting idea but, not needed. Just do > 1 scan and send process. -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
samba-bugs at samba.org
2018-Oct-11 17:51 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #7 from Luiz Angelo Daros de Luca <luizluca at gmail.com> --- I also vote for this feature. Using multiple connections, rsync can use multiples internet connections at the same time. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Feb-07 02:24 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #8 from Michael <michael.williams at infatech.co.nz> --- +1 from me on this. We have several situations where we need to copy a large number of very small files, and I expect that having multiple file transfer threads, allowing say ~5 transfers concurrently, would speed up the process considerably. I expect that this would also make better use of the available network bandwidth as each transfer appears to have an overhead for starting and completing the transfer which makes the effective transfer rate far less than the available network bandwidth. This is the method one of our pieces of backup software uses to speed up backups and is also implemented in FileZilla for file transfers. Consider a very large file that needs to be transferred, along with a number of small files. In a single transfer mode, all other files would need to wait while the large file is transferred. If there are multiple transfers happening concurrently, the smaller files will continue transferring while the large file transfers. I have seen the benefits of this sort of implementation in other software. I can also see benefits in having file transfers begin whilst rsync is comparing files. This could logically work if you consider rsync makes a 'list' of files to be transferred and that it begins transferring files as soon as this list begins to be populated. In situations where there are a large number of files and few of these files changed, the sync could effectively be completed by the time rsync is finished comparing files (given the few changed files may have already been transferred during the file comparison). This also is effectively implemented in FileZilla (consider copying a directory in which FileZilla has to recurse into each directory and add each file to copy into the queue). Interestingly, I assumed this was already an option for rsync, so I went looking to find the necessary option. However, all I found were the previously mentioned hacks, which weren't what I was going for. -- You are receiving this mail because: You are the QA Contact for the bug.
Marc Roos
2019-Feb-07 10:47 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
I made a bash script doing this in parallel, checks how many rsyncs are running and then starts another 'concurrent one'. My parallel sessions are against different servers. I doubt if it would make any sense doing multiple sessions between the same two hosts. My single rsync sessions was already limited by the hosts iops. So two threads would run at half speed. IMO rsync does what it needs to do, if you want it to run in parallel execute it in parallel. > >--- Comment #8 from Michael <michael.williams at infatech.co.nz> --- >+1 from me on this. > >We have several situations where we need to copy a large number of very small >files, and I expect that having multiple file transfer threads, allowing say ~5 >transfers concurrently, would speed up the process considerably. I expect that >this would also make better use of the available network bandwidth as each >transfer appears to have an overhead for starting and completing the transfer So test it with two or 3 concurrent sessions. >which makes the effective transfer rate far less than the available network >bandwidth. This is the method one of our pieces of backup software uses to >speed up backups and is also implemented in FileZilla for file transfers. >Consider a very large file that needs to be transferred, along with a number of >small files. In a single transfer mode, all other files would need to wait >while the large file is transferred. If there are multiple transfers happening >concurrently, the smaller files will continue transferring while the large file >transfers. I have seen the benefits of this sort of implementation in other >software. > >I can also see benefits in having file transfers begin whilst rsync is >comparing files. This could logically work if you consider rsync makes a 'list' >of files to be transferred and that it begins transferring files as soon as >this list begins to be populated. In situations where there are a large number >of files and few of these files changed, the sync could effectively be >completed by the time rsync is finished comparing files (given the few changed >files may have already been transferred during the file comparison). This also >is effectively implemented in FileZilla (consider copying a directory in which >FileZilla has to recurse into each directory and add each file to copy into the >queue). > >Interestingly, I assumed this was already an option for rsync, so I went >looking to find the necessary option. However, all I found were the previously >mentioned hacks, which weren't what I was going for. > > >
samba-bugs at samba.org
2019-Feb-07 10:58 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #9 from Paul Slootman <paul at debian.org> --- The issue when copying a large number of small files is disk IO / seeking. Check the wait for IO values using top / whatever when doing such a transfer. Running multiple threads in such a situation will only cause the disk to thrash even more. Multiple threads makes sense on high latency links. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Feb-07 16:26 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #10 from Scott Peterson <scott.d.peterson at intel.com> --- (In reply to Paul Slootman from comment #9) Multiple connections also makes sense on high bandwidth links. I’ve never been able to rsync at wire speed on a 40G link using only one connection. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Feb-07 16:50 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #11 from Scott Peterson <scott.d.peterson at intel.com> --- (In reply to Haravikk from comment #4) SPDY has apparently evolved into QUIC. QUIC supports multiple streams, which can be created by either end. There can be a huge number of these. It seems like a sender of files could create a stream per file it wanted to send, then send to that stream as async reads completed. The reads that complete first are sent first. Complete files on fast storage might be sent as one on slow storage streamed out at a lower rate. This should also allow the receiver to consume the incoming streams at different rates, as they might do if their destination media had different write performance. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2023-Apr-17 10:40 UTC
[Bug 5124] Parallelize the rsync run using multiple threads and/or connections
https://bugzilla.samba.org/show_bug.cgi?id=5124 --- Comment #12 from Paulo Marques <paulo.marques at bitfile.pt> --- Using multiple connections also helps when you have LACP network links, which are relatively common in data center setups to have both redundancy and increased bandwidth. If you have two 1Gbps links aggregated, you can only use 1Gbps using rsync, but you could use 2Gbps if rsync made several connections from different TCP ports. -- You are receiving this mail because: You are the QA Contact for the bug.
Reasonably Related Threads
- [Bug 5124] Parallelize the rsync run using multiple threads and/or connections
- [Bug 5124] Parallelize the rsync run using multiple threads and/or connections
- DO NOT REPLY [Bug 5124] New: Lessons to learn from other tools, better use of resources, speed gains
- rsync: omit summary with a single -v
- Starting tcpdump increases Samba performance 20 times