David Favro
2005-Jul-06 19:26 UTC
Other possible solutions to: rsync memory usage, paid feature request
Hi, Matthew -- Regarding your message of 05-Jul-2005 concerning rsync memory usage (sorry that I am not directly replying to it; I am not as yet subscribed to the list and my mailer doesn't allow me to hard-code an In-Reply-To or References header): While I applaud anyone who wants to encourage open-source development, it seems to me that, if in fact your problem is that you are running out of memory due to transferring too many files at a time, there are much cheaper solutions for your company than paying for the change to rsync described in the FAQ. 1) Free: break your rsync's into several executions rather than one huge one. Do several sub-directory trees, each separately. If your data files are not organized in such a way that they can easily be divided into a reasonable number of sub-directory trees, consider re-organizing them so that they can be: it will pay off in many other sys-admin benefits as well. 2) Cheap: buy more swap space. These days random-access magnetic storage is running close to 0.50 USD per gig (e.g. here: http://www.buy.com/retail/product.asp?sku=10360313 is 200GB for $105 in the US, including shipping). At the stated rate of 100 bytes per file, this is enough storage to add 2 billion files to each rsync that you run, for a price that is less than many programmers want for a week of coding. If you have much more than 2 billion files in each sub- directory tree, you are probably doing something very wrong. :-) 3) Free: If your problem is not that you are running *out* of memory but rather that rsync is (temporarily) 'stealing' the core (solid-state) memory from the other 'more important' (i.e. requiring quicker response time) processes (causing their data to get swapped out, which might reduce response-time when that data later needs to get swapped back in), you might also consider using the operating system to either lock-down the memory used by your important server programs so that it cannot be swapped out, or give them higher priority (memory-priority, not CPU- scheduling priority, though that might be a good idea also) in such a way that rsync gets swapped out before they do, and it will then maintain a small footprint in physical memory (I am not sure if this is possible or how to do it under Linux, but would be interested to know -- a sort of variant of the nice command but for core usage, or a per- process maximum-in-core parameter). I would however use some caution when doing either of these since the general-purpose VM swap-out algorithms used by most modern operating systems usually do a pretty good job of getting everything serviced in a reasonable response time: forcing rsync to thrash the swap-cache as it might do if the lists are traversed as often as the FAQ implies will not necessarily increase overall performance of the system. Solution (1) above will also greatly improve this situation. Otherwise, the final suggestion: 4) Expensive: buy more solid-state memory. Possibly still cheaper than paying for coding, but at any rate, in my experience, more core is rarely the best solution for lack-of-core problems. Another thought to consider: the method for the proposed "week of coding" solution isn't specified, but it may well involve spooling the lists to temporary files, in which case you'll probably need to buy the storage from solution number (2) above anyhow, in addition to paying for the coding, and get what amounts to nearly the same solution as (2) and/or (3) anyhow. That said, I am always in favor of frugal use of core -- it all depends on what the proposed solution is; if it involves substituting user-space 'swapping' to disk rather than kernel-space swapping, it's likely not worth (apparently large) effort, which could be better directed at other improvements, especially considering the likely decrease in the cost of solid-state memory in the future. Finally, trying to first experiment with solutions (1) and (2) above may help you to determine if indeed the problem is what you think it is, before you shell out for a software solution. Also keep in mind that in my experience, when most programmers estimate "1 week of coding" it often ends up taking 2 or 3, or sometimes 8. Just my (rambling) thoughts as a fellow programmer and system administrator. Anyhow, I really admire someone who is willing to shell out for improvements to open-source code! -- David Favro Senior Partner Meta-Dynamic Solutions meta-dynamic.com
Paul Slootman
2005-Jul-07 07:45 UTC
Other possible solutions to: rsync memory usage, paid feature request
On Wed 06 Jul 2005, David Favro wrote:> > 1) Free: break your rsync's into several executions rather than one huge > one. Do several sub-directory trees, each separately. If your data[...]> 2) Cheap: buy more swap space. These days random-access magnetic[...]> 4) Expensive: buy more solid-state memory. Possibly still cheaper than[...] None of these proposals would have helped when I wanten to move two year's worth of Debian archive images to another system using rsync. The Debian archive is currently around 88.000 files (at least what we mirror of it). Every day a snapshot is taken; common files are hardlinked across days. This means an incredible amount of directory entries and hunderds of thousands of distinct files. Doing 1) was not feasible, as that would result in very many hardlinks being lost and files effectively duplicated, leading to wasted space. Doing 2) was tried (actually: creating swap files on disk), but then we ran into the virtual address limitations of the 32-bit system: 3GB wasn't enough by far. Doing 3) would have the same problem as 2). Going to a 64-bit system might have helped, but I think that the memory usage would have exceeded what's reasonable in solid-state memory, and using swap would have slowed it all down horribly as the lists in memory are apparently transversed quite regularly. As it is, it took a couple of days before the virtual memory limit was reached... I ended up rsyncing the days separately, and using a perl program to build a tree of md5sums which were hardlinks to the corresponding files. With each new directory the md5sums could be compared and hardlinks recreated. However, I would *love* to see rsync be more memory-efficient... Paul Slootman
Matthew S. Hallacy
2005-Jul-07 18:04 UTC
Other possible solutions to: rsync memory usage, paid feature request
[...]> 1) Free: break your rsync's into several executions rather than one huge > one. Do several sub-directory trees, each separately. If your data > files are not organized in such a way that they can easily be divided > into a reasonable number of sub-directory trees, consider re-organizing > them so that they can be: it will pay off in many other sys-admin > benefits as well.It's customer data, so we have no control over how it's organized. All we can depend on is the data being in /home, which is what we rsync.> 2) Cheap: buy more swap space. These days random-access magnetic > storage is running close to 0.50 USD per gig (e.g. here: > http://www.buy.com/retail/product.asp?sku=10360313 is 200GB for $105 in > the US, including shipping). At the stated rate of 100 bytes per file, > this is enough storage to add 2 billion files to each rsync that you > run, for a price that is less than many programmers want for a week of > coding. If you have much more than 2 billion files in each sub- > directory tree, you are probably doing something very wrong. :-)The servers already have 2-4GB of ram, with another gig of swap. Yes, these servers have a LOT of small files.> 3) Free: If your problem is not that you are running *out* of memory but > rather that rsync is (temporarily) 'stealing' the core (solid-state) > memory from the other 'more important' (i.e. requiring quicker response > time) processes (causing their data to get swapped out, which might > reduce response-time when that data later needs to get swapped back in), > you might also consider using the operating system to either lock-down > the memory used by your important server programs so that it cannot be[snip] Yes, the problem is that memory is being stolen from the processes that the servers exist for to begin with. Your solution isn't really viable though =)> 4) Expensive: buy more solid-state memory. Possibly still cheaper than > paying for coding, but at any rate, in my experience, more core is > rarely the best solution for lack-of-core problems.I agree, which is why we're willing to pay someone with rsync coding fu to fix it =) -- Matthew S. Hallacy