Well, I solved this problem myself, it seems. It was not an rsync
problem, per se, but it's interesting anyway on big filesystems like
this so I'll outline what went down:
Because my rsyncs were mostly just statting millions of files very
quickly, RAM filled up with inode cache. At a certain point, the kernel
stopped allowing new cache entries to be added to the slab memory it had
been using, and was slow to reclaim memory on old, clean inode cache
entries. So it basically slowed the I/O of the computer to barely anything.
Slab memory can be checked by looking at the /proc/meminfo file. If you
see that slab memory is using up a fair portion of your total memory,
run the 'slabtop' program to see the top offenders. In my case, it was
the filesystem that was screwing me (by way of the kernel).
I was able to speed up the reclaiming of clean, unused inode cache
entries by tweaking this in the kernel:
# sysctl -w vm.vfs_cache_pressure=10000
The default value for that is 100, where higher values release memory
faster for dentries and inodes. It helped, but my rsyncs were still
faster than it was, and it didn't help that much. What really fixed it
was this:
# echo 3 > /proc/sys/vm/drop_caches
That effectively clears ALL dentry and inode entries from slab memory
immediately. When I did that, memory usage dropped from 35GB to 500MB,
my rsyncs fired themselves up again magically, and the computer was
responsive again.
Slab memory began to fill up again of course, as the rsyncs were still
going. But slowly. For this edge case, I'm just going to configure a
cron job to flush the dentry/inode cache every five minutes or so. But
things look much better now!
A word of warning for folks rsyncing HUGE numbers of files under linux. ;)
As a side note, Solaris does not seem to have this problem, presumably
because the kernel handles inode/dentry caching in a different way.
-erich
Erich Weiler wrote:> Hi Y'all,
>
> I'm seeing some interesting behavior that I was hoping someone could
> shed some light on. Basically I'm trying to rsync a lot of files, in a
> series of about 60 rsyncs, from one server to another. There are about
> 160 million files. I'm running 3 rsyncs concurrently to increase the
> speed, and as each one finishes, another starts, until all 60 are done.
>
> The machine I'm initiating the rsyncs on has 48GB RAM. This is CentOS
> linux 5.4, kernel revision 2.6.18-164.15.1.el5. Rsync version 3.0.5 (on
> both sides).
>
> I was able to rsync all the data over to the new machine. But, because
> there was so much data, I need to run the rsyncs again to catch data
> that changed during the last rsync run. It sort of hangs midway through.
>
> What happens is that as the rsyncs run, the memory usage on the machine
> slowly creeps up, using quite a bit of RAM, which is odd because I
> thought the rsyncs were counting files incrementally, to reduce RAM
> impact. But, looking at top, the rsync processes aren't using much RAM
> at all:
>
> top - 12:22:10 up 1 day, 27 min, 1 user, load average: 46.85, 46.37,
> 44.97
> Tasks: 309 total, 8 running, 301 sleeping, 0 stopped, 0 zombie
> Cpu(s): 1.0%us, 13.8%sy, 0.0%ni, 84.9%id, 0.0%wa, 0.0%hi, 0.3%si,
> 0.0%st
> Mem: 49435196k total, 34842524k used, 14592672k free, 141748k buffers
> Swap: 10241428k total, 0k used, 10241428k free, 49428k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7351 root 25 0 19892 9.8m 844 R 100.1 0.0 552:58.55 rsync
> 9084 root 16 0 13108 2904 820 R 100.1 0.0 299:24.59 rsync
> 4759 root 0 -20 1447m 94m 15m S 29.9 0.2 667:34.21 mmfsd
> 9539 root 16 0 30136 19m 820 R 6.3 0.0 6:29.28 rsync
> 9540 root 15 0 271m 46m 260 S 0.3 0.1 0:12.13 rsync
> 10047 root 15 0 10992 1212 768 R 0.3 0.0 0:00.01 top
> 1 root 15 0 10348 700 592 S 0.0 0.0 0:02.15 init
> ...etc...
>
> But nevertheless, 34GB RAM is in use. But what really kills things is
> that at some point, each rsync all of a sudden ramps up to 100% CPU
> usage, and the all activity for that rsync essentially stops. In the
> above example, 2 of the 3 rsyncs are in that 100% CPU state, while the
> third rsync is only at 6.3%, but that is the one actually doing
> something. In some cases all 3 rsyncs get to 100%, and they all stall,
> there is no network traffic on the NIC at all and they don't progress.
>
> Now mostly what they are doing is counting files, since most of the
> files are the same on both sides, but there are just so many files (160
> million). I don't seem to be out of memory, but I don't know why
rsync
> would go to 100% CPU and just stall.
>
> I am rsyncing from an rsync server to my local server, with commands
> similar to this:
>
> rsync -a --delete rsync://encodek-0-4/data/genomes/ /hive/data/genomes/
>
> Again, both sides at version 3.0.5. Nothing fancy or special. I have
> confirmed that it does count the files incrementally by running a few
> manually, it does report "getting incremental file list...".
>
> Any ideas why the processes go to 100% CPU and then stall? I should
> also note that the initial run of rsyncs, where it was actually copying
> a ton of data, did not seem to have this problem, but now that the data
> is there and I'm rsyncing again, it seems to have this problem. Is it
> somehow related to the fact that it is mostly comparing a ton of files
> very quickly but not actually copying many of them?
>
> Thanks for any ideas!
>
> -erich
>