Andrew Martin
2013-Sep-06 19:55 UTC
rsync 3.0.9 hangs when syncing from NFSv3 share - possible to retry after timeout?
Hello, I'm using rsync 3.0.9 to backup several NFS shares from a fileserver, mounted over NFSv3, to a local RAID on a backup server. Both servers are running Ubuntu 12.04 server LTS. The fileserver's filesystem is ext4. The NFS shares are mounted on the backup server as follows: fileserver:/mnt/storage/share1 /mnt/share1 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1) fileserver:/mnt/storage/share2 /mnt/share2 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1) fileserver:/mnt/storage/share3 /mnt/share3 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1) These shares contain a large amount of files, including SVN checkouts, extracted kernel trees, etc. I've run into a problem where rsync will appear to hang or block indefinitely when backing up one particular share, share3, but occasionally it will happen with one of the other shares instead. A cron starts backing up share3 nightly at 20:15. When this blocking problem does not occur, the backup typically finishes around 20:45. However, when this problem occurs, rsync blocks indefinitely. I have configured rsync to run using the "timeout" command so that it will be killed if not finished by 9:00 the next day: timeout -k 30s 764m rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 The exit code is 137, which I believe is 128 (from rsync) plus 9 sent by timeout. Here are the child rsync processes, as you can see 1915 is in uninterruptable sleep, but I believe that is normal: root 1914 0.0 0.0 10148 492 ? S Sep05 0:00 timeout -k 30s 764m rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 root 1915 0.0 0.3 81240 27784 ? D Sep05 0:20 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 root 1916 0.0 0.2 120028 19032 ? S Sep05 0:22 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 root 1917 0.0 0.3 138272 26612 ? S Sep05 0:07 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 Running strace on the processes shows that the processes are not actively doing anything: # strace -p 1914 Process 1914 attached - interrupt to quit wait4(1915, # strace -p 1915 Process 1915 attached - interrupt to quit # strace -p 1916 Process 1916 attached - interrupt to quit select(4, [3], [], NULL, {10, 731653}^C <unfinished ...> Process 1916 detached # strace -p 1917 Process 1917 attached - interrupt to quit select(1, [0], [], NULL, {27, 691627}^C <unfinished ...> Process 1917 detached Based on the output in my rsync log file, I can see the last directory that it copied a file from. I ran "time find /path/to/that/dir -type f" on that directory and some other directories on share3 and all of them returned quickly; I was not able to make "find" block. The rsync crons that run for share1 and share2 typically complete successfully, and they are also mounted over NFS with the same mount options from the same fileserver. I do not see anything obviously related in dmesg on either the the backup server or fileserver. Does anyone have an idea on what is causing rsync to hang, or if there is a way to have it retry or skip a file if there is a problem rather than blocking forever? The --timeout option seems like it will abort the entire sync, but I would like just skip over the bad section and continue with the rest of the backup. Is this possible? Thanks, Andrew
Kevin Korb
2013-Sep-06 20:53 UTC
rsync 3.0.9 hangs when syncing from NFSv3 share - possible to retry after timeout?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Is there a special reason why you don't use rsync or rsync over ssh as the communication method instead of NFS? You are being stuck with - --whole-file in this configuration not to mention the expense of doing a ton of stat() calls over the NFS. Also, you can use lsof to see exactly what file or directory rsync has open. On 09/06/13 15:55, Andrew Martin wrote:> Hello, > > I'm using rsync 3.0.9 to backup several NFS shares from a > fileserver, mounted over NFSv3, to a local RAID on a backup server. > Both servers are running Ubuntu 12.04 server LTS. The fileserver's > filesystem is ext4. The NFS shares are mounted on the backup server > as follows: fileserver:/mnt/storage/share1 /mnt/share1 type nfs > (ro,tcp,bg,soft,intr,addr=192.168.1.1) > fileserver:/mnt/storage/share2 /mnt/share2 type nfs > (ro,tcp,bg,soft,intr,addr=192.168.1.1) > fileserver:/mnt/storage/share3 /mnt/share3 type nfs > (ro,tcp,bg,soft,intr,addr=192.168.1.1) > > These shares contain a large amount of files, including SVN > checkouts, extracted kernel trees, etc. I've run into a problem > where rsync will appear to hang or block indefinitely when backing > up one particular share, share3, but occasionally it will happen > with one of the other shares instead. A cron starts backing up > share3 nightly at 20:15. When this blocking problem does not occur, > the backup typically finishes around 20:45. However, when this > problem occurs, rsync blocks indefinitely. I have configured rsync > to run using the "timeout" command so that it will be killed if not > finished by 9:00 the next day: timeout -k 30s 764m rsync -av > --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 > --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 The > exit code is 137, which I believe is 128 (from rsync) plus 9 sent > by timeout. > > Here are the child rsync processes, as you can see 1915 is in > uninterruptable sleep, but I believe that is normal: root 1914 > 0.0 0.0 10148 492 ? S Sep05 0:00 timeout -k 30s > 764m rsync -av --modify-window=2 > --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ > /mnt/share3/ /mnt/backups/share3/2013-09-05 root 1915 0.0 > 0.3 81240 27784 ? D Sep05 0:20 rsync -av > --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 > --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 root > 1916 0.0 0.2 120028 19032 ? S Sep05 0:22 rsync -av > --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 > --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 root > 1917 0.0 0.3 138272 26612 ? S Sep05 0:07 rsync -av > --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 > --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05 > > Running strace on the processes shows that the processes are not > actively doing anything: # strace -p 1914 Process 1914 attached - > interrupt to quit wait4(1915, > > # strace -p 1915 Process 1915 attached - interrupt to quit > > # strace -p 1916 Process 1916 attached - interrupt to quit > select(4, [3], [], NULL, {10, 731653}^C <unfinished ...> Process > 1916 detached > > # strace -p 1917 Process 1917 attached - interrupt to quit > select(1, [0], [], NULL, {27, 691627}^C <unfinished ...> Process > 1917 detached > > Based on the output in my rsync log file, I can see the last > directory that it copied a file from. I ran "time find > /path/to/that/dir -type f" on that directory and some other > directories on share3 and all of them returned quickly; I was not > able to make "find" block. The rsync crons that run for share1 and > share2 typically complete successfully, and they are also mounted > over NFS with the same mount options from the same fileserver. > > I do not see anything obviously related in dmesg on either the the > backup server or fileserver. Does anyone have an idea on what is > causing rsync to hang, or if there is a way to have it retry or > skip a file if there is a problem rather than blocking forever? The > --timeout option seems like it will abort the entire sync, but I > would like just skip over the bad section and continue with the > rest of the backup. Is this possible? > > Thanks, > > Andrew >- -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ Kevin Korb Phone: (407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. Kevin at FutureQuest.net (work) Orlando, Florida kmk at sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIqQOQACgkQVKC1jlbQAQcjwQCg1OhS8NciSJXolj6uND88O7R+ mLwAn0OPMGRfI/OrXjaNNBnz4RSUvS2U =6/1y -----END PGP SIGNATURE-----