I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run: $ rsync --remove-source-files speed:/var/crawldir . but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions? Ideas I had were: - a pause between downloading the file list and downloading the files - an exclude rule for recently modified files - a check to not delete a file if its file size has changed since it was copied but I don't see any way to do any of these.
On Sun, 2008-09-07 at 10:59 -0400, Aaron Swartz wrote:> I have two machines, speed and mass. speed has a fast Internet > connection and is running a crawler which downloads a lot of files to > disk. mass has a lot of disk space. I want to move the files from > speed to mass after they're done downloading. Ideally, I'd just run: > > $ rsync --remove-source-files speed:/var/crawldir . > > but I worry that rsync will unlink a source file that hasn't finished > downloading yet. (I looked at the source code and I didn't see > anything protecting against this.)Yes, that could happen.> Ideas I had were: > - a pause between downloading the file list and downloading the filesThis approach would fail for very large files unless the pause is correspondingly long.> - an exclude rule for recently modified files > - a check to not delete a file if its file size has changed since it > was copiedEither of these would probably work, and they would not be hard to implement by modifying rsync, but they seem hackish. IMO, a proper solution is to have the crawler indicate somehow which files are unfinished so rsync can avoid copying those. E.g., the crawler could name unfinished files according to a special pattern so that you could exclude them with --exclude, or it could keep them in a temporary directory that rsync doesn't visit. Matt
Aaron, please CC rsync@lists.samba.org so that others can help you and your message is archived for others' future benefit. On Sun, 2008-09-07 at 16:16 -0400, Aaron Swartz wrote:> > IMO, a proper solution is to have the crawler indicate somehow which > > files are unfinished so rsync can avoid copying those. E.g., the > > crawler could name unfinished files according to a special pattern so > > that you could exclude them with --exclude, or it could keep them in a > > temporary directory that rsync doesn't visit. > > I agree, but this is not how most crawlers are written. (Imagine, e.g. wget.)Then I would modify the crawler. Matt