Hi folks, I've been googling around for awhile but I can't seem to find an answer to my question. I have a number of filesystems that contain thousands of hard links due to some bad organization of data. Rsync, cpio and various other utilities fail to copy this data because I think there might be some cycles in it. (you know you have troubles if cpio can't copy it!) What I thought I would do instead is to copy the data but skip any files that are hard links. Then after the copy is finished, I will use some kind of find . -type l type command that finds the hard links and then make a script to recreate it. This saves me a lot of trouble with not having to stat the files and not having the receive side balloon up. Is there a way to have it skip hard links when doing an rsync? Or is there some other mystic incantation that I can use that might accomplish the same thing. Thanks, sri
On Wed, Mar 07, 2007 at 09:22:08PM -0800, Sriram Ramkrishna wrote:> Hi folks, I've been googling around for awhile but I can't seem to find > an answer to my question. > > I have a number of filesystems that contain thousands of hard links due > to some bad organization of data. Rsync, cpio and various other > utilities fail to copy this data because I think there might be some > cycles in it. (you know you have troubles if cpio can't copy it!) > > What I thought I would do instead is to copy the data but skip any files > that are hard links. Then after the copy is finished, I will use some > kind of find . -type l type command that finds the hard links and then > make a script to recreate it. This saves me a lot of trouble with not > having to stat the files and not having the receive side balloon up. > > Is there a way to have it skip hard links when doing an rsync? > Or is there some other mystic incantation that I can use that might > accomplish the same thing. >Surely a hard link is just 'a file', that's what a file is, thus it's impossible to skip them without skipping everything (except symbolic links, FIFOs, etc). The only clue to something having more than one link to it is the 'number of links', but then how do you decide which link is the 'right' one to copy as they're all the file. -- Chris Green
On Wed 07 Mar 2007, Sriram Ramkrishna wrote:> that are hard links. Then after the copy is finished, I will use some > kind of find . -type l type command that finds the hard links and thenfind -type l will find symbolic links, *not* hard links. Paul Slootman
On Wed, Mar 07, 2007 at 09:22:08PM -0800, Sriram Ramkrishna wrote: | Hi folks, I've been googling around for awhile but I can't seem to find | an answer to my question. | | I have a number of filesystems that contain thousands of hard links due | to some bad organization of data. Rsync, cpio and various other | utilities fail to copy this data because I think there might be some | cycles in it. (you know you have troubles if cpio can't copy it!) | | What I thought I would do instead is to copy the data but skip any files | that are hard links. Then after the copy is finished, I will use some | kind of find . -type l type command that finds the hard links and then | make a script to recreate it. This saves me a lot of trouble with not | having to stat the files and not having the receive side balloon up. | | Is there a way to have it skip hard links when doing an rsync? | Or is there some other mystic incantation that I can use that might | accomplish the same thing. The following command pipeline can give you a list which you could isolate to being just the first ocurrence of each file that is sharing the same inode: find . ! -type d -printf '%10i %P\n' | awk '{n=substr($0,12);if(a[$1]==1){print "other",n;}else{a[$1]=1;print "first",n;}}' Note the above is 123 characters long. You may have issues with mail programs that truncate or wrap it around, so be careful. The fixed size formatting of the inode number in the find output is to make it easy to extract the name, or the name plust the symlink target, in the awk command using substr(). One approach in the situation you have, if the filesystem is not corrupt (which it might be, because files don't create cycles), is to create a list of files based on their inode number, and hardlink each file to one named by its inode number. Just rsync the directory full of inode numbers. Then re-expand on the destination based on that list. You should not be following symlinks in a file tree recursion. Rsync, find, cpio, and others, know not to. But I suspect some kind of filesystem corruption, or at least some hard links being applied to directories. The latter can create cycles if not done carefully (and there is virtually no case to ever do that at all by intent). I do not consider it bad organization to have lots of files be hardlinked. In fact, I have a program that actually seeks out indentical files and makes them be hardlinked to save space (not safe in all cases, but safe in most). The command "find . -type l" will only find symlinks. You can find files that have hard links with "find . ! -type d -links +1 -print". Note that all file types can have hard links, even symlinks. Do exclude directories as those will have many links for other reasons (e.g. 1 for self reference, 1 for being inside a directory and 1 each for each subdirectory within). -- |---------------------------------------/----------------------------------| | Phil Howard KA9WGN (ka9wgn.ham.org) / Do not send to the address below | | first name lower case at ipal.net / spamtrap-2007-03-08-0651@ipal.net | |------------------------------------/-------------------------------------|
On Wed, Mar 07, 2007 at 09:22:08PM -0800, Sriram Ramkrishna wrote:> Is there a way to have it skip hard links when doing an rsync?If you mean you want to skip any file that has more than one link, you could do this: find . -type f -links +1 >/path/exclude.txt Then, you'd use the exclude.txt file via the --exclude-from option. However, you mentioned loops, and that makes me think that the problem is not with file loops, but dirs looping back in the hierarchy. The above won't help you if that is the case (since find will loop too). ..wayne..
On Wed, Mar 07, 2007 at 09:22:08PM -0800, Sriram Ramkrishna wrote: Hi there, For some reason, I sent this mail before I was fully subscribed and I have missed out on the replies. If I don't answer all the responses this is why.> The following command pipeline can give you a list which you could > isolate to being just the first ocurrence of each file that is sharing the > same inode:> find . ! -type d -printf '%10i %P\n' | awk > '{n=substr($0,12);if(a[$1]==1){print > "other",n;}else{a[$1]=1;print "first",n;}}'Yes, I think I have something similar that someone else has used to do the same thing. Thank you, this is most useful.> One approach in the situation you have, if the filesystem is not corrupt > (which it might be, because files don't create cycles), is to create aI think I probably hard links to directories. I have observed cpio going through a loop continously. Since I was doing this on an AIX JFS filesystem (on an AIX fileserver) it might not have same protections that I believe Linux when hitting a circular loop.> list of files based on their inode number, and hardlink each file to one > named by its inode number. Just rsync the directory full of inode > numbers. Then re-expand on the destination based on that list.> You should not be following symlinks in a file tree recursion. Rsync, > find, cpio, and others, know not to.> But I suspect some kind of filesystem corruption, or at least some hard > links being applied to directories. The latter can create cycles if not > done carefully (and there is virtually no case to ever do that at all by > intent).I think this is exactly what's happening. I think I have a number of cycles that are causing the data to go loopy. (pardon the pun) If that's the case, how does one find self referential hard/softlinks?> I do not consider it bad organization to have lots of files be > hardlinked. In fact, I have a program that actually seeks out > indentical files and makes them be hardlinked to save space (not > safe in all cases, but safe in most).Sure, but in a large filesystem, it's been very painful to copy this data when rsync is taking days instead of hours.> The command "find . -type l" will only find symlinks. You can find > files that have hard links with "find . ! -type d -links +1 -print". > Note that all file types can have hard links, even symlinks. Do > exclude directories as those will have many links for other reasons > (e.g. 1 for self reference, 1 for being inside a directory and 1 each > for each subdirectory within).Can I also do use find to create a list of files that are not hardlink and then use --include-file and --exclude=*? I had thought that might be an alternative way. If I use this rule, does rsync still stat through the filesystem? sri