Andrew J. Romero
2013-Sep-05 16:08 UTC
rsync -H option yields corrupt replicas (due to non-unique inode ids)
Hi, Our organization hosts a specialized Linux distribution. As is typical with Linux distributions, the set of files that make up our Linux distro contains a very complex web of self-referential hard links. Several other sites use our Linux distro and maintain either partial or full internal mirror copies of it. The standard method used by Linux mirror sites to pull/replicate a subset of a Linux distribution (or a complete Linux distribution) from a master repository is to use rsync with options that produce the following behavior: the first time a unique file is encountered, it's content is replicated; however, when subsequent hard links to the file are detected, only the hardlinks are replicated. The primary copy of our Linux distro is stored on our BlueArc Titan NAS (NFS server). Relative to the mirror-sites, our rsync server "sits in front of" the NAS. Internally the BlueArc Titan has a unique object id for files; however, the inode ID presented to clients by the BlueArc Titan is not unique, rsync (with -H option) is erroneously identifying unique files as a hard-links to different files. Causing mirror repositories to be essentially corrupt and not usable. It is my understanding that the NFS v3 spec. does not require NFS servers to present unique inode ids to clients. I believe that the reasoning is that: large scale NAS appliances internally need to use very wide object ids; but, externally need to present (when asked) inode ids that any client an deal with. Are there options to rsync that will allow me to reliably replicate my hard-link rich Linux distro from my NAS. Thanks Andy
Kevin Korb
2013-Sep-05 17:21 UTC
rsync -H option yields corrupt replicas (due to non-unique inode ids)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Rsync determines hard links via inode numbers. That is the only way to determine that 2 files are actually the same file. On 09/05/13 12:08, Andrew J. Romero wrote:> Hi, > > Our organization hosts a specialized Linux distribution. > > As is typical with Linux distributions, the set of files that make > up our Linux distro contains a very complex web of self-referential > hard links. > > Several other sites use our Linux distro and maintain either > partial or full internal mirror copies of it. > > The standard method used by Linux mirror sites to pull/replicate a > subset of a Linux distribution (or a complete Linux distribution) > from a master repository is to use rsync with options that produce > the following behavior: > > the first time a unique file is encountered, it's content is > replicated; however, when subsequent hard links to the file are > detected, only the hardlinks are replicated. > > The primary copy of our Linux distro is stored on our BlueArc Titan > NAS (NFS server). Relative to the mirror-sites, our rsync server > "sits in front of" the NAS. > > Internally the BlueArc Titan has a unique object id for files; > however, the inode ID presented to clients by the BlueArc Titan is > not unique, rsync (with -H option) is erroneously identifying > unique files as a hard-links to different files. Causing mirror > repositories to be essentially corrupt and not usable. > > It is my understanding that the NFS v3 spec. does not require NFS > servers to present unique inode ids to clients. I believe that the > reasoning is that: large scale NAS appliances internally need to > use very wide object ids; but, externally need to present (when > asked) inode ids that any client an deal with. > > Are there options to rsync that will allow me to reliably replicate > my hard-link rich Linux distro from my NAS. > > Thanks > > Andy >- -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ Kevin Korb Phone: (407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. Kevin at FutureQuest.net (work) Orlando, Florida kmk at sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.20 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIovagACgkQVKC1jlbQAQfHdwCeLTR/n+bzzDauqxLmpKz61pkR 3+YAoM+UAsCG4RhcbVXeY0hSQ4BZzmm+ =1vPo -----END PGP SIGNATURE-----
Matthias Schniedermeyer
2013-Sep-05 23:43 UTC
rsync -H option yields corrupt replicas (due to non-unique inode ids)
On 05.09.2013 16:08, Andrew J. Romero wrote:> Hi, > > Our organization hosts a specialized Linux distribution. > > As is typical with Linux distributions, > the set of files that make up our Linux distro > contains a very complex web of self-referential hard links. > > Several other sites use our Linux distro > and maintain either partial or full > internal mirror copies of it. > > The standard method used by Linux mirror sites to > pull/replicate a subset of a Linux distribution > (or a complete Linux distribution) from a master > repository is to use rsync with options that > produce the following behavior: > > the first time a unique file is encountered, > it's content is replicated; however, when subsequent hard links > to the file are detected, only the hardlinks are replicated. > > The primary copy of our Linux distro > is stored on our BlueArc Titan NAS > (NFS server). Relative to the mirror-sites, > our rsync server "sits in front of" the NAS. > > Internally the BlueArc Titan has a unique object id > for files; however, the inode ID presented to clients > by the BlueArc Titan is not unique, > rsync (with -H option) is erroneously > identifying unique files > as a hard-links to different files. > Causing mirror repositories to be essentially corrupt > and not usable. > > It is my understanding that the NFS v3 spec. > does not require NFS servers to present unique inode > ids to clients. I believe that the reasoning is that: > large scale NAS appliances internally need to > use very wide object ids; but, externally need to > present (when asked) inode ids that any client > an deal with. > > Are there options to rsync that will > allow me to reliably replicate my > hard-link rich Linux distro from my NAS.I could be a plain 32bit/64bit problem. In this case 64bit inodes and i'm not sure NFS v3 supports 64bit inodes. I'm pretty sure that NFS v4 supports 64bit inodes and NFS v2 doesn't. Google didn't give me a straight answer and the Wikipedia-Page only says that NFS v3 got support for 64bit file-size/offsets, but inodes aren't mentioned. So assuming NFS v3 either doesn't support 64bit inodes or somehow isn't setup correctly: Just as Kevin said rsync determines "is the same file" by inode, so if the filesystem has 64bit inodes and NFS truncates them to 32bit totally unreleated files APPEAR to have the same inode. So if rsync doesn't check size/mtime/owner(...) it can crosslink totally unrelated files. As you should have examples of "crosslinked" files just "stat" them on the commandline and see what same inode-numbers are shown. And on the NAS itself, assuming you can get a command prompt, also stat the file and check if the inode-numbers are below or above 2^32. And assuming you get different numbers below or above 2^32 check if the lower 32bits are identical. And if you ask yourself "hey 32bit is a large number space, how can i get collisions". That's called the birthday paradox: http://en.wikipedia.org/wiki/Birthday_paradox -- Matthias