Hi list, and Wayne in particular, It was almost a year since we had the discussion (with http://lists.samba.org/archive/rsync/2005-March/011875.html as it's conclusion) regarding chances for hash collisions and large files. As now we have someone asking about synching 5TB files, I decided to actually submit a patch. Attached is a patch that uses a non-predetermined hash table size, so that the hash cell load (alpha) is never more than 80%. As far as my understanding of rsync goes, this requires no change in the rsync protocol. Comments welcome, Shachar -------------- next part -------------- A non-text attachment was scrubbed... Name: dynamic_hash.patch Type: text/x-patch Size: 1454 bytes Desc: not available Url : http://lists.samba.org/archive/rsync/attachments/20060225/62f748cc/dynamic_hash.bin
On Sat, Feb 25, 2006 at 01:25:52PM +0200, Shachar Shemesh wrote:> Attached is a patch that uses a non-predetermined hash table size, so > that the hash cell load (alpha) is never more than 80%.Thanks for the patch! Here's some comments: - You didn't change the size of the "tag" typedef (an unsigned short), and your patch makes the value potentially overflow. - For smaller hash-table sizes, your algorithm does a lookup in the table based only on the s1 value (due to the (s2 << 16) value being too large to have any remainder less than the tablesize). So, I think this probably needs to leave gettag() calling gettag2(), and change gettag2() to factor both s1 and s2 into some kind of an improved tag-generating computation. ..wayne..