On Tue, 5 Oct 2004, Khamenia, Valery wrote:> BTW, do you mean that current hash-based implementation brings *clearly*
> better performance than any O(n*log(n)) sort based algorithm?
> If I have correctly understood src/main/unique.c then current
> hash function is niether minimal perfect hash function nor even
> minimal hash function. In addition, as might be expected,
> current hash function uses full pass through the string to get
> a hash key.
>
> So, in particular, can anyone clearly show that the current
> hash-based algorithm will be quicker then sort-based
> algorithm if the input has:
>
> 1. a lot of strings;
> 2. strings are very long;
> 3. strings are quite unsimilar
>
> ?
>
> Hm, I don't believe you are ready to prove smth. like that.
Clearly the algorithm is not optimal for all possible sets of
inputs (eg if the inputs are already known to be unique then an even
faster implementation is just to do nothing).
As R's string comparison function use C's strcmp, for which the C
standard
makes no performance guarantees whatsoever, it is not possible to prove
*anything* about the relative performance of algorithms without making
some additional assumptions.
However, if this was a serious question, it is known that
a) in some versions of R on some byte-orderings the hash was broken and
the performance was far inferior, suggesting that it is ordinarily quite
effective.
b) Switching the rowsum function from a sort-based implementation to
hashing produced a substantial speed increase.
-thomas