I'd rather keep discussion on the list - others may be interested or
have useful comments...
On Fri, Sep 10, 2004 at 05:22:41PM -0400, Georges Dupret
wrote:> Actually, I am really looking for the weights, and not the term
> frequencies. I found the term frequencies, but I thought it would be
> unelegant to re-compute the weights based on them.
Ah, I think I see. Given a document in an mset, you want to know what
component of the weight came from each term?
That information isn't kept - we sum the component weights to give the
total weight for each document, but we don't keep the component weights.
It's cheap to recalculate the components, but the current API doesn't
allow you to use a Xapian::Weight object to do it, which is a shame.
It probably wouldn't be hard to allow this though. It's possible to
keep the component weights around too, but I'd want to look carefully
at the overhead of doing so for something most people aren't interested
in and which can be recalculated.
Mind you, even if it is recalculated, that could be done behind the
scenes so the API behaves as if the components of the weights were kept
around.
> In a first time, I need to find the unique most similar document to each
> document for some work on search-engine logs, so I don't need
> clustering. In the mean term, I will need the term weights to hold-on
> with some work based on Principal Component Analysis.
>
> Why do you recommand single link clustering instead of k-mean?
K-mean clustering (and in general most partition based schemes) require
that you decide how many clusters you want (the value of K for K-mean
clustering) before you start clustering. You perhaps have some idea
as to the maximum or average cluster size, or the number of clusters
which you want, but being required to pick an exact number means that
the algorithm might be forced to split or merge clusters where this
is unnatural.
On the other hand, with hierarchical clustering schemes (such as single
link), you can stop clustering when the largest cluster hits a certain
size, or the average cluster size passes a threshold, or you can look
at how the minimum distance between clusters is changing and stop at
the biggest change near the number of clusters you'd like.
K-mean clustering is also not really deterministic. You need K starting
centres to cluster around, and the choice of centres can affect where
the clusters end up being. That seems a bit unsatisfactory really.
There's an illustration of what I'm talking about here:
http://www.clustan.com/k-means_critique.html#Differentsolutions
I've seen suggestions that you can address this by running the
clustering several times with different starting centres and somehow
pick the best (minimise the sums of squares of each point's distance
from the cluster centre say). But it seems likely that the number of
different stable solutions will depend on n, so this scheme worsens the
time complexity of K-mean.
But if you've a very large number of points, hierarchical clustering
eventually just isn't going to be fast enough (especially not for
interactive work). Single link is O(n^2). I've not actually seen a
time complexity given for K-mean. Each iteration is O(n), and the
number of iterations required will probably depend on the number of
points, so it's probably worse than O(n) but it's surely much better
than O(n^2).
I've seen a combined approach suggested - use K-mean clustering (or
similar) to group your myriad of points into small clusters, then use a
hierarchical scheme to cluster the small clusters together. You then
don't need to exactly choose the final number of clusters.
I should probably point out I'm not a clustering expert! I've just read
various books and papers discussing the subject, and done a little
experimenting of my own.
Cheers,
Olly