thr3ads.net - Xapian devel - K MEANS clustering [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-Jul-26 04:48 UTC

K MEANS clustering

Hello,

I've been working on the KMeans clustering algorithm recently and since the
past week, I have been stuck on a problem which I'm not able to find a
solution to.

Since we are representing documents as Tf-idf vectors, they are really
sparse vectors (a usual corpus can have around 5000 terms). So it gets
really difficult to represent these sparse vectors in a way that would be
computationally efficient to calculate euclidian distances. I had
implemented a K-Medioids algorithm using PAM just to try it out, after
modifying the API for whatever more was required, and that seems fine,
since we are dealing with document vectors and not arbitrary vectors. But
with KMeans, I am not able to figure out how to represent these centroids
during each iteration when the average of a cluster is to be computed.
So my confusion is, how could i represent an arbitrary sparse vector to be
used as the centroid in k means?
Can anyone please guide me on this?
Will using boost C++ be a solution?

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160726/342a6e4b/attachment.html>

Parth Gupta

2016-Jul-26 06:39 UTC

head link

K MEANS clustering

Hi Richhiey

Usually, a corpus easily would have around a few hundred thousands of
terms. As you have already identified the vectors are sparse i.e. a few
tens or hundred entries would be non-zero. One common optimisation is to
store only non-zero entries. You can use, for example, an unorderd set to
store those entries. In order to store more information for each entry e.g.
frequency of the term, you can store a custom object in the unordered set
and use it while computing the distance.

Now, centroids are usually initialised by some algorithm or randomly and
they can be dense. Let's calculate the in-memory storage for our clustering
project. Let's say our k=10 (in practice it can be smaller) and our
vocabulary size is 1M. To host these centroids into main memory, we would
need 10*10^6*8 bytes which is around 80MB. If you compromise on precision
and choose single-precision float then you can store them with 40MB.

Now let's consider you want to perform one iteration of k-means with 30
documents. If you store them in dense format you need 240 MB which is still
not bad but can be worse if you want to cluster 50 documents and also the
collection is huge (a few millions vocabulary size). In this case, if you
consider standard k-means where your distance metric is euclidean, it is
very risky because centroid entries in the zero positions of the vectors
also contribute to the distance. Moreover, the vectors are sparse which can
highly influence the score. Hence, you should consider to use spherical
k-means which uses cosine distance. Now you can store document vectors in
sparse format because zero entries do not contribute to the distance
metric.

In case you choose dense format, there are BLAS libraries which have
optimised implementation of L2 norm.

Cheers
Parth

On Tue, Jul 26, 2016 at 10:18 AM, Richhiey Thomas <richhiey.thomas at
gmail.com> wrote:
> Hello,
>
> I've been working on the KMeans clustering algorithm recently and since
> the past week, I have been stuck on a problem which I'm not able to
find a
> solution to.
>
> Since we are representing documents as Tf-idf vectors, they are really
> sparse vectors (a usual corpus can have around 5000 terms). So it gets
> really difficult to represent these sparse vectors in a way that would be
> computationally efficient to calculate euclidian distances. I had
> implemented a K-Medioids algorithm using PAM just to try it out, after
> modifying the API for whatever more was required, and that seems fine,
> since we are dealing with document vectors and not arbitrary vectors. But
> with KMeans, I am not able to figure out how to represent these centroids
> during each iteration when the average of a cluster is to be computed.
> So my confusion is, how could i represent an arbitrary sparse vector to be
> used as the centroid in k means?
> Can anyone please guide me on this?
> Will using boost C++ be a solution?
>
> Thanks
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160726/f4d046be/attachment-0001.html>

Richhiey Thomas

2016-Jul-27 02:28 UTC

head link

K MEANS clustering

Hey Parth,

Thanks for the reply.
I am considering implementing a cosine distance metric because of the
dimensionality issue that comes in with K-Means and euclidian distance
metric.

Currently, the way I'm finding distances between documents is finding their
terms and looking up their term frequencies which I've stored in a map. So
I've not stored a unique vector for every document. Now in KMeans, when we
find the mean of a cluster, the resultant need not be a document vector. So
representing these centroids is becoming a problem since the centroids will
be dense. Should I use a map for that too? By storing all the terms and
their avg values.
Or would it be a better approach to have a document vector for every
document stored?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/eebf302f/attachment.html>

Richhiey Thomas

2016-Jul-27 12:47 UTC

head link

K MEANS clustering

Hey Parth,

Thanks for the reply.
I am considering implementing a cosine distance metric too, along with
euclidian distance because of the dimensionality issue that comes in with
K-Means and euclidian distance metric.
That does help when we deal with sparse vectors for documents. The
particular problem I'm having is representing centroids in an efficient way.
For example, when we find the mean vector of a cluster, the resultant
centroid need not be a document vector of a document belonging to that
cluster. Hence representing that cluster, which will be dense as a C++ map
is inefficient because of the number of terms associated with it and
calculating distances with that doesn't work or scale too well.
Over that, my distance calculation works over two documents. So will I need
to modify that in a way to accommodate arbitrary vectors which might not
represent document vectors?
Would be great if everyone could add there inputs on this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/d164d0d8/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

Xapian devel - Jul 2016 - K MEANS clustering

K MEANS clustering

K MEANS clustering

K MEANS clustering

K MEANS clustering

Reasonably Related Threads