thr3ads.net - Xapian devel - K MEANS clustering [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-Jul-27 12:47 UTC

K MEANS clustering

Hey Parth,

Thanks for the reply.
I am considering implementing a cosine distance metric too, along with
euclidian distance because of the dimensionality issue that comes in with
K-Means and euclidian distance metric.
That does help when we deal with sparse vectors for documents. The
particular problem I'm having is representing centroids in an efficient way.
For example, when we find the mean vector of a cluster, the resultant
centroid need not be a document vector of a document belonging to that
cluster. Hence representing that cluster, which will be dense as a C++ map
is inefficient because of the number of terms associated with it and
calculating distances with that doesn't work or scale too well.
Over that, my distance calculation works over two documents. So will I need
to modify that in a way to accommodate arbitrary vectors which might not
represent document vectors?
Would be great if everyone could add there inputs on this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/d164d0d8/attachment.html>

Parth Gupta

2016-Jul-28 06:45 UTC

head link

K MEANS clustering

Hi Richhiey

Storing the centroids as double arrays is a better choice because of their
dense nature and simplicity for operating over in parallel e.g. if you want
to pass it to BLAS subroutine. In my previous email, I tried to calculate
space requirements for it.

A document can be stored in the sparse format (like you do with map) but
before passing it to a cosine similarity subroutine you can make it dense
(create a new double array and set its particular non-zero indexes).
Alternatively, if you decide to operate in the sparse space, you can
efficiently access centroid entries at indexes for which document has
non-zero entry.

Basically, here you have to make a design choice. In the former case, you
can use BLAS like subroutines, while in the latter, you save computation.
The former is also valid for the euclidean distance metric while the latter
is not. Olly/James might have an opinion (may be from the previous cluster
branch).

Cheers
Parth

On Wed, Jul 27, 2016 at 6:17 PM, Richhiey Thomas <richhiey.thomas at
gmail.com>
wrote:
> Hey Parth,
>
> Thanks for the reply.
> I am considering implementing a cosine distance metric too, along with
> euclidian distance because of the dimensionality issue that comes in with
> K-Means and euclidian distance metric.
> That does help when we deal with sparse vectors for documents. The
> particular problem I'm having is representing centroids in an efficient
way.
> For example, when we find the mean vector of a cluster, the resultant
> centroid need not be a document vector of a document belonging to that
> cluster. Hence representing that cluster, which will be dense as a C++ map
> is inefficient because of the number of terms associated with it and
> calculating distances with that doesn't work or scale too well.
> Over that, my distance calculation works over two documents. So will I
> need to modify that in a way to accommodate arbitrary vectors which might
> not represent document vectors?
> Would be great if everyone could add there inputs on this.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160728/6800036f/attachment.html>

James Aylett

2016-Jul-28 09:57 UTC

head link

K MEANS clustering

On Thu, Jul 28, 2016 at 12:15:51PM +0530, Parth Gupta wrote:
> Storing the centroids as double arrays is a better choice because of their
> dense nature and simplicity for operating over in parallel e.g. if you want
> to pass it to BLAS subroutine. In my previous email, I tried to calculate
> space requirements for it.
Centroids are certainly more likely to be dense than document vectors
in this case. I'd initially just get something working, probably
dense, and worry about performance once you have a test
suite. (Everything is easier once you have a test suite.)
> A document can be stored in the sparse format (like you do with map) but
> before passing it to a cosine similarity subroutine you can make it dense
> (create a new double array and set its particular non-zero indexes).
> Alternatively, if you decide to operate in the sparse space, you can
> efficiently access centroid entries at indexes for which document has
> non-zero entry.
Computing cosine is pretty easil between two sparsely-stored document
vectors providing you know the size of the equivalent array. Just
iterate over the indices, and keep track of the three terms making up
the similarity measure; on a zero component the vector-specific term
won't increase, and the numerator term won't increase if the component
in either vector is zero.

Don't forget that once you've initialised the system, you shouldn't
need
the map from term to id.
> The former is also valid for the euclidean distance metric while the
> latter is not. Olly/James might have an opinion (may be from the
> previous cluster branch).
I don't have thoughts from previous clustering projects, but I will
note that no decision made at this point is final, and it should be
reasonably easy to change things once there's a running version that
we can profile to see how it behaves.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Xapian devel - Jul 2016 - K MEANS clustering

K MEANS clustering

K MEANS clustering

K MEANS clustering