Hey Parth, Thanks for the reply. I am considering implementing a cosine distance metric too, along with euclidian distance because of the dimensionality issue that comes in with K-Means and euclidian distance metric. That does help when we deal with sparse vectors for documents. The particular problem I'm having is representing centroids in an efficient way. For example, when we find the mean vector of a cluster, the resultant centroid need not be a document vector of a document belonging to that cluster. Hence representing that cluster, which will be dense as a C++ map is inefficient because of the number of terms associated with it and calculating distances with that doesn't work or scale too well. Over that, my distance calculation works over two documents. So will I need to modify that in a way to accommodate arbitrary vectors which might not represent document vectors? Would be great if everyone could add there inputs on this. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160727/d164d0d8/attachment.html>
Hi Richhiey Storing the centroids as double arrays is a better choice because of their dense nature and simplicity for operating over in parallel e.g. if you want to pass it to BLAS subroutine. In my previous email, I tried to calculate space requirements for it. A document can be stored in the sparse format (like you do with map) but before passing it to a cosine similarity subroutine you can make it dense (create a new double array and set its particular non-zero indexes). Alternatively, if you decide to operate in the sparse space, you can efficiently access centroid entries at indexes for which document has non-zero entry. Basically, here you have to make a design choice. In the former case, you can use BLAS like subroutines, while in the latter, you save computation. The former is also valid for the euclidean distance metric while the latter is not. Olly/James might have an opinion (may be from the previous cluster branch). Cheers Parth On Wed, Jul 27, 2016 at 6:17 PM, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> Hey Parth, > > Thanks for the reply. > I am considering implementing a cosine distance metric too, along with > euclidian distance because of the dimensionality issue that comes in with > K-Means and euclidian distance metric. > That does help when we deal with sparse vectors for documents. The > particular problem I'm having is representing centroids in an efficient way. > For example, when we find the mean vector of a cluster, the resultant > centroid need not be a document vector of a document belonging to that > cluster. Hence representing that cluster, which will be dense as a C++ map > is inefficient because of the number of terms associated with it and > calculating distances with that doesn't work or scale too well. > Over that, my distance calculation works over two documents. So will I > need to modify that in a way to accommodate arbitrary vectors which might > not represent document vectors? > Would be great if everyone could add there inputs on this. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160728/6800036f/attachment.html>
On Thu, Jul 28, 2016 at 12:15:51PM +0530, Parth Gupta wrote:> Storing the centroids as double arrays is a better choice because of their > dense nature and simplicity for operating over in parallel e.g. if you want > to pass it to BLAS subroutine. In my previous email, I tried to calculate > space requirements for it.Centroids are certainly more likely to be dense than document vectors in this case. I'd initially just get something working, probably dense, and worry about performance once you have a test suite. (Everything is easier once you have a test suite.)> A document can be stored in the sparse format (like you do with map) but > before passing it to a cosine similarity subroutine you can make it dense > (create a new double array and set its particular non-zero indexes). > Alternatively, if you decide to operate in the sparse space, you can > efficiently access centroid entries at indexes for which document has > non-zero entry.Computing cosine is pretty easil between two sparsely-stored document vectors providing you know the size of the equivalent array. Just iterate over the indices, and keep track of the three terms making up the similarity measure; on a zero component the vector-specific term won't increase, and the numerator term won't increase if the component in either vector is zero. Don't forget that once you've initialised the system, you shouldn't need the map from term to id.> The former is also valid for the euclidean distance metric while the > latter is not. Olly/James might have an opinion (may be from the > previous cluster branch).I don't have thoughts from previous clustering projects, but I will note that no decision made at this point is final, and it should be reasonably easy to change things once there's a running version that we can profile to see how it behaves. J -- James Aylett, occasional trouble-maker xapian.org