Hello, Thanks James for the reply. That cleared a few things out. Apologies for replying late because of exams going on. I was going through the previous clustering API to understand how it worked and it seems like the the approach for construction of the termlists which are used for distance metrics use TF-IDF weighting with cosine similarity, which is very similar to the approach I would need for this project. Just in this case, euclidian distance would be the metric. Would it be good to structure it in a way similar to the previous API with a few changes? For example, the Xapian::DocSimCosine::similarity( ) function in itself calculates the tf idf vectors and calculates the similarity. Instead would it be possible to have a custom weighting scheme sub classing Xapian::Weight? This can help in providing the user an option about which weighting scheme to use to create document vectors in K-means. More ways of creating document sources should be allowed, for example from a vector of docid's that the user has. I have also been looking at the existing test API and I'll create a new PR for a simple test in the next 1-2 days, maybe for checking whether the value of k is valid or checking the euclidian distance calculations for document vectors. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160506/5efc4814/attachment.html>
On Fri, May 06, 2016 at 02:29:48AM +0530, Richhiey Thomas wrote:> I was going through the previous clustering API to understand how it worked > and it seems like the the approach for construction of the termlists which > are used for distance metrics use TF-IDF weighting with cosine similarity, > which is very similar to the approach I would need for this project. Just > in this case, euclidian distance would be the metric. > > Would it be good to structure it in a way similar to the previous API with > a few changes?I suspect that the public API will want to be fairly similar to the previous one, yes.> For example, the Xapian::DocSimCosine::similarity( ) function in itself > calculates the tf idf vectors and calculates the similarity. Instead would > it be possible to have a custom weighting scheme sub classing > Xapian::Weight? This can help in providing the user an option about which > weighting scheme to use to create document vectors in K-means.I doubt that will work. Xapian::Weight computes a single score for a document against a given query. Similarity metrics in clustering generally work by providing a distance between two vectors, each of which represents a document. So the API you'll want is different to ::Weight. It probably will be useful in future to allow for different metrics to be used, though. That will probably involve separating creation of the tf-idf vectors from calculating the similarity.> More ways of creating document sources should be allowed, for example from > a vector of docid's that the user has.There does seem to be value (as a future extension of clustering) in allow people to cluster based on just a set of documents.> I have also been looking at the existing test API and I'll create a new PR > for a simple test in the next 1-2 days, maybe for checking whether the > value of k is valid or checking the euclidian distance calculations for > document vectors.Writing a tested euclidian distance calculation between two document vectors sounds reasonably small, but it does require you to decide how the document vectors are going to be represented. I don't think that's particularly hard, but it means you should think of it in terms of the public APIs that will be used to construct the set of doc vecs out of an MSet, and how they'll be passed into the clustering system (and how you'll then get the clusters out again). J -- James Aylett, occasional trouble-maker xapian.org
Hello, I had been thinking about how to write tests that help us come up with the public API that will be used for clustering and I'd just like to describe two tests and the way I am thinking about the API. I'd like to know whether I'm on the right path or how this can be improved. 1) Testcase to check euclidian similarity of document vectors DEFINE_TESTCASE(euclidian, backend) { Xapian::Database db(get_database("euclidian")); //Make this file contain two sentences which are identical and treated as two diff docs //Get MSet containing two docs Document doc1 = mset[0].get_document(); Document doc2 = mset[1].get_document(); DocSim d; int sim = d.get_distance(doc1.termlist_begin(), doc1.termlist_end(), doc2.termlist_begin(), doc2.termlist_end(), SIMILARITY_OPTION /* (in this case, euclidian) */ ); TEST( sim == 0) } The creation of TF-IDF vectors from the termlists of the documents will be done inside the DocSim class. The get_distance() function calculates the distance and we can support many similarity measures later on. The default can be euclidian distance 2) Test case to check whether clusters are valid by checking whether any cluster is empty DEFINE_TESTCASE(custer1, backend) { Xapian::Database db(get_database("cluster_api")); //Get Mset against a query, MSet -> matches Xapian::Cluster c; Xapian::ClusterSet cset = c.cluster(matches,k); if (cset != NULL) { for(Xapian::ClusterSetIterator i=cset.begin(); i!=cset.end(); i++) { Xapian::DocumentSet d = i.get_clusterdocs(); TEST(d.size() != 0) } } } Xapian::Cluster class will contain the main clustering functionality which will cluster the documents and store the results in a class Xapian:ClusterSet, which is returned by Xapian::Cluster::cluster(). This will also contain a vector of the cluster IDs and a map of document IDs and its associated cluster ID. Xapian::ClusterSet contains the cluster ID and vector of documents belonging to that cluster. Xapian::ClusterSetIterator can be used to go through the ClusterSet objects The documents belonging to a certain cluster can be retrieved by a function which returns documents to a DocumentSet. This can again be made iterable but I don't know how productive making a DocumentSet would be. This is a very rough way of how I think the API would be. I'd like to know if there are places where I am going wrong so I can improve on them before the coding period starts. Also, I apologize for not being too responsive on the mailing list, but I've been having exams going on. They'll be getting over on the 26th of this month, after which I can concentrate on the project completely. Thanks, Richhiey -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160519/8e4df826/attachment.html>