thr3ads.net - Xapian devel - GSoC 2016 - Introduction [May 2016]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2016-May-05 20:59 UTC

GSoC 2016 - Introduction

Hello,

Thanks James for the reply. That cleared a few things out. Apologies for
replying late because of exams going on.

I was going through the previous clustering API to understand how it worked
and it seems like the the approach for construction of the termlists which
are used for distance metrics use TF-IDF weighting with cosine similarity,
which is very similar to the approach I would need for this project. Just
in this case, euclidian distance would be the metric.

Would it be good to structure it in a way similar to the previous API with
a few changes?

For example, the Xapian::DocSimCosine::similarity( ) function in itself
calculates the tf idf vectors and calculates the similarity. Instead would
it be possible to have a custom weighting scheme sub classing
Xapian::Weight? This can help in providing the user an option about which
weighting scheme to use to create document vectors in K-means.

More ways of creating document sources should be allowed, for example from
a vector of docid's that the user has.

I have also been looking at the existing test API and I'll create a new PR
for a simple test in the next 1-2 days, maybe for checking whether the
value of k is valid or checking the euclidian distance calculations for
document vectors.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160506/5efc4814/attachment.html>

James Aylett

2016-May-09 10:42 UTC

head link

GSoC 2016 - Introduction

On Fri, May 06, 2016 at 02:29:48AM +0530, Richhiey Thomas wrote:
> I was going through the previous clustering API to understand how it worked
> and it seems like the the approach for construction of the termlists which
> are used for distance metrics use TF-IDF weighting with cosine similarity,
> which is very similar to the approach I would need for this project. Just
> in this case, euclidian distance would be the metric.
> 
> Would it be good to structure it in a way similar to the previous API with
> a few changes?
I suspect that the public API will want to be fairly similar to the
previous one, yes.
> For example, the Xapian::DocSimCosine::similarity( ) function in itself
> calculates the tf idf vectors and calculates the similarity. Instead would
> it be possible to have a custom weighting scheme sub classing
> Xapian::Weight? This can help in providing the user an option about which
> weighting scheme to use to create document vectors in K-means.
I doubt that will work. Xapian::Weight computes a single score for a
document against a given query. Similarity metrics in clustering
generally work by providing a distance between two vectors, each of
which represents a document. So the API you'll want is different to
::Weight.

It probably will be useful in future to allow for different metrics to
be used, though. That will probably involve separating creation of the
tf-idf vectors from calculating the similarity.
> More ways of creating document sources should be allowed, for example from
> a vector of docid's that the user has.
There does seem to be value (as a future extension of clustering) in
allow people to cluster based on just a set of documents.
> I have also been looking at the existing test API and I'll create a new
PR
> for a simple test in the next 1-2 days, maybe for checking whether the
> value of k is valid or checking the euclidian distance calculations for
> document vectors.
Writing a tested euclidian distance calculation between two document
vectors sounds reasonably small, but it does require you to decide how
the document vectors are going to be represented. I don't think
that's particularly hard, but it means you should think of it in terms
of the public APIs that will be used to construct the set of doc vecs
out of an MSet, and how they'll be passed into the clustering system
(and how you'll then get the clusters out again).

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

Richhiey Thomas

2016-May-18 20:06 UTC

head link

GSoC 2016 - Introduction

Hello,

I had been thinking about how to write tests that help us come up with the
public API that will be used for clustering and I'd just like to describe
two tests and the way I am thinking about the API. I'd like to know whether
I'm on the right path or how this can be improved.

1) Testcase to check euclidian similarity of document vectors

DEFINE_TESTCASE(euclidian, backend)
{
    Xapian::Database db(get_database("euclidian"));
    //Make this file contain two sentences which are identical and treated
as two diff docs
    //Get MSet containing two docs
    Document doc1 = mset[0].get_document();
    Document doc2 = mset[1].get_document();
    DocSim d;
    int sim = d.get_distance(doc1.termlist_begin(), doc1.termlist_end(),
doc2.termlist_begin(), doc2.termlist_end(), SIMILARITY_OPTION /* (in this
case, euclidian) */ );
    TEST( sim == 0)
}

The creation of TF-IDF vectors from the termlists of the documents will be
done inside the DocSim class. The get_distance() function calculates the
distance and we can support many similarity measures later on. The default
can be euclidian distance

2) Test case to check whether clusters are valid by checking whether any
cluster is empty

DEFINE_TESTCASE(custer1, backend)
{
    Xapian::Database db(get_database("cluster_api"));
    //Get Mset against a query, MSet -> matches
    Xapian::Cluster c;
    Xapian::ClusterSet cset = c.cluster(matches,k);
    if (cset != NULL)
    {
        for(Xapian::ClusterSetIterator i=cset.begin(); i!=cset.end(); i++)
        {
            Xapian::DocumentSet d = i.get_clusterdocs();
            TEST(d.size() != 0)
        }
    }
}

Xapian::Cluster class will contain the main clustering functionality which
will cluster the documents and store the results in a class
Xapian:ClusterSet, which is returned by Xapian::Cluster::cluster(). This
will also contain a vector of the cluster IDs and a map of document IDs and
its associated cluster ID.

Xapian::ClusterSet contains the cluster ID and vector of documents
belonging to that cluster. Xapian::ClusterSetIterator can be used to go
through the ClusterSet objects

The documents belonging to a certain cluster can be retrieved by a function
which returns documents to a DocumentSet. This can again be made iterable
but I don't know how productive making a DocumentSet would be.

This is a very rough way of how I think the API would be. I'd like to know
if there are places where I am going wrong so I can improve on them before
the coding period starts.

Also, I apologize for not being too responsive on the mailing list, but
I've been having exams going on. They'll be getting over on the 26th of
this month, after which I can concentrate on the project completely.

Thanks,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160519/8e4df826/attachment.html>

Xapian devel - May 2016 - GSoC 2016 - Introduction

GSoC 2016 - Introduction

GSoC 2016 - Introduction

GSoC 2016 - Introduction