thr3ads.net - Xapian devel - Introduction and Doubts [Mar 2016]

If this information is useful, please help other people find it:
Share via:

nirmal singhania

2016-Mar-10 00:17 UTC

Introduction and Doubts

I was not sharing it on maling list because i thought that someone can use
all ideas i proposed in their GSOC proposal.

Surely i will contribute to xapian project.
sorry if that was against the rules
The algorithm is not developed by me but after having much research on
various clustering techniques.
I found that there is a new algorithm called CLUBS(Clustering Using Binary
Splitting) which gives better results than kmeans++ and hierarchical
agglomerative clustering.
It is faster and produces good results based on various metrics of cluster
quality.

the algorithm works in following way
The first phase of the algorithm is
divisive, as the original data set(in this case, set of search documents to
cluster) is split recursively into miniclusters through successive
binary splits: the algorithm?s second phase is agglomerative since these
miniclusters
are recombined into the final result. Further, the approach induces during
execution a dynamic
hierarchical grid that will better fit the dataset w.r.t. classical grid
approaches that exploit
a fixed grid instead. Finally, the algorithm exploits the analytical
properties of the
Quadratic Sums of Squares (SSQ in the following) function to minimize the
cost of
merge and split operations.

you can refer the paper for more details about time complexity and
performance
http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf

for implementing it,we can use Documentsource class in our previous
clustering approach and create a binary tree
and perform and topdownsplitting and then bottomup merging.
First we have to implement feature extraction from text document(TFIDF
would be a good choice) which is implemented in xapian weighting schemes.
Then we will implement function to compute distances between documents
based on normalized TF-IDF Matrix.
Based on distances we will initially assign cluster and improve on it
using topdownsplitting
and then bottomup merging.

We will compute SSQ of a cluster from all points in a cluster. and then
Improve our clustering based on SSQ measures.

The functions for topdownsplitting  will include computeAverageDeltaSSQ to
average the SSQ of all points in cluster.
The functions for bottomupmerging will include computeweightedSSQ.
.

Please Give your valuable suggestions.
Have a Nice Day.

Regards,
Nirmal Singhania
B.tech III Yr

On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:
>
> > based on my some time of Research,I have in mind a clustering
algorithm
> > that can overcome Quality issues of K-means(and its variants) and
Speed
> > Issues of Hierarchical Agglomerative Clustering.
> > Theoretically it can work O(n) and Can produce results better than HAC
> > based on various metrics.
> > I can't discuss it on mailing-list but you say we can discuss more
about
> it
> > and its implementation in xapian in PM.
>
> Hi, Nirmal -- welcome to Xapian! What do you mean when you say that
> you can't discuss it on the mailing list? If you can't discuss it
in
> our public forums, then you aren't going to be able to contribute that
> work to Xapian, are you? (Since we'll need it documented clearly,
> which will likely include useful details of that discussion.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160310/5ac9af7f/attachment-0001.html>

James Aylett

2016-Mar-10 12:16 UTC

head link

Introduction and Doubts

On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:
> I was not sharing it on maling list because i thought that someone
> can use all ideas i proposed in their GSOC proposal.
It's usually pretty obvious to us if someone has copied parts of
someone else's proposal.
> The algorithm is not developed by me but after having much research
> on various clustering techniques.  I found that there is a new
> algorithm called CLUBS(Clustering Using Binary Splitting) which
> gives better results than kmeans++ and hierarchical agglomerative
> clustering.  It is faster and produces good results based on various
> metrics of cluster quality.
I've only skimmed the paper for now, but it certainly looks
interesting. Do you have a reason for picking TFIDF for feature
extraction? Are there other approaches that might make sense? You may
want to include in your project proposal how you intend to evaluate
the speed and accuracy of the final clustering system.

It sounds like you have a good handle on how you're going to go about
implementing CLUBS in Xapian. Having a detailed plan in your proposal
is a good way of demonstrating that you've thought through the
practical aspect of adding a feature. When writing up your proposed
timeline, remember to break things into small pieces -- no more than a
week at most, and you'll probably find some that come in shorter than
that.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

nirmal singhania

2016-Mar-10 19:51 UTC

head link

Introduction and Doubts

Tf-idf is most used used weighting scheme is easy to understand and has
been used in other frameworks like lucene and many other places.
okapi bm25(implemented in xapian) is theoretically better/improved measure
than tf-idf and

i am looking into various other weighting scheme which are there in xapian
or can be implemented like TF-ICF(term frequecy inverse corpus
frequency),TF-RF(term frequency-relevance frequency)


for evaluating the speed and accuracy of final clustering system we can
benchmark it against various other algos like k-means,HAC based on the
measures mentioned in previous mail.(purity,F-measure,Entropy,F-Measure,Overall
Similarity,Relative Margin,Variance Ratio)

Please give your suggestions
Have a Nice day



Regards,
Nirmal Singhania
B.tech III Yr

On Thu, Mar 10, 2016 at 5:46 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Thu, Mar 10, 2016 at 05:47:29AM +0530, nirmal singhania wrote:
>
> > I was not sharing it on maling list because i thought that someone
> > can use all ideas i proposed in their GSOC proposal.
>
> It's usually pretty obvious to us if someone has copied parts of
> someone else's proposal.
>
> > The algorithm is not developed by me but after having much research
> > on various clustering techniques.  I found that there is a new
> > algorithm called CLUBS(Clustering Using Binary Splitting) which
> > gives better results than kmeans++ and hierarchical agglomerative
> > clustering.  It is faster and produces good results based on various
> > metrics of cluster quality.
>
> I've only skimmed the paper for now, but it certainly looks
> interesting. Do you have a reason for picking TFIDF for feature
> extraction? Are there other approaches that might make sense? You may
> want to include in your project proposal how you intend to evaluate
> the speed and accuracy of the final clustering system.
>
> It sounds like you have a good handle on how you're going to go about
> implementing CLUBS in Xapian. Having a detailed plan in your proposal
> is a good way of demonstrating that you've thought through the
> practical aspect of adding a feature. When writing up your proposed
> timeline, remember to break things into small pieces -- no more than a
> week at most, and you'll probably find some that come in shorter than
> that.
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160311/eb39a175/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

Xapian devel - Mar 2016 - Introduction and Doubts

Introduction and Doubts

Introduction and Doubts

Introduction and Doubts

Maybe Matching Threads