thr3ads.net - Xapian devel - Introduction and Doubts [Mar 2016]

If this information is useful, please help other people find it:
Share via:

nirmal singhania

2016-Mar-09 04:57 UTC

Introduction and Doubts

Hello All,I am Nirmal Singhania from NIIT University,India.
I am interested in Clustering of search results Topic.

I have been in field of practical machine learning and information
retrieval from quite some time.
I took various courses/MOOC on Information retrieval and Text Mining and
have been working on real life datasets(KDD99,AWID,Movielens).
Because the problems you face in real life ML/IR scenario is different is
different from what taught in theory.I am also working on R&D on
"Hybrid
Techniques for Intrusion Detection using Data Mining and Clustering on
Newer Datasets".

Taking initial look at the docsim folder in xapian-core.
These are my insights
The clustering used is Single Link Agglomerative Hierarchical clustering.
Its Time Complexity is O(n^2) for n=number of documents.
At first Choosing K-means seems to be viable solution as K-Means has O(n)
Time Complexity.
But it has various Shortcomings
1) The learning algorithm requires apriori specification of the number of
cluster centers.
2)Different Initial Partitions can result in different final clusters
3)It does not work well with clusters of different size and Different
Density.
After That we Can Think of KMeans++
The *k*-means++ algorithm addresses the first of these obstacles by
specifying a procedure to initialize the cluster centers before proceeding
with the standard *k*-means optimization iterations
But it is a little bit slow due to cluster initialization.
Then we can think of bisecting k-means which is better than
k-means.but the bisecting
K-means algorithm is a divisive hierarchical clustering algorithm
It is little bit faster than original k-means but the results of clustering
are poorer than Hierarchical agglomerative clustering
based on various Metrics of Cluster quality such as
Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.

based on my some time of Research,I have in mind a clustering algorithm
that can overcome Quality issues of K-means(and its variants) and Speed
Issues of Hierarchical Agglomerative Clustering.
Theoretically it can work O(n) and Can produce results better than HAC
based on various metrics.
I can't discuss it on mailing-list but you say we can discuss more about it
and its implementation in xapian in PM.

Thank you for your Time

Regards,
Nirmal Singhania
B.tech III Yr
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/9d6e6136/attachment-0001.html>

nirmal singhania

2016-Mar-09 06:42 UTC

head link

Introduction and Doubts

And Yes,the similarity measure for document similarity is cosine similarity.
For the algorithm i proposed in trailing mail,i have to implement euclidean
distance similarity measure and tweak it to make it work well with the
algorithm.

waiting for your suggestions.

Regards,
Nirmal Singhania
B.tech III Yr

On Wed, Mar 9, 2016 at 10:27 AM, nirmal singhania <
nirmal.singhania at st.niituniversity.in> wrote:
> Hello All,I am Nirmal Singhania from NIIT University,India.
> I am interested in Clustering of search results Topic.
>
> I have been in field of practical machine learning and information
> retrieval from quite some time.
> I took various courses/MOOC on Information retrieval and Text Mining and
> have been working on real life datasets(KDD99,AWID,Movielens).
> Because the problems you face in real life ML/IR scenario is different is
> different from what taught in theory.I am also working on R&D on
"Hybrid
> Techniques for Intrusion Detection using Data Mining and Clustering on
> Newer Datasets".
>
> Taking initial look at the docsim folder in xapian-core.
> These are my insights
> The clustering used is Single Link Agglomerative Hierarchical clustering.
> Its Time Complexity is O(n^2) for n=number of documents.
> At first Choosing K-means seems to be viable solution as K-Means has O(n)
> Time Complexity.
> But it has various Shortcomings
> 1) The learning algorithm requires apriori specification of the number
> of  cluster centers.
> 2)Different Initial Partitions can result in different final clusters
> 3)It does not work well with clusters of different size and Different
> Density.
> After That we Can Think of KMeans++
> The *k*-means++ algorithm addresses the first of these obstacles by
> specifying a procedure to initialize the cluster centers before proceeding
> with the standard *k*-means optimization iterations
> But it is a little bit slow due to cluster initialization.
> Then we can think of bisecting k-means which is better than k-means.but the
bisecting
> K-means algorithm is a divisive hierarchical clustering algorithm
> It is little bit faster than original k-means but the results of
> clustering are poorer than Hierarchical agglomerative clustering
> based on various Metrics of Cluster quality such as
> Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.
>
> based on my some time of Research,I have in mind a clustering algorithm
> that can overcome Quality issues of K-means(and its variants) and Speed
> Issues of Hierarchical Agglomerative Clustering.
> Theoretically it can work O(n) and Can produce results better than HAC
> based on various metrics.
> I can't discuss it on mailing-list but you say we can discuss more
about
> it and its implementation in xapian in PM.
>
> Thank you for your Time
>
>
>
>
>
>
> Regards,
> Nirmal Singhania
> B.tech III Yr
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/22b74b05/attachment.html>

James Aylett

2016-Mar-09 17:03 UTC

head link

Introduction and Doubts

On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:
> based on my some time of Research,I have in mind a clustering algorithm
> that can overcome Quality issues of K-means(and its variants) and Speed
> Issues of Hierarchical Agglomerative Clustering.
> Theoretically it can work O(n) and Can produce results better than HAC
> based on various metrics.
> I can't discuss it on mailing-list but you say we can discuss more
about it
> and its implementation in xapian in PM.
Hi, Nirmal -- welcome to Xapian! What do you mean when you say that
you can't discuss it on the mailing list? If you can't discuss it in
our public forums, then you aren't going to be able to contribute that
work to Xapian, are you? (Since we'll need it documented clearly,
which will likely include useful details of that discussion.)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org

nirmal singhania

2016-Mar-10 00:17 UTC

head link

Introduction and Doubts

I was not sharing it on maling list because i thought that someone can use
all ideas i proposed in their GSOC proposal.

Surely i will contribute to xapian project.
sorry if that was against the rules
The algorithm is not developed by me but after having much research on
various clustering techniques.
I found that there is a new algorithm called CLUBS(Clustering Using Binary
Splitting) which gives better results than kmeans++ and hierarchical
agglomerative clustering.
It is faster and produces good results based on various metrics of cluster
quality.

the algorithm works in following way
The first phase of the algorithm is
divisive, as the original data set(in this case, set of search documents to
cluster) is split recursively into miniclusters through successive
binary splits: the algorithm?s second phase is agglomerative since these
miniclusters
are recombined into the final result. Further, the approach induces during
execution a dynamic
hierarchical grid that will better fit the dataset w.r.t. classical grid
approaches that exploit
a fixed grid instead. Finally, the algorithm exploits the analytical
properties of the
Quadratic Sums of Squares (SSQ in the following) function to minimize the
cost of
merge and split operations.

you can refer the paper for more details about time complexity and
performance
http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf

for implementing it,we can use Documentsource class in our previous
clustering approach and create a binary tree
and perform and topdownsplitting and then bottomup merging.
First we have to implement feature extraction from text document(TFIDF
would be a good choice) which is implemented in xapian weighting schemes.
Then we will implement function to compute distances between documents
based on normalized TF-IDF Matrix.
Based on distances we will initially assign cluster and improve on it
using topdownsplitting
and then bottomup merging.

We will compute SSQ of a cluster from all points in a cluster. and then
Improve our clustering based on SSQ measures.

The functions for topdownsplitting  will include computeAverageDeltaSSQ to
average the SSQ of all points in cluster.
The functions for bottomupmerging will include computeweightedSSQ.
.

Please Give your valuable suggestions.
Have a Nice Day.

Regards,
Nirmal Singhania
B.tech III Yr

On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <james-xapian at
tartarus.org>
wrote:
> On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:
>
> > based on my some time of Research,I have in mind a clustering
algorithm
> > that can overcome Quality issues of K-means(and its variants) and
Speed
> > Issues of Hierarchical Agglomerative Clustering.
> > Theoretically it can work O(n) and Can produce results better than HAC
> > based on various metrics.
> > I can't discuss it on mailing-list but you say we can discuss more
about
> it
> > and its implementation in xapian in PM.
>
> Hi, Nirmal -- welcome to Xapian! What do you mean when you say that
> you can't discuss it on the mailing list? If you can't discuss it
in
> our public forums, then you aren't going to be able to contribute that
> work to Xapian, are you? (Since we'll need it documented clearly,
> which will likely include useful details of that discussion.)
>
> J
>
> --
>   James Aylett, occasional trouble-maker
>   xapian.org
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160310/5ac9af7f/attachment-0001.html>

Reasonably Related Threads

Search for more seemingly similar threads

Xapian devel - Mar 2016 - Introduction and Doubts

Introduction and Doubts

Introduction and Doubts

Introduction and Doubts

Introduction and Doubts

Reasonably Related Threads