Hello devs.
I would like to propose how I plan to go about improving and getting a
system that can be integrated into Xapian in this GSoC for the clustering
branch.
I have identified three areas of work which were not touched last time.
1) Automated Performance Analysis
I had roughly implemented 2 evaluation techniques previously (Distance b/w
document and centroids within clusters and Silhouette coefficient) but I
hadn't implemented them within Xapian, and thus it wasn't possible to
automate the process of evaluating the clustering results in the
ClusterSet. It is thus important to implement cluster evaluation techniques
within a module (as a ClusterEvaluation class) so that users can get output
on how they can improve their clustering by passing in the ClusterSet (and
the labels if necessary).
The cluster evaluation techniques that I would like to consider are :
      a) Silhouette coefficient
      b) Adjusted Rand Index
      c) Fowlkes Mallows index
      d) F - Measure
      e) Homogeneity, Completeness and V-Measure
2) Dimensionality Reduction
Due to high dimensionality of text documents, it is necessary to have
atleast one semantic dimensionality reduction technique. For this, I would
like to implement Latent Semantic Analysis for dimensionality reduction of
input document vectors.
LSA transforms the original data in a different space so that two
documents/words about the same concept are mapped close (so that they have
higher cosine similarity). LSA achieves this by Singular Value
Decomposition (SVD) of term-document matrix.
What I have found currently is that when we eliminate words that occur
rarely in documents, we can have the algorithm run very fast. The main
problem in runtime performance stems out of document vectors ending up to
be very high dimensional.
3) Hierarchical Clustering
Since the clustering API is already in place, I would like to implement a
hierarchical clusterer to cluster the search results.
Currently, I have created a new PR for reviewing the work done previously
so that it can be merged as soon as possible and trying to optimize the
code and find out different bottlenecks, so that speed can be improved.
It would be great to have feedback on what everyone thinks about this so
that I can re-implement or improve things and make them better.
Thanks. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170309/6f875819/attachment.html>
On 9 Mar 2017, at 05:18, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> Hello devs.Hi Richhiey :-)> 1) Automated Performance AnalysisThis is definitely interesting. However it's not terribly worth having this unless/until we have more than one clustering system to evaluate, is it? (Beyond the uniform/random one, although I guess if any clusterer performs _worse_ than that, it's a bad sign!)> 2) Dimensionality ReductionDo you think we'll need to implement several of these, for different uses? If not, is there a reason you think LSA will work best? You talk about eliminating words that occur rarely in documents — could we have a quick-and-dirty approach that looks at the within-corpus frequency of terms? (If multiple different approaches makes sense, we'll end up having each encapsulated as a class or similar, in the same way we're doing with Clusterer. It's worth having an idea of this up front, in case it changes the approach.)> 3) Hierarchical Clustering > Since the clustering API is already in place, I would like to implement a hierarchical clusterer to cluster the search results.Do you have a particular approach you think is a good one? Are you thinking agglomerative or divisive? J -- James Aylett devfort.com — spacelog.org — tartarus.org/james/
> This is definitely interesting. However it's not terribly worth having > this unless/until we have more than one clustering system to evaluate, is > it? (Beyond the uniform/random one, although I guess if any clusterer > performs _worse_ than that, it's a bad sign!) >This module can still help us with the KMeans clusterer implemented and since I would like to implement a hierarchal clusterer, it could help in relative comparison too. Before, I was looking at both internal as well as external evaluation techniques. But I guess a regular use case of this API will not provide a way to have ground truth labels for documents. Thus internal evaluation techniques would be the better option. I would thus like to change my approach and introduce a few internal clustering evaluation techniques. They are: 1) Silhouette coefficient 2) Dunn Index 3) Root Mean Square Standard Deviation 4) Calinski-Harabasz index 5) Davies-Bouldin index I would be trying to aim at implementing this Performance analysis module by the end of the community bonding period. Also, once this is set up, it would be easier to code and evaluate newer clusterers, with minimal changes in API.> > Do you think we'll need to implement several of these, for different uses? > If not, is there a reason you think LSA will work best? You talk about > eliminating words that occur rarely in documents — could we have a > quick-and-dirty approach that looks at the within-corpus frequency of terms? >I did try the quick-and-dirty approach that you mentioned, but looking at within-corpus frequency removes a lot of words that could otherwise add meaning. This will also be corpus-dependent, and hence a bad idea. For now, removing stop words and the stemmed duplicates within the Document has helped, but it could be better to add functionality for semantic dimensionality reduction like LSA. LSA would work best because over various other methods, it is a text mining tool rather than a statistical tool. I'm not sure whether implementing more than one technique will be in the scope of GSoC, but I see no harm in believing that we could use more. So we could create a class like DimReduction and sub-class it for implementing various techniques.> > > Do you have a particular approach you think is a good one? Are you > thinking agglomerative or divisive? >I was thinking of agglomerative clustering where we start from individual clusters going to a cluster containing all documents. This would be fairly simple to implement since we have the API fairly in place. We would only need to find a way to merge two clusters while going up the hierarchy tree. So as we start off, we can initialize multiple clusters having their own documents with the Cluster class, and then at each step, merging upwards by merging two of the Cluster object contents into one and calculating new cluster centroids. We just need to do this iteratively till the number of clusters end up being one. I will document all my findings from this conversation and my previous ideas into a proposal and send it soon. Thanks. Richhiey -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170320/10885eed/attachment.html>