search for: docsim

Displaying 5 results from an estimated 5 matches for "docsim".

Did you mean: docsis
2014 Dec 30
2
Help with xapian
Hi, Can someone tell me what was Gaurav Arora's exact contribution in the Clustering Search Results part during GSoC 2014? I guess that will be more helpful in understanding his code. Regards Karthik On Tue, Dec 16, 2014 at 4:06 AM, Olly Betts <olly at survex.com> wrote: > On Mon, Dec 15, 2014 at 06:56:39PM +0530, karthik iyer wrote: >> Could some one tell me some specific
2016 Mar 06
3
GSOC-2016 Project : Clustering of search results
...irst off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out. With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial...
2016 May 05
2
GSoC 2016 - Introduction
...e metrics use TF-IDF weighting with cosine similarity, which is very similar to the approach I would need for this project. Just in this case, euclidian distance would be the metric. Would it be good to structure it in a way similar to the previous API with a few changes? For example, the Xapian::DocSimCosine::similarity( ) function in itself calculates the tf idf vectors and calculates the similarity. Instead would it be possible to have a custom weighting scheme sub classing Xapian::Weight? This can help in providing the user an option about which weighting scheme to use to create document vecto...
2016 Mar 09
3
Introduction and Doubts
...KDD99,AWID,Movielens). Because the problems you face in real life ML/IR scenario is different is different from what taught in theory.I am also working on R&D on "Hybrid Techniques for Intrusion Detection using Data Mining and Clustering on Newer Datasets". Taking initial look at the docsim folder in xapian-core. These are my insights The clustering used is Single Link Agglomerative Hierarchical clustering. Its Time Complexity is O(n^2) for n=number of documents. At first Choosing K-means seems to be viable solution as K-Means has O(n) Time Complexity. But it has various Shortcomings...
2016 Mar 05
2
GSOC-2016 Project : Clustering of search results
Hello devs, I am Richhiey Thomas, pursuing my third year of undergraduate studies in Computer Science from Mumbai University. I had gone through the project list for this year and the project idea based on clustering caught my attention. I spoke to Assem Chelli on IRC who guided me to the code and got me started. I started going through the code and have successfully built Xapian on my machine.