thr3ads.net - search: "docsim"

Displaying 5 results from an estimated 5 matches for "docsim".

Did you mean: docsis

2014 Dec 30

Help with xapian

Hi, Can someone tell me what was Gaurav Arora's exact contribution in the Clustering Search Results part during GSoC 2014? I guess that will be more helpful in understanding his code. Regards Karthik On Tue, Dec 16, 2014 at 4:06 AM, Olly Betts <olly at survex.com> wrote: > On Mon, Dec 15, 2014 at 06:56:39PM +0530, karthik iyer wrote: >> Could some one tell me some specific

GSOC-2016 Project : Clustering of search results

2016 Mar 06

GSOC-2016 Project : Clustering of search results

...irst off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out. With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial...

GSoC 2016 - Introduction

2016 May 05

GSoC 2016 - Introduction

...e metrics use TF-IDF weighting with cosine similarity, which is very similar to the approach I would need for this project. Just in this case, euclidian distance would be the metric. Would it be good to structure it in a way similar to the previous API with a few changes? For example, the Xapian::DocSimCosine::similarity( ) function in itself calculates the tf idf vectors and calculates the similarity. Instead would it be possible to have a custom weighting scheme sub classing Xapian::Weight? This can help in providing the user an option about which weighting scheme to use to create document vecto...

Introduction and Doubts

2016 Mar 09

Introduction and Doubts

...KDD99,AWID,Movielens). Because the problems you face in real life ML/IR scenario is different is different from what taught in theory.I am also working on R&D on "Hybrid Techniques for Intrusion Detection using Data Mining and Clustering on Newer Datasets". Taking initial look at the docsim folder in xapian-core. These are my insights The clustering used is Single Link Agglomerative Hierarchical clustering. Its Time Complexity is O(n^2) for n=number of documents. At first Choosing K-means seems to be viable solution as K-Means has O(n) Time Complexity. But it has various Shortcomings...

GSOC-2016 Project : Clustering of search results

2016 Mar 05

GSOC-2016 Project : Clustering of search results

Hello devs, I am Richhiey Thomas, pursuing my third year of undergraduate studies in Computer Science from Mumbai University. I had gone through the project list for this year and the project idea based on clustering caught my attention. I spoke to Assem Chelli on IRC who guided me to the code and got me started. I started going through the code and have successfully built Xapian on my machine.

search for: docsim