Displaying 5 results from an estimated 5 matches for "docsim".
Did you mean:
docsis
2014 Dec 30
2
Help with xapian
Hi,
Can someone tell me what was Gaurav Arora's exact contribution in the
Clustering Search Results part during GSoC 2014? I guess that will be
more helpful in understanding his code.
Regards
Karthik
On Tue, Dec 16, 2014 at 4:06 AM, Olly Betts <olly at survex.com> wrote:
> On Mon, Dec 15, 2014 at 06:56:39PM +0530, karthik iyer wrote:
>> Could some one tell me some specific
2016 Mar 06
3
GSOC-2016 Project : Clustering of search results
...irst off, the distance metric used in the current implementation is the
cosine measure. Though useful, K-means implicitly uses Euclidian distance
as a measure of document similarity between two document term vectors.
Hence, simply creating one more class for a distance metric by just
inheriting the DocSim base class will be good. Using the tf-idf weights, we
can find term weights and instead of using these vectors for cosine
similarity, euclid distance can be found out.
With a similarity measure in place, we can initialize the k centroids using
k-means++, an algorithm used for choosing the initial...
2016 May 05
2
GSoC 2016 - Introduction
...e metrics use TF-IDF weighting with cosine similarity,
which is very similar to the approach I would need for this project. Just
in this case, euclidian distance would be the metric.
Would it be good to structure it in a way similar to the previous API with
a few changes?
For example, the Xapian::DocSimCosine::similarity( ) function in itself
calculates the tf idf vectors and calculates the similarity. Instead would
it be possible to have a custom weighting scheme sub classing
Xapian::Weight? This can help in providing the user an option about which
weighting scheme to use to create document vecto...
2016 Mar 09
3
Introduction and Doubts
...KDD99,AWID,Movielens).
Because the problems you face in real life ML/IR scenario is different is
different from what taught in theory.I am also working on R&D on "Hybrid
Techniques for Intrusion Detection using Data Mining and Clustering on
Newer Datasets".
Taking initial look at the docsim folder in xapian-core.
These are my insights
The clustering used is Single Link Agglomerative Hierarchical clustering.
Its Time Complexity is O(n^2) for n=number of documents.
At first Choosing K-means seems to be viable solution as K-Means has O(n)
Time Complexity.
But it has various Shortcomings...
2016 Mar 05
2
GSOC-2016 Project : Clustering of search results
Hello devs,
I am Richhiey Thomas, pursuing my third year of undergraduate studies in
Computer Science from Mumbai University. I had gone through the project
list for this year and the project idea based on clustering caught my
attention. I spoke to Assem Chelli on IRC who guided me to the code and got
me started.
I started going through the code and have successfully built Xapian on my
machine.