similar to: KMeans Clusterer - Going forward

Displaying 20 results from an estimated 1000 matches similar to: "KMeans Clusterer - Going forward"

2009 Mar 26
ideas on picking stopwords
I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my
2008 Mar 27
Proper noun stemming
Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see
2008 Mar 12
how can i use stopwords?
Hi, I do not understand the stopword function... I've set the termgenerator like this: $self->{'Stemmer'} = new Search::Xapian::Stem(german2); $self->{'Stopper'} = new Search::Xapian::SimpleStopper(); $self->{'TermGenerator'} = new Search::Xapian::TermGenerator; $self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );
2016 Aug 19
KMeans - Evaluation Results
On 18 Aug 2016, at 23:59, Richhiey Thomas <richhiey.thomas at> wrote: > I've currently added a few classes which don't really belong to the public API (currently) into private headers and used PIMPL with the Cluster class. I'm having difficulty reading your changes, because you aren't keeping to one complete change per commit. So for instance you've added a
2016 Aug 17
KMeans - Evaluation Results
On Wed, Aug 17, 2016 at 7:23 PM, James Aylett <james-xapian at> wrote: > >> How long does 200?300 documents take to cluster? How does it grow as > more documents are included in the MSet? We'd expect an MSet of 1000 > documents to take longer to cluster than one with 100, but the important > thing is _how_ the time increases as the number of documents
2016 Aug 17
KMeans - Evaluation Results
I've gone through the link that you sent me and I currently understand how this helps and works to some extent, but I am not too sure of how I should start with converting the current interface to PIMPL design. I'm not used to this design pattern so its taking some time to sink in :) Say I start with the Clusterer class, I create a ClustererImpl class which is the internal class that
2016 Aug 18
KMeans - Evaluation Results
> > > > Actually, you're doing something slightly unusual there: making the > internal member public. Protected would be better, and private is I think > most usual; library clients aren't going to have access to the Internal > class declaration, so they can't call things on it. This means it's > actually difficult right now to subclass Feature. > > I
2016 Aug 17
KMeans - Evaluation Results
> How long does 200?300 documents take to cluster? How does it grow as more > documents are included in the MSet? We'd expect an MSet of 1000 documents > to take longer to cluster than one with 100, but the important thing is > _how_ the time increases as the number of documents grows. > > Currently, the number of seconds taken for clustering a set of documents for varying
2016 Aug 15
KMeans - Evaluation Results
Hello, I've recently finished with an implementation of KMeans with two initialization techniques, random initialization and KMeans++. I would like to share my findings after evaluating the same. I have tested this implementation of KMeans with a BBC news article dataset. I am currently working on evaluating the same with FIRE datasets. Currently, clustering more than 500 documents
2017 Mar 09
GSoC 2017 Project Proposal
Hello devs. I would like to propose how I plan to go about improving and getting a system that can be integrated into Xapian in this GSoC for the clustering branch. I have identified three areas of work which were not touched last time. 1) Automated Performance Analysis I had roughly implemented 2 evaluation techniques previously (Distance b/w document and centroids within clusters and
2007 Jun 28
TermGenerator and SimpleStopper
Hi, I'm using SimpleStopper with TermGenerator in a Python indexing script, in an attempt to keep my index size down (currently 30K per doc, and I have 200 million docs to index, which I think implies 6TB.) However, unprefixed (positional?) terms are not affected by the stopper, though Z-prefixed terms are. I assume this is intentional for phrase queries, but I need to reduce my
2010 Nov 15
Stopword addition and stemming
Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a * TLD, eg) . I've noticed that searching for [bob] results in *very* slow search times (tens of seconds), since it
2009 Nov 12
How can this code be improved?
I am running the following code on a MacBook Pro 17" Unibody early 2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode. freq.stopwords <- numeric(0) freq.nonstopwords <- numeric(0) token.tables <- list(0) <- c(0) cat("Beginning at ", date(), ".\n") for (i.d in 1:length(tokens)) { tt <- list(0) for (i.s in
2007 Mar 04
Getting non-stemmed terms from IndexReader
I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via
2010 Oct 08
Get a list of all terms in an indexed corpus
Hello, I have a corpus that I have indexed with xapian/xappy and I would now like to generate a corpus-specific list of stopwords. (This is a technical corpus, so a typical stopword list wouldn't be helpful.) My first thought was to ask the xapian database for a list of terms followed by their frequency. My intuition is that I could probably bring together a list of stopwords by examining
2015 Jul 26
Get term from document by position
> Snippet highlighting is something that was worked on for a GSoC project a > few years ago, and is mentioned in our FAQ: <>. > It?s not available in the 1.2 series, but as I understand it should work out of the > box in 1.3.3. I tried it, this approach returns snippet that have nothing to do with the search string. Moreover, it takes too
2016 Jun 09
2nd week progress
Hello devs, I have filled out the repo link on TRAC as suggested. I'll also keep the journal updated on TRAC from now on. I am almost done with defining all the base classes required for the clusterer and have started coding the euclidian distance metric. This should be completed by tomorrow after which I'll be spending one day to test and make sure everything functions as expected, so
2016 Jul 26
K MEANS clustering
Hello, I've been working on the KMeans clustering algorithm recently and since the past week, I have been stuck on a problem which I'm not able to find a solution to. Since we are representing documents as Tf-idf vectors, they are really sparse vectors (a usual corpus can have around 5000 terms). So it gets really difficult to represent these sparse vectors in a way that would be
2006 Aug 11
Proposed changes to omindex
Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp
2006 Jul 26
tweaking minimum word length?
Hi, Can Ferret be configured to change the minimum word length of what it indexes? Right now it seems to drop words 3 characters or less, but I''d like to include words going down to 2 characters. How would I do that? Francis