Displaying 20 results from an estimated 1000 matches similar to: "KMeans Clusterer - Going forward"
2009 Mar 26
1
ideas on picking stopwords
I'm looking at adding some stopwords to my indexing procedure, and was
wondering if anyone had any good rules of thumb on how to pick which
words to blacklist. It all seems a little... well... vague. Although I
guess it kind of depends on the sort of documents you're wanting to index.
My current idea is to write a little script to output the terms with the
highest frequency in my
2008 Mar 27
2
Proper noun stemming
Hi All
I was wondering if anyone had a solution for the following problem.
I user QueryParser to stem my documents before adding them to a
database. During the stemming process I would like to find a way of
keeping proper nouns that span two or more words together as a phrase.
For example "New York" or "Gordon Brown" or "Prime Minister" get spilt
up. I see
2008 Mar 12
1
how can i use stopwords?
Hi,
I do not understand the stopword function...
I've set the termgenerator like this:
$self->{'Stemmer'} = new Search::Xapian::Stem(german2);
$self->{'Stopper'} = new Search::Xapian::SimpleStopper();
$self->{'TermGenerator'} = new Search::Xapian::TermGenerator;
$self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );
2016 Aug 19
2
KMeans - Evaluation Results
On 18 Aug 2016, at 23:59, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:
> I've currently added a few classes which don't really belong to the public API (currently) into private headers and used PIMPL with the Cluster class.
I'm having difficulty reading your changes, because you aren't keeping to one complete change per commit. So for instance you've added a
2016 Aug 17
2
KMeans - Evaluation Results
On Wed, Aug 17, 2016 at 7:23 PM, James Aylett <james-xapian at tartarus.org>
wrote:
> >> How long does 200?300 documents take to cluster? How does it grow as
> more documents are included in the MSet? We'd expect an MSet of 1000
> documents to take longer to cluster than one with 100, but the important
> thing is _how_ the time increases as the number of documents
2016 Aug 17
2
KMeans - Evaluation Results
I've gone through the link that you sent me and I currently understand how
this helps and works to some extent, but I am not too sure of how I should
start with converting the current interface to PIMPL design. I'm not used
to this design pattern so its taking some time to sink in :)
Say I start with the Clusterer class, I create a ClustererImpl class which
is the internal class that
2016 Aug 18
3
KMeans - Evaluation Results
>
>
>
> Actually, you're doing something slightly unusual there: making the
> internal member public. Protected would be better, and private is I think
> most usual; library clients aren't going to have access to the Internal
> class declaration, so they can't call things on it. This means it's
> actually difficult right now to subclass Feature.
>
> I
2016 Aug 17
2
KMeans - Evaluation Results
> How long does 200?300 documents take to cluster? How does it grow as more
> documents are included in the MSet? We'd expect an MSet of 1000 documents
> to take longer to cluster than one with 100, but the important thing is
> _how_ the time increases as the number of documents grows.
>
> Currently, the number of seconds taken for clustering a set of documents
for varying
2016 Aug 15
2
KMeans - Evaluation Results
Hello,
I've recently finished with an implementation of KMeans with two
initialization techniques, random initialization and KMeans++. I would like
to share my findings after evaluating the same.
I have tested this implementation of KMeans with a BBC news article
dataset. I am currently working on evaluating the same with FIRE datasets.
Currently, clustering more than 500 documents
2017 Mar 09
2
GSoC 2017 Project Proposal
Hello devs.
I would like to propose how I plan to go about improving and getting a
system that can be integrated into Xapian in this GSoC for the clustering
branch.
I have identified three areas of work which were not touched last time.
1) Automated Performance Analysis
I had roughly implemented 2 evaluation techniques previously (Distance b/w
document and centroids within clusters and
2007 Jun 28
1
TermGenerator and SimpleStopper
Hi,
I'm using SimpleStopper with TermGenerator in a Python indexing
script, in an attempt to keep my index size down (currently 30K per
doc, and I have 200 million docs to index, which I think implies
6TB.) However, unprefixed (positional?) terms are not affected by
the stopper, though Z-prefixed terms are.
I assume this is intentional for phrase queries, but I need to reduce
my
2010 Nov 15
4
Stopword addition and stemming
Hi,
Two questions which I'm unsure about:
Stemming: I've turned on stemming, etc, but how can I confirm that
it's being used in searches? What should I look/search for?
Stopwords: I'm trying out xapian on a regional dataset (searching
data from a *.co.us TLD, eg) . I've noticed that searching for [bob
co.us] results in *very* slow search times (tens of seconds), since it
2009 Nov 12
1
How can this code be improved?
I am running the following code on a MacBook Pro 17" Unibody early
2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in
64-bit mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
tt <- list(0)
for (i.s in
2007 Mar 04
5
Getting non-stemmed terms from IndexReader
I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I''m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.
--
Posted via http://www.ruby-forum.com/.
2010 Oct 08
1
Get a list of all terms in an indexed corpus
Hello,
I have a corpus that I have indexed with xapian/xappy and I would now
like to generate a corpus-specific list of stopwords. (This is a
technical corpus, so a typical stopword list wouldn't be helpful.)
My first thought was to ask the xapian database for a list of terms
followed by their frequency. My intuition is that I could probably bring
together a list of stopwords by examining
2015 Jul 26
1
Get term from document by position
> Snippet highlighting is something that was worked on for a GSoC project a
> few years ago, and is mentioned in our FAQ: <http://trac.xapian.org/wiki/FAQ/Snippets>.
> It?s not available in the 1.2 series, but as I understand it should work out of the
> box in 1.3.3.
I tried it, this approach returns snippet that have nothing to do with the search string. Moreover, it takes too
2016 Jun 09
2
2nd week progress
Hello devs,
I have filled out the repo link on TRAC as suggested. I'll also keep the
journal updated on TRAC from now on.
I am almost done with defining all the base classes required for the
clusterer and have started coding the euclidian distance metric. This
should be completed by tomorrow after which I'll be spending one day to
test and make sure everything functions as expected, so
2016 Jul 26
3
K MEANS clustering
Hello,
I've been working on the KMeans clustering algorithm recently and since the
past week, I have been stuck on a problem which I'm not able to find a
solution to.
Since we are representing documents as Tf-idf vectors, they are really
sparse vectors (a usual corpus can have around 5000 terms). So it gets
really difficult to represent these sparse vectors in a way that would be
2006 Aug 11
3
Proposed changes to omindex
Proposed changes to omindex
Currently Available Items
=========================
1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
indexing.
2) Add the document?s last modified time to the value table (ID 0). This would allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp
2006 Jul 26
13
tweaking minimum word length?
Hi,
Can Ferret be configured to change the minimum word length of what it
indexes? Right now it seems to drop words 3 characters or less, but
I''d like to include words going down to 2 characters. How would I do
that?
Francis