Displaying 20 results from an estimated 5000 matches similar to: "Stopword addition and stemming"
2008 Mar 12
1
how can i use stopwords?
Hi,
I do not understand the stopword function...
I've set the termgenerator like this:
$self->{'Stemmer'} = new Search::Xapian::Stem(german2);
$self->{'Stopper'} = new Search::Xapian::SimpleStopper();
$self->{'TermGenerator'} = new Search::Xapian::TermGenerator;
$self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );
2017 Jun 14
2
KMeans Clusterer - Going forward
Hello,
I have finished moving the API to PIMPL classes and will fix issues within
the current code over the next week, based on reviews from mentors.
The next step going forward is to start with forming document vectors that
are reduced and more useful. This majorly helps in saving run time (since
time for distance calculation depends on number of terms). Getting the
useful terms within a
2008 Mar 27
2
Proper noun stemming
Hi All
I was wondering if anyone had a solution for the following problem.
I user QueryParser to stem my documents before adding them to a
database. During the stemming process I would like to find a way of
keeping proper nouns that span two or more words together as a phrase.
For example "New York" or "Gordon Brown" or "Prime Minister" get spilt
up. I see
2009 Mar 26
1
ideas on picking stopwords
I'm looking at adding some stopwords to my indexing procedure, and was
wondering if anyone had any good rules of thumb on how to pick which
words to blacklist. It all seems a little... well... vague. Although I
guess it kind of depends on the sort of documents you're wanting to index.
My current idea is to write a little script to output the terms with the
highest frequency in my
2006 Nov 13
1
Stemming, stop words, acts_as_ferret
I''d like to get the following behavior:
1. Stemming. The search is on a database of summaries of California legal
cases. Things like a search for "thermal image" needs to hit "thermal
imaging."
2. Stop words. Searches for "failing to instruct the jury" should come up
with hits on a search for "fail to instruct."
3. Case-insensitive.
What I
2005 Jun 09
1
Query parser and stemming of norwegian letters
Hello, can I get an explanation of the following.
Running the following code:
....
pqp=new QueryParser();
Stem stem("norwegian");
cout << "DEBUG " << stem.stem_word(_sXapian)<< endl;
pqp->set_stemmer(stem);
pqp->set_database(*_pdatabase);
pqp->set_default_op(Query::OP_AND);
//Set the
2012 Jan 13
4
Troubles with stemming (tm + Snowball packages) under MacOS
Dear all,
I have some troubles using the stemming algorithm provided by the tm
(text mining) + Snowball packages.
Here is my config:
MacOS 10.5
R 2.12.0 / R 2.13.1 / R 2.14.1 (I have tried several versions)
I have installed all the needed packages (tm, rJava, rWeka, Snowball)
+ dependencies. I have desactivated AWT (like written in
2013 Apr 09
3
Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Hi,
I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:
data<-read.delim("bigcompanies_ascii.csv",
header=TRUE,
quote="'",
2013 Apr 09
3
Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Hi,
I bumped into a serious issue while trying to analyse some texts in
Bulgarian language (with the tm package). I import a tab-separated csv
file, which holds a total of 22 variables, most of which are text cells
(not factors), using the read.delim function:
data<-read.delim("bigcompanies_ascii.csv",
header=TRUE,
quote="'",
2009 Nov 12
1
How can this code be improved?
I am running the following code on a MacBook Pro 17" Unibody early
2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in
64-bit mode.
freq.stopwords <- numeric(0)
freq.nonstopwords <- numeric(0)
token.tables <- list(0)
i.ss <- c(0)
cat("Beginning at ", date(), ".\n")
for (i.d in 1:length(tokens)) {
tt <- list(0)
for (i.s in
2004 Dec 14
1
stopwords
Hi!
I would like to use the lists of stopwords provided with Xapian. Are
there some standard way to remove stopwords automatically, or should I
implement it mysel in the indexer?
Regards,
Georges Dupret
2006 Jul 26
13
tweaking minimum word length?
Hi,
Can Ferret be configured to change the minimum word length of what it
indexes? Right now it seems to drop words 3 characters or less, but
I''d like to include words going down to 2 characters. How would I do
that?
Francis
2010 Oct 08
1
Get a list of all terms in an indexed corpus
Hello,
I have a corpus that I have indexed with xapian/xappy and I would now
like to generate a corpus-specific list of stopwords. (This is a
technical corpus, so a typical stopword list wouldn't be helpful.)
My first thought was to ask the xapian database for a list of terms
followed by their frequency. My intuition is that I could probably bring
together a list of stopwords by examining
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work to remove "the".
This package has undergone extensive redevelopment with changes to the
function syntax, so perhaps I am just missing something.
Please see my simple example, output, and sessionInfo() below.
Thanks!
Mark
require(tm)
2007 Aug 18
2
Problem with lsa package (data.frame) on Windows XP
Dear R team,
The following piece of code (to use the lsa package) works fine on my
mac os x, but when I run the same code on Windows XP, it doesn't work
any more.
### code:
library("lsa")
matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE.
000\\LSA\\cuentos\\", stemming=TRUE, language="spanish",
minWordLength=2, minDocFreq=1,
2017 Jul 18
1
Help-Multi class classification for large datasets
Hai all,
We are working on Multi-class Classification. Currently up to 1.1 million
records Ranger package in R is able to handle. Training time on 128 GB RAM
is 12 days, which is not a practically feasible method to proceed further.
In future we will have dataset of dimension 10 million records, we are in
search for a package or framework which can handle 10 million records with
at least 12000
2011 Dec 14
1
How to enable stemming with default_op set to OP_NEAR
Hi All,
I know that from version 1.2.6, if default_op is OP_NEAR or OP_PHRASE then stemming of the terms is disabled, since positional information isn't indexed for stemmed terms by default. However, I would like to try using OP_NEAR as default_op with stemming because I think the near operator is somehow different from exact phrase. Then I wanna see how the search results looks with this
2011 Sep 04
5
Ranking and term proximity
Hi,
I was reading an article recently about how google ranks results
(among many other things of course) based on the proximity of the
search terms in the source documents. In addition, the position of
the search terms in the search query string itself is also taken into
consideration when determining how important each term is.
Does Xapian do something similar - at least for the first part?
2007 Jan 19
9
Double-quoted query with "and" fails.
Hi,
We''re using Ferret 0.9.4 and we''ve observed the following behavior.
Searching for ''fieldname: foo and bar'' works fine while ''fieldname:
"foo and bar"'' doesn''t return any results. Is there a way to make
ferret recognize the ''and'' inside the query as a search term and not
an operator? (I hope I got the
2009 Jul 17
3
Ayuda con el paquete de text mining (TM)
Estimados, les escribo para consultar, lo siguiente:
Estoy haciendo un trabajo de text mining y necesito importar una serie de
textos para preprocesarlos, es decir eliminar los Stopwords, hacer stemming,
eliminar signos de puntuación etc. Esto último lo puedo realizar con los
datasets que trae la librería TM. Lo que no puedo lograr es importar texto
desde algún medio a pesar que existe funciones