thr3ads.net - similar to: "Stopword addition and stemming"

Displaying 20 results from an estimated 5000 matches similar to: "Stopword addition and stemming"

2008 Mar 12

how can i use stopwords?

Hi, I do not understand the stopword function... I've set the termgenerator like this: $self->{'Stemmer'} = new Search::Xapian::Stem(german2); $self->{'Stopper'} = new Search::Xapian::SimpleStopper(); $self->{'TermGenerator'} = new Search::Xapian::TermGenerator; $self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );

KMeans Clusterer - Going forward

2017 Jun 14

KMeans Clusterer - Going forward

Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a

Proper noun stemming

2008 Mar 27

Proper noun stemming

Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see

ideas on picking stopwords

2009 Mar 26

ideas on picking stopwords

I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my

Stemming, stop words, acts_as_ferret

2006 Nov 13

Stemming, stop words, acts_as_ferret

I''d like to get the following behavior: 1. Stemming. The search is on a database of summaries of California legal cases. Things like a search for "thermal image" needs to hit "thermal imaging." 2. Stop words. Searches for "failing to instruct the jury" should come up with hits on a search for "fail to instruct." 3. Case-insensitive. What I

Query parser and stemming of norwegian letters

2005 Jun 09

Query parser and stemming of norwegian letters

Hello, can I get an explanation of the following. Running the following code: .... pqp=new QueryParser(); Stem stem("norwegian"); cout << "DEBUG " << stem.stem_word(_sXapian)<< endl; pqp->set_stemmer(stem); pqp->set_database(*_pdatabase); pqp->set_default_op(Query::OP_AND); //Set the

Troubles with stemming (tm + Snowball packages) under MacOS

2012 Jan 13

Troubles with stemming (tm + Snowball packages) under MacOS

Dear all, I have some troubles using the stemming algorithm provided by the tm (text mining) + Snowball packages. Here is my config: MacOS 10.5 R 2.12.0 / R 2.13.1 / R 2.14.1 (I have tried several versions) I have installed all the needed packages (tm, rJava, rWeka, Snowball) + dependencies. I have desactivated AWT (like written in

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013 Apr 09

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Hi, I bumped into a serious issue while trying to analyse some texts in Bulgarian language (with the tm package). I import a tab-separated csv file, which holds a total of 22 variables, most of which are text cells (not factors), using the read.delim function: data<-read.delim("bigcompanies_ascii.csv", header=TRUE, quote="'",

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013 Apr 09

Question on Stopword Removal from a Cyrillic (Bulgarian)Text

How can this code be improved?

2009 Nov 12

How can this code be improved?

I am running the following code on a MacBook Pro 17" Unibody early 2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode. freq.stopwords <- numeric(0) freq.nonstopwords <- numeric(0) token.tables <- list(0) i.ss <- c(0) cat("Beginning at ", date(), ".\n") for (i.d in 1:length(tokens)) { tt <- list(0) for (i.s in

stopwords

2004 Dec 14

stopwords

Hi! I would like to use the lists of stopwords provided with Xapian. Are there some standard way to remove stopwords automatically, or should I implement it mysel in the indexer? Regards, Georges Dupret

tweaking minimum word length?

2006 Jul 26

tweaking minimum word length?

Hi, Can Ferret be configured to change the minimum word length of what it indexes? Right now it seems to drop words 3 characters or less, but I''d like to include words going down to 2 characters. How would I do that? Francis

Get a list of all terms in an indexed corpus

2010 Oct 08

Get a list of all terms in an indexed corpus

Hello, I have a corpus that I have indexed with xapian/xappy and I would now like to generate a corpus-specific list of stopwords. (This is a technical corpus, so a typical stopword list wouldn't be helpful.) My first thought was to ask the xapian database for a list of terms followed by their frequency. My intuition is that I could probably bring together a list of stopwords by examining

package "tm" fails to remove "the" with remove stopwords

2009 Nov 12

package "tm" fails to remove "the" with remove stopwords

I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm)

Problem with lsa package (data.frame) on Windows XP

2007 Aug 18

Problem with lsa package (data.frame) on Windows XP

Dear R team, The following piece of code (to use the lsa package) works fine on my mac os x, but when I run the same code on Windows XP, it doesn't work any more. ### code: library("lsa") matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE. 000\\LSA\\cuentos\\", stemming=TRUE, language="spanish", minWordLength=2, minDocFreq=1,

Help-Multi class classification for large datasets

2017 Jul 18

Help-Multi class classification for large datasets

Hai all, We are working on Multi-class Classification. Currently up to 1.1 million records Ranger package in R is able to handle. Training time on 128 GB RAM is 12 days, which is not a practically feasible method to proceed further. In future we will have dataset of dimension 10 million records, we are in search for a package or framework which can handle 10 million records with at least 12000

How to enable stemming with default_op set to OP_NEAR

2011 Dec 14

How to enable stemming with default_op set to OP_NEAR

Hi All, I know that from version 1.2.6, if default_op is OP_NEAR or OP_PHRASE then stemming of the terms is disabled, since positional information isn't indexed for stemmed terms by default. However, I would like to try using OP_NEAR as default_op with stemming because I think the near operator is somehow different from exact phrase. Then I wanna see how the search results looks with this

Ranking and term proximity

2011 Sep 04

Ranking and term proximity

Hi, I was reading an article recently about how google ranks results (among many other things of course) based on the proximity of the search terms in the source documents. In addition, the position of the search terms in the search query string itself is also taken into consideration when determining how important each term is. Does Xapian do something similar - at least for the first part?

Double-quoted query with "and" fails.

2007 Jan 19

Double-quoted query with "and" fails.

Hi, We''re using Ferret 0.9.4 and we''ve observed the following behavior. Searching for ''fieldname: foo and bar'' works fine while ''fieldname: "foo and bar"'' doesn''t return any results. Is there a way to make ferret recognize the ''and'' inside the query as a search term and not an operator? (I hope I got the

Ayuda con el paquete de text mining (TM)

2009 Jul 17

Ayuda con el paquete de text mining (TM)

Estimados, les escribo para consultar, lo siguiente: Estoy haciendo un trabajo de text mining y necesito importar una serie de textos para preprocesarlos, es decir eliminar los Stopwords, hacer stemming, eliminar signos de puntuación etc. Esto último lo puedo realizar con los datasets que trae la librería TM. Lo que no puedo lograr es importar texto desde algún medio a pesar que existe funciones

similar to: Stopword addition and stemming