similar to: ideas on picking stopwords

Displaying 20 results from an estimated 2000 matches similar to: "ideas on picking stopwords"

2008 Mar 27
2
Proper noun stemming
Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see
2008 Mar 12
1
how can i use stopwords?
Hi, I do not understand the stopword function... I've set the termgenerator like this: $self->{'Stemmer'} = new Search::Xapian::Stem(german2); $self->{'Stopper'} = new Search::Xapian::SimpleStopper(); $self->{'TermGenerator'} = new Search::Xapian::TermGenerator; $self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );
2017 Jun 14
2
KMeans Clusterer - Going forward
Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a
2010 Nov 15
4
Stopword addition and stemming
Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a *.co.us TLD, eg) . I've noticed that searching for [bob co.us] results in *very* slow search times (tens of seconds), since it
2012 Jan 13
4
Troubles with stemming (tm + Snowball packages) under MacOS
Dear all, I have some troubles using the stemming algorithm provided by the tm (text mining) + Snowball packages. Here is my config: MacOS 10.5 R 2.12.0 / R 2.13.1 / R 2.14.1 (I have tried several versions) I have installed all the needed packages (tm, rJava, rWeka, Snowball) + dependencies. I have desactivated AWT (like written in
2002 Jul 12
2
HP-UX slow login problem found?
I think I finally figured out the problem that many people have been having with extremely long login times under HP-UX 11.x. The problem is really in OpenSSL, and in particular the Diffie-Hellman parameter generation routines under the PA-RISC processor. I suspect this may not be a problem with the IA64 (Itanium) processors. This especially shows up if you use the gcc compiler. Fortunately I
2009 Jul 17
3
Ayuda con el paquete de text mining (TM)
Estimados, les escribo para consultar, lo siguiente: Estoy haciendo un trabajo de text mining y necesito importar una serie de textos para preprocesarlos, es decir eliminar los Stopwords, hacer stemming, eliminar signos de puntuación etc. Esto último lo puedo realizar con los datasets que trae la librería TM. Lo que no puedo lograr es importar texto desde algún medio a pesar que existe funciones
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm)
2007 Aug 28
1
flintlock fork causes hang in Apache+Python+mod_python
I am trying to use Xapian 1.0.2 with the Python SWIG bindings withn an environment consisting of Apache httpd with mod_python. (not as a CGI) Also this is Linux. Whenever the python code attempts to open a database the entire httpd process will hang indefinitely. The python bindings work outside of the apache/mod_python environment. >From the best I can tell the hang occurs in
2002 Jul 26
1
HP-UX 11 Corrupted MAC errors
Using 3.4p1 under HP-UX 11.0 I am repeatedly getting disconnected with Corrupted MAC on input. I am connecting from a RedHat Linux client (at 3.1p1). The incorrect MAC is appearing on the server packet receive side. Never get an invalid MAC on the client side. I'm currently diving into packet.c to try to find this, but the behavior is so strange and predictable I thought I'd see if
2012 Dec 13
2
Tamaño de la matriz de términos y memoria. Paquete TM
Hola a todos! Tengo algunos problemas con el tamaño de la matriz de términos que obtengo. Los comandos que utilizo son los siguientes: # carga librerias library(tm) library(wordcloud) library(Rstem) library(Snowball) # lee el documento UTF-8 y lo convierte a ASCII txt <-
2002 Aug 30
1
LIBCRYPTO?
Hi all, I have a question about OpenSSH configuration. In Makefile there is defined LIBS=$(LIBCRYPTO), but the problem is that the version of OpenSSL that I'm using holds only the version LIBCRYPT. When adding LIBCRYPT to the Makefile I get: sshd.elf2flt: In function `key_regeneration_alarm': /.../ssh/sshd.c:252: undefined reference to `RSA_free' /.../ssh/sshd.c:253: undefined
2020 Nov 02
1
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed) [proposed patch]
> On 02/11/2020 15:11 PGNet Dev <pgnet.dev at gmail.com> wrote: > > > On 11/2/20 12:44 AM, Aki Tuomi wrote: > > you should try removing use_libfts from your config line and let solr do that part. > > sry, i'm a bit confused. > > you'd suggested I _add_ it, > > https://dovecot.org/pipermail/dovecot/2020-October/120258.html > > > I
2020 Oct 18
8
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
I've since rebuilt/reconfig'd all parts of my setup from scratch; some good cleanup along the way. Atm, my entire system for send/recv, store/retrieve, + rules & search is working as I intend. Ok, mostly ... Except for this accented-character search mystery. I've got a _lot_ of mail with various languages in bodies, so _do_ need to get this sorted. > On 10/18/20 2:58 PM,
2011 Jun 04
1
Problem with Snowball & RWeka
I too have this problem. Everything worked fine last year, but after updating R and packages I can no longer do word stemming. Unfortunately, I didn't save the old binaries, otherwise I would just revert back. Hoping someone finds a solution for R on Windows. Thanks! There is a potential solution for R on Mac OS from Kurt Hornik copied below, but I cannot get this to work on Windows.
2020 Apr 28
3
Stopwords: Topic modelling con LDA
Buenos días, Estoy realizando un análisis de topic models con el método LDA. En principio, he quitado del análisis las palabras "stopwords" universales. A la hora de ver los topics y sus palabras más frecuentes encuentro que son muy similares y hay palabras que aparecen en todos los topics. Los textos que estoy analizando son opiniones de consumidores sobre una categoría concreta de
2004 Dec 14
1
stopwords
Hi! I would like to use the lists of stopwords provided with Xapian. Are there some standard way to remove stopwords automatically, or should I implement it mysel in the indexer? Regards, Georges Dupret
2006 Jul 26
13
tweaking minimum word length?
Hi, Can Ferret be configured to change the minimum word length of what it indexes? Right now it seems to drop words 3 characters or less, but I''d like to include words going down to 2 characters. How would I do that? Francis
2007 Mar 04
5
Getting non-stemmed terms from IndexReader
I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.
2020 Oct 19
4
v2.3.11.3 solr plugin search via MUA fails to match accented ascii characters; cmd line exec of `doveadm fts lookup` PANICs (assertion failed)
On 10/19/20 1:18 AM, John Fawcett wrote: > I would recommend you to redo the tests after correcting the > configuration. To be doubly sure you can include accented and unique non > accented text in the same email and search for both. If the non accented > text is found you know you've searching against the updated index and > the fact that accented text is not found is not