thr3ads.net - search: "unstemmed"

Displaying 18 results from an estimated 18 matches for "unstemmed".

Did you mean: stemmed

2015 Jul 26

Get term from document by position

...arch string. Moreover, it takes too long to generate a snippet. > Note that your suggested approach of going from terms to snippet doesn?t work in the general > case, because of issues like stemming. Actually, it works just fine. I am using the following indexing scheme: First, i index unstemmed text. Next, i add a term with a unique prefix to the database. This term is used as a delimiter between stemmed and unstemmed terms. Finally, i index stemmed text. When generating snippet (if stemmer is being used) i get positions of the stemmed terms (that the snipped should consist of) and the...

Getting non-stemmed terms from IndexReader

2007 Mar 04

Getting non-stemmed terms from IndexReader

I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.

Proper noun stemming

2008 Mar 27

Proper noun stemming

Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see

KMeans Clusterer - Going forward

2017 Jun 14

KMeans Clusterer - Going forward

Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a

manual flushing thresholds for deletes?

2023 Mar 26

manual flushing thresholds for deletes?

...ngth itself, > and 3x for the position overhead If I follow you want an approximation to the number of raw bytes in the text to match the non-delete case, so I think you want something like: get_doclength() / 2 * (mean_word_length + 1) The /2 is assuming you're indexing both stemmed and unstemmed terms since with the default indexing strategy one word in the document generates one of each. The +1 is for the spaces between words in the text. This is likely to underestimate due to punctuation and runs of whitespace, So perhaps +1.<something> is better (and perhaps better to overestima...

ideas on picking stopwords

2009 Mar 26

ideas on picking stopwords

I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my

Proposed changes to omindex

2006 Aug 11

Proposed changes to omindex

Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp

Does OP_NEAR works with stemming?

2011 May 27

Does OP_NEAR works with stemming?

Hi All, I used the OP_NEAR operator for queryparser, and when I searched for "apple store" from my own collection, the query is parsed as "Zappl:(pos=1) NEAR 11 Zstore:(pos=2)" but retrieved nothing. However, if I type in "Apple Store", the query is parsed as Xapian::Query((apple:(pos=1) NEAR 11 store:(pos=2))) and some results are showed. I'm not sure whether

TermGenerator and SimpleStopper

2007 Jun 28

TermGenerator and SimpleStopper

Hi, I'm using SimpleStopper with TermGenerator in a Python indexing script, in an attempt to keep my index size down (currently 30K per doc, and I have 200 million docs to index, which I think implies 6TB.) However, unprefixed (positional?) terms are not affected by the stopper, though Z-prefixed terms are. I assume this is intentional for phrase queries, but I need to reduce my

Problem with stop words by indexing

2010 May 27

Problem with stop words by indexing

...ocs - now done. > > This ought to be more configurable, as should some other things in > TermGenerator. I'm thinking we should look at how to improve TermGenerator > in 1.3.x. 1.3.x release is a little bit far away for my use case (I speak here only about the capacity of removing unstemmed stop words). I have (in termegenerator_internal.cc, line 129) changed the default value of stop_mode from STOPWORDS_INDEX_UNSTEMMED_ONLY to STOPWORDS_IGNORE and xapian does now exactly what I want. Wouldn't be possible to simply add a property "stopper_strategy" to the termgenerator...

Get term from document by position

2015 Jul 23

Get term from document by position

Hello. Is there any FAST way to get a term from the xapian document by it's position, something like std::string term = Xapian::Document::GetTermByPosition(int position) ? Below i have described a task that i am trying to solve, in case if somebody is interested. ============================================================================ When displaying search results, i would like to

Problem with Snowball & RWeka

2011 Jun 04

Problem with Snowball & RWeka

...s I got a Java error "Could not initialize the GenericPropertiesCreator. This exception was produced: java.lang.NullPointerException". After receiving this error once in the session, no further error messages are generated. However, SnowballStemmer() and stemDocument() return the original unstemmed text. Possible Solution: For those on Mac OS, Kurt Hornik wrote... These issues seem to be specific to Mac OS X. Recent versions of Weka have added a package management system not unlike R's, to the effect that now when external packages (or the Snowball jar) is loaded their...

manual flushing thresholds for deletes?

2023 Mar 24

manual flushing thresholds for deletes?

Years ago, I ran into OOM problems with the default flush threshold of 10000 documents while indexing (add/replace). Realizing I had documents of hugely varying sizes (0.5KB..20MB) and little RAM, I instead tracked the number of raw bytes in the text being indexed and flushed whenever I'd seen a configurable byte count. Not the most scientific way, but it seems to work well enough on low-end

manual flushing thresholds for deletes?

2023 Mar 27

manual flushing thresholds for deletes?

...position overhead > > If I follow you want an approximation to the number of raw bytes in the > text to match the non-delete case, so I think you want something like: > > get_doclength() / 2 * (mean_word_length + 1) > > The /2 is assuming you're indexing both stemmed and unstemmed terms > since with the default indexing strategy one word in the document > generates one of each. > > The +1 is for the spaces between words in the text. This is > likely to underestimate due to punctuation and runs of whitespace, > So perhaps +1.<something> is better (an...

Enhance synonyms feature of the query parser (patch included)

2012 Jan 05

Enhance synonyms feature of the query parser (patch included)

...yparser.lemon' --- *** queryparser.lemony 2012-01-05 12:28:39.000000000 +0800 --- queryparser.lemony.new 2012-01-05 12:52:56.000000000 +0800 *************** *** 307,316 **** --- 307,318 ---- for (piter = prefixes.begin(); piter != prefixes.end(); ++piter) { // First try the unstemmed term: string term; + #ifndef HAVE_SYNONYMS_ENH if (!piter->empty()) { term += *piter; if (prefix_needs_colon(*piter, name[0])) term += ':'; } + #endif term += name; Xapian::Database db = state->get_database(); ********...

Searching using prefixes

2011 Jul 27

Searching using prefixes

Hi guys I'm trying to figure out how I can use probabilistic searching on a given field within a document; I've written to the list about this before, but haven't quite figured out what's required and, following a little research, I think I understand what I need to do but I'd like a clarification on this. o We have a database of a number of documents, with fields: title,

How to beat Google aka Xapian & Natural Language Processing.

2007 Oct 01

How to beat Google aka Xapian & Natural Language Processing.

Xapians! If tomorrow Xapian search engine would achieved the same performance and result in searches as Google we would not be able to beat Google, because we would create only a copy of the searches that already exists from Google search engine. However there is a way to beat anyone, and there is a way to beat Google successfully as well just do not give up. Some see it as implementing Ajax, or

Full text search improvements

2013 Nov 30

Full text search improvements

...to word prefixes. This of course would mean that the "ab" referring to "a" UID list would no longer work for the first nodes. Substring searching likely wouldn't work very nicely for stemmed words. So Squat should probably index the full stemmed word and then also index the unstemmed word in the small 4 letter pieces. It should be possible to also disable substring searching entirely. Squat already attempts to reduce disk space by encoding the common characters with less bits than other characters. This is hardcoded for English language though. Each index compression could ana...

search for: unstemmed