search for: unstemmed

Displaying 18 results from an estimated 18 matches for "unstemmed".

Did you mean: stemmed
2015 Jul 26
1
Get term from document by position
...arch string. Moreover, it takes too long to generate a snippet. > Note that your suggested approach of going from terms to snippet doesn?t work in the general > case, because of issues like stemming. Actually, it works just fine. I am using the following indexing scheme: First, i index unstemmed text. Next, i add a term with a unique prefix to the database. This term is used as a delimiter between stemmed and unstemmed terms. Finally, i index stemmed text. When generating snippet (if stemmer is being used) i get positions of the stemmed terms (that the snipped should consist of) and the...
2007 Mar 04
5
Getting non-stemmed terms from IndexReader
I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.
2008 Mar 27
2
Proper noun stemming
Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see
2017 Jun 14
2
KMeans Clusterer - Going forward
Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a
2023 Mar 26
1
manual flushing thresholds for deletes?
...ngth itself, > and 3x for the position overhead If I follow you want an approximation to the number of raw bytes in the text to match the non-delete case, so I think you want something like: get_doclength() / 2 * (mean_word_length + 1) The /2 is assuming you're indexing both stemmed and unstemmed terms since with the default indexing strategy one word in the document generates one of each. The +1 is for the spaces between words in the text. This is likely to underestimate due to punctuation and runs of whitespace, So perhaps +1.<something> is better (and perhaps better to overestima...
2009 Mar 26
1
ideas on picking stopwords
I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my
2006 Aug 11
3
Proposed changes to omindex
Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp
2011 May 27
1
Does OP_NEAR works with stemming?
Hi All, I used the OP_NEAR operator for queryparser, and when I searched for "apple store" from my own collection, the query is parsed as "Zappl:(pos=1) NEAR 11 Zstore:(pos=2)" but retrieved nothing. However, if I type in "Apple Store", the query is parsed as Xapian::Query((apple:(pos=1) NEAR 11 store:(pos=2))) and some results are showed. I'm not sure whether
2007 Jun 28
1
TermGenerator and SimpleStopper
Hi, I'm using SimpleStopper with TermGenerator in a Python indexing script, in an attempt to keep my index size down (currently 30K per doc, and I have 200 million docs to index, which I think implies 6TB.) However, unprefixed (positional?) terms are not affected by the stopper, though Z-prefixed terms are. I assume this is intentional for phrase queries, but I need to reduce my
2010 May 27
1
Problem with stop words by indexing
...ocs - now done. > > This ought to be more configurable, as should some other things in > TermGenerator. I'm thinking we should look at how to improve TermGenerator > in 1.3.x. 1.3.x release is a little bit far away for my use case (I speak here only about the capacity of removing unstemmed stop words). I have (in termegenerator_internal.cc, line 129) changed the default value of stop_mode from STOPWORDS_INDEX_UNSTEMMED_ONLY to STOPWORDS_IGNORE and xapian does now exactly what I want. Wouldn't be possible to simply add a property "stopper_strategy" to the termgenerator...
2015 Jul 23
1
Get term from document by position
Hello. Is there any FAST way to get a term from the xapian document by it's position, something like std::string term = Xapian::Document::GetTermByPosition(int position) ? Below i have described a task that i am trying to solve, in case if somebody is interested. ============================================================================ When displaying search results, i would like to
2011 Jun 04
1
Problem with Snowball & RWeka
...s I got a Java error "Could not initialize the GenericPropertiesCreator. This exception was produced: java.lang.NullPointerException". After receiving this error once in the session, no further error messages are generated. However, SnowballStemmer() and stemDocument() return the original unstemmed text. Possible Solution: For those on Mac OS, Kurt Hornik wrote... These issues seem to be specific to Mac OS X. Recent versions of Weka have added a package management system not unlike R's, to the effect that now when external packages (or the Snowball jar) is loaded their...
2023 Mar 24
1
manual flushing thresholds for deletes?
Years ago, I ran into OOM problems with the default flush threshold of 10000 documents while indexing (add/replace). Realizing I had documents of hugely varying sizes (0.5KB..20MB) and little RAM, I instead tracked the number of raw bytes in the text being indexed and flushed whenever I'd seen a configurable byte count. Not the most scientific way, but it seems to work well enough on low-end
2023 Mar 27
1
manual flushing thresholds for deletes?
...position overhead > > If I follow you want an approximation to the number of raw bytes in the > text to match the non-delete case, so I think you want something like: > > get_doclength() / 2 * (mean_word_length + 1) > > The /2 is assuming you're indexing both stemmed and unstemmed terms > since with the default indexing strategy one word in the document > generates one of each. > > The +1 is for the spaces between words in the text. This is > likely to underestimate due to punctuation and runs of whitespace, > So perhaps +1.<something> is better (an...
2012 Jan 05
1
Enhance synonyms feature of the query parser (patch included)
...yparser.lemon' --- *** queryparser.lemony 2012-01-05 12:28:39.000000000 +0800 --- queryparser.lemony.new 2012-01-05 12:52:56.000000000 +0800 *************** *** 307,316 **** --- 307,318 ---- for (piter = prefixes.begin(); piter != prefixes.end(); ++piter) { // First try the unstemmed term: string term; + #ifndef HAVE_SYNONYMS_ENH if (!piter->empty()) { term += *piter; if (prefix_needs_colon(*piter, name[0])) term += ':'; } + #endif term += name; Xapian::Database db = state->get_database(); ********...
2011 Jul 27
3
Searching using prefixes
Hi guys I'm trying to figure out how I can use probabilistic searching on a given field within a document; I've written to the list about this before, but haven't quite figured out what's required and, following a little research, I think I understand what I need to do but I'd like a clarification on this. o We have a database of a number of documents, with fields: title,
2007 Oct 01
3
How to beat Google aka Xapian & Natural Language Processing.
Xapians! If tomorrow Xapian search engine would achieved the same performance and result in searches as Google we would not be able to beat Google, because we would create only a copy of the searches that already exists from Google search engine. However there is a way to beat anyone, and there is a way to beat Google successfully as well just do not give up. Some see it as implementing Ajax, or
2013 Nov 30
4
Full text search improvements
...to word prefixes. This of course would mean that the "ab" referring to "a" UID list would no longer work for the first nodes. Substring searching likely wouldn't work very nicely for stemmed words. So Squat should probably index the full stemmed word and then also index the unstemmed word in the small 4 letter pieces. It should be possible to also disable substring searching entirely. Squat already attempts to reduce disk space by encoding the common characters with less bits than other characters. This is hardcoded for English language though. Each index compression could ana...