I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.
David Balmain
2007-Mar-06 02:13 UTC
[Ferret-talk] Getting non-stemmed terms from IndexReader
On 3/5/07, Ted <admin at mightytofu.com> wrote:> I need to get a set of terms being indexed using Ferret. I used > IndexReader.terms and it returns a list of TermEnum nicely. The only > problem is that my analyzer includes a stemming filter. > So now, the terms I''m getting back are all stemmed. Is there anyway to > get the original unstemmed terms back from the index somehow? Thanks.Hi Ted, Unfortunately this isn''t really possible. What I''d recommend is indexing the field twice; once with a stemming analyzer and once without. See PerFieldAnalyzer; http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html Hope that helps. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
Thanks for the response. This is exactly what I did... indexing the field twice and then have different analyzers for both. David Balmain wrote:> On 3/5/07, Ted <admin at mightytofu.com> wrote: >> I need to get a set of terms being indexed using Ferret. I used >> IndexReader.terms and it returns a list of TermEnum nicely. The only >> problem is that my analyzer includes a stemming filter. >> So now, the terms I''m getting back are all stemmed. Is there anyway to >> get the original unstemmed terms back from the index somehow? Thanks. > > Hi Ted, > > Unfortunately this isn''t really possible. What I''d recommend is > indexing the field twice; once with a stemming analyzer and once > without. See PerFieldAnalyzer; > > http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html > > Hope that helps. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.
I encountered another problem: After I removed docs from the index, the doc_freq returned by IndexReader.terms is not updated. It always shows the old number or bigger number after more docs with that term is added. So it looks like the doc_freq is not updated corrected on removal of a doc. David Balmain wrote:> On 3/5/07, Ted <admin at mightytofu.com> wrote: >> I need to get a set of terms being indexed using Ferret. I used >> IndexReader.terms and it returns a list of TermEnum nicely. The only >> problem is that my analyzer includes a stemming filter. >> So now, the terms I''m getting back are all stemmed. Is there anyway to >> get the original unstemmed terms back from the index somehow? Thanks. > > Hi Ted, > > Unfortunately this isn''t really possible. What I''d recommend is > indexing the field twice; once with a stemming analyzer and once > without. See PerFieldAnalyzer; > > http://ferret.davebalmain.com/api/classes/Ferret/Analysis/PerFieldAnalyzer.html > > Hope that helps. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.
David Balmain
2007-Mar-06 03:25 UTC
[Ferret-talk] Getting non-stemmed terms from IndexReader
On 3/6/07, Ted <admin at mightytofu.com> wrote:> I encountered another problem: > > After I removed docs from the index, the doc_freq returned by > IndexReader.terms is not updated. It always shows the old number or > bigger number after more docs with that term is added. > So it looks like the doc_freq is not updated corrected on removal of a > doc.This is impossible to fix without ruining performance. To fix this problem I would basically need to optimize the index after every deletion. In fact, you can do this yourself if you like. Just optimize the index whenever you need to rely on the doc frequency being correct and you have possible deletions in the index. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
Got it. I had thought that ''flush'' would do the trick, but i guess not so. I think I will have to call optimize but do so only when necessary then. Thanks for your response. David Balmain wrote:> On 3/6/07, Ted <admin at mightytofu.com> wrote: >> I encountered another problem: >> >> After I removed docs from the index, the doc_freq returned by >> IndexReader.terms is not updated. It always shows the old number or >> bigger number after more docs with that term is added. >> So it looks like the doc_freq is not updated corrected on removal of a >> doc. > > This is impossible to fix without ruining performance. To fix this > problem I would basically need to optimize the index after every > deletion. In fact, you can do this yourself if you like. Just optimize > the index whenever you need to rely on the doc frequency being correct > and you have possible deletions in the index. > > Cheers, > Dave-- Posted via http://www.ruby-forum.com/.