I would like to thank all the people who have contributed to this very fine project. Great work! I''ve encountered some strange results while examining the term frequency of one of my indexed documents. The indexed terms seem to vary for the very same document depending on the presence or absence of completely unrelated operations in the code, so the resulting term frequency changes, too. I repeatedly call ''index_reader.term_docs_for'' for the only document I''ve indexed in the snippet below, but depending on the presence of the statement ''dummy_count = 0'' or some formatting code for the output the resulting term frequencies change from correct answers to wrong ones. Sometimes terms are not found at all. For better examination I add a complete snippet which produce this behavior on my system (the text is taken from http://de.wikipedia.org/wiki/Entgelt). I''m working with ferret Version 0.11.3, C extensions compiled with VC6.0 (but the 0.10.9-mswin32 binaries from the ferret gem show the same behavior), and ruby version 1.8.5. Has anybody an explanation for that or do I misuse something? require ''rubygems'' require ''ferret'' $KCODE=''u'' text = <<END_OF_TEXT Der Begriff Entgelt (n.; Plural "Entgelte") bezeichnet die in einem Vertrag... END_OF_TEXT class StemAnalyzer < Ferret::Analysis::Analyzer def token_stream(field, str) return Ferret::Analysis::StemFilter.new(Ferret::Analysis::StandardTokenizer.new(str),"german") end end puts "Using Ferret v#{Ferret::VERSION}..." puts "Using Ruby v#{VERSION}..." @index = Ferret::I.new(:analyzer => StemAnalyzer.new()) @index << {:title => "Entgelt", :content => text} #dummy_count = 0 index_reader = @index.reader tde=index_reader.term_docs_for(:content, "Vertrag") tde.each{|did,freq| puts "Term \''Vertrag\'' occurs in Document \''#{@index[did][:title]}\'' #{freq} times (5 expected)\n"} tde=index_reader.term_docs_for(:content, "BGB") tde.each{|did,freq| puts "Term \''BGB\'' occurs in Document \''#{@index[did][:title]}\'' #{freq} times (3 expected)\n"} tde=index_reader.term_docs_for(:content, "Leistung") tde.each{|did,freq| puts "Term \''Leistung\'' occurs in Document \''#{@index[did][:title]}\'' #{freq} times (12 expected)\n"} Output: => Using Ferret v0.11.3... => Using Ruby v1.8.5... => Term ''Vertrag'' occurs in Document ''Entgelt'' 4 times (5 expected) => Term ''Leistung'' occurs in Document ''Entgelt'' 3 times (12 expected) Ouput after removing the comment in ''dummy_count=0'': => Using Ferret v0.11.3... => Using Ruby v1.8.5... => Term ''Vertrag'' occurs in Document ''Entgelt'' 5 times (5 expected) => Term ''BGB'' occurs in Document ''Entgelt'' 3 times (3 expected) => Term ''Leistung'' occurs in Document ''Entgelt'' 12 times (12 expected) -- Posted via http://www.ruby-forum.com/.
On 3/21/07, Thomas Senf <thomas.senf at web.de> wrote:> I would like to thank all the people who have contributed to this very > fine project. Great work! > > I''ve encountered some strange results while examining the term frequency > of one of my indexed documents. The indexed terms seem to vary for the > very same document depending on the presence or absence of completely > unrelated operations in the code, so the resulting term frequency > changes, too. > > I repeatedly call ''index_reader.term_docs_for'' for the only document > I''ve indexed in the snippet below, but depending on the presence of the > statement > ''dummy_count = 0'' or some formatting code for the output the resulting > term frequencies change from correct answers to wrong ones. Sometimes > terms are not > found at all. > > For better examination I add a complete snippet which produce this > behavior on my system (the text is taken from > http://de.wikipedia.org/wiki/Entgelt). I''m > working with ferret Version 0.11.3, C extensions compiled with VC6.0 > (but the 0.10.9-mswin32 binaries from the ferret gem show the same > behavior), and ruby > version 1.8.5. > > Has anybody an explanation for that or do I misuse something? > <snip>Test Code</snip>Hi Thomas, Firstly, well done compiling Ferret on Windows and thanks for posting this. The reason I haven''t yet released a win32 gem is that I''m still trying to work out the String#dump issue which is wreaking havoc when people try and use Ferret with Rails on Windows. I suspect this issue of yours is somehow related. I''ll let you know as soon as I find a solution. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/
David Balmain wrote:> On 3/21/07, Thomas Senf <thomas.senf at web.de> wrote: > >> I would like to thank all the people who have contributed to this very >> fine project. Great work! >> >> I''ve encountered some strange results while examining the term frequency >> of one of my indexed documents. The indexed terms seem to vary for the >> very same document depending on the presence or absence of completely >> unrelated operations in the code, so the resulting term frequency >> changes, too. >> >> I repeatedly call ''index_reader.term_docs_for'' for the only document >> I''ve indexed in the snippet below, but depending on the presence of the >> statement >> ''dummy_count = 0'' or some formatting code for the output the resulting >> term frequencies change from correct answers to wrong ones. Sometimes >> terms are not >> found at all. >> >> For better examination I add a complete snippet which produce this >> behavior on my system (the text is taken from >> http://de.wikipedia.org/wiki/Entgelt). I''m >> working with ferret Version 0.11.3, C extensions compiled with VC6.0 >> (but the 0.10.9-mswin32 binaries from the ferret gem show the same >> behavior), and ruby >> version 1.8.5. >> >> Has anybody an explanation for that or do I misuse something? >> <snip>Test Code</snip> >>I ran the test code on both the 0.10.9 win32 gem and on Cygwin on 0.11.3 Here are the results: # dummy_count = 0 Using Ferret v0.10.9... Using Ruby v1.8.5... Term ''Vertrag'' occurs in Document ''Entgelt'' 4 times (5 expected) Term ''BGB'' occurs in Document ''Entgelt'' 1 times (3 expected) Term ''Leistung'' occurs in Document ''Entgelt'' 5 times (12 expected) Using Ferret v0.11.3... Using Ruby v1.8.5... Term ''Vertrag'' occurs in Document ''Entgelt'' 5 times (5 expected) Term ''BGB'' occurs in Document ''Entgelt'' 9 times (3 expected) Term ''Leistung'' occurs in Document ''Entgelt'' 12 times (12 expected) dummy_count = 0 C:\Documents and Settings\Patrick Ritchie\ruby>ruby tf_test.rb Using Ferret v0.10.9... Using Ruby v1.8.5... Term ''Vertrag'' occurs in Document ''Entgelt'' 4 times (5 expected) Term ''BGB'' occurs in Document ''Entgelt'' 1 times (3 expected) Term ''Leistung'' occurs in Document ''Entgelt'' 5 times (12 expected) Using Ferret v0.11.3... Using Ruby v1.8.5... Term ''Vertrag'' occurs in Document ''Entgelt'' 5 times (5 expected) Term ''BGB'' occurs in Document ''Entgelt'' 9 times (3 expected) Term ''Leistung'' occurs in Document ''Entgelt'' 12 times (12 expected) Results don''t seem to change when dummy_count is set, I think the difference between Cygwin and the straight win32 build is the UTF-8 support. Cheers! Patrick