I can consistently segfault the 0.10.4 gem, so I''m trying to get the subversion version working with hopes towards tracking the problem down. I have a fresh SVN checkout but: a) the version (in ferret.rb) claims to be 0.9.6; and b) Ferret::Index::FieldInfos and a couple other classes are missing at run time. It looks like this is because they''re not exported in the C extension (although I do see the corresponding C objects in the code.) Have I managed to acquire some outdated version of Ferret? Thanks for any help! -- William <wmorgan-ferret at masanjin.net>
On 9/24/06, William Morgan <wmorgan-ferret at masanjin.net> wrote:> I can consistently segfault the 0.10.4 gem, so I''m trying to get the > subversion version working with hopes towards tracking the problem down. > > I have a fresh SVN checkout but: > > a) the version (in ferret.rb) claims to be 0.9.6; and > b) Ferret::Index::FieldInfos and a couple other classes are missing at > run time. It looks like this is because they''re not exported in the C > extension (although I do see the corresponding C objects in the > code.) > > Have I managed to acquire some outdated version of Ferret? > > Thanks for any help!Hi William, The 0.10.* series was developed in a different subversion repository. You can check it out from: $ svn co svn://www.davebalmain.com/exp ferret If I have time today I might roll it into the original repository. I''m not sure exactly how I''m going to do it though. By the way, the 0.10.7 gem is out and it has all changes in it, including the fix for your TermQuery problem. Cheers, Dave
Excerpts from David Balmain''s mail of 23 Sep 2006 (PDT):> The 0.10.* series was developed in a different subversion repository. > You can check it out from: > > $ svn co svn://www.davebalmain.com/exp ferretThanks! See patch in following message.> By the way, the 0.10.7 gem is out and it has all changes in it, > including the fix for your TermQuery problem.Sadly it doesn''t seem to fix the problem, but I''ll spend some more time playing around now that I have the updated source. -- William <wmorgan-ferret at masanjin.net>
On 9/25/06, William Morgan <wmorgan-ferret at masanjin.net> wrote:> Excerpts from David Balmain''s mail of 23 Sep 2006 (PDT): > > The 0.10.* series was developed in a different subversion repository. > > You can check it out from: > > > > $ svn co svn://www.davebalmain.com/exp ferret > > Thanks! See patch in following message. > > > By the way, the 0.10.7 gem is out and it has all changes in it, > > including the fix for your TermQuery problem. > > Sadly it doesn''t seem to fix the problem, but I''ll spend some more time > playing around now that I have the updated source.Hi William, Did you rebuild the index? You''ll need to do that before it makes any difference. Cheers, Dave
Hi Dave, Excerpts from David Balmain''s mail of 24 Sep 2006 (PDT):> Did you rebuild the index? You''ll need to do that before it makes any > difference.Yes, the original example now works---thanks! Unfortunately, I still see a lot of queries that return nothing in TermQuery form, but work fine in String form. For example:> (0..10).each do |j| > m = @i[j][:message_id] > n1 = @i.search(Ferret::Search::TermQuery.new(:message_id, m)).total_hits > n2 = @i.search("message_id:#{m}").total_hits > puts "#{m}: #{n1} #{n2}" > end43134A26.5010503 at qwest.com: 0 1 20050830.032307.1370960293.aamine at loveruby.net: 1 1 43137684.4090506 at qwest.com: 1 1 39AA6550E5AA554AB1456707D6E5563D0DCCF5 at QTOMAE2K3M01.AD.QINTRA.COM: 0 1 200508292246.j7TMkwdh001657 at sharui.nakada.niregi.kanuma.tochigi.jp: 0 1 87zmr017on.fsf at m17n.org: 1 1 1125383295.382347.22398.nullmailer at x31.priv.netlab.jp: 1 1 9B68375A-AA86-4EB9-AEC9-675E7C6EFBA6 at pobox.com: 0 1 20050905154808.53555.qmail at web50313.mail.yahoo.com: 1 1 431C7204.80505 at pobox.com: 0 1 200509052114.j85LEek4030178 at rubyforge.org: 0 1 Based on the first and third entries, I can''t imagine this is a tokenization problem. What do you think? -- William <wmorgan-ferret at masanjin.net>
On 9/27/06, William Morgan <wmorgan-ferret at masanjin.net> wrote:> Hi Dave, > > Excerpts from David Balmain''s mail of 24 Sep 2006 (PDT): > > Did you rebuild the index? You''ll need to do that before it makes any > > difference. > > Yes, the original example now works---thanks! Unfortunately, I still see > a lot of queries that return nothing in TermQuery form, but work fine in > String form. > > For example: > > > (0..10).each do |j| > > m = @i[j][:message_id] > > n1 = @i.search(Ferret::Search::TermQuery.new(:message_id, m)).total_hits > > n2 = @i.search("message_id:#{m}").total_hits > > puts "#{m}: #{n1} #{n2}" > > end > 43134A26.5010503 at qwest.com: 0 1 > 20050830.032307.1370960293.aamine at loveruby.net: 1 1 > 43137684.4090506 at qwest.com: 1 1 > 39AA6550E5AA554AB1456707D6E5563D0DCCF5 at QTOMAE2K3M01.AD.QINTRA.COM: 0 1 > 200508292246.j7TMkwdh001657 at sharui.nakada.niregi.kanuma.tochigi.jp: 0 1 > 87zmr017on.fsf at m17n.org: 1 1 > 1125383295.382347.22398.nullmailer at x31.priv.netlab.jp: 1 1 > 9B68375A-AA86-4EB9-AEC9-675E7C6EFBA6 at pobox.com: 0 1 > 20050905154808.53555.qmail at web50313.mail.yahoo.com: 1 1 > 431C7204.80505 at pobox.com: 0 1 > 200509052114.j85LEek4030178 at rubyforge.org: 0 1 > > Based on the first and third entries, I can''t imagine this is a > tokenization problem. What do you think? > > -- > William <wmorgan-ferret at masanjin.net>Hi William, You need to downcase the term when you add it to a TermQuery. The StandardAnalyzer downcases all text so you need to do the same with any terms you add to any hand built queries. One way to see what might possibly be wrong is to run the term through the analyzer yourself. require ''rubygems'' require ''ferret'' include Ferret::Analysis EMAILS = [ "43134A26.5010503 at qwest.com", "20050830.032307.1370960293.aamine at loveruby.net", "43137684.4090506 at qwest.com", "39AA6550E5AA554AB1456707D6E5563D0DCCF5 at QTOMAE2K3M01.AD.QINTRA.COM", "200508292246.j7TMkwdh001657 at sharui.nakada.niregi.kanuma.tochigi.jp", "87zmr017on.fsf at m17n.org", "1125383295.382347.22398.nullmailer at x31.priv.netlab.jp", "9B68375A-AA86-4EB9-AEC9-675E7C6EFBA6 at pobox.com", "20050905154808.53555.qmail at web50313.mail.yahoo.com", "431C7204.80505 at pobox.com", "200509052114.j85LEek4030178 at rubyforge.org" ] a = StandardAnalyzer.new EMAILS.each do |email| print email + ":" tz = a.token_stream(:field, email) puts email == tz.next.text end Hope that clears things up. Cheers, Dave
Hi Dave, Excerpts from David Balmain''s mail of 26 Sep 2006 (PDT):> You need to downcase the term when you add it to a TermQuery. The > StandardAnalyzer downcases all text so you need to do the same with > any terms you add to any hand built queries.Thanks for the response. Downcasing the string passed into the TermQuery does, in fact, retrieve the document. BUT, I had used a WhitespaceAnalyzer with no downcasing on that field, so it should have preserved case in the index. In fact, some experimentation shows:> mid = "43134A26.5010503 at qwest.com" > i = Ferret::Index::Index.new > wsa = Ferret::Analysis::WhiteSpaceAnalyzer.new false > wsa.token_stream(:message_id, mid).next=> token["43134A26.5010503 at qwest.com":0:26:1]> i.add_document({:message_id => mid}, wsa) > i.search(Ferret::Search::TermQuery.new(:message_id, mid))=> #<struct Ferret::Search::TopDocs total_hits=0, hits=[], max_score=0.0>> i.search(Ferret::Search::TermQuery.new(:message_id, mid.downcase))=> #<struct Ferret::Search::TopDocs total_hits=1, hits=[#<struct Ferret::Search::Hit doc=0, score=0.3068528175354>], max_score=0.3068528175354> So it looks like WSA#token_stream does the right thing. Is it possible isn''t not actually being called at insertion time? Or am I misunderstanding something? -- William <wmorgan-ferret at masanjin.net>
On 9/28/06, William Morgan <wmorgan-ferret at masanjin.net> wrote:> Hi Dave, > > Excerpts from David Balmain''s mail of 26 Sep 2006 (PDT): > > You need to downcase the term when you add it to a TermQuery. The > > StandardAnalyzer downcases all text so you need to do the same with > > any terms you add to any hand built queries. > > Thanks for the response. Downcasing the string passed into the TermQuery > does, in fact, retrieve the document. BUT, I had used a > WhitespaceAnalyzer with no downcasing on that field, so it should have > preserved case in the index. > > In fact, some experimentation shows: > > > mid = "43134A26.5010503 at qwest.com" > > i = Ferret::Index::Index.new > > wsa = Ferret::Analysis::WhiteSpaceAnalyzer.new false > > wsa.token_stream(:message_id, mid).next > => token["43134A26.5010503 at qwest.com":0:26:1] > > i.add_document({:message_id => mid}, wsa) > > i.search(Ferret::Search::TermQuery.new(:message_id, mid)) > => #<struct Ferret::Search::TopDocs total_hits=0, hits=[], max_score=0.0> > > i.search(Ferret::Search::TermQuery.new(:message_id, mid.downcase)) > => #<struct Ferret::Search::TopDocs total_hits=1, hits=[#<struct Ferret::Search::Hit doc=0, score=0.3068528175354>], max_score=0.3068528175354> > > So it looks like WSA#token_stream does the right thing. Is it possible > isn''t not actually being called at insertion time? Or am I > misunderstanding something? > > -- > William <wmorgan-ferret at masanjin.net>Hi William, Ok, this is definitely a a bug. I''ve already fixed it and it''ll be out in the next release. By the way, you probably already know this but you can set the analyzer used by the index. Ferret::Index::Index.new(:analyzer => wsa) You probably have a good reason to be doing it the way you are but I just wanted to check. Cheers, Dave
Excerpts from David Balmain''s mail of 27 Sep 2006 (PDT):> Ok, this is definitely a a bug. I''ve already fixed it and it''ll be out > in the next release.Thank you.> By the way, you probably already know this but you can set the > analyzer used by the index. > > Ferret::Index::Index.new(:analyzer => wsa) > > You probably have a good reason to be doing it the way you are but I > just wanted to check.Nope, no good reason. Just an incomplete understanding of the API. This way''s much better. -- William <wmorgan-ferret at masanjin.net>