Hi all, I''ve posted that few weeks ago but no one answered, but this feature is REALLY important for us. I have many objects with a url field, of course containing standards urls... I''m trying to match them but i actually got problems with that. Here''s a little code of what i would like to achieve: require ''rubygems'' require ''ferret'' require ''ftools'' class TestAnalyzer def token_stream(field, str) ts = Ferret::Analysis::AsciiStandardTokenizer.new(str) ts = Ferret::Analysis::AsciiLowerCaseFilter.new(ts) end end system ''rm -rf /tmp/ferret_test'' if File.exists?(''/tmp/ferret_test'') File.mkpath(''/tmp/ferret_test'') INDEX = Ferret::I.new(:path => ''/tmp/ferret_test'', :analyzer => TestAnalyzer.new) INDEX << {:type => :url, :url => ''http://google.fr''} INDEX << {:type => :url, :url => ''http://ferret.davebalmain.com''} INDEX << {:type => :url, :url => ''http://www.unixaumonde.com''} INDEX << {:type => :url, :url => ''http://www.rift.fr''} [''type:url AND url:*google*'', ''type:url AND url:*"://foobar"*'', ''type:url AND url:"http://goo"*'', ''type:url AND url:"http://goo*"''].each do |q| puts "\nSearching #{q}" INDEX.search(q).hits.each { |x| p INDEX[x.doc].load } puts "\n" end I hope Dave or anyone else will be able to give us an hint or a release, something like this.. Regards, Jeremie ''ahFeel'' BORDIER Rift Technologies -- Posted via http://www.ruby-forum.com/.
On Tue, Apr 03, 2007 at 12:04:28PM +0200, ahFeel wrote:> Hi all, > > I''ve posted that few weeks ago but no one answered, but this feature is > REALLY important for us. > > I have many objects with a url field, of course containing standards > urls... > I''m trying to match them but i actually got problems with that.Ok, here we go: First of all, use INDEX.process_query(query_string) to see how Ferret sees your querys after the QueryParser parsed them. You''ll see that the results ferret gives perfectly match the queries the parser generated from your query strings - but these are not the results you want. So you''ll have do work on the analysis part. Here it seems your problem is that your analyzer is stripping away the wildcards you use, i.e. a = TestAnalyzer.new qp = Ferret::QueryParser.new :analyzer => a qp.parse ''url:"http://ferret.davebalmain.com"'' # url:ferret.davebalmain.com qp.parse ''url:"http://ferret*"'' # url:ferret -> bad, won''t mach A custom URLAnalyzer that strips away the protocol://, but leaves intact wildcards in queries could help here. You also should think about further tokenizing the domain part by splitting at ''.'' (as a LetterTokenizer would do). So url:ferret would match the ferret.davebalmain.com url even without wildcard. Also keep in mind that you do not have to use Ferret''s Query Parser if it doesn''t fit your needs - you can always build your own. Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Thank you for you''re usefull answer, even if it''s quite a weird behavior of Ferret''s query parser, i''ll try to go on with that :) Thanks again Jens for everything you do for Ferret too ! :) Regards, J?r?mie ''ahFeel'' BORDIER Rift Technologies. -- Posted via http://www.ruby-forum.com/.
On 4/3/07, ahFeel <ahfeel_nospam_ at rift.fr> wrote:> Thank you for you''re usefull answer, even if it''s quite a weird behavior > of Ferret''s query parser, i''ll try to go on with that :)I can see why this behaviour may seem a little weird. Unfortunately, the way phrase queries are implemented, it is impossible to have a wildcard term within a phrase query. So "http://goo*" treats http://goo* as a term in a phrase query and runs it through the analyzer which then strips the wild-card character ''*''. "http://goo"* is a phrase query with ''*'' at the end which doesn''t have any meaning in ferret query language. http://goo* should work with a WhiteSpaceAnalyzer. The StandardAnalyzer strips the http:// (or file:/// or ftp://) from the beginning of terms during analysis. However, when you add a wild-card character to a query the term doesn''t get analyzed. So basically the query http://google.fr will be converted to the query google.fr and will match, but the query http://goo* will not be analyzed and match http://goo* but there is no http://google.fr in the index, only google.fr, so you won''t get a match. Searching for goo* however will work. What you might like to try is stripping http:// from your queries with a simple query.gsub(/http:\/\//, ''''). Hope that helps, Dave -- Dave Balmain http://www.davebalmain.com/