thr3ads.net - Ferret talk - [Ferret-talk] [Repost] Problem with url searching.. [Apr 2007]

If this information is useful, please help other people find it:
Share via:

ahFeel

2007-Apr-03 10:04 UTC

[Ferret-talk] [Repost] Problem with url searching..

Hi all,

I''ve posted that few weeks ago but no one answered, but this feature is
REALLY important for us.

I have many objects with a url field, of course containing standards
urls...
I''m trying to match them but i actually got problems with that.

Here''s a little code of what i would like to achieve:
require ''rubygems''
require ''ferret''
require ''ftools''

class TestAnalyzer
  def token_stream(field, str)
    ts = Ferret::Analysis::AsciiStandardTokenizer.new(str)
    ts = Ferret::Analysis::AsciiLowerCaseFilter.new(ts)
  end
end

system ''rm -rf /tmp/ferret_test'' if
File.exists?(''/tmp/ferret_test'')
File.mkpath(''/tmp/ferret_test'')
INDEX = Ferret::I.new(:path => ''/tmp/ferret_test'',
:analyzer =>
TestAnalyzer.new)
INDEX << {:type => :url, :url =>
''http://google.fr''}
INDEX << {:type => :url, :url =>
''http://ferret.davebalmain.com''}
INDEX << {:type => :url, :url =>
''http://www.unixaumonde.com''}
INDEX << {:type => :url, :url =>
''http://www.rift.fr''}

[''type:url AND url:*google*'',
 ''type:url AND url:*"://foobar"*'',
 ''type:url AND url:"http://goo"*'',
 ''type:url AND url:"http://goo*"''].each do |q|
  puts "\nSearching #{q}"
  INDEX.search(q).hits.each { |x| p INDEX[x.doc].load }
  puts "\n"
end

I hope Dave or anyone else will be able to give us an hint or a release,
something like this..

Regards,
Jeremie ''ahFeel'' BORDIER
Rift Technologies

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-Apr-03 10:39 UTC

head link

[Ferret-talk] [Repost] Problem with url searching..

On Tue, Apr 03, 2007 at 12:04:28PM +0200, ahFeel wrote:> Hi all,
> 
> I''ve posted that few weeks ago but no one answered, but this
feature is
> REALLY important for us.
> 
> I have many objects with a url field, of course containing standards
> urls...
> I''m trying to match them but i actually got problems with that.
Ok, here we go:

First of all, use 

INDEX.process_query(query_string) 

to see how Ferret sees your querys after the QueryParser parsed them.

You''ll see that the results ferret gives perfectly match the queries
the
parser generated from your query strings - but these are not the results
you want. 

So you''ll have do work on the analysis part. Here it seems your problem
is that your analyzer is stripping away the wildcards you use, i.e.

a = TestAnalyzer.new
qp = Ferret::QueryParser.new :analyzer => a
qp.parse ''url:"http://ferret.davebalmain.com"'' #
url:ferret.davebalmain.com
qp.parse ''url:"http://ferret*"''                #
url:ferret  -> bad, won''t mach

A custom URLAnalyzer that strips away the protocol://, but leaves intact
wildcards in queries could help here. You also should think about
further tokenizing the domain part by splitting at ''.'' (as a
LetterTokenizer would do). So url:ferret would match
the ferret.davebalmain.com url even without wildcard.

Also keep in mind that you do not have to use Ferret''s Query Parser if
it doesn''t fit your needs - you can always build your own.

Jens

-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

ahFeel

2007-Apr-03 12:10 UTC

head link

[Ferret-talk] [Repost] Problem with url searching..

Thank you for you''re usefull answer, even if it''s quite a
weird behavior
of Ferret''s query parser, i''ll try to go on with that :)

Thanks again Jens for everything you do for Ferret too ! :)

Regards,
J?r?mie ''ahFeel'' BORDIER
Rift Technologies.

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2007-Apr-06 05:43 UTC

head link

[Ferret-talk] [Repost] Problem with url searching..

On 4/3/07, ahFeel <ahfeel_nospam_ at rift.fr>
wrote:> Thank you for you''re usefull answer, even if it''s quite a
weird behavior
> of Ferret''s query parser, i''ll try to go on with that :)
I can see why this behaviour may seem a little weird. Unfortunately,
the way phrase queries are implemented, it is impossible to have a
wildcard term within a phrase query. So "http://goo*" treats
http://goo* as a term in a phrase query and runs it through the
analyzer which then strips the wild-card character ''*''.

"http://goo"* is a phrase query with ''*'' at the end
which doesn''t have
any meaning in ferret query language.

http://goo* should work with a WhiteSpaceAnalyzer. The
StandardAnalyzer strips the http:// (or file:/// or ftp://) from the
beginning of terms during analysis. However, when you add a wild-card
character to a query the term doesn''t get analyzed. So basically the
query http://google.fr will be converted to the query google.fr and
will match, but the query http://goo* will not be analyzed and match
http://goo* but there is no http://google.fr in the index, only
google.fr, so you won''t get a match. Searching for goo* however will
work. What you might like to try is stripping http:// from your
queries with a simple query.gsub(/http:\/\//, '''').

Hope that helps,
Dave

-- 
Dave Balmain
http://www.davebalmain.com/

Maybe Matching Threads

Search for more possibly parallel threads

Ferret talk - Apr 2007 - [Repost] Problem with url searching..

[Ferret-talk] [Repost] Problem with url searching..

[Ferret-talk] [Repost] Problem with url searching..

[Ferret-talk] [Repost] Problem with url searching..

[Ferret-talk] [Repost] Problem with url searching..

Maybe Matching Threads