Carl Lerche
2007-Jan-21 17:09 UTC
[Ferret-talk] A few questions: Tweaking StemFilter, indexes, ...
Hello all, I am new to the list, but I have been using ferret for a little bit already. I would first like to thank Dave for all his work on ferret. I had a few questions that I haven''t been able to figure out after messing around with ferret and going through the documentation. StemFilter ------ I am trying to improve the quality of my searches in context of the content of my application. I have created an analyzer using the following: StemFilter.new StopFilter.new( LowerCaseFilter.new(StandardTokenizer.new(text)), @stop_words ) This has been pretty good so far, however, I really would like to get a search for "plumber" match "plumbing" at maybe a lower score than it would match "plumbers". The thing is that plumber(s) is filtered to "plumber" and plumbing is filtered to plumb, so it doesn''t match. Is there any way to tweak the filter to be able to do these matches? I would like to match all noun and verbs together (and ideally with a lower score than different verb conjugations would match). Another example would be driving and driver. Worst case scenario, I could probably do some preprocessing to the search queries to expand "plumber" or "driving" to a query that includes both stems (for example expand the query for plumber to "plumber plumb") Indexes --- I was wondering how exactly indexes are implemented under the hood and if there is a way to give hints to ferret as to how our queries will be formed in order to optimize performance. Maybe I''m thinking of ferret too much as a database, but I am not too familiar with what''s under ferret''s hood. The reason I ask is that for the project I am working on, I have huge amounts of text to search, but each item also has a location associated with it (longitude & lattitude) and each query will only want to search the text located in a specific area (point and radius). I can add ranged parameters to the query and that will work, but is that optimal? Hopefully I am making sense. Donations --- I was wondering if there is a page that lists the total amount of donations so far? Thanks, -carl -- EPA Rating: 3000 Lines of Code / Gallon (of coffee)
Ewout
2007-Jan-22 00:15 UTC
[Ferret-talk] A few questions: Tweaking StemFilter, indexes, ...
Hi, You could use a FuzzyQuery, that will match words that have some degree of resemblance, with lower score.>StemFilter ------ > >I am trying to improve the quality of my searches in context of the >content of my application. I have created an analyzer using the >following: > >StemFilter.new StopFilter.new( >LowerCaseFilter.new(StandardTokenizer.new(text)), @stop_words ) > >This has been pretty good so far, however, I really would like to get >a search for "plumber" match "plumbing" at maybe a lower score than it >would match "plumbers". The thing is that plumber(s) is filtered to >"plumber" and plumbing is filtered to plumb, so it doesn''t match. Is >there any way to tweak the filter to be able to do these matches? I >would like to match all noun and verbs together (and ideally with a >lower score than different verb conjugations would match). Another >example would be driving and driver.
William Morgan
2007-Jan-22 17:10 UTC
[Ferret-talk] A few questions: Tweaking StemFilter, indexes, ...
Excerpts from Carl Lerche''s message of Sun Jan 21 09:09:59 -0800 2007:> Worst case scenario, I could probably do some preprocessing to the > search queries to expand "plumber" or "driving" to a query that > includes both stems (for example expand the query for plumber to > "plumber plumb")You can either do query expansion or you can modify the stemmer. Query expansion is probably a little easier to experiment with because you don''t have to worry about reindexing, but it does come with a search-time cost which may or may not be negligible. (And it gets a little tricky with phrasal queries.)> I can add ranged parameters to the query and that will work, but is > that optimal? Hopefully I am making sense.I don''t know for sure whether Ferret is sophisticated enough to optimize retrieval based on multiple ranges, but it may very well be. In any case, I think you''re doing the right thing. -- William <wmorgan-ferret at masanjin.net>