Mitchell Curtis Hatter
2007-Jul-07 03:18 UTC
[Ferret-talk] Extending/Modifying QueryParser
Hi, I''ve implemented synonym searching in my rails application but have an idea I''d like to implement but can''t figure out how to do. The idea is that I''d like to give the end user the choice on whether to search for the synonym of a word or not. Preferably by extending the query language to parse a construct similar to ''%word1'' and then have the word turned into a or list (i.e., word1|word2|word3|...). Currently, the query parser constantly calls SynonymTokenFilter to get synonyms for each token. Is there a way I can go about achieving this functionality? Here''s an overview of what I''ve done so far: My model classes in my rails app use acts_as_ferret with a call that looks like: acts_as_ferret( :fields => [:body], :store_class_name => true, :ferret => { :or_default => false, :analyzer => SynonymAnalyzer.new(WordnetSynonymEngine.new, []) } ) I created a SynonymAnalyzer and SynonymTokenFilter: class SynonymAnalyzer < Ferret::Analysis::Analyzer include Ferret::Analysis def initialize(synonym_engine, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true) @synonym_engine = synonym_engine @lower = lower @stop_words = stop_words end def token_stream(field, str) ts = StandardTokenizer.new(str) ts = LowerCaseFilter.new(ts) if @lower ts = StopFilter.new(ts, @stop_words) ts = SynonymTokenFilter.new(ts, @synonym_engine) end end class SynonymTokenFilter < Ferret::Analysis::TokenStream include Ferret::Analysis def initialize(token_stream, synonym_engine) @token_stream = token_stream @synonym_stack = [] @synonym_engine = synonym_engine end def text=(text) @token_stream.text = text end def next return @synonym_stack.pop if @synonym_stack.size > 0 if token = @token_stream.next add_synonyms_to_stack(token) unless token.nil? end return token end private def add_synonyms_to_stack(token) synonyms = @synonym_engine.get_synonyms(token.text) return if synonyms.nil? synonyms.each do |s| @synonym_stack.push( Token.new(s, token.start, token.end, 0)) end end end FInally a WordnetSynonymEngine that queries my wordnet index I created: class WordnetSynonymEngine include Ferret::Search def initialize(index_name = "wordnet") @searcher = Searcher.new("#{RAILS_ROOT}/index/#{ENV [''RAILS_ENV'']}/#{index_name}") end def get_synonyms(word) @searcher.search_each(TermQuery.new(:word, word)) do |doc_id, score| return @searcher[doc_id][:syn] end return nil end end It works great except that I''d really like that ability to only run tokens through the SynonymTokenFilter when they are prepended by an unescaped % sign. Also, if anyone is interested I can post the code for turning the wordnet prolog database into a ferret database (primarily recoding the java lucene program that did the same thing to ruby and ferret). Thanks, Curtis
On Fri, Jul 06, 2007 at 11:18:09PM -0400, Mitchell Curtis Hatter wrote:> Hi, > > I''ve implemented synonym searching in my rails application but have > an idea I''d like to implement but can''t figure out how to do. The > idea is that I''d like to give the end user the choice on whether to > search for the synonym of a word or not. Preferably by extending the > query language to parse a construct similar to ''%word1'' and then have > the word turned into a or list (i.e., word1|word2|word3|...). > > Currently, the query parser constantly calls SynonymTokenFilter to > get synonyms for each token. Is there a way I can go about achieving > this functionality?You have to extend Ferret''s Query Parser to achieve this. If you don''t want to mess around with the grammar stuff the parser code is generated from, you could also preprocess user queries to modify them accordingly before giving them to the QueryParser. Can get complicated, too ;-) Atm you''re doing the synonym stuff twice, once at indexing time and once when Queries are parsed. Because of the insertion of synonyms in the index at indexing time, adding synonyms to Queries is not really needed any more. So you don''t really want to specify your SynonymAnalyzer for aaf as the analyzer to use for indexing and searching (aaf doesn''t support different analyzers for indexing/searching bec. in general it''s a good idea to use the same analyzer in both cases). If you used plain Ferret and wanted Synonyms everywhere or in a specific field, but for ALL queries, you could use your Analyzer at indexing time, but not for Query parsing. In your case, using your WordnetEngine in a customized QueryParser or a custom query preprocessor would be the better way.> Here''s an overview of what I''ve done so far:[..] That''s really cool stuff, would you mind posting this to Ferret''s Wiki so other people can more easily find it? If you included the WordnetSynonymEngine that would be even better :-) Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Mitchell Curtis Hatter
2007-Jul-10 18:12 UTC
[Ferret-talk] Extending/Modifying QueryParser
> You have to extend Ferret''s Query Parser to achieve this. If you don''t > want to mess around with the grammar stuff the parser code is > generated > from, you could also preprocess user queries to modify them > accordingly > before giving them to the QueryParser. Can get complicated, too ;-)I do not enjoy writing parsers, and am not especially good at it. I think first I''ll check out the grammar for the parser and see if I can modify that. Perhaps creating a SynonymQuery class? I did consider preprocesing user queries and then just grouping the resulting or''d query in parens: ''rabbit %{ferret}'' would parse to ''rabbit (ferret|"black-footed ferret"|etc|etc)'' but I''m sure there are situations where that would not be good but it''s an option.> > So you don''t really want to specify your SynonymAnalyzer for aaf as > the > analyzer to use for indexing and searching (aaf doesn''t support > different analyzers for indexing/searching bec. in general it''s a good > idea to use the same analyzer in both cases).Thanks, I was looking at aaf wondering how I could specific a different analyzer to use for searches. I didn''t find anything that would really let me get a hold of the QueryParser to change the analyzer used. Glad I wasn''t just missing it.> > If you used plain Ferret and wanted Synonyms everywhere or in a > specific > field, but for ALL queries, you could use your Analyzer at indexing > time, > but not for Query parsing. In your case, using your WordnetEngine in a > customized QueryParser or a custom query preprocessor would be the > better way.Since this isn''t for anything but fun right now (at work I''m stuck using Oracle''s full text engine which has its own set of problems) first I''ll try modifying the QueryParser grammar to account for a new query type. My C is not very good so hopefully won''t have to do much, but I like that solution better then having to write a pre-processor for queries.> > That''s really cool stuff, would you mind posting this to Ferret''s Wiki > so other people can more easily find it? If you included the > WordnetSynonymEngine that would be even better :-) > > Cheers, > JensThanks, I''ve posted it to the Ferret wiki. It''s quite long but I hope that''s not a problem. I included the wordnetSynonymEngine and created a YAMLSynonymEngine just to show how it can be pluggable. Thanks for the tips I''ll see what I can accomplish, Curtis