Mitchell Curtis Hatter
2007-Jul-07 03:18 UTC
[Ferret-talk] Extending/Modifying QueryParser
Hi,
I''ve implemented synonym searching in my rails application but have
an idea I''d like to implement but can''t figure out how to do.
The
idea is that I''d like to give the end user the choice on whether to
search for the synonym of a word or not. Preferably by extending the
query language to parse a construct similar to ''%word1'' and
then have
the word turned into a or list (i.e., word1|word2|word3|...).
Currently, the query parser constantly calls SynonymTokenFilter to
get synonyms for each token. Is there a way I can go about achieving
this functionality?
Here''s an overview of what I''ve done so far:
My model classes in my rails app use acts_as_ferret with a call that
looks like:
acts_as_ferret(
:fields => [:body],
:store_class_name => true,
:ferret => {
:or_default => false,
:analyzer => SynonymAnalyzer.new(WordnetSynonymEngine.new, [])
}
)
I created a SynonymAnalyzer and SynonymTokenFilter:
class SynonymAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(synonym_engine, stop_words =
FULL_ENGLISH_STOP_WORDS, lower = true)
@synonym_engine = synonym_engine
@lower = lower
@stop_words = stop_words
end
def token_stream(field, str)
ts = StandardTokenizer.new(str)
ts = LowerCaseFilter.new(ts) if @lower
ts = StopFilter.new(ts, @stop_words)
ts = SynonymTokenFilter.new(ts, @synonym_engine)
end
end
class SynonymTokenFilter < Ferret::Analysis::TokenStream
include Ferret::Analysis
def initialize(token_stream, synonym_engine)
@token_stream = token_stream
@synonym_stack = []
@synonym_engine = synonym_engine
end
def text=(text)
@token_stream.text = text
end
def next
return @synonym_stack.pop if @synonym_stack.size > 0
if token = @token_stream.next
add_synonyms_to_stack(token) unless token.nil?
end
return token
end
private
def add_synonyms_to_stack(token)
synonyms = @synonym_engine.get_synonyms(token.text)
return if synonyms.nil?
synonyms.each do |s|
@synonym_stack.push(
Token.new(s, token.start, token.end, 0))
end
end
end
FInally a WordnetSynonymEngine that queries my wordnet index I created:
class WordnetSynonymEngine
include Ferret::Search
def initialize(index_name = "wordnet")
@searcher = Searcher.new("#{RAILS_ROOT}/index/#{ENV
[''RAILS_ENV'']}/#{index_name}")
end
def get_synonyms(word)
@searcher.search_each(TermQuery.new(:word, word)) do |doc_id,
score|
return @searcher[doc_id][:syn]
end
return nil
end
end
It works great except that I''d really like that ability to only run
tokens through the SynonymTokenFilter when they are prepended by an
unescaped % sign.
Also, if anyone is interested I can post the code for turning the
wordnet prolog database into a ferret database (primarily recoding
the java lucene program that did the same thing to ruby and ferret).
Thanks,
Curtis
On Fri, Jul 06, 2007 at 11:18:09PM -0400, Mitchell Curtis Hatter wrote:> Hi, > > I''ve implemented synonym searching in my rails application but have > an idea I''d like to implement but can''t figure out how to do. The > idea is that I''d like to give the end user the choice on whether to > search for the synonym of a word or not. Preferably by extending the > query language to parse a construct similar to ''%word1'' and then have > the word turned into a or list (i.e., word1|word2|word3|...). > > Currently, the query parser constantly calls SynonymTokenFilter to > get synonyms for each token. Is there a way I can go about achieving > this functionality?You have to extend Ferret''s Query Parser to achieve this. If you don''t want to mess around with the grammar stuff the parser code is generated from, you could also preprocess user queries to modify them accordingly before giving them to the QueryParser. Can get complicated, too ;-) Atm you''re doing the synonym stuff twice, once at indexing time and once when Queries are parsed. Because of the insertion of synonyms in the index at indexing time, adding synonyms to Queries is not really needed any more. So you don''t really want to specify your SynonymAnalyzer for aaf as the analyzer to use for indexing and searching (aaf doesn''t support different analyzers for indexing/searching bec. in general it''s a good idea to use the same analyzer in both cases). If you used plain Ferret and wanted Synonyms everywhere or in a specific field, but for ALL queries, you could use your Analyzer at indexing time, but not for Query parsing. In your case, using your WordnetEngine in a customized QueryParser or a custom query preprocessor would be the better way.> Here''s an overview of what I''ve done so far:[..] That''s really cool stuff, would you mind posting this to Ferret''s Wiki so other people can more easily find it? If you included the WordnetSynonymEngine that would be even better :-) Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Mitchell Curtis Hatter
2007-Jul-10 18:12 UTC
[Ferret-talk] Extending/Modifying QueryParser
> You have to extend Ferret''s Query Parser to achieve this. If you don''t > want to mess around with the grammar stuff the parser code is > generated > from, you could also preprocess user queries to modify them > accordingly > before giving them to the QueryParser. Can get complicated, too ;-)I do not enjoy writing parsers, and am not especially good at it. I think first I''ll check out the grammar for the parser and see if I can modify that. Perhaps creating a SynonymQuery class? I did consider preprocesing user queries and then just grouping the resulting or''d query in parens: ''rabbit %{ferret}'' would parse to ''rabbit (ferret|"black-footed ferret"|etc|etc)'' but I''m sure there are situations where that would not be good but it''s an option.> > So you don''t really want to specify your SynonymAnalyzer for aaf as > the > analyzer to use for indexing and searching (aaf doesn''t support > different analyzers for indexing/searching bec. in general it''s a good > idea to use the same analyzer in both cases).Thanks, I was looking at aaf wondering how I could specific a different analyzer to use for searches. I didn''t find anything that would really let me get a hold of the QueryParser to change the analyzer used. Glad I wasn''t just missing it.> > If you used plain Ferret and wanted Synonyms everywhere or in a > specific > field, but for ALL queries, you could use your Analyzer at indexing > time, > but not for Query parsing. In your case, using your WordnetEngine in a > customized QueryParser or a custom query preprocessor would be the > better way.Since this isn''t for anything but fun right now (at work I''m stuck using Oracle''s full text engine which has its own set of problems) first I''ll try modifying the QueryParser grammar to account for a new query type. My C is not very good so hopefully won''t have to do much, but I like that solution better then having to write a pre-processor for queries.> > That''s really cool stuff, would you mind posting this to Ferret''s Wiki > so other people can more easily find it? If you included the > WordnetSynonymEngine that would be even better :-) > > Cheers, > JensThanks, I''ve posted it to the Ferret wiki. It''s quite long but I hope that''s not a problem. I included the wordnetSynonymEngine and created a YAMLSynonymEngine just to show how it can be pluggable. Thanks for the tips I''ll see what I can accomplish, Curtis