Hi Aslak,
Ignoring things like def is easy using the stop filter. Something like this;
include Ferret
analyzer = Analysis::StandardAnalyzer.new(["def", "class",
"end", "module"])
index = Index::Index({:analyzer => analyzer})
But the standard analyzer is probably not the best to use. It will
parse name.split.each as one token. I''m not sure that this is what you
want. I''d use a simple letter tokenizer but I''d add a few
other
symbols. Probably something like this;
class CodeTokenizer < Analysis::RegExpTokenizer
protected
# Collects only characters which satisfy the regular expression
# _/[[:alpha:]_@$?!]+/_.
def token_re()
/[[:alpha:]_@$?!]+/
end
end
class CodeAnalyzer < StopAnalyzer
# An array containing some common ruby words that are not usually useful
# for searching.
CODE_STOP_WORDS = [def", "class", "end",
"module"] #etc
# Builds an analyzer which removes words in the provided array.
def initialize(stop_words = CODE_STOP_WORDS)
@stop_words = stop_words
end
# Filters CodeTokenizer with StopFilter.
def token_stream(field, string)
return StopFilter.new(CodeTokenizer.new(string), @stop_words)
end
end
This should be a good start for you. Incidently, my next project is
going to be to build a kind of documentation wiki. I''ll be integrating
rdoc with ferret to make the documentation searchable and commentable.
I''ll actually be using Ferret quite heavily so it should be an good
example app for people to work off. I''m working on cFerret integration
(huge performance improvement) at the moment though so it could be a
while.
Cheers,
Dave
On 11/17/05, aslak hellesoy <aslak.hellesoy at gmail.com>
wrote:> Hi again,
>
> I''m using ferret to index source code - DamageControl will allow
users
> to search for text in source code.
>
> Currently I''m using the default index with no custom analyzer
(I''m
> using the StandardAnalyzer). Do you have any recommendations about how
> to write an analyzer that will index source code in a more
''optimal''
> way? I.e. disregard common sourcecode tokens and take into account
> dots and such when tokenizing.
>
> For example, if the source code looks like this:
>
> def foo(bar)
> bar.zap()
> end
>
> searching for ''def'' should not match (too common).
searching for ''zap''
> should match (even if it''s not surrounded by spaces, but
''ignorable
> characters''.
>
> Also, it might make sense to use different analyzers for different
> source code types (java, ruby, haskell etc).
>
> Some hints for this would be great.
>
> Cheers,
> Aslak
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>