thr3ads.net - Ferret talk - [Ferret-talk] indexing source code [Nov 2005]

If this information is useful, please help other people find it:
Share via:

aslak hellesoy

2005-Nov-17 14:43 UTC

[Ferret-talk] indexing source code

Hi again,

I''m using ferret to index source code - DamageControl will allow users
to search for text in source code.

Currently I''m using the default index with no custom analyzer
(I''m
using the StandardAnalyzer). Do you have any recommendations about how
to write an analyzer that will index source code in a more
''optimal''
way? I.e. disregard common sourcecode tokens and take into account
dots and such when tokenizing.

For example, if the source code looks like this:

def foo(bar)
  bar.zap()
end

searching for ''def'' should not match (too common). searching
for ''zap''
should match (even if  it''s not surrounded by spaces, but
''ignorable
characters''.

Also, it might make sense to use different analyzers for different
source code types (java, ruby, haskell etc).

Some hints for this would be great.

Cheers,
Aslak

David Balmain

2005-Nov-18 02:31 UTC

head link

[Ferret-talk] Fwd: indexing source code

Hi Aslak,

Ignoring things like def is easy using the stop filter. Something like this;

  include Ferret
  analyzer = Analysis::StandardAnalyzer.new(["def", "class",
"end", "module"])
  index = Index::Index({:analyzer => analyzer})

But the standard analyzer is probably not the best to use. It will
parse name.split.each as one token. I''m not sure that this is what you
want. I''d use a simple letter tokenizer but I''d add a few
other
symbols. Probably something like this;

  class CodeTokenizer < Analysis::RegExpTokenizer
    protected
      # Collects only characters which satisfy the regular expression
      # _/[[:alpha:]_@$?!]+/_.
      def token_re()
        /[[:alpha:]_@$?!]+/
      end
  end

  class CodeAnalyzer < StopAnalyzer
    # An array containing some common ruby words that are not usually useful
    # for searching.
    CODE_STOP_WORDS = [def", "class", "end",
"module"] #etc

    # Builds an analyzer which removes words in the provided array.
    def initialize(stop_words = CODE_STOP_WORDS)
      @stop_words = stop_words
    end

    # Filters CodeTokenizer with StopFilter.
    def token_stream(field, string)
      return StopFilter.new(CodeTokenizer.new(string), @stop_words)
    end
  end


This should be a good start for you. Incidently, my next project is
going to be to build a kind of documentation wiki. I''ll be integrating
rdoc with ferret to make the documentation searchable and commentable.
I''ll actually be using Ferret quite heavily so it should be an good
example app for people to work off. I''m working on cFerret integration
(huge performance improvement) at the moment though so it could be a
while.

Cheers,
Dave

On 11/17/05, aslak hellesoy <aslak.hellesoy at gmail.com>
wrote:> Hi again,
>
> I''m using ferret to index source code - DamageControl will allow
users
> to search for text in source code.
>
> Currently I''m using the default index with no custom analyzer
(I''m
> using the StandardAnalyzer). Do you have any recommendations about how
> to write an analyzer that will index source code in a more
''optimal''
> way? I.e. disregard common sourcecode tokens and take into account
> dots and such when tokenizing.
>
> For example, if the source code looks like this:
>
> def foo(bar)
>   bar.zap()
> end
>
> searching for ''def'' should not match (too common).
searching for ''zap''
> should match (even if  it''s not surrounded by spaces, but
''ignorable
> characters''.
>
> Also, it might make sense to use different analyzers for different
> source code types (java, ruby, haskell etc).
>
> Some hints for this would be great.
>
> Cheers,
> Aslak
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Reasonably Related Threads

Search for more apparently analagous threads

Ferret talk - Nov 2005 - indexing source code

[Ferret-talk] indexing source code

[Ferret-talk] Fwd: indexing source code

Reasonably Related Threads