Hi John,
On 4/13/06, John Butler <john at likealightbulb.com>
wrote:> Is it possible with Ferret to find the location of the matches in a
> document? For example imagine I have 100 documents and I search with
> the phrase "bob~0.5" and that returns 3 matching documents. How
can I
> then find all locations in a specific document where it matched
> "bob~0.5". What I need is something like an array that contains
the
> start index and length for each match within a given document. Does
> this exact? Should I break up my matching document into subdocuments
> then search on that?
A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can''t say when.
> Also for my application I will be searching for fairly large pieces
> of text ( many sentences long ) and doing fuzzy matching. I suppose
> what I am doing is very similar to trying to find matching phrases
> within an essay to catch people plagiarizing (that''s not what
I''m
> doing at all, but it''s close enough in terms of methods).
>
> Are both of these possible with Ferret? Is there another technology I
> should look at for doing this? I will have a relatively small index
> size ( somewhere between 100 and 500 ) and so I''m not really
> concerned with speed issues.
>
At University we had an assignment to write a program that would find
similar documents with the purpose of catching people plagiarizing. I
used a running hash, something similar to this;
    require ''ferret''
    NUM_WORDS = 5
    def hash_doc(filename)
      stk = Ferret::Analysis::StandardTokenizer.new("")
      words = []
      hashes = []
      File.open(filename) do |f|
        f.each do |line|
          stk.text = line
          while tk = stk.next()
            words << tk.text
            if words.size == NUM_WORDS
              hashes << words.hash
              words.shift
            end
          end
        end
      end
      return hashes.sort!
    end
    def hash_cmp(hash1, hash2)
      same = 0
      size_avg = (hash1.size + hash2.size)/2
      h1 = hash1.pop
      h2 = hash2.pop
      while (not hash1.empty? and not hash2.empty?)
        if (h2 == h1)
          same += 1
          h1 = hash1.pop
          h2 = hash2.pop
        else
          if (h1 > h2)
            h1 = hash1.pop
          else
            h2 = hash2.pop
          end
        end
      end
      return same.to_f/size_avg
    end
    puts hash_cmp(hash_doc(ARGV[0]), hash_doc(ARGV[1]))
I''m not sure if this would work better for you than using a really
long phrase query.
Cheers,
Dave
> Thanks so much for any help!
>
> -John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>