thr3ads.net - Ferret talk - [Ferret-talk] Location of match? [Apr 2006]

If this information is useful, please help other people find it:
Share via:

John Butler

2006-Apr-12 19:37 UTC

[Ferret-talk] Location of match?

Is it possible with Ferret to find the location of the matches in a  
document? For example imagine I have 100 documents and I search with  
the phrase "bob~0.5" and that returns 3 matching documents. How can I
then find all locations in a specific document where it matched  
"bob~0.5". What I need is something like an array that contains the  
start index and length for each match within a given document. Does  
this exact? Should I break up my matching document into subdocuments  
then search on that?

Also for my application I will be searching for fairly large pieces  
of text ( many sentences long ) and doing fuzzy matching. I suppose  
what I am doing is very similar to trying to find matching phrases  
within an essay to catch people plagiarizing (that''s not what
I''m
doing at all, but it''s close enough in terms of methods).

Are both of these possible with Ferret? Is there another technology I  
should look at for doing this? I will have a relatively small index  
size ( somewhere between 100 and 500 ) and so I''m not really  
concerned with speed issues.

Thanks so much for any help!

-John

David Balmain

2006-Apr-18 02:53 UTC

head link

[Ferret-talk] Location of match?

Hi John,

On 4/13/06, John Butler <john at likealightbulb.com>
wrote:> Is it possible with Ferret to find the location of the matches in a
> document? For example imagine I have 100 documents and I search with
> the phrase "bob~0.5" and that returns 3 matching documents. How
can I
> then find all locations in a specific document where it matched
> "bob~0.5". What I need is something like an array that contains
the
> start index and length for each match within a given document. Does
> this exact? Should I break up my matching document into subdocuments
> then search on that?
A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can''t say when.
> Also for my application I will be searching for fairly large pieces
> of text ( many sentences long ) and doing fuzzy matching. I suppose
> what I am doing is very similar to trying to find matching phrases
> within an essay to catch people plagiarizing (that''s not what
I''m
> doing at all, but it''s close enough in terms of methods).
>
> Are both of these possible with Ferret? Is there another technology I
> should look at for doing this? I will have a relatively small index
> size ( somewhere between 100 and 500 ) and so I''m not really
> concerned with speed issues.
>
At University we had an assignment to write a program that would find
similar documents with the purpose of catching people plagiarizing. I
used a running hash, something similar to this;

    require ''ferret''

    NUM_WORDS = 5

    def hash_doc(filename)
      stk = Ferret::Analysis::StandardTokenizer.new("")
      words = []
      hashes = []
      File.open(filename) do |f|
        f.each do |line|
          stk.text = line
          while tk = stk.next()
            words << tk.text
            if words.size == NUM_WORDS
              hashes << words.hash
              words.shift
            end
          end
        end
      end
      return hashes.sort!
    end

    def hash_cmp(hash1, hash2)
      same = 0
      size_avg = (hash1.size + hash2.size)/2
      h1 = hash1.pop
      h2 = hash2.pop
      while (not hash1.empty? and not hash2.empty?)
        if (h2 == h1)
          same += 1
          h1 = hash1.pop
          h2 = hash2.pop
        else
          if (h1 > h2)
            h1 = hash1.pop
          else
            h2 = hash2.pop
          end
        end
      end
      return same.to_f/size_avg
    end

    puts hash_cmp(hash_doc(ARGV[0]), hash_doc(ARGV[1]))

I''m not sure if this would work better for you than using a really
long phrase query.

Cheers,
Dave
> Thanks so much for any help!
>
> -John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Marcus Crafter

2006-Jun-13 02:29 UTC

head link

[Ferret-talk] Location of match?

Hi David,

David Balmain wrote:> A search result highlighter is coming in a future version of Ferret.
> This will enable you to find the position of the match in a document.
> I can''t say when.
This would be awesome and also what I''m looking for too - has there
been
any progress on this at all since you''re last post we might be able to 
take a look at?

Can we help in any way?

Cheers,

Marcus

-- 
Posted via http://www.ruby-forum.com/.

Possibly Parallel Threads

Search for more seemingly similar threads

Ferret talk - Apr 2006 - Location of match?

[Ferret-talk] Location of match?

[Ferret-talk] Location of match?

[Ferret-talk] Location of match?

Possibly Parallel Threads