Hi John,
On 4/13/06, John Butler <john at likealightbulb.com>
wrote:> Is it possible with Ferret to find the location of the matches in a
> document? For example imagine I have 100 documents and I search with
> the phrase "bob~0.5" and that returns 3 matching documents. How
can I
> then find all locations in a specific document where it matched
> "bob~0.5". What I need is something like an array that contains
the
> start index and length for each match within a given document. Does
> this exact? Should I break up my matching document into subdocuments
> then search on that?
A search result highlighter is coming in a future version of Ferret.
This will enable you to find the position of the match in a document.
I can''t say when.
> Also for my application I will be searching for fairly large pieces
> of text ( many sentences long ) and doing fuzzy matching. I suppose
> what I am doing is very similar to trying to find matching phrases
> within an essay to catch people plagiarizing (that''s not what
I''m
> doing at all, but it''s close enough in terms of methods).
>
> Are both of these possible with Ferret? Is there another technology I
> should look at for doing this? I will have a relatively small index
> size ( somewhere between 100 and 500 ) and so I''m not really
> concerned with speed issues.
>
At University we had an assignment to write a program that would find
similar documents with the purpose of catching people plagiarizing. I
used a running hash, something similar to this;
require ''ferret''
NUM_WORDS = 5
def hash_doc(filename)
stk = Ferret::Analysis::StandardTokenizer.new("")
words = []
hashes = []
File.open(filename) do |f|
f.each do |line|
stk.text = line
while tk = stk.next()
words << tk.text
if words.size == NUM_WORDS
hashes << words.hash
words.shift
end
end
end
end
return hashes.sort!
end
def hash_cmp(hash1, hash2)
same = 0
size_avg = (hash1.size + hash2.size)/2
h1 = hash1.pop
h2 = hash2.pop
while (not hash1.empty? and not hash2.empty?)
if (h2 == h1)
same += 1
h1 = hash1.pop
h2 = hash2.pop
else
if (h1 > h2)
h1 = hash1.pop
else
h2 = hash2.pop
end
end
end
return same.to_f/size_avg
end
puts hash_cmp(hash_doc(ARGV[0]), hash_doc(ARGV[1]))
I''m not sure if this would work better for you than using a really
long phrase query.
Cheers,
Dave
> Thanks so much for any help!
>
> -John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>