thr3ads.net - Ferret talk - [Ferret-talk] Comparing two documents in the index [May 2006]

If this information is useful, please help other people find it:
Share via:

Jeroen Bulters

2006-May-26 17:38 UTC

[Ferret-talk] Comparing two documents in the index

I want to compare two documents in the index (i.e. retrieve the cosine 
similarity/score between two documents term-vector''s). Is this possible
using the standard Ferret functionality?

Thanks in advance,

Jeroen Bulters

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-26 23:55 UTC

head link

[Ferret-talk] Comparing two documents in the index

On 5/27/06, Jeroen Bulters <jeroenbulters at gmail.com>
wrote:> I want to compare two documents in the index (i.e. retrieve the cosine
> similarity/score between two documents term-vector''s). Is this
possible
> using the standard Ferret functionality?
Hi Jeroen,

No problem. Make sure you store term-vectors when you add the field. That is;

    doc.add_field(:field, "yada yada yada",
                  Field::Store::NO,                # or YES
                  Field::Index::TOKENIZED,   # or UNTOKENIZED
                  Field::TermVector::YES)     # or anything else but NO

Then you can retrieve the term vector from an index reader like so;

    term_vector = index_reader.get_term_vector(doc_num, :field)
    terms = term_vector.terms # array of terms in :field in document
    freqs = term_vector.freqs # array of corresponding frequencies

Hope that helps. Is that enough to get you going?

Cheers,
Dave

Jeroen Bulters

2006-May-27 13:09 UTC

head link

[Ferret-talk] Comparing two documents in the index

David Balmain wrote:
>     doc.add_field(:field, "yada yada yada",
>                   Field::Store::NO,                # or YES
>                   Field::Index::TOKENIZED,   # or UNTOKENIZED
>                   Field::TermVector::YES)     # or anything else but NO
I got this far:
------ BEGIN CODE SNIPPET ------
# Read weblog data
weblogs = YAML::load(File.open("weblogs.yml"))

# Walk over weblogs and save all data.
print "--- Analyzing weblogs:\n"
weblogs.each do |weblog, id|
  content = ""
  print " * Indexing weblog #{weblog}/#{id} "
  # Load the appropriate file for parsing.
  weblogdata = YAML::load(File.open("./data/#{id}"))

  weblogdata[:posts].each do |id, post|
    # Clean up content
    # by removing all UBB blocks. This will cut-out some content. I 
consider this
    # loss a plus :D
    content = content + "\n\n" + 
post[:text].gsub(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "")
    #content.gsub!(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "")
  end

  # Create a new document
  doc = Document.new
  doc.add_field(:id, weblog, Field::Store::YES, Field::Index::TOKENIZED, 
Field::TermVector::NO)
  doc.add_field(:content, content, Field::Store::NO, 
Field::Index::TOKENIZED, Field::TermVector::YES)

  # And add to the index.
  index << doc
  index.flush

  print "done.\n"
end
------ END CODE SNIPPET ------

I Index about 23000 weblogs with their weblog id as the document id and 
the content by termvector. Now I want to compare two weblogs. So what 
you suggest is that I retrieve the term-vectors for both documents and 
calculate the dotproduct of the two vectors myself; or is there a nice 
Ferret-way to do this?

Thanks in advance,

Jeroen Bulters

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-27 14:55 UTC

head link

[Ferret-talk] Comparing two documents in the index

On 5/27/06, Jeroen Bulters <jeroenbulters at gmail.com>
wrote:> I Index about 23000 weblogs with their weblog id as the document id and
> the content by termvector. Now I want to compare two weblogs. So what
> you suggest is that I retrieve the term-vectors for both documents and
> calculate the dotproduct of the two vectors myself; or is there a nice
> Ferret-way to do this?
Until now I haven''t really used the TermVectors so this probably
isn''t
the best way to do it but here goes (this is very rough);

    def cosine_similarity(index_reader, doc1, doc2)
      tv1 = index_reader.get_term_vector(doc1, :data)
      terms1 = tv1.terms
      freqs1 = tv1.freqs
      matrix = {}
      terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]}

      tv2 = index_reader.get_term_vector(doc2, :data)
      terms2 = tv2.terms
      freqs2 = tv2.freqs
      terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]}

      dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b}
      lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} *
                                  freqs2.inject(0) {|sp, f| sp += f*f})
      dot_product / lengths_product
    end

I''d be interested to hear how you go with this. If performance is poor
I can add something like this to the C code.

Hope this helps,
Dave

Jeroen Bulters

2006-May-27 15:40 UTC

head link

[Ferret-talk] Comparing two documents in the index

David Balmain wrote:> Until now I haven''t really used the TermVectors so this probably
isn''t
> the best way to do it but here goes (this is very rough);
I''m going to try this out now. I''ll also try extracting all
term vectors
from doc1 and using them as a query on doc2 (using a BooleanQuery). They 
use this kind of method in "Lucene in Action" (somewhere around page
190
if I recall correctly).

Thanks for your quick responses; I''ll let you know how things work out.

Cheers,

Jeroen Bulters

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-May-27 22:36 UTC

head link

[Ferret-talk] Comparing two documents in the index

On 5/28/06, Jeroen Bulters <jeroenbulters at gmail.com>
wrote:> David Balmain wrote:
> > Until now I haven''t really used the TermVectors so this
probably isn''t
> > the best way to do it but here goes (this is very rough);
>
> I''m going to try this out now. I''ll also try extracting
all term vectors
> from doc1 and using them as a query on doc2 (using a BooleanQuery). They
> use this kind of method in "Lucene in Action" (somewhere around
page 190
> if I recall correctly).
If it''s a "More Like This" query that you are trying to
write, I
recommend you look at the Lucene code here;

   
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup

It''s part of Lucene 2.0 now. I''ll be adding MoreLikeThis
Queries in
the near future.

Cheers,
Dave

Jeroen Bulters

2006-May-28 12:37 UTC

head link

[Ferret-talk] Comparing two documents in the index

Yes it is a more like this query, but: I only want the relevance score 
for document B given document A as the query (so weblog:B AND 
all_terms_from_A)

I''ll look into it; thesis is due in 4 weeks so I''ve got loads
of time :D

Cheers,

Jeroen Bulters

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2006-May-29 07:33 UTC

head link

[Ferret-talk] Comparing two documents in the index

On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain
wrote:> On 5/28/06, Jeroen Bulters <jeroenbulters at gmail.com> wrote:
> > David Balmain wrote:
> > > Until now I haven''t really used the TermVectors so this
probably isn''t
> > > the best way to do it but here goes (this is very rough);
> >
> > I''m going to try this out now. I''ll also try
extracting all term vectors
> > from doc1 and using them as a query on doc2 (using a BooleanQuery).
They
> > use this kind of method in "Lucene in Action" (somewhere
around page 190
> > if I recall correctly).
> 
> If it''s a "More Like This" query that you are trying to
write, I
> recommend you look at the Lucene code here;
> 
>    
http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup
or you check out the port of this  that lives in acts_as_ferret :-)

http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb
from Line 525 till around 720.

> It''s part of Lucene 2.0 now. I''ll be adding MoreLikeThis
Queries in
> the near future.
Dave, that''s a nice idea. Should I try to prepare a patch for this
based
on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
whole more like this thing more or less is about building a BooleanQuery, 
I think speed is no issue here.

Jens


-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

David Balmain

2006-May-29 07:54 UTC

head link

[Ferret-talk] Comparing two documents in the index

On 5/29/06, Jens Kraemer <kraemer at webit.de>
wrote:><snip>
>
> On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain wrote:
> > It''s part of Lucene 2.0 now. I''ll be adding
MoreLikeThis Queries in
> > the near future.
>
> Dave, that''s a nice idea. Should I try to prepare a patch for this
based
> on what I did in acts_as_ferret ? Would be ruby-only, though. But as the
> whole more like this thing more or less is about building a BooleanQuery,
> I think speed is no issue here.
Hi Jens,

That''d be great but not just yet. I may be making a few adjustments to
the API in the coming week. I''ll be sure to discuss possible changes
with you guys when the time comes.

Gotta run. Cheers,
Dave

Possibly Parallel Threads

Search for more apparently analagous threads

Ferret talk - May 2006 - Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

[Ferret-talk] Comparing two documents in the index

Possibly Parallel Threads