I want to compare two documents in the index (i.e. retrieve the cosine similarity/score between two documents term-vector''s). Is this possible using the standard Ferret functionality? Thanks in advance, Jeroen Bulters -- Posted via http://www.ruby-forum.com/.
On 5/27/06, Jeroen Bulters <jeroenbulters at gmail.com> wrote:> I want to compare two documents in the index (i.e. retrieve the cosine > similarity/score between two documents term-vector''s). Is this possible > using the standard Ferret functionality?Hi Jeroen, No problem. Make sure you store term-vectors when you add the field. That is; doc.add_field(:field, "yada yada yada", Field::Store::NO, # or YES Field::Index::TOKENIZED, # or UNTOKENIZED Field::TermVector::YES) # or anything else but NO Then you can retrieve the term vector from an index reader like so; term_vector = index_reader.get_term_vector(doc_num, :field) terms = term_vector.terms # array of terms in :field in document freqs = term_vector.freqs # array of corresponding frequencies Hope that helps. Is that enough to get you going? Cheers, Dave
David Balmain wrote:> doc.add_field(:field, "yada yada yada", > Field::Store::NO, # or YES > Field::Index::TOKENIZED, # or UNTOKENIZED > Field::TermVector::YES) # or anything else but NOI got this far: ------ BEGIN CODE SNIPPET ------ # Read weblog data weblogs = YAML::load(File.open("weblogs.yml")) # Walk over weblogs and save all data. print "--- Analyzing weblogs:\n" weblogs.each do |weblog, id| content = "" print " * Indexing weblog #{weblog}/#{id} " # Load the appropriate file for parsing. weblogdata = YAML::load(File.open("./data/#{id}")) weblogdata[:posts].each do |id, post| # Clean up content # by removing all UBB blocks. This will cut-out some content. I consider this # loss a plus :D content = content + "\n\n" + post[:text].gsub(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "") #content.gsub!(/\[[^\]]+\][^\[]+\[[^\]]+\]/i, "") end # Create a new document doc = Document.new doc.add_field(:id, weblog, Field::Store::YES, Field::Index::TOKENIZED, Field::TermVector::NO) doc.add_field(:content, content, Field::Store::NO, Field::Index::TOKENIZED, Field::TermVector::YES) # And add to the index. index << doc index.flush print "done.\n" end ------ END CODE SNIPPET ------ I Index about 23000 weblogs with their weblog id as the document id and the content by termvector. Now I want to compare two weblogs. So what you suggest is that I retrieve the term-vectors for both documents and calculate the dotproduct of the two vectors myself; or is there a nice Ferret-way to do this? Thanks in advance, Jeroen Bulters -- Posted via http://www.ruby-forum.com/.
On 5/27/06, Jeroen Bulters <jeroenbulters at gmail.com> wrote:> I Index about 23000 weblogs with their weblog id as the document id and > the content by termvector. Now I want to compare two weblogs. So what > you suggest is that I retrieve the term-vectors for both documents and > calculate the dotproduct of the two vectors myself; or is there a nice > Ferret-way to do this?Until now I haven''t really used the TermVectors so this probably isn''t the best way to do it but here goes (this is very rough); def cosine_similarity(index_reader, doc1, doc2) tv1 = index_reader.get_term_vector(doc1, :data) terms1 = tv1.terms freqs1 = tv1.freqs matrix = {} terms1.size.times {|i| matrix[terms1[i]] = [freqs1[i], 0]} tv2 = index_reader.get_term_vector(doc2, :data) terms2 = tv2.terms freqs2 = tv2.freqs terms2.size.times {|i| (matrix[terms2[i]] ||= [0])[1] = freqs2[i]} dot_product = matrix.values.inject(0) {|dp, (a,b)| dp += a*b} lengths_product = Math.sqrt(freqs1.inject(0) {|sp, f| sp += f*f} * freqs2.inject(0) {|sp, f| sp += f*f}) dot_product / lengths_product end I''d be interested to hear how you go with this. If performance is poor I can add something like this to the C code. Hope this helps, Dave
David Balmain wrote:> Until now I haven''t really used the TermVectors so this probably isn''t > the best way to do it but here goes (this is very rough);I''m going to try this out now. I''ll also try extracting all term vectors from doc1 and using them as a query on doc2 (using a BooleanQuery). They use this kind of method in "Lucene in Action" (somewhere around page 190 if I recall correctly). Thanks for your quick responses; I''ll let you know how things work out. Cheers, Jeroen Bulters -- Posted via http://www.ruby-forum.com/.
On 5/28/06, Jeroen Bulters <jeroenbulters at gmail.com> wrote:> David Balmain wrote: > > Until now I haven''t really used the TermVectors so this probably isn''t > > the best way to do it but here goes (this is very rough); > > I''m going to try this out now. I''ll also try extracting all term vectors > from doc1 and using them as a query on doc2 (using a BooleanQuery). They > use this kind of method in "Lucene in Action" (somewhere around page 190 > if I recall correctly).If it''s a "More Like This" query that you are trying to write, I recommend you look at the Lucene code here; http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markup It''s part of Lucene 2.0 now. I''ll be adding MoreLikeThis Queries in the near future. Cheers, Dave
Yes it is a more like this query, but: I only want the relevance score for document B given document A as the query (so weblog:B AND all_terms_from_A) I''ll look into it; thesis is due in 4 weeks so I''ve got loads of time :D Cheers, Jeroen Bulters -- Posted via http://www.ruby-forum.com/.
On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain wrote:> On 5/28/06, Jeroen Bulters <jeroenbulters at gmail.com> wrote: > > David Balmain wrote: > > > Until now I haven''t really used the TermVectors so this probably isn''t > > > the best way to do it but here goes (this is very rough); > > > > I''m going to try this out now. I''ll also try extracting all term vectors > > from doc1 and using them as a query on doc2 (using a BooleanQuery). They > > use this kind of method in "Lucene in Action" (somewhere around page 190 > > if I recall correctly). > > If it''s a "More Like This" query that you are trying to write, I > recommend you look at the Lucene code here; > > http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_0/contrib/similarity/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?revision=409698&view=markupor you check out the port of this that lives in acts_as_ferret :-) http://projects.jkraemer.net/acts_as_ferret/browser/trunk/plugin/acts_as_ferret/lib/acts_as_ferret.rb from Line 525 till around 720.> It''s part of Lucene 2.0 now. I''ll be adding MoreLikeThis Queries in > the near future.Dave, that''s a nice idea. Should I try to prepare a patch for this based on what I did in acts_as_ferret ? Would be ruby-only, though. But as the whole more like this thing more or less is about building a BooleanQuery, I think speed is no issue here. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On 5/29/06, Jens Kraemer <kraemer at webit.de> wrote:><snip> > > On Sun, May 28, 2006 at 07:36:25AM +0900, David Balmain wrote: > > It''s part of Lucene 2.0 now. I''ll be adding MoreLikeThis Queries in > > the near future. > > Dave, that''s a nice idea. Should I try to prepare a patch for this based > on what I did in acts_as_ferret ? Would be ruby-only, though. But as the > whole more like this thing more or less is about building a BooleanQuery, > I think speed is no issue here.Hi Jens, That''d be great but not just yet. I may be making a few adjustments to the API in the coming week. I''ll be sure to discuss possible changes with you guys when the time comes. Gotta run. Cheers, Dave