-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I have the following Problem: I created a fairly simple sample project to try out acts_as_ferret and present the results. The test set is relatively easy: I have extracts from 6 Wikipedia-Articles about several Topics, which are copied into a model that has two fields: title and text. This works quite well, until I try to use #more_like_this, which returns all of the other articles, even if they have nothing to do with the active article. I debugged a bit and found out that the query build by #more_like_this is nothing more then "-id:<id of the active record>". (so the _result_ is correct) To try that out on the console, I used: entry = Entry.find(1) entry.more_like_this(:field_names => [''text'']) Either I''m doing something entirely wrong or there is a bug. ;) Before filing a ticket, I want to rule out the first case. Ferret version is 0.11.4, aaf version is the current stable version (although trunk didn''t work as well). I uploaded the demo project together with a dump of the Database to: Project: http://putstuff.putfile.com/95477/8752808 Dump: http://putstuff.putfile.com/95479/6169502 Thanks in advance. Florian Gilcher P.S.: There is another minor bug. Altough #more_like_this does set a default option for :field_names (line #35), this option leads to a crash in #retrieve_terms. The default option is nil and #retrieve_terms thus tries to call #each on nil. (line #113) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGnJXo8RlGMqQ8m7oRAoAqAJ4g3oqoLk8XB61tCm+hUJlKdfz0UQCgmoSi /t3GM4u/N10/S1DVyzUUE48=wocT -----END PGP SIGNATURE-----
Hi, first of all, 6 documents is not really a corpus to judge the usability of more_like_this - by default it will only consider terms occuring in at least 5 documents to be of any relevance (:min_doc_freq option). So if you have very different documents where the only common words are filtered out as noise words, you''ll end up without any terms to use for finding similar documents, which would lead to the query you mentioned. However more_like_this should indeed return an empty result set in this case ;-) Besides that, you should store term vectors (give :term_vector => :yes for the fields you want to use more_like_this on in your call to acts_as_ferret), this will speed up the search for relevant terms. Jens On Tue, Jul 17, 2007 at 12:11:55PM +0200, Florian Gilcher wrote:> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > I have the following Problem: > > I created a fairly simple sample project to try out acts_as_ferret and > present the results. > > The test set is relatively easy: I have extracts from 6 > Wikipedia-Articles about several Topics, which are copied into a model > that has two fields: title and text. This works quite well, until I try > to use #more_like_this, which returns all of the other articles, even if > they have nothing to do with the active article. I debugged a bit and > found out that the query build by #more_like_this is nothing more then > "-id:<id of the active record>". > (so the _result_ is correct) > > To try that out on the console, I used: > > entry = Entry.find(1) > entry.more_like_this(:field_names => [''text'']) > > Either I''m doing something entirely wrong or there is a bug. ;) Before > filing a ticket, I want to rule out the first case. > > Ferret version is 0.11.4, aaf version is the current stable version > (although trunk didn''t work as well). > > I uploaded the demo project together with a dump of the Database to: > > Project: http://putstuff.putfile.com/95477/8752808 > Dump: http://putstuff.putfile.com/95479/6169502 > > Thanks in advance. > Florian Gilcher > > P.S.: There is another minor bug. Altough #more_like_this does set a > default option for :field_names (line #35), this option leads to a crash > in #retrieve_terms. The default option is nil and #retrieve_terms thus > tries to call #each on nil. (line #113) > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.3 (Darwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFGnJXo8RlGMqQ8m7oRAoAqAJ4g3oqoLk8XB61tCm+hUJlKdfz0UQCgmoSi > /t3GM4u/N10/S1DVyzUUE48> =wocT > -----END PGP SIGNATURE----- > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I am aware of the fact that the corpus is a bit small (but nicer for presentation purposes), but it surprised me that I found no way (even when playing with the parameters) to get at least 1 common word from the the set. (it wasn''t intended to be usable, but presentable) I will play around a bit more and add some documents. Thanks for the hints. Greetings Florian Gilcher Jens Kraemer wrote:> Hi, > > first of all, 6 documents is not really a corpus to judge the usability > of more_like_this - by default it will only consider terms occuring in > at least 5 documents to be of any relevance (:min_doc_freq option). So > if you have very different documents where the only common words are > filtered out as noise words, you''ll end up without any terms to use > for finding similar documents, which would lead to the query you > mentioned. > > However more_like_this should indeed return an empty result set in this > case ;-) > > Besides that, you should store term vectors (give :term_vector => :yes > for the fields you want to use more_like_this on in your call to > acts_as_ferret), this will speed up the search for relevant terms. > > > Jens > > > On Tue, Jul 17, 2007 at 12:11:55PM +0200, Florian Gilcher wrote: > Hi, > > I have the following Problem: > > I created a fairly simple sample project to try out acts_as_ferret and > present the results. > > The test set is relatively easy: I have extracts from 6 > Wikipedia-Articles about several Topics, which are copied into a model > that has two fields: title and text. This works quite well, until I try > to use #more_like_this, which returns all of the other articles, even if > they have nothing to do with the active article. I debugged a bit and > found out that the query build by #more_like_this is nothing more then > "-id:<id of the active record>". > (so the _result_ is correct) > > To try that out on the console, I used: > > entry = Entry.find(1) > entry.more_like_this(:field_names => [''text'']) > > Either I''m doing something entirely wrong or there is a bug. ;) Before > filing a ticket, I want to rule out the first case. > > Ferret version is 0.11.4, aaf version is the current stable version > (although trunk didn''t work as well). > > I uploaded the demo project together with a dump of the Database to: > > Project: http://putstuff.putfile.com/95477/8752808 > Dump: http://putstuff.putfile.com/95479/6169502 > > Thanks in advance. > Florian Gilcher > > P.S.: There is another minor bug. Altough #more_like_this does set a > default option for :field_names (line #35), this option leads to a crash > in #retrieve_terms. The default option is nil and #retrieve_terms thus > tries to call #each on nil. (line #113)_______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk>>-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGnLDa8RlGMqQ8m7oRAvfwAJ9Tf3n8doy/EzkDS/Q4Mgf+WNTZZwCeMCnu 75or+J8oDXojyqO4oUzt3IY=uhKz -----END PGP SIGNATURE-----