All I''ve been using Ferret for the past few days and have been wondering how best to integrate it with Rails. Initially I just setup the following in my environment.rb file: require ''ferret'' import Ferret @@index = Index::Index.new() However, this felt very unclean to me, so I decided the best way to do it would be to create a separate model like this: require ''singleton'' require ''ferret'' class CompanyIndex include Singleton include Ferret SEARCH_INDEX = "#{RAILS_ROOT}/db/index.db" def initialize @index = Index::Index.new(:path => SEARCH_INDEX) end def search # search code end def add # code to add to index end end etc. I then access it using CompanyIndex.instance. Now, is this the best way to do this? Are there cleaner ways of integrating it? Also, since Ferret requires a lock on the index, am I correct in assuming that having multiple Rails dispatch threads running would not work? Luke
David Balmain
2005-Oct-26 11:36 UTC
Re: RFC: How best to integrate Ferret with a Rails project
On 10/26/05, Luke Randall <luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > All > > I''ve been using Ferret for the past few days and have been wondering > how best to integrate it with Rails. Initially I just setup the > following in my environment.rb file: > > require ''ferret'' > import Ferret > > @@index = Index::Index.new() > > However, this felt very unclean to me, so I decided the best way to do > it would be to create a separate model like this: > > require ''singleton'' > require ''ferret'' > > class CompanyIndex > include Singleton > include Ferret > > SEARCH_INDEX = "#{RAILS_ROOT}/db/index.db" > > def initialize > @index = Index::Index.new(:path => SEARCH_INDEX) > end > > def search > # search code > end > > def add > # code to add to index > end > end > > etc. I then access it using CompanyIndex.instance. > > Now, is this the best way to do this? Are there cleaner ways of > integrating it?I''d be interested to see an answer to this myself. I like what you have so far. Also, since Ferret requires a lock on the index, am I correct in> assuming that having multiple Rails dispatch threads running would not > work?Sort of. You can have as many dispatch threads as you like as long as only one is updating the index at a time. So one solution would be to have one dispatch thread that handles all updates, which probably won''t work easily into a rails app. Another would be to only open the writer when you need it. In this case, it''d be a good idea if possible to batch your updates. I''ve added a flush method to the next version so that you can do this (thanks to Nick Stuart); def do_index_update(...) index.delete(id1) index.delete(id2) index << { :id => id1, :contents => "yada yada yada" } index << { :id => id2, :contents => "yada yada yada" } index << { :id => id3, :contents => "yada yada yada" } # make sure the writer is closed. index.flush() # <= coming in next version end If you are looking for maximum speed and control, you''ll want to use the Index::IndexReader, Index::IndexWriter and Search::IndexSearcher classes directly. They aren''t documented very well yet so you may want to have a look at the code for the Index::Index class to see how I handle them. Also note that Index::Index is not threadsafe yet. I''m working on that right now and it should be out by the end of the week. Hope that helps. Feedback is most welcome, Dave Luke> _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >_______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Jan Prill
2005-Oct-26 12:11 UTC
Re: RFC: How best to integrate Ferret and ANN: howto on wiki
Hello list, once again: /* outing myself as ruby and RoR newbie */ I''ve put up a howto in the misc howto section on the rails wiki: http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails Please consider this as an approach to give back something to the community even a rails newbie like me could provide. I had nearly finished this thing as the thread of Luke started so I''m looking forward to a ''best practice'' way of integrating Ferret into rails. Please keep the wiki page up to date as you proceed if possible. Thanks again to Dave, Luke and the others pushing things forward on the important feature of fulltext-search in web-development... As you''ve surely realized I''m not a native english speaker, so hopefully time will rule out my mistakes as correcting edits to http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails will apply... best regards Jan Prill
David Balmain
2005-Oct-26 13:17 UTC
Re: RFC: How best to integrate Ferret and ANN: howto on wiki
Wow, thanks for doing that Jan. That looks great. I don''t have time to go through it all now. I''m having some trouble with the threading. A couple of things. Once the next version of Ferret is out, you''ll probably want to do an index.flush() where you have the index.optimize() at the end. This should solve the write lock problem I think. The other thing is pagination in Ferret can be done by setting num_docs and first_doc. So if you have 10 results per page and you want to show page 4; index.search_each(conditions, {:num_docs => 10, :first_doc => 30}) do |doc, score| @records << index[doc] end Not sure how this would work in Rails but I hope it helps. Regards, Dave On 10/26/05, Jan Prill <JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org> wrote:> > Hello list, > > once again: /* outing myself as ruby and RoR newbie */ > > I''ve put up a howto in the misc howto section on the rails wiki: > http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > > Please consider this as an approach to give back something to the > community even a rails newbie like me could provide. I had nearly > finished this thing as the thread of Luke started so I''m looking forward > to a ''best practice'' way of integrating Ferret into rails. Please keep > the wiki page up to date as you proceed if possible. Thanks again to > Dave, Luke and the others pushing things forward on the important > feature of fulltext-search in web-development... > > As you''ve surely realized I''m not a native english speaker, so hopefully > time will rule out my mistakes as correcting edits to > http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > will apply... > > best regards > Jan Prill > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >_______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Jan Prill
2005-Oct-26 13:25 UTC
Re: RFC: How best to integrate Ferret and ANN: howto on wiki
Hi, Dave, indeed going by ''first_doc'' has to be the way to go. I tried that and had issues with it. I''ll try again. For now I just wanted to get this little howto up and running. There are so many smart railers and rubyists around that I bet there will be a ''best ferret integration practice'' in no time. Hopefully you get all help you need from these ''masters'' since ferret is the thing I''ve missed most on my first steps on rails... I''m trying to do content centric (meaning loads of text) stuff and you can''t do these without a good fulltext search... regards Jan David Balmain wrote:> Wow, thanks for doing that Jan. That looks great. I don''t have time to > go through it all now. I''m having some trouble with the threading. A > couple of things. Once the next version of Ferret is out, you''ll > probably want to do an index.flush() where you have the > index.optimize() at the end. This should solve the write lock problem > I think. The other thing is pagination in Ferret can be done by > setting num_docs and first_doc. So if you have 10 results per page and > you want to show page 4; > >| index.search_each(conditions, {:num_docs => 10, :first_doc => 30}) do |doc, score| > @records << index[doc] > end| > > Not sure how this would work in Rails but I hope it helps. > > Regards, > Dave > > On 10/26/05, *Jan Prill* <JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org > <mailto:JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org>> wrote: > > Hello list, > > once again: /* outing myself as ruby and RoR newbie */ > > I''ve put up a howto in the misc howto section on the rails wiki: > http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > > Please consider this as an approach to give back something to the > community even a rails newbie like me could provide. I had nearly > finished this thing as the thread of Luke started so I''m looking > forward > to a ''best practice'' way of integrating Ferret into rails. Please keep > the wiki page up to date as you proceed if possible. Thanks again to > Dave, Luke and the others pushing things forward on the important > feature of fulltext-search in web-development... > > As you''ve surely realized I''m not a native english speaker, so > hopefully > time will rule out my mistakes as correcting edits to > http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > will apply... > > best regards > Jan Prill > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org <mailto:Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org> > http://lists.rubyonrails.org/mailman/listinfo/rails > >
Luke Randall
2005-Oct-26 17:33 UTC
Re: RFC: How best to integrate Ferret with a Rails project
> into a rails app. Another would be to only open the writer when you need it. > In this case, it''d be a good idea if possible to batch your updates.Thanks, this will be most useful.> If you are looking for maximum speed and control, you''ll want to use the > Index::IndexReader, Index::IndexWriter and Search::IndexSearcher classes > directly.Also, if cFerret is compiled, does it automatically use that? This being on a Linux box... I was just wondering because currently indexing takes about 30s per 100 records, which seems to me must be because it is using the Ruby implementation. Sorry, one more (noob type) question. This relates more to how the indexer itself works. I''ve tried to Google around for this but haven''t found anything... I''ve got my own analyzer set up, using the Porter Stemmer. Now, when I search, does it matter that my query obviously isn''t being stemmed? Does it just match it to the beginning of the word or what? Or does the query actually get stemmed as well? (I haven''t looked at the query & search code that much). Sorry if this is a dumb question, but I''ve just been wondering about it. Thanks in advance Luke PS Thanks for all the work that you''ve been doing. Ferret really came just in time for me, and is working great.
Erik Hatcher
2005-Oct-26 17:44 UTC
Re: RFC: How best to integrate Ferret with a Rails project
On 26 Oct 2005, at 13:33, Luke Randall wrote:> Sorry, one more (noob type) question. This relates more to how the > indexer itself works. I''ve tried to Google around for this but haven''t > found anything... > > I''ve got my own analyzer set up, using the Porter Stemmer. Now, when I > search, does it matter that my query obviously isn''t being stemmed? > Does it just match it to the beginning of the word or what? Or does > the query actually get stemmed as well? (I haven''t looked at the query > & search code that much). Sorry if this is a dumb question, but I''ve > just been wondering about it.It is very important that the terms (or tokens within a field) match from what was indexed to the query. Whether the same analyzer, or not, is a good question - one that doesn''t have a definite answer, but at the very least they must be compatible. So you need to ensure a _compatible_ (likely the same) analyzer is used for parsing a query as was used during indexing. More specifically - if you''re stemming during indexing, you''ll need to stem the terms of the query to match properly. Erik
Luke Randall
2005-Oct-26 18:35 UTC
Re: RFC: How best to integrate Ferret with a Rails project
> It is very important that the terms (or tokens within a field) match > from what was indexed to the query. Whether the same analyzer, or > not, is a good question - one that doesn''t have a definite answer, > but at the very least they must be compatible. So you need to ensure > a _compatible_ (likely the same) analyzer is used for parsing a query > as was used during indexing. > > More specifically - if you''re stemming during indexing, you''ll need > to stem the terms of the query to match properly. > > ErikThanks a lot. As I''m sure is obvious, I am very new to information retrieval so I don''t know much at all about it. However, that seemed likely to me, so I wanted to check with someone who knows.
David Balmain
2005-Oct-27 02:40 UTC
Re: RFC: How best to integrate Ferret with a Rails project
On 10/27/05, Luke Randall <luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > > It is very important that the terms (or tokens within a field) match > > from what was indexed to the query. Whether the same analyzer, or > > not, is a good question - one that doesn''t have a definite answer, > > but at the very least they must be compatible. So you need to ensure > > a _compatible_ (likely the same) analyzer is used for parsing a query > > as was used during indexing. > > > > More specifically - if you''re stemming during indexing, you''ll need > > to stem the terms of the query to match properly. > > > > ErikJust to add to that. If you are using the Index::Index class, it will handle it all for you. ie, the same analyzer is used in the indexer and the query parser. Otherwise, like Erik said, you''ll need to make sure your analyzers are at least compatible. _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
David Balmain
2005-Oct-27 02:44 UTC
Re: RFC: How best to integrate Ferret with a Rails project
On 10/27/05, Luke Randall <luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> > > Also, if cFerret is compiled, does it automatically use that? This > being on a Linux box... I was just wondering because currently > indexing takes about 30s per 100 records, which seems to me must be > because it is using the Ruby implementation.The C extensions included with Ferret are not cFerret. They are just some basic extensions which double the speed of the indexer. cFerret is a full rewrite of the indexer in C and is 100 times faster. Once I get the ruby version stable, I''ll start work on integrating cFerret. Cheers, Dave _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails