thr3ads.net - Rails - RFC: How best to integrate Ferret with a Rails project [Oct 2005]

If this information is useful, please help other people find it:
Share via:

Luke Randall

2005-Oct-26 10:02 UTC

RFC: How best to integrate Ferret with a Rails project

All

I''ve been using Ferret for the past few days and have been wondering
how best to integrate it with Rails. Initially I just setup the
following in my environment.rb file:

require ''ferret''
import Ferret

@@index = Index::Index.new()

However, this felt very unclean to me, so I decided the best way to do
it would be to create a separate model like this:

require ''singleton''
require ''ferret''

class CompanyIndex
  include Singleton
  include Ferret

  SEARCH_INDEX = "#{RAILS_ROOT}/db/index.db"

  def initialize
    @index = Index::Index.new(:path => SEARCH_INDEX)
  end

  def search
  # search code
  end

  def add
  # code to add to index
  end
end

etc. I then access it using CompanyIndex.instance.

Now, is this the best way to do this? Are there cleaner ways of integrating it?

Also, since Ferret requires a lock on the index, am I correct in
assuming that having multiple Rails dispatch threads running would not
work?

Luke

David Balmain

2005-Oct-26 11:36 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

On 10/26/05, Luke Randall
<luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:>
> All
>
> I''ve been using Ferret for the past few days and have been
wondering
> how best to integrate it with Rails. Initially I just setup the
> following in my environment.rb file:
>
> require ''ferret''
> import Ferret
>
> @@index = Index::Index.new()
>
> However, this felt very unclean to me, so I decided the best way to do
> it would be to create a separate model like this:
>
> require ''singleton''
> require ''ferret''
>
> class CompanyIndex
> include Singleton
> include Ferret
>
> SEARCH_INDEX = "#{RAILS_ROOT}/db/index.db"
>
> def initialize
> @index = Index::Index.new(:path => SEARCH_INDEX)
> end
>
> def search
> # search code
> end
>
> def add
> # code to add to index
> end
> end
>
> etc. I then access it using CompanyIndex.instance.
>
> Now, is this the best way to do this? Are there cleaner ways of
> integrating it?

I''d be interested to see an answer to this myself. I like what you have
so
far.

Also, since Ferret requires a lock on the index, am I correct
in> assuming that having multiple Rails dispatch threads running would not
> work?

Sort of. You can have as many dispatch threads as you like as long as only
one is updating the index at a time. So one solution would be to have one
dispatch thread that handles all updates, which probably won''t work
easily
into a rails app. Another would be to only open the writer when you need it.
In this case, it''d be a good idea if possible to batch your updates.
I''ve
added a flush method to the next version so that you can do this
(thanks to Nick
Stuart);

def do_index_update(...)
index.delete(id1)
index.delete(id2)
index << { :id => id1, :contents => "yada yada yada" }
index << { :id => id2, :contents => "yada yada yada" }
index << { :id => id3, :contents => "yada yada yada" }

# make sure the writer is closed.
index.flush() # <= coming in next version
end

If you are looking for maximum speed and control, you''ll want to use
the
Index::IndexReader, Index::IndexWriter and Search::IndexSearcher classes
directly. They aren''t documented very well yet so you may want to have
a
look at the code for the Index::Index class to see how I handle them. Also
note that Index::Index is not threadsafe yet. I''m working on that right
now
and it should be out by the end of the week.

Hope that helps. Feedback is most welcome,
Dave

Luke> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Jan Prill

2005-Oct-26 12:11 UTC

head link

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Hello list,

once again: /* outing myself as ruby and RoR newbie */

I''ve put up a howto in the misc howto section on the rails wiki: 
http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails

Please consider this as an approach to give back something to the 
community even a rails newbie like me could provide. I had nearly 
finished this thing as the thread of Luke started so I''m looking
forward
to a ''best practice'' way of integrating Ferret into rails.
Please keep
the wiki page up to date as you proceed if possible. Thanks again to 
Dave, Luke and the others pushing things forward on the important 
feature of fulltext-search in web-development...

As you''ve surely realized I''m not a native english speaker, so
hopefully
time will rule out my mistakes as correcting edits to 
http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails 
will apply...

best regards
Jan Prill

David Balmain

2005-Oct-26 13:17 UTC

head link

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Wow, thanks for doing that Jan. That looks great. I don''t have time to
go
through it all now. I''m having some trouble with the threading. A
couple of
things. Once the next version of Ferret is out, you''ll probably want to
do
an index.flush() where you have the index.optimize() at the end. This should
solve the write lock problem I think. The other thing is pagination in
Ferret can be done by setting num_docs and first_doc. So if you have 10
results per page and you want to show page 4;

      index.search_each(conditions, {:num_docs => 10, :first_doc =>
30}) do |doc, score|
        @records << index[doc]
      end

Not sure how this would work in Rails but I hope it helps.

Regards,
Dave

On 10/26/05, Jan Prill <JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org>
wrote:>
> Hello list,
>
> once again: /* outing myself as ruby and RoR newbie */
>
> I''ve put up a howto in the misc howto section on the rails wiki:
> http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails
>
> Please consider this as an approach to give back something to the
> community even a rails newbie like me could provide. I had nearly
> finished this thing as the thread of Luke started so I''m looking
forward
> to a ''best practice'' way of integrating Ferret into
rails. Please keep
> the wiki page up to date as you proceed if possible. Thanks again to
> Dave, Luke and the others pushing things forward on the important
> feature of fulltext-search in web-development...
>
> As you''ve surely realized I''m not a native english
speaker, so hopefully
> time will rule out my mistakes as correcting edits to
> http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails
> will apply...
>
> best regards
> Jan Prill
>
>
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Jan Prill

2005-Oct-26 13:25 UTC

head link

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Hi, Dave,

indeed going by ''first_doc'' has to be the way to go. I tried
that and
had issues with it. I''ll try again. For now I just wanted to get this 
little howto up and running. There are so many smart railers and 
rubyists around that I bet there will be a ''best ferret integration 
practice'' in no time. Hopefully you get all help you need from these 
''masters'' since ferret is the thing I''ve missed most
on my first steps
on rails... I''m trying to do content centric (meaning loads of text) 
stuff and you can''t do these without a good fulltext search...

regards
Jan

David Balmain wrote:
> Wow, thanks for doing that Jan. That looks great. I don''t have
time to
> go through it all now. I''m having some trouble with the threading.
A
> couple of things. Once the next version of Ferret is out, you''ll 
> probably want to do an index.flush() where you have the 
> index.optimize() at the end. This should solve the write lock problem 
> I think. The other thing is pagination in Ferret can be done by 
> setting num_docs and first_doc. So if you have 10 results per page and 
> you want to show page 4;
>
>|      index.search_each(conditions, {:num_docs => 10, :first_doc =>
30}) do |doc, score|
>        @records << index[doc]
>      end|
>
> Not sure how this would work in Rails but I hope it helps.
>
> Regards,
> Dave
>
> On 10/26/05, *Jan Prill*
<JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org
> <mailto:JanPrill-sTn/vYlS8ieELgA04lAiVw@public.gmane.org>> wrote:
>
>     Hello list,
>
>     once again: /* outing myself as ruby and RoR newbie */
>
>     I''ve put up a howto in the misc howto section on the rails
wiki:
>     http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails
>
>     Please consider this as an approach to give back something to the
>     community even a rails newbie like me could provide. I had nearly
>     finished this thing as the thread of Luke started so I''m
looking
>     forward
>     to a ''best practice'' way of integrating Ferret into
rails. Please keep
>     the wiki page up to date as you proceed if possible. Thanks again to
>     Dave, Luke and the others pushing things forward on the important
>     feature of fulltext-search in web-development...
>
>     As you''ve surely realized I''m not a native english
speaker, so
>     hopefully
>     time will rule out my mistakes as correcting edits to
>     http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails
>     will apply...
>
>     best regards
>     Jan Prill
>
>
>     _______________________________________________
>     Rails mailing list
>     Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
<mailto:Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org>
>     http://lists.rubyonrails.org/mailman/listinfo/rails
>
>

Luke Randall

2005-Oct-26 17:33 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

> into a rails app. Another would be to only open the writer when you need
it.
> In this case, it''d be a good idea if possible to batch your
updates.
Thanks, this will be most useful.
>  If you are looking for maximum speed and control, you''ll want to
use the
> Index::IndexReader, Index::IndexWriter and Search::IndexSearcher classes
> directly.
Also, if cFerret is compiled, does it automatically use that? This
being on a Linux box... I was just wondering because currently
indexing takes about 30s per 100 records, which seems to me must be
because it is using the Ruby implementation.

Sorry, one more (noob type) question. This relates more to how the
indexer itself works. I''ve tried to Google around for this but
haven''t
found anything...

I''ve got my own analyzer set up, using the Porter Stemmer. Now, when I
search, does it matter that my query obviously isn''t being stemmed?
Does it just match it to the beginning of the word or what? Or does
the query actually get stemmed as well? (I haven''t looked at the query
& search code that much). Sorry if this is a dumb question, but
I''ve
just been wondering about it.

Thanks in advance
Luke

PS Thanks for all the work that you''ve been doing. Ferret really came
just in time for me, and is working great.

Erik Hatcher

2005-Oct-26 17:44 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

On 26 Oct 2005, at 13:33, Luke Randall wrote:> Sorry, one more (noob type) question. This relates more to how the
> indexer itself works. I''ve tried to Google around for this but
haven''t
> found anything...
>
> I''ve got my own analyzer set up, using the Porter Stemmer. Now,
when I
> search, does it matter that my query obviously isn''t being
stemmed?
> Does it just match it to the beginning of the word or what? Or does
> the query actually get stemmed as well? (I haven''t looked at the
query
> & search code that much). Sorry if this is a dumb question, but
I''ve
> just been wondering about it.
It is very important that the terms (or tokens within a field) match  
from what was indexed to the query.  Whether the same analyzer, or  
not, is a good question - one that doesn''t have a definite answer,  
but at the very least they must be compatible.  So you need to ensure  
a _compatible_ (likely the same) analyzer is used for parsing a query  
as was used during indexing.

More specifically - if you''re stemming during indexing, you''ll
need
to stem the terms of the query to match properly.

     Erik

Luke Randall

2005-Oct-26 18:35 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

> It is very important that the terms (or tokens within a field) match
> from what was indexed to the query.  Whether the same analyzer, or
> not, is a good question - one that doesn''t have a definite answer,
> but at the very least they must be compatible.  So you need to ensure
> a _compatible_ (likely the same) analyzer is used for parsing a query
> as was used during indexing.
>
> More specifically - if you''re stemming during indexing,
you''ll need
> to stem the terms of the query to match properly.
>
>      Erik
Thanks a lot. As I''m sure is obvious, I am very new to information
retrieval so I don''t know much at all about it. However, that seemed
likely to me, so I wanted to check with someone who knows.

David Balmain

2005-Oct-27 02:40 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

On 10/27/05, Luke Randall
<luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:>
> > It is very important that the terms (or tokens within a field) match
> > from what was indexed to the query. Whether the same analyzer, or
> > not, is a good question - one that doesn''t have a definite
answer,
> > but at the very least they must be compatible. So you need to ensure
> > a _compatible_ (likely the same) analyzer is used for parsing a query
> > as was used during indexing.
> >
> > More specifically - if you''re stemming during indexing,
you''ll need
> > to stem the terms of the query to match properly.
> >
> > Erik

Just to add to that. If you are using the Index::Index class, it will handle
it all for you. ie, the same analyzer is used in the indexer and the query
parser. Otherwise, like Erik said, you''ll need to make sure your
analyzers
are at least compatible.


_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

David Balmain

2005-Oct-27 02:44 UTC

head link

Re: RFC: How best to integrate Ferret with a Rails project

On 10/27/05, Luke Randall
<luke.randall-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:>
>
> Also, if cFerret is compiled, does it automatically use that? This
> being on a Linux box... I was just wondering because currently
> indexing takes about 30s per 100 records, which seems to me must be
> because it is using the Ruby implementation.

The C extensions included with Ferret are not cFerret. They are just some
basic extensions which double the speed of the indexer. cFerret is a full
rewrite of the indexer in C and is 100 times faster. Once I get the ruby
version stable, I''ll start work on integrating cFerret.

Cheers,
Dave

_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Rails - Oct 2005 - RFC: How best to integrate Ferret with a Rails project

RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Re: RFC: How best to integrate Ferret and ANN: howto on wiki

Re: RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret with a Rails project

Re: RFC: How best to integrate Ferret with a Rails project