thr3ads.net - Ferret talk - [Ferret-talk] Multithreading / multiprocessing woes [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Scott Davies

2007-Nov-16 10:56 UTC

[Ferret-talk] Multithreading / multiprocessing woes

I''ve been running some multithreaded tests on Ferret.  Using a single
Ferret::Index::Index inside a DRb server, it definitely behaves for me
as if all readers are locked out of the index when writing is going on
in that index, not just optimization -- at least when segment merging
happens, which is when the writes take the longest and you can
therefore least afford to lock out all reads.  This is very easy to
notice when you add, say, your 100,000th document to the index, and
that one write takes over 5 seconds to complete because it triggers a
bunch of incremental segment-merging, and all queries to the index
stall in the meantime.  Or when you add your millionth document, which
can stall all reads for over a minute. :-(

When I try to use an IndexReader in a separate process, things are
even worse.  The IndexReader doesn''t see any updates to the index
since it was created.  Not too surprising, but if I try creating a new
IndexReader for every query, and have the Index in the other writing
process turn on auto_flush, then the reading process crashes after a
few (generally fewer than 100) queries, in one of at least two
different ways selected apparently at random:

Failure Mode #1:

script/ferret_speedtest2_reader:30:in `initialize'': IO Error occured
at <except.c>:93 in xraise (IOError)
Error occured in index.c:901 - sis_find_segments_file
	Error reading the segment infos. Store listing was


	from script/ferret_speedtest2_reader:30:in `new''
	from script/ferret_speedtest2_reader:30:in `run_test_query''

[Yes, there really are two blank lines after "Store listing was".]

Failure Mode #2:
script/ferret_speedtest2_reader:30:in `initialize'': IO Error occured
at <except.c>:93 in xraise (IOError)
Error occured in fs_store.c:127 - fs_each
	doing ''each'' in
/Users/scott/dev/ruby/timetracker/tmp/ferret_speedtest_index: <Too
many open files>

	from script/ferret_speedtest2_reader:30:in `new''
	from script/ferret_speedtest2_reader:30:in `run_test_query''

Meanwhile, if I try eliminating this second failure mode by explicitly
calling close on the IndexReader
before I throw it away, the close immediately crashes with:

script/ferret_speedtest2_reader:45: [BUG] Bus Error
ruby 1.8.6 (2007-03-13) [i686-darwin8.8.5]

Abort trap


Given the combination of problems above, I''m at a loss to understand
how to use Ferret on a live website that requires reasonably fast
turnaround between a user submitting data and the user being able to
search over that data, unless either (1) the site only gets a few
thousand new index entries per day and the site can be taken down for
a few minutes daily to optimize the index, or (2) it''s OK for the
entire site to periodically stall on all queries for seconds or even
minutes whenever segment-merging happens to kick in.

Do all Ferret users just suck it up and live with one of these
limitations, or am I missing something and/or just getting "lucky"
with the errors above?

For reference, the system being used here is a Mac running Leopard,
although I doubt that matters...

Benjamin Krause

2007-Nov-16 12:12 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Scott,
> Do all Ferret users just suck it up and live with one of these
> limitations, or am I missing something and/or just getting
"lucky"
> with the errors above?
This limitations you''re talking about are known and will be fixed
in the near future.. the trick is, to have one read-only and one
write-only index.. This is currently being worked on. If you need
a fix right now, you need to do it yourself but you can take a look
on omdb''s code and how it''s done there:

http://bugs.omdb.org/browser/branches/2007.1/lib/omdb/ferret/lib/util.rb
(see the switch code)

If you don''t need a fix right now, i''m sure AAF will come up
with
a solution for that in the near future (aka probably not this year).

on a side note.. for the to many open files error, see:

http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html
(use_compound_file, you may have set this to false) or simply increase
the number of open files. On omdb we''re running with 32k :-)

rails at homer.omdb.org ~ $ ulimit -n
32768

Cheers
  Ben

Scott Davies

2007-Nov-16 20:35 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Hi Ben --

Thanks much for the quick and helpful reply!  Unfortunately, the
solution you''re using on omdb looks suspect to me, for the same reason
that Alex Neth brought up a few days ago on this list: to my knowledge
there''s no guarantee that rsync will produce a coherent snapshot of
the source directory as it was at any one particular instant in time.
In fact, I don''t see how rsync could both always terminate in finite
time and provide such a guarantee, except on exotic filesystems that
provide, say, atomic snapshots with copy-on-write capabilities.
(Sigh...sometimes I miss the Google File System.)  In which case you''d
have to disable your site during the rsync in order to prevent
corruption, which basically boils down to the "must take site offline
daily for a few minutes to deal with this problem" limitation. 
I''m
guessing the rsync is faster than an index optimization, so I guess
this might at least cut down on the amount of time the site has to be
down, but still...wah.

Am I a fool for wondering whether it might ultimately be less painful
to try an index server that runs Lucene under a JRuby process?

On Nov 16, 2007 4:12 AM, Benjamin Krause <bk at benjaminkrause.com>
wrote:> Scott,
>
> > Do all Ferret users just suck it up and live with one of these
> > limitations, or am I missing something and/or just getting
"lucky"
> > with the errors above?
>
> This limitations you''re talking about are known and will be fixed
> in the near future.. the trick is, to have one read-only and one
> write-only index.. This is currently being worked on. If you need
> a fix right now, you need to do it yourself but you can take a look
> on omdb''s code and how it''s done there:
>
> http://bugs.omdb.org/browser/branches/2007.1/lib/omdb/ferret/lib/util.rb
> (see the switch code)
>
> If you don''t need a fix right now, i''m sure AAF will come
up with
> a solution for that in the near future (aka probably not this year).
>
> on a side note.. for the to many open files error, see:
>
> http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html
> (use_compound_file, you may have set this to false) or simply increase
> the number of open files. On omdb we''re running with 32k :-)
>
> rails at homer.omdb.org ~ $ ulimit -n
> 32768
>
> Cheers
>   Ben
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Erik Hatcher

2007-Nov-16 21:13 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On Nov 16, 2007, at 3:35 PM, Scott Davies wrote:> Am I a fool for wondering whether it might ultimately be less painful
> to try an index server that runs Lucene under a JRuby process?
Or, rather, an index server that runs Solr accessed with a pure Ruby,  
solr-ruby, API (which works with MRI or JRuby)?   :)

	Erik

Benjamin Krause

2007-Nov-16 22:40 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Scott,

we''re using two directories, not one for ferret. One
index is the passive index. it is not used for searches,
but new indexing requests will be added to that index.
so lets call it the indexing-index.

all mongrels will use the second directory, lets call it
searching-index. Both indexes are almost identical,
i''ll explain the differences.

All out indexing requests are queued. So whenever
you want to index something, it will be placed in the
queue, and added to the indexing-index. After a
certain amount of queue-items added to the index,
we''re stopping indexing. The queue will be halted.
New requests can be added, but nothing will be
added to the indexing-index.

Now we''re rsyncing the indexing-index to all machines,
remember, searching is still done in the searching-index,
which is outdated, but we don''t mind about that :)

After rsync is complete, we''re switching both directories,
so the indexing-index becomes the searching-index and
vice versa. Actually we''re just switching symlinks, so
the this will take almost no time. And even if one of the
mongrels still have a filehandle to the old index open,
nothing will happen, it is still using the outdated index,
but the next request will use the new index. After that,
the new indexing-index will be synced from the
searching-index. As the searching-index is read-only,
there is no risk of corrupting something during the
sync.

Now we''re resuming processing the queue, until we''ve
added our certain amount of queue entries, or the queue
is empty.

The downside is, that the searching-index is outdated,
but not more that a couple of minutes (about 2 minutes
on omdb). We didn''t have one corrupted index since.
There is now downtime whatsoever, and the rsync snapshot
will always be coherent.

Cheers
  Ben

Scott Davies

2007-Nov-17 00:37 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Ben --

Thanks for the detailed explanation!  Yes, that does make sense.  If I
understand it correctly, though, something won''t show up in a search
until at least one index switch happens after it''s been submitted,
which means we''re talking about a minute or so on average (not just
worst-case) from submission to search result, even if the switches are
being done constantly (given that each switch takes about two
minutes).  For my site, I''m really hoping that most content will show
up within a second or so of its submission.  That simply can''t happen
if I''m not updating the same index I''m doing searches with. 
I''d be OK
with the turnaround *occasionally* being a minute -- say, while an
index optimization or particularly large segment merge happens.  But
so far it looks to me like the choices with Ferret are either:

(1) The *average* time from submission to search result is on the
order of minutes.  However, searches are always reasonably fast.
(Your approach.)

(2) The average time from submission to search result is less than a
second.  However, the *worst-case* times can be minutes, and now all
*searches* stall over those minutes as well, which is Bad.  If you
don''t get more than a few thousand submissions per day, you can at
least schedule these outages as nightly index optimizations, but
you''ll have the outages one way or another.  (All "same index used
for
reading + writing" approaches.)

I don''t think either of these choices is very good for the particular
site I have in mind (at least if I''m being optimistic enough about its
chances  of "taking off" to worry about the possibility of many
thousands of submissions / day).  Am I correct in my summarization of
the two choices with Ferret here, or have I missed something?

Anyhow, thanks again!  If those two options are in fact what I have, I
think I''ll run some tests with Lucene/JRuby to see whether that
provides a third option as far as performance goes, and report back
what sort of issues come up.  (My guess is that it''ll be moderately
painful to set up and that the average throughput will be worse than
Ferret''s, but that an average submission-to-search-result turnaround
time of a second or two will be achievable without the site
necessarily going completely down for minutes every now and then.
We''ll see.)

-- Scott

On Nov 16, 2007 2:40 PM, Benjamin Krause <bk at benjaminkrause.com>
wrote:> Scott,
>
> we''re using two directories, not one for ferret. One
> index is the passive index. it is not used for searches,
> but new indexing requests will be added to that index.
> so lets call it the indexing-index.
>
> all mongrels will use the second directory, lets call it
> searching-index. Both indexes are almost identical,
> i''ll explain the differences.
>
> All out indexing requests are queued. So whenever
> you want to index something, it will be placed in the
> queue, and added to the indexing-index. After a
> certain amount of queue-items added to the index,
> we''re stopping indexing. The queue will be halted.
> New requests can be added, but nothing will be
> added to the indexing-index.
>
> Now we''re rsyncing the indexing-index to all machines,
> remember, searching is still done in the searching-index,
> which is outdated, but we don''t mind about that :)
>
> After rsync is complete, we''re switching both directories,
> so the indexing-index becomes the searching-index and
> vice versa. Actually we''re just switching symlinks, so
> the this will take almost no time. And even if one of the
> mongrels still have a filehandle to the old index open,
> nothing will happen, it is still using the outdated index,
> but the next request will use the new index. After that,
> the new indexing-index will be synced from the
> searching-index. As the searching-index is read-only,
> there is no risk of corrupting something during the
> sync.
>
> Now we''re resuming processing the queue, until we''ve
> added our certain amount of queue entries, or the queue
> is empty.
>
> The downside is, that the searching-index is outdated,
> but not more that a couple of minutes (about 2 minutes
> on omdb). We didn''t have one corrupted index since.
> There is now downtime whatsoever, and the rsync snapshot
> will always be coherent.
>
>
> Cheers
>   Ben
>
>
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Scott Davies

2007-Nov-17 10:12 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Hmmm...I''d first heard of Solr only a couple of days ago, and I
hadn''t
been aware of a Ruby API to it until you mentioned it.
Interesting...thanks!

On Nov 16, 2007 1:13 PM, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On Nov 16, 2007, at 3:35 PM, Scott Davies wrote:
> > Am I a fool for wondering whether it might ultimately be less painful
> > to try an index server that runs Lucene under a JRuby process?
>
> Or, rather, an index server that runs Solr accessed with a pure Ruby,
> solr-ruby, API (which works with MRI or JRuby)?   :)
>
>         Erik
>
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Jens Kraemer

2007-Nov-17 21:50 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Hi!

On Fri, Nov 16, 2007 at 02:56:26AM -0800, Scott Davies
wrote:> I''ve been running some multithreaded tests on Ferret.  Using a
single
> Ferret::Index::Index inside a DRb server, it definitely behaves for me
> as if all readers are locked out of the index when writing is going on
> in that index, not just optimization -- at least when segment merging
> happens, which is when the writes take the longest and you can
> therefore least afford to lock out all reads.  This is very easy to
> notice when you add, say, your 100,000th document to the index, and
> that one write takes over 5 seconds to complete because it triggers a
> bunch of incremental segment-merging, and all queries to the index
> stall in the meantime.  Or when you add your millionth document, which
> can stall all reads for over a minute. :-(
Don''t get me wrong, but how often do you think you''ll add your
millionth
document to the index? 

And even if you really do index a million documents per week - I
wouldn''t exactly call it bad performance if one or two search requests
*per week* take a minute to complete, while all others are completed in
less than a second...

Having that said, the problem with blocking searches might be possible
to solve by not using Ferret''s Index class for searching/indexing, but
using the lower level APIs (Searcher and IndexWriter) and doing manual
synchronization (inside *one* process). I didn''t feel the need to
implement this for aaf (yet ;-), since I think it''s already fast enough
to not be the bottleneck in most real world usage scenarios (say -
typical Rails apps using aaf for full text search).
> When I try to use an IndexReader in a separate process, things are
> even worse.  The IndexReader doesn''t see any updates to the index
> since it was created.  Not too surprising, but if I try creating a new
> IndexReader for every query, and have the Index in the other writing
> process turn on auto_flush, then the reading process crashes after a
> few (generally fewer than 100) queries, in one of at least two
> different ways selected apparently at random:
[..]

Stick to the one-process-per-index rule to be on the safe side.
> Given the combination of problems above, I''m at a loss to
understand
> how to use Ferret on a live website that requires reasonably fast
> turnaround between a user submitting data and the user being able to
> search over that data, unless either (1) the site only gets a few
> thousand new index entries per day and the site can be taken down for
> a few minutes daily to optimize the index, or (2) it''s OK for the
> entire site to periodically stall on all queries for seconds or even
> minutes whenever segment-merging happens to kick in.
I wouldn''t set the limit at a few thousand new documents per day, and
also optimizing daily is only useful if you''re having lots of document
deletions per day. 


Cheers,
Jens

PS: If you happen to benchmark Solr against aaf''s DRb server, be sure
to
let us know your findings :-)

-- 
Jens Kr?mer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

Erik Hatcher

2007-Nov-18 09:29 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On Nov 17, 2007, at 5:12 AM, Scott Davies wrote:> Hmmm...I''d first heard of Solr only a couple of days ago, and I
hadn''t
> been aware of a Ruby API to it until you mentioned it.
> Interesting...thanks!
I''ve honestly given fairly little of my time to Ferret, though I have  
tinkered with it some and it is mighty fine!

Believe you me, I don''t want to steal any thunder from Ferret.  And  
I''ve not compared/contrasted them much myself.  Truth be told
I''m
still a Java dude, and knowing that Lucene and Solr are in Java,  
excel at what they are designed to do  and already gulping the Apache  
cool-ade I really dig Solr.

I''ve presented solr+ruby a couple of times now, once at RailsConf and  
then again a few weeks ago at rubyconf.

RailsConf:
<http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf>

rubyconf:
<http://code4lib.org/files/solr-ruby.pdf>

acts_as_solr as it exists today is sub-optimal compared to  
acts_as_ferret.  I''m quite admittedly not much into relational  
databases so I have only tinkered in this area myself.

	Erik

Andreas Korth

2007-Nov-18 15:05 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Hi everyone!

This is a very interesting thread, because it raises the question as  
to whether Ferret is something you would want to use in a production  
environment - or not.

I''ve been using Ferret in two applications and my experiences were  
quite disappointing. I chose Ferret because it''s fast and it''s
got a
Ruby API. Everything else about it is just annoying and potentially  
hazardous.

What worries me most is the fact that Ferret is effectively an  
abandoned project. The original author, who is the sole owner of the  
code, hasn''t been posting to this list for about six months. He
hasn''t
introduced any improvements in about the same period of time and many  
bugs still remain unfixed. New bugs can''t be submitted (let alone  
patches) because the project Trac is offline.

There is no other component in my applications which behaves as badly  
as Ferret. If you don''t treat it _very_ carefully it will throw  
segfaults as if this was an established way of indicating an error  
condition.

The ActsAsFerret plugin _does_ treat ferret quite carefully and it''s  
the only reason why many people are able to use Ferret at all.  
However, AAF is one approach and for some applications it might not be  
the right one. Especially if you want to put multiple models in one  
index - it''s possible, but not really a flexible solution.

The most sensitive point of Ferret is concurrency and many people  
actually use Ferret in distributed environments (which is usually a  
Rails app that scales across several machines). AAF introduces a DRb  
server to work around this problem, but with many concurrent read/ 
write requests, performance quickly degrades.

With the advent of JRuby, a myriad of Java-based solutions is now  
accessible to Ruby developers, including many full-text indices. There  
are very mature solutions readily available for production use and  
many next-generation search engines currently in development.

For the next application that needs full text search, I''m most  
definitely not going to use Ferret. I agree with Erik and give Solr a  
shot.

I would like to encourage everyone, who is already using another full  
text index for Ruby/Rails to share his/her experiences on this list.  
Because I have the feeling that many people would like to get rid of  
Ferret for exactly the same reasons I''ve pointed out above.

        Andy

On 16.11.2007, at 22:13, Erik Hatcher wrote:
>
> On Nov 16, 2007, at 3:35 PM, Scott Davies wrote:
>> Am I a fool for wondering whether it might ultimately be less painful
>> to try an index server that runs Lucene under a JRuby process?
>
> Or, rather, an index server that runs Solr accessed with a pure Ruby,
> solr-ruby, API (which works with MRI or JRuby)?   :)
>
> 	Erik
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk

casey at nerdle.com

2007-Nov-18 15:24 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Andy,

You asked about other full text indexes for Ruby/Rails.  I am using both 
AAF/Ferret and Sphinx in my app.

I haven''t had any problems with Ferret or acts_as_ferret so far.  I am 
using the DRb server and it is being hit with 200-250,000 requests a day 
from dozens of clients (Mongrel instances). My index isn''t huge - it is
about 600 MB.

I''m using Sphinx (http://www.sphinxsearch.com/) wherever I
don''t need
realtime updates.  A large portion of my site requires search indexes to 
be always up-to-date but in many places, I can live with an index that may 
be 5 minutes old.  Sphinx trades realtime indexing for performance - both 
search and indexing speed is blazingly fast. Sphinx comes with a server 
component that speaks a simple protocol and there are several rails 
plugins available.

Sphinx (and acts_as_sphinx or whatever plugin you choose) and 
acts_as_ferret are very different animals, but I''m very pleased with
the
combination.

Casey


On Sun, 18 Nov 2007, Andreas Korth wrote:
> Hi everyone!
>
> This is a very interesting thread, because it raises the question as
> to whether Ferret is something you would want to use in a production
> environment - or not.
>
> I''ve been using Ferret in two applications and my experiences were
> quite disappointing. I chose Ferret because it''s fast and
it''s got a
> Ruby API. Everything else about it is just annoying and potentially
> hazardous.
>
> What worries me most is the fact that Ferret is effectively an
> abandoned project. The original author, who is the sole owner of the
> code, hasn''t been posting to this list for about six months. He
hasn''t
> introduced any improvements in about the same period of time and many
> bugs still remain unfixed. New bugs can''t be submitted (let alone
> patches) because the project Trac is offline.
>
> There is no other component in my applications which behaves as badly
> as Ferret. If you don''t treat it _very_ carefully it will throw
> segfaults as if this was an established way of indicating an error
> condition.
>
> The ActsAsFerret plugin _does_ treat ferret quite carefully and
it''s
> the only reason why many people are able to use Ferret at all.
> However, AAF is one approach and for some applications it might not be
> the right one. Especially if you want to put multiple models in one
> index - it''s possible, but not really a flexible solution.
>
> The most sensitive point of Ferret is concurrency and many people
> actually use Ferret in distributed environments (which is usually a
> Rails app that scales across several machines). AAF introduces a DRb
> server to work around this problem, but with many concurrent read/
> write requests, performance quickly degrades.
>
> With the advent of JRuby, a myriad of Java-based solutions is now
> accessible to Ruby developers, including many full-text indices. There
> are very mature solutions readily available for production use and
> many next-generation search engines currently in development.
>
> For the next application that needs full text search, I''m most
> definitely not going to use Ferret. I agree with Erik and give Solr a
> shot.
>
> I would like to encourage everyone, who is already using another full
> text index for Ruby/Rails to share his/her experiences on this list.
> Because I have the feeling that many people would like to get rid of
> Ferret for exactly the same reasons I''ve pointed out above.
>
>        Andy
>
>
> On 16.11.2007, at 22:13, Erik Hatcher wrote:
>
>>
>> On Nov 16, 2007, at 3:35 PM, Scott Davies wrote:
>>> Am I a fool for wondering whether it might ultimately be less
painful
>>> to try an index server that runs Lucene under a JRuby process?
>>
>> Or, rather, an index server that runs Solr accessed with a pure Ruby,
>> solr-ruby, API (which works with MRI or JRuby)?   :)
>>
>> 	Erik
>>
>> _______________________________________________
>> Ferret-talk mailing list
>> Ferret-talk at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/ferret-talk
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Marvin Humphrey

2007-Nov-18 17:29 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On Nov 18, 2007, at 7:05 AM, Andreas Korth wrote:
> What worries me most is the fact that Ferret is effectively an
> abandoned project. The original author, who is the sole owner of the
> code, hasn''t been posting to this list for about six months. He
hasn''t
> introduced any improvements in about the same period of time and many
> bugs still remain unfixed.
I have a large fraction of the expertise needed to maintain the C  
part of the Ferret code base, FWIW.   What I''m missing is significant  
Ruby expertise, which I wouldn''t mind accumulating.  :)

If what''s needed is C-level bug fixing, I can probably help out.
> New bugs can''t be submitted (let alone
> patches) because the project Trac is offline.
I know it''s been down before, but <http://ferret.davebalmain.com/ 
trac> looks like it''s up to me, now.  Also, I see a commit from Dave
bumping the version to 0.11.5 yesterday.

The C code base that I am currently working on, which has a  
foundation designed by Dave and I to be shared by multiple host  
languages, is going to wind up having Ruby bindings eventually.  It  
will either happen as part of the Lucy project, or independently.

In the meantime, perhaps I can contribute to Ferret in a caretaker/ 
troubleshooter role.  Dave gave me commit access to the repository a  
while ago, and I just verified that I still have it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Jens Kraemer

2007-Nov-18 17:51 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Hi!

On Sun, Nov 18, 2007 at 10:24:34AM -0500, casey at nerdle.com
wrote:> Andy,
> 
> You asked about other full text indexes for Ruby/Rails.  I am using both 
> AAF/Ferret and Sphinx in my app.
> 
> I haven''t had any problems with Ferret or acts_as_ferret so far. 
I am
> using the DRb server and it is being hit with 200-250,000 requests a day 
> from dozens of clients (Mongrel instances). My index isn''t huge -
it is
> about 600 MB.
ah, glad to see somebody where everything just works standing up and tell
the world :-)


On Sun, 18 Nov 2007, Andreas Korth wrote:
[..]> >
> > What worries me most is the fact that Ferret is effectively an
> > abandoned project. The original author, who is the sole owner of the
> > code, hasn''t been posting to this list for about six months.
He hasn''t
> > introduced any improvements in about the same period of time and many
> > bugs still remain unfixed. New bugs can''t be submitted (let
alone
> > patches) because the project Trac is offline.
Trac is online again for days, and Ferret even got a new logo :-) I
wouldn''t call it abandoned, it''s just stabilizing.
> > There is no other component in my applications which behaves as badly
> > as Ferret. If you don''t treat it _very_ carefully it will
throw
> > segfaults as if this was an established way of indicating an error
> > condition.
> >
> > The ActsAsFerret plugin _does_ treat ferret quite carefully and
it''s
> > the only reason why many people are able to use Ferret at all.
> > However, AAF is one approach and for some applications it might not be
> > the right one. Especially if you want to put multiple models in one
> > index - it''s possible, but not really a flexible solution.
Well, even if aaf doesn''t fit your needs, you might at least have a
look
at it if you want to know how to treat your Ferret well :-) I admit it
isn''t always an easy library to deal with, but with a proper set of
unit
tests it''s entirely possible and no headache at all. Imho.
> > The most sensitive point of Ferret is concurrency and many people
> > actually use Ferret in distributed environments (which is usually a
> > Rails app that scales across several machines). AAF introduces a DRb
> > server to work around this problem, but with many concurrent read/
> > write requests, performance quickly degrades.
AAf''s DRb server can handle some serious load as it is now, but for
sure
there''s much room for improvement. However I didn''t receive
many
complaints from people actually *having* this problem in real life
applications yet. Most of the time this is brought up as some kind of
''what if'' problem. Somebody did a speed comparison of Solr and
aaf/Drb a
while back, where aaf was at least as fast as Solr was, with it''s
admittedly naive DRb server. 

I don''t say this was a representative benchmark or anything, but
it''s
the only numbers I know of...

So please from now on, anybody feeling to blame aaf''s DRb as slow,
*please* show us some numbers and the test process which led to
these numbers.

Ideally you''d also show us the numbers of any solution you''ve
found to
be faster solving the same problem. Thanks.
> > With the advent of JRuby, a myriad of Java-based solutions is now
> > accessible to Ruby developers, including many full-text indices. There
> > are very mature solutions readily available for production use and
> > many next-generation search engines currently in development.
For sure. I''m excited by these possiblities as well.



Cheers,
Jens


-- 
Jens Kr?mer
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database

Andreas Korth

2007-Nov-18 19:50 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On 18.11.2007, at 18:51, Jens Kraemer wrote:
> Trac is online again for days, and Ferret even got a new logo :-) I
> wouldn''t call it abandoned, it''s just stabilizing.
Yes, I noticed that. I should have checked before posting. However, a  
project site that is frequently down for extended periods of time is  
not exactly building up trust :)
> AAf''s DRb server can handle some serious load as it is now, but
for
> sure
> there''s much room for improvement. However I didn''t
receive many
> complaints from people actually *having* this problem in real life
> applications yet. Most of the time this is brought up as some kind of
> ''what if'' problem.
My apologies for implying that AAF is part of the problem. It  
certainly isn''t. I made the mistake to mix up my concerns about Ferret
with comments on AAF. What I actually meant to say, is that AAF is one  
viable way to deal with some of Ferret''s shortcomings.

The fact that in the Rails community AAF is almost synonymous with  
Ferret speaks for your plugin and I''m not in a position to question  
that.
> So please from now on, anybody feeling to blame aaf''s DRb as slow,
> *please* show us some numbers and the test process which led to
> these numbers.
Again, I wasn''t to blame AAF here.

To be more precise: Ferret is pretty damn fast. The problem is its  
extremely sensitive API which exposes problems from the C  
implementation to the Ruby developer. I don''t know of any way to catch
a segfault in Ruby, and even if I did, there''s little I can do about  
it from Rubyland.

Without transactional index updates, such behavior is intolerable,  
unless you can afford to rebuild your index several times a day. This  
leaves us to build another Ruby API on top of Ferret''s in order to  
compensate for these imperfections.

I wrote a custom solution with a focus on reliability. But with all  
the infrastructure built around Ferret (DRb server, transactions,  
queuing), the overall indexing performance wasn''t that great anymore:  
Remote indexing with 10 concurrent clients was 8-9 times slower than  
local indexing.

Maybe AAF is faster, but since the implementations are different,  
there''s no point in comparing them directly.

          Andy

Andreas Korth

2007-Nov-18 19:56 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On 18.11.2007, at 16:24, casey at nerdle.com wrote:
> I''m using Sphinx (http://www.sphinxsearch.com/) wherever I
don''t need
> realtime updates.  A large portion of my site requires search  
> indexes to
> be always up-to-date but in many places, I can live with an index  
> that may
> be 5 minutes old.  Sphinx trades realtime indexing for performance -  
> both
> search and indexing speed is blazingly fast. Sphinx comes with a  
> server
> component that speaks a simple protocol and there are several rails
> plugins available.
Thanks, Casey. I''ll take a look at Sphinx. Since I''m primarily
concerned about index consistency and don''t mind short delays either,  
it sounds like a pretty good alternative.

          Cheers,
          Andy

Julio Cesar Ody

2007-Nov-19 00:45 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

Great. For my own curiosity, and maybe people here share some of it:

Is it possible to write your own custom analyzers for Solr? If so, how
easy it is? Can one do that in Ruby or do I have to write it in Java?

I personally think that''s one of the greatest things about Ferret. So
far I haven''t bothered looking into Sphinx or Solr precisely because,
from a glance, I couldn''t find a way to customize anything in detail
like I can do with Ferret. I assume there is a way...

Thing is, reading through the Ferret booklet (the one from OReilly),
you get a glimpse of how easy it is to build custom solutions using
it. So whereas it''s kind of sad that the lead developer has been
distant from the project in the last few months (?), I have to say,
there''s hardly matching how easy it is to work with it.

On Nov 18, 2007 8:29 PM, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> On Nov 17, 2007, at 5:12 AM, Scott Davies wrote:
> > Hmmm...I''d first heard of Solr only a couple of days ago, and
I hadn''t
> > been aware of a Ruby API to it until you mentioned it.
> > Interesting...thanks!
>
> I''ve honestly given fairly little of my time to Ferret, though I
have
> tinkered with it some and it is mighty fine!
>
> Believe you me, I don''t want to steal any thunder from Ferret. 
And
> I''ve not compared/contrasted them much myself.  Truth be told
I''m
> still a Java dude, and knowing that Lucene and Solr are in Java,
> excel at what they are designed to do  and already gulping the Apache
> cool-ade I really dig Solr.
>
> I''ve presented solr+ruby a couple of times now, once at RailsConf
and
> then again a few weeks ago at rubyconf.
>
> RailsConf:
> <http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf>
>
> rubyconf:
> <http://code4lib.org/files/solr-ruby.pdf>
>
> acts_as_solr as it exists today is sub-optimal compared to
> acts_as_ferret.  I''m quite admittedly not much into relational
> databases so I have only tinkered in this area myself.
>
>         Erik
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Scott Davies

2007-Nov-21 19:53 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

For the record, while Lucene is pretty well-behaved as far as I can
tell, DRb running under JRuby is not.  When hit with multiple request
streams simultaneously, DRb under JRuby 1.0.2 very quickly falls over
and stops responding to all queries.  DRb under JRuby 1.1b1 *almost*
works, but every now and then JRuby will freak out and for a few
requests things will fail in very strange ways.  (Attempts to
construct Java objects will fail with exceptions such as "undefined
method `constructors'' for nil:NilClass" or "undefined method
`java_class'' for Class:Class"; sometimes looking up a class will
fail...)

On the plus side, I do get the impression that JRuby development is
pretty active, and I see some concurrency bugs listed as high-priority
for JRuby 1.1, some of which have already been patched in the trunk.
My guess is that JRuby+Lucene+DRb will be a fine choice in a few
months...it was actually pretty painless to set up, even with MLI Ruby
RoR clients talking to a JRuby indexing server.  (I have a simple
metaprogramming hack that lets the client specify a sequence of code
to execute on the server side, where the specification looks *almost*
like normal Ruby code; this effectively lets me easily construct
gnarly Lucene query trees in MLI Ruby clients that know nothing about
Lucene or Java.  I actually initially came up with this hack to work
around Ferret''s "query trees and filters don''t
marshal" issue.)
JRuby''s not ready for serious use in scenarios with concurrency just
yet, though.

Meanwhile, I''m hoping to avoid Solr because it seems (1) kind of
complicated for what I''d actually get out of it in my particular
application, (2) not particularly well-documented given its size, and
(3) likely to get in my way when I want to do anything low-level and
gnarly with Lucene.

I guess I''ll continue limping along with Ferret for the moment and
hope the concurrency issues get worked out soonish.  Has anyone
actually decided specifically to make Ferret bulletproof in the face
of concurrency over the next few months, or is it probably just not
going to happen?  If it doesn''t, I suspect Ferret will probably fall
by the wayside as more Ruby people jump ship for Lucene-based
solutions.  Which would be a shame, because Ferret does hold a lot of
promise...indexing is hard, and Ferret is *almost* a great solution.
(Too bad the last 20% is usually 80% of the work...)

-- Scott

On Nov 18, 2007 4:45 PM, Julio Cesar Ody <julioody at gmail.com>
wrote:> Great. For my own curiosity, and maybe people here share some of it:
>
> Is it possible to write your own custom analyzers for Solr? If so, how
> easy it is? Can one do that in Ruby or do I have to write it in Java?
>
> I personally think that''s one of the greatest things about Ferret.
So
> far I haven''t bothered looking into Sphinx or Solr precisely
because,
> from a glance, I couldn''t find a way to customize anything in
detail
> like I can do with Ferret. I assume there is a way...
>
> Thing is, reading through the Ferret booklet (the one from OReilly),
> you get a glimpse of how easy it is to build custom solutions using
> it. So whereas it''s kind of sad that the lead developer has been
> distant from the project in the last few months (?), I have to say,
> there''s hardly matching how easy it is to work with it.
>
>
>
>
> On Nov 18, 2007 8:29 PM, Erik Hatcher <erik at ehatchersolutions.com>
wrote:
> >
> > On Nov 17, 2007, at 5:12 AM, Scott Davies wrote:
> > > Hmmm...I''d first heard of Solr only a couple of days
ago, and I hadn''t
> > > been aware of a Ruby API to it until you mentioned it.
> > > Interesting...thanks!
> >
> > I''ve honestly given fairly little of my time to Ferret,
though I have
> > tinkered with it some and it is mighty fine!
> >
> > Believe you me, I don''t want to steal any thunder from
Ferret.  And
> > I''ve not compared/contrasted them much myself.  Truth be told
I''m
> > still a Java dude, and knowing that Lucene and Solr are in Java,
> > excel at what they are designed to do  and already gulping the Apache
> > cool-ade I really dig Solr.
> >
> > I''ve presented solr+ruby a couple of times now, once at
RailsConf and
> > then again a few weeks ago at rubyconf.
> >
> > RailsConf:
> > <http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf>
> >
> > rubyconf:
> > <http://code4lib.org/files/solr-ruby.pdf>
> >
> > acts_as_solr as it exists today is sub-optimal compared to
> > acts_as_ferret.  I''m quite admittedly not much into
relational
> > databases so I have only tinkered in this area myself.
> >
> >         Erik
> >
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
> >
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

Erik Hatcher

2007-Nov-21 20:24 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On Nov 21, 2007, at 2:53 PM, Scott Davies wrote:> My guess is that JRuby+Lucene+DRb will be a fine choice in a few
> months...
Definitely not a bad choice.  However I still implore you to give  
Solr another chance.  More on that....
> Meanwhile, I''m hoping to avoid Solr because it seems (1) kind of
> complicated for what I''d actually get out of it in my particular
> application
How so?   It''s a "search server" with the same goals that I
imagine
you''d have for the JRuby+Lucene+DRb combination.

It''s not really complicated, especially with the solr-ruby library.   
Add documents, delete them, query for them.  Leverage highlighting  
and more-like these features, dismax querying, etc.
> , (2) not particularly well-documented given its size
Wow.   Have you seen the Solr wiki?   http://wiki.apache.org/solr -  
there are nooks and crannies documented on that wiki that go well  
beyond what I''d consider good documentation.

By all means point me to areas that aren''t documented that you need  
to know (off list) and I''ll get those taken care of.
> (3) likely to get in my way when I want to do anything low-level and
> gnarly with Lucene.
Maybe, but not much in your way.  You''d have to wrap your low-level  
mojo inside some Solr API perhaps, but not even if we''re just talking  
about custom analyzers or similarity implementation.
> Which would be a shame, because Ferret does hold a lot of
> promise..
hear hear!   I definitely extend major kudos to Dave and the other  
Ferret contributors.  Great stuff.

	Erik

Scott Davies

2007-Nov-21 22:04 UTC

head link

[Ferret-talk] Multithreading / multiprocessing woes

On Nov 21, 2007 12:24 PM, Erik Hatcher <erik at ehatchersolutions.com>
wrote:>
> How so?   It''s a "search server" with the same goals
that I imagine
> you''d have for the JRuby+Lucene+DRb combination.
It''s a bit more than I need right out of the gate, what with the
caching, replication, faceted search, etc.  Of course, that might not
be a problem if it uses sensible configuration defaults I can safely
ignore to start with.
> It''s not really complicated, especially with the solr-ruby
library.
> Add documents, delete them, query for them.  Leverage highlighting
> and more-like these features, dismax querying, etc.
My particular application does enough weird things that, for the most
part, I''d prefer unfettered access to the low-level Lucene APIs.  (For
example, my application uses a lot of gnarly query trees involving
filters and ranges, and I''m not sure whether those are easily
transmitted through the Solr APIs.  Then I have "run all of these
queries against each of the documents in this specific set and tell me
which document/query pairs match in one fell swoop" routines, in which
case it might be a good idea to copy the documents into a temporary
RAM index to run the queries against.)
>
> > , (2) not particularly well-documented given its size
>
> Wow.   Have you seen the Solr wiki?   http://wiki.apache.org/solr -
> there are nooks and crannies documented on that wiki that go well
> beyond what I''d consider good documentation.
>
> By all means point me to areas that aren''t documented that you
need
> to know (off list) and I''ll get those taken care of.
Wikis are fine for looking up details when you already mostly know
what you''re doing, but they''re not nearly as useful when
you''re in the
earlier stages trying to get the big "What does this system look like
and how does it work?" picture and evaluate initial plans of attack.
Ferret and Lucene both have entire *books* written about them that are
*excellent* for those purposes.  (They''re not free-as-in-beer, but are
well worth the cost.)  By comparison, Solr has a very simple "here is
how you get a straightforward app off the ground" tutorial that says
little about how Solr is actually organized, and then you''re basicaly
left staring at a Wiki page with a thousand bullet points and no clear
path to big-picture enlightenment.  And given the choice between (1)
using a lower-level system that''s been very well-documented in a
well-organized explanatory fashion and (2) using a slightly
higher-level system I still haven''t acquired a mental "big
picture"
for, I generally find (1) more productive.

This isn''t a criticism of Solr''s documentation nearly as much
as a
hearty "Book-style documentation is useful, and, holy crap, Ferret and
Lucene actually HAVE IT.  Woohoo!", plus an added bonus testament to
my own laziness.
> > (3) likely to get in my way when I want to do anything low-level and
> > gnarly with Lucene.
>
> Maybe, but not much in your way.  You''d have to wrap your
low-level
> mojo inside some Solr API perhaps, but not even if we''re just
talking
> about custom analyzers or similarity implementation.
Yeah, my guess is that if I sit down and figure out how Solr is laid
out, adding APIs to do what I want won''t be too hard.  Might still be
kind of tedious implementing all the necessary marshaling, though.

-- Scott

Possibly Parallel Threads

Search for more possibly parallel threads

Ferret talk - Nov 2007 - Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

[Ferret-talk] Multithreading / multiprocessing woes

Possibly Parallel Threads