I''ve been running some multithreaded tests on Ferret. Using a single Ferret::Index::Index inside a DRb server, it definitely behaves for me as if all readers are locked out of the index when writing is going on in that index, not just optimization -- at least when segment merging happens, which is when the writes take the longest and you can therefore least afford to lock out all reads. This is very easy to notice when you add, say, your 100,000th document to the index, and that one write takes over 5 seconds to complete because it triggers a bunch of incremental segment-merging, and all queries to the index stall in the meantime. Or when you add your millionth document, which can stall all reads for over a minute. :-( When I try to use an IndexReader in a separate process, things are even worse. The IndexReader doesn''t see any updates to the index since it was created. Not too surprising, but if I try creating a new IndexReader for every query, and have the Index in the other writing process turn on auto_flush, then the reading process crashes after a few (generally fewer than 100) queries, in one of at least two different ways selected apparently at random: Failure Mode #1: script/ferret_speedtest2_reader:30:in `initialize'': IO Error occured at <except.c>:93 in xraise (IOError) Error occured in index.c:901 - sis_find_segments_file Error reading the segment infos. Store listing was from script/ferret_speedtest2_reader:30:in `new'' from script/ferret_speedtest2_reader:30:in `run_test_query'' [Yes, there really are two blank lines after "Store listing was".] Failure Mode #2: script/ferret_speedtest2_reader:30:in `initialize'': IO Error occured at <except.c>:93 in xraise (IOError) Error occured in fs_store.c:127 - fs_each doing ''each'' in /Users/scott/dev/ruby/timetracker/tmp/ferret_speedtest_index: <Too many open files> from script/ferret_speedtest2_reader:30:in `new'' from script/ferret_speedtest2_reader:30:in `run_test_query'' Meanwhile, if I try eliminating this second failure mode by explicitly calling close on the IndexReader before I throw it away, the close immediately crashes with: script/ferret_speedtest2_reader:45: [BUG] Bus Error ruby 1.8.6 (2007-03-13) [i686-darwin8.8.5] Abort trap Given the combination of problems above, I''m at a loss to understand how to use Ferret on a live website that requires reasonably fast turnaround between a user submitting data and the user being able to search over that data, unless either (1) the site only gets a few thousand new index entries per day and the site can be taken down for a few minutes daily to optimize the index, or (2) it''s OK for the entire site to periodically stall on all queries for seconds or even minutes whenever segment-merging happens to kick in. Do all Ferret users just suck it up and live with one of these limitations, or am I missing something and/or just getting "lucky" with the errors above? For reference, the system being used here is a Mac running Leopard, although I doubt that matters...
Scott,> Do all Ferret users just suck it up and live with one of these > limitations, or am I missing something and/or just getting "lucky" > with the errors above?This limitations you''re talking about are known and will be fixed in the near future.. the trick is, to have one read-only and one write-only index.. This is currently being worked on. If you need a fix right now, you need to do it yourself but you can take a look on omdb''s code and how it''s done there: http://bugs.omdb.org/browser/branches/2007.1/lib/omdb/ferret/lib/util.rb (see the switch code) If you don''t need a fix right now, i''m sure AAF will come up with a solution for that in the near future (aka probably not this year). on a side note.. for the to many open files error, see: http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html (use_compound_file, you may have set this to false) or simply increase the number of open files. On omdb we''re running with 32k :-) rails at homer.omdb.org ~ $ ulimit -n 32768 Cheers Ben
Hi Ben -- Thanks much for the quick and helpful reply! Unfortunately, the solution you''re using on omdb looks suspect to me, for the same reason that Alex Neth brought up a few days ago on this list: to my knowledge there''s no guarantee that rsync will produce a coherent snapshot of the source directory as it was at any one particular instant in time. In fact, I don''t see how rsync could both always terminate in finite time and provide such a guarantee, except on exotic filesystems that provide, say, atomic snapshots with copy-on-write capabilities. (Sigh...sometimes I miss the Google File System.) In which case you''d have to disable your site during the rsync in order to prevent corruption, which basically boils down to the "must take site offline daily for a few minutes to deal with this problem" limitation. I''m guessing the rsync is faster than an index optimization, so I guess this might at least cut down on the amount of time the site has to be down, but still...wah. Am I a fool for wondering whether it might ultimately be less painful to try an index server that runs Lucene under a JRuby process? On Nov 16, 2007 4:12 AM, Benjamin Krause <bk at benjaminkrause.com> wrote:> Scott, > > > Do all Ferret users just suck it up and live with one of these > > limitations, or am I missing something and/or just getting "lucky" > > with the errors above? > > This limitations you''re talking about are known and will be fixed > in the near future.. the trick is, to have one read-only and one > write-only index.. This is currently being worked on. If you need > a fix right now, you need to do it yourself but you can take a look > on omdb''s code and how it''s done there: > > http://bugs.omdb.org/browser/branches/2007.1/lib/omdb/ferret/lib/util.rb > (see the switch code) > > If you don''t need a fix right now, i''m sure AAF will come up with > a solution for that in the near future (aka probably not this year). > > on a side note.. for the to many open files error, see: > > http://ferret.davebalmain.com/api/classes/Ferret/Index/IndexWriter.html > (use_compound_file, you may have set this to false) or simply increase > the number of open files. On omdb we''re running with 32k :-) > > rails at homer.omdb.org ~ $ ulimit -n > 32768 > > Cheers > Ben > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On Nov 16, 2007, at 3:35 PM, Scott Davies wrote:> Am I a fool for wondering whether it might ultimately be less painful > to try an index server that runs Lucene under a JRuby process?Or, rather, an index server that runs Solr accessed with a pure Ruby, solr-ruby, API (which works with MRI or JRuby)? :) Erik
Scott, we''re using two directories, not one for ferret. One index is the passive index. it is not used for searches, but new indexing requests will be added to that index. so lets call it the indexing-index. all mongrels will use the second directory, lets call it searching-index. Both indexes are almost identical, i''ll explain the differences. All out indexing requests are queued. So whenever you want to index something, it will be placed in the queue, and added to the indexing-index. After a certain amount of queue-items added to the index, we''re stopping indexing. The queue will be halted. New requests can be added, but nothing will be added to the indexing-index. Now we''re rsyncing the indexing-index to all machines, remember, searching is still done in the searching-index, which is outdated, but we don''t mind about that :) After rsync is complete, we''re switching both directories, so the indexing-index becomes the searching-index and vice versa. Actually we''re just switching symlinks, so the this will take almost no time. And even if one of the mongrels still have a filehandle to the old index open, nothing will happen, it is still using the outdated index, but the next request will use the new index. After that, the new indexing-index will be synced from the searching-index. As the searching-index is read-only, there is no risk of corrupting something during the sync. Now we''re resuming processing the queue, until we''ve added our certain amount of queue entries, or the queue is empty. The downside is, that the searching-index is outdated, but not more that a couple of minutes (about 2 minutes on omdb). We didn''t have one corrupted index since. There is now downtime whatsoever, and the rsync snapshot will always be coherent. Cheers Ben
Ben -- Thanks for the detailed explanation! Yes, that does make sense. If I understand it correctly, though, something won''t show up in a search until at least one index switch happens after it''s been submitted, which means we''re talking about a minute or so on average (not just worst-case) from submission to search result, even if the switches are being done constantly (given that each switch takes about two minutes). For my site, I''m really hoping that most content will show up within a second or so of its submission. That simply can''t happen if I''m not updating the same index I''m doing searches with. I''d be OK with the turnaround *occasionally* being a minute -- say, while an index optimization or particularly large segment merge happens. But so far it looks to me like the choices with Ferret are either: (1) The *average* time from submission to search result is on the order of minutes. However, searches are always reasonably fast. (Your approach.) (2) The average time from submission to search result is less than a second. However, the *worst-case* times can be minutes, and now all *searches* stall over those minutes as well, which is Bad. If you don''t get more than a few thousand submissions per day, you can at least schedule these outages as nightly index optimizations, but you''ll have the outages one way or another. (All "same index used for reading + writing" approaches.) I don''t think either of these choices is very good for the particular site I have in mind (at least if I''m being optimistic enough about its chances of "taking off" to worry about the possibility of many thousands of submissions / day). Am I correct in my summarization of the two choices with Ferret here, or have I missed something? Anyhow, thanks again! If those two options are in fact what I have, I think I''ll run some tests with Lucene/JRuby to see whether that provides a third option as far as performance goes, and report back what sort of issues come up. (My guess is that it''ll be moderately painful to set up and that the average throughput will be worse than Ferret''s, but that an average submission-to-search-result turnaround time of a second or two will be achievable without the site necessarily going completely down for minutes every now and then. We''ll see.) -- Scott On Nov 16, 2007 2:40 PM, Benjamin Krause <bk at benjaminkrause.com> wrote:> Scott, > > we''re using two directories, not one for ferret. One > index is the passive index. it is not used for searches, > but new indexing requests will be added to that index. > so lets call it the indexing-index. > > all mongrels will use the second directory, lets call it > searching-index. Both indexes are almost identical, > i''ll explain the differences. > > All out indexing requests are queued. So whenever > you want to index something, it will be placed in the > queue, and added to the indexing-index. After a > certain amount of queue-items added to the index, > we''re stopping indexing. The queue will be halted. > New requests can be added, but nothing will be > added to the indexing-index. > > Now we''re rsyncing the indexing-index to all machines, > remember, searching is still done in the searching-index, > which is outdated, but we don''t mind about that :) > > After rsync is complete, we''re switching both directories, > so the indexing-index becomes the searching-index and > vice versa. Actually we''re just switching symlinks, so > the this will take almost no time. And even if one of the > mongrels still have a filehandle to the old index open, > nothing will happen, it is still using the outdated index, > but the next request will use the new index. After that, > the new indexing-index will be synced from the > searching-index. As the searching-index is read-only, > there is no risk of corrupting something during the > sync. > > Now we''re resuming processing the queue, until we''ve > added our certain amount of queue entries, or the queue > is empty. > > The downside is, that the searching-index is outdated, > but not more that a couple of minutes (about 2 minutes > on omdb). We didn''t have one corrupted index since. > There is now downtime whatsoever, and the rsync snapshot > will always be coherent. > > > Cheers > Ben > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Hmmm...I''d first heard of Solr only a couple of days ago, and I hadn''t been aware of a Ruby API to it until you mentioned it. Interesting...thanks! On Nov 16, 2007 1:13 PM, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Nov 16, 2007, at 3:35 PM, Scott Davies wrote: > > Am I a fool for wondering whether it might ultimately be less painful > > to try an index server that runs Lucene under a JRuby process? > > Or, rather, an index server that runs Solr accessed with a pure Ruby, > solr-ruby, API (which works with MRI or JRuby)? :) > > Erik > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
Hi! On Fri, Nov 16, 2007 at 02:56:26AM -0800, Scott Davies wrote:> I''ve been running some multithreaded tests on Ferret. Using a single > Ferret::Index::Index inside a DRb server, it definitely behaves for me > as if all readers are locked out of the index when writing is going on > in that index, not just optimization -- at least when segment merging > happens, which is when the writes take the longest and you can > therefore least afford to lock out all reads. This is very easy to > notice when you add, say, your 100,000th document to the index, and > that one write takes over 5 seconds to complete because it triggers a > bunch of incremental segment-merging, and all queries to the index > stall in the meantime. Or when you add your millionth document, which > can stall all reads for over a minute. :-(Don''t get me wrong, but how often do you think you''ll add your millionth document to the index? And even if you really do index a million documents per week - I wouldn''t exactly call it bad performance if one or two search requests *per week* take a minute to complete, while all others are completed in less than a second... Having that said, the problem with blocking searches might be possible to solve by not using Ferret''s Index class for searching/indexing, but using the lower level APIs (Searcher and IndexWriter) and doing manual synchronization (inside *one* process). I didn''t feel the need to implement this for aaf (yet ;-), since I think it''s already fast enough to not be the bottleneck in most real world usage scenarios (say - typical Rails apps using aaf for full text search).> When I try to use an IndexReader in a separate process, things are > even worse. The IndexReader doesn''t see any updates to the index > since it was created. Not too surprising, but if I try creating a new > IndexReader for every query, and have the Index in the other writing > process turn on auto_flush, then the reading process crashes after a > few (generally fewer than 100) queries, in one of at least two > different ways selected apparently at random:[..] Stick to the one-process-per-index rule to be on the safe side.> Given the combination of problems above, I''m at a loss to understand > how to use Ferret on a live website that requires reasonably fast > turnaround between a user submitting data and the user being able to > search over that data, unless either (1) the site only gets a few > thousand new index entries per day and the site can be taken down for > a few minutes daily to optimize the index, or (2) it''s OK for the > entire site to periodically stall on all queries for seconds or even > minutes whenever segment-merging happens to kick in.I wouldn''t set the limit at a few thousand new documents per day, and also optimizing daily is only useful if you''re having lots of document deletions per day. Cheers, Jens PS: If you happen to benchmark Solr against aaf''s DRb server, be sure to let us know your findings :-) -- Jens Kr?mer http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
On Nov 17, 2007, at 5:12 AM, Scott Davies wrote:> Hmmm...I''d first heard of Solr only a couple of days ago, and I hadn''t > been aware of a Ruby API to it until you mentioned it. > Interesting...thanks!I''ve honestly given fairly little of my time to Ferret, though I have tinkered with it some and it is mighty fine! Believe you me, I don''t want to steal any thunder from Ferret. And I''ve not compared/contrasted them much myself. Truth be told I''m still a Java dude, and knowing that Lucene and Solr are in Java, excel at what they are designed to do and already gulping the Apache cool-ade I really dig Solr. I''ve presented solr+ruby a couple of times now, once at RailsConf and then again a few weeks ago at rubyconf. RailsConf: <http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf> rubyconf: <http://code4lib.org/files/solr-ruby.pdf> acts_as_solr as it exists today is sub-optimal compared to acts_as_ferret. I''m quite admittedly not much into relational databases so I have only tinkered in this area myself. Erik
Hi everyone! This is a very interesting thread, because it raises the question as to whether Ferret is something you would want to use in a production environment - or not. I''ve been using Ferret in two applications and my experiences were quite disappointing. I chose Ferret because it''s fast and it''s got a Ruby API. Everything else about it is just annoying and potentially hazardous. What worries me most is the fact that Ferret is effectively an abandoned project. The original author, who is the sole owner of the code, hasn''t been posting to this list for about six months. He hasn''t introduced any improvements in about the same period of time and many bugs still remain unfixed. New bugs can''t be submitted (let alone patches) because the project Trac is offline. There is no other component in my applications which behaves as badly as Ferret. If you don''t treat it _very_ carefully it will throw segfaults as if this was an established way of indicating an error condition. The ActsAsFerret plugin _does_ treat ferret quite carefully and it''s the only reason why many people are able to use Ferret at all. However, AAF is one approach and for some applications it might not be the right one. Especially if you want to put multiple models in one index - it''s possible, but not really a flexible solution. The most sensitive point of Ferret is concurrency and many people actually use Ferret in distributed environments (which is usually a Rails app that scales across several machines). AAF introduces a DRb server to work around this problem, but with many concurrent read/ write requests, performance quickly degrades. With the advent of JRuby, a myriad of Java-based solutions is now accessible to Ruby developers, including many full-text indices. There are very mature solutions readily available for production use and many next-generation search engines currently in development. For the next application that needs full text search, I''m most definitely not going to use Ferret. I agree with Erik and give Solr a shot. I would like to encourage everyone, who is already using another full text index for Ruby/Rails to share his/her experiences on this list. Because I have the feeling that many people would like to get rid of Ferret for exactly the same reasons I''ve pointed out above. Andy On 16.11.2007, at 22:13, Erik Hatcher wrote:> > On Nov 16, 2007, at 3:35 PM, Scott Davies wrote: >> Am I a fool for wondering whether it might ultimately be less painful >> to try an index server that runs Lucene under a JRuby process? > > Or, rather, an index server that runs Solr accessed with a pure Ruby, > solr-ruby, API (which works with MRI or JRuby)? :) > > Erik > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
casey at nerdle.com
2007-Nov-18 15:24 UTC
[Ferret-talk] Multithreading / multiprocessing woes
Andy, You asked about other full text indexes for Ruby/Rails. I am using both AAF/Ferret and Sphinx in my app. I haven''t had any problems with Ferret or acts_as_ferret so far. I am using the DRb server and it is being hit with 200-250,000 requests a day from dozens of clients (Mongrel instances). My index isn''t huge - it is about 600 MB. I''m using Sphinx (http://www.sphinxsearch.com/) wherever I don''t need realtime updates. A large portion of my site requires search indexes to be always up-to-date but in many places, I can live with an index that may be 5 minutes old. Sphinx trades realtime indexing for performance - both search and indexing speed is blazingly fast. Sphinx comes with a server component that speaks a simple protocol and there are several rails plugins available. Sphinx (and acts_as_sphinx or whatever plugin you choose) and acts_as_ferret are very different animals, but I''m very pleased with the combination. Casey On Sun, 18 Nov 2007, Andreas Korth wrote:> Hi everyone! > > This is a very interesting thread, because it raises the question as > to whether Ferret is something you would want to use in a production > environment - or not. > > I''ve been using Ferret in two applications and my experiences were > quite disappointing. I chose Ferret because it''s fast and it''s got a > Ruby API. Everything else about it is just annoying and potentially > hazardous. > > What worries me most is the fact that Ferret is effectively an > abandoned project. The original author, who is the sole owner of the > code, hasn''t been posting to this list for about six months. He hasn''t > introduced any improvements in about the same period of time and many > bugs still remain unfixed. New bugs can''t be submitted (let alone > patches) because the project Trac is offline. > > There is no other component in my applications which behaves as badly > as Ferret. If you don''t treat it _very_ carefully it will throw > segfaults as if this was an established way of indicating an error > condition. > > The ActsAsFerret plugin _does_ treat ferret quite carefully and it''s > the only reason why many people are able to use Ferret at all. > However, AAF is one approach and for some applications it might not be > the right one. Especially if you want to put multiple models in one > index - it''s possible, but not really a flexible solution. > > The most sensitive point of Ferret is concurrency and many people > actually use Ferret in distributed environments (which is usually a > Rails app that scales across several machines). AAF introduces a DRb > server to work around this problem, but with many concurrent read/ > write requests, performance quickly degrades. > > With the advent of JRuby, a myriad of Java-based solutions is now > accessible to Ruby developers, including many full-text indices. There > are very mature solutions readily available for production use and > many next-generation search engines currently in development. > > For the next application that needs full text search, I''m most > definitely not going to use Ferret. I agree with Erik and give Solr a > shot. > > I would like to encourage everyone, who is already using another full > text index for Ruby/Rails to share his/her experiences on this list. > Because I have the feeling that many people would like to get rid of > Ferret for exactly the same reasons I''ve pointed out above. > > Andy > > > On 16.11.2007, at 22:13, Erik Hatcher wrote: > >> >> On Nov 16, 2007, at 3:35 PM, Scott Davies wrote: >>> Am I a fool for wondering whether it might ultimately be less painful >>> to try an index server that runs Lucene under a JRuby process? >> >> Or, rather, an index server that runs Solr accessed with a pure Ruby, >> solr-ruby, API (which works with MRI or JRuby)? :) >> >> Erik >> >> _______________________________________________ >> Ferret-talk mailing list >> Ferret-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/ferret-talk > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On Nov 18, 2007, at 7:05 AM, Andreas Korth wrote:> What worries me most is the fact that Ferret is effectively an > abandoned project. The original author, who is the sole owner of the > code, hasn''t been posting to this list for about six months. He hasn''t > introduced any improvements in about the same period of time and many > bugs still remain unfixed.I have a large fraction of the expertise needed to maintain the C part of the Ferret code base, FWIW. What I''m missing is significant Ruby expertise, which I wouldn''t mind accumulating. :) If what''s needed is C-level bug fixing, I can probably help out.> New bugs can''t be submitted (let alone > patches) because the project Trac is offline.I know it''s been down before, but <http://ferret.davebalmain.com/ trac> looks like it''s up to me, now. Also, I see a commit from Dave bumping the version to 0.11.5 yesterday. The C code base that I am currently working on, which has a foundation designed by Dave and I to be shared by multiple host languages, is going to wind up having Ruby bindings eventually. It will either happen as part of the Lucy project, or independently. In the meantime, perhaps I can contribute to Ferret in a caretaker/ troubleshooter role. Dave gave me commit access to the repository a while ago, and I just verified that I still have it. Marvin Humphrey Rectangular Research http://www.rectangular.com/
Hi! On Sun, Nov 18, 2007 at 10:24:34AM -0500, casey at nerdle.com wrote:> Andy, > > You asked about other full text indexes for Ruby/Rails. I am using both > AAF/Ferret and Sphinx in my app. > > I haven''t had any problems with Ferret or acts_as_ferret so far. I am > using the DRb server and it is being hit with 200-250,000 requests a day > from dozens of clients (Mongrel instances). My index isn''t huge - it is > about 600 MB.ah, glad to see somebody where everything just works standing up and tell the world :-) On Sun, 18 Nov 2007, Andreas Korth wrote: [..]> > > > What worries me most is the fact that Ferret is effectively an > > abandoned project. The original author, who is the sole owner of the > > code, hasn''t been posting to this list for about six months. He hasn''t > > introduced any improvements in about the same period of time and many > > bugs still remain unfixed. New bugs can''t be submitted (let alone > > patches) because the project Trac is offline.Trac is online again for days, and Ferret even got a new logo :-) I wouldn''t call it abandoned, it''s just stabilizing.> > There is no other component in my applications which behaves as badly > > as Ferret. If you don''t treat it _very_ carefully it will throw > > segfaults as if this was an established way of indicating an error > > condition. > > > > The ActsAsFerret plugin _does_ treat ferret quite carefully and it''s > > the only reason why many people are able to use Ferret at all. > > However, AAF is one approach and for some applications it might not be > > the right one. Especially if you want to put multiple models in one > > index - it''s possible, but not really a flexible solution.Well, even if aaf doesn''t fit your needs, you might at least have a look at it if you want to know how to treat your Ferret well :-) I admit it isn''t always an easy library to deal with, but with a proper set of unit tests it''s entirely possible and no headache at all. Imho.> > The most sensitive point of Ferret is concurrency and many people > > actually use Ferret in distributed environments (which is usually a > > Rails app that scales across several machines). AAF introduces a DRb > > server to work around this problem, but with many concurrent read/ > > write requests, performance quickly degrades.AAf''s DRb server can handle some serious load as it is now, but for sure there''s much room for improvement. However I didn''t receive many complaints from people actually *having* this problem in real life applications yet. Most of the time this is brought up as some kind of ''what if'' problem. Somebody did a speed comparison of Solr and aaf/Drb a while back, where aaf was at least as fast as Solr was, with it''s admittedly naive DRb server. I don''t say this was a representative benchmark or anything, but it''s the only numbers I know of... So please from now on, anybody feeling to blame aaf''s DRb as slow, *please* show us some numbers and the test process which led to these numbers. Ideally you''d also show us the numbers of any solution you''ve found to be faster solving the same problem. Thanks.> > With the advent of JRuby, a myriad of Java-based solutions is now > > accessible to Ruby developers, including many full-text indices. There > > are very mature solutions readily available for production use and > > many next-generation search engines currently in development.For sure. I''m excited by these possiblities as well. Cheers, Jens -- Jens Kr?mer http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
On 18.11.2007, at 18:51, Jens Kraemer wrote:> Trac is online again for days, and Ferret even got a new logo :-) I > wouldn''t call it abandoned, it''s just stabilizing.Yes, I noticed that. I should have checked before posting. However, a project site that is frequently down for extended periods of time is not exactly building up trust :)> AAf''s DRb server can handle some serious load as it is now, but for > sure > there''s much room for improvement. However I didn''t receive many > complaints from people actually *having* this problem in real life > applications yet. Most of the time this is brought up as some kind of > ''what if'' problem.My apologies for implying that AAF is part of the problem. It certainly isn''t. I made the mistake to mix up my concerns about Ferret with comments on AAF. What I actually meant to say, is that AAF is one viable way to deal with some of Ferret''s shortcomings. The fact that in the Rails community AAF is almost synonymous with Ferret speaks for your plugin and I''m not in a position to question that.> So please from now on, anybody feeling to blame aaf''s DRb as slow, > *please* show us some numbers and the test process which led to > these numbers.Again, I wasn''t to blame AAF here. To be more precise: Ferret is pretty damn fast. The problem is its extremely sensitive API which exposes problems from the C implementation to the Ruby developer. I don''t know of any way to catch a segfault in Ruby, and even if I did, there''s little I can do about it from Rubyland. Without transactional index updates, such behavior is intolerable, unless you can afford to rebuild your index several times a day. This leaves us to build another Ruby API on top of Ferret''s in order to compensate for these imperfections. I wrote a custom solution with a focus on reliability. But with all the infrastructure built around Ferret (DRb server, transactions, queuing), the overall indexing performance wasn''t that great anymore: Remote indexing with 10 concurrent clients was 8-9 times slower than local indexing. Maybe AAF is faster, but since the implementations are different, there''s no point in comparing them directly. Andy
On 18.11.2007, at 16:24, casey at nerdle.com wrote:> I''m using Sphinx (http://www.sphinxsearch.com/) wherever I don''t need > realtime updates. A large portion of my site requires search > indexes to > be always up-to-date but in many places, I can live with an index > that may > be 5 minutes old. Sphinx trades realtime indexing for performance - > both > search and indexing speed is blazingly fast. Sphinx comes with a > server > component that speaks a simple protocol and there are several rails > plugins available.Thanks, Casey. I''ll take a look at Sphinx. Since I''m primarily concerned about index consistency and don''t mind short delays either, it sounds like a pretty good alternative. Cheers, Andy
Great. For my own curiosity, and maybe people here share some of it: Is it possible to write your own custom analyzers for Solr? If so, how easy it is? Can one do that in Ruby or do I have to write it in Java? I personally think that''s one of the greatest things about Ferret. So far I haven''t bothered looking into Sphinx or Solr precisely because, from a glance, I couldn''t find a way to customize anything in detail like I can do with Ferret. I assume there is a way... Thing is, reading through the Ferret booklet (the one from OReilly), you get a glimpse of how easy it is to build custom solutions using it. So whereas it''s kind of sad that the lead developer has been distant from the project in the last few months (?), I have to say, there''s hardly matching how easy it is to work with it. On Nov 18, 2007 8:29 PM, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > On Nov 17, 2007, at 5:12 AM, Scott Davies wrote: > > Hmmm...I''d first heard of Solr only a couple of days ago, and I hadn''t > > been aware of a Ruby API to it until you mentioned it. > > Interesting...thanks! > > I''ve honestly given fairly little of my time to Ferret, though I have > tinkered with it some and it is mighty fine! > > Believe you me, I don''t want to steal any thunder from Ferret. And > I''ve not compared/contrasted them much myself. Truth be told I''m > still a Java dude, and knowing that Lucene and Solr are in Java, > excel at what they are designed to do and already gulping the Apache > cool-ade I really dig Solr. > > I''ve presented solr+ruby a couple of times now, once at RailsConf and > then again a few weeks ago at rubyconf. > > RailsConf: > <http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf> > > rubyconf: > <http://code4lib.org/files/solr-ruby.pdf> > > acts_as_solr as it exists today is sub-optimal compared to > acts_as_ferret. I''m quite admittedly not much into relational > databases so I have only tinkered in this area myself. > > Erik > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
For the record, while Lucene is pretty well-behaved as far as I can tell, DRb running under JRuby is not. When hit with multiple request streams simultaneously, DRb under JRuby 1.0.2 very quickly falls over and stops responding to all queries. DRb under JRuby 1.1b1 *almost* works, but every now and then JRuby will freak out and for a few requests things will fail in very strange ways. (Attempts to construct Java objects will fail with exceptions such as "undefined method `constructors'' for nil:NilClass" or "undefined method `java_class'' for Class:Class"; sometimes looking up a class will fail...) On the plus side, I do get the impression that JRuby development is pretty active, and I see some concurrency bugs listed as high-priority for JRuby 1.1, some of which have already been patched in the trunk. My guess is that JRuby+Lucene+DRb will be a fine choice in a few months...it was actually pretty painless to set up, even with MLI Ruby RoR clients talking to a JRuby indexing server. (I have a simple metaprogramming hack that lets the client specify a sequence of code to execute on the server side, where the specification looks *almost* like normal Ruby code; this effectively lets me easily construct gnarly Lucene query trees in MLI Ruby clients that know nothing about Lucene or Java. I actually initially came up with this hack to work around Ferret''s "query trees and filters don''t marshal" issue.) JRuby''s not ready for serious use in scenarios with concurrency just yet, though. Meanwhile, I''m hoping to avoid Solr because it seems (1) kind of complicated for what I''d actually get out of it in my particular application, (2) not particularly well-documented given its size, and (3) likely to get in my way when I want to do anything low-level and gnarly with Lucene. I guess I''ll continue limping along with Ferret for the moment and hope the concurrency issues get worked out soonish. Has anyone actually decided specifically to make Ferret bulletproof in the face of concurrency over the next few months, or is it probably just not going to happen? If it doesn''t, I suspect Ferret will probably fall by the wayside as more Ruby people jump ship for Lucene-based solutions. Which would be a shame, because Ferret does hold a lot of promise...indexing is hard, and Ferret is *almost* a great solution. (Too bad the last 20% is usually 80% of the work...) -- Scott On Nov 18, 2007 4:45 PM, Julio Cesar Ody <julioody at gmail.com> wrote:> Great. For my own curiosity, and maybe people here share some of it: > > Is it possible to write your own custom analyzers for Solr? If so, how > easy it is? Can one do that in Ruby or do I have to write it in Java? > > I personally think that''s one of the greatest things about Ferret. So > far I haven''t bothered looking into Sphinx or Solr precisely because, > from a glance, I couldn''t find a way to customize anything in detail > like I can do with Ferret. I assume there is a way... > > Thing is, reading through the Ferret booklet (the one from OReilly), > you get a glimpse of how easy it is to build custom solutions using > it. So whereas it''s kind of sad that the lead developer has been > distant from the project in the last few months (?), I have to say, > there''s hardly matching how easy it is to work with it. > > > > > On Nov 18, 2007 8:29 PM, Erik Hatcher <erik at ehatchersolutions.com> wrote: > > > > On Nov 17, 2007, at 5:12 AM, Scott Davies wrote: > > > Hmmm...I''d first heard of Solr only a couple of days ago, and I hadn''t > > > been aware of a Ruby API to it until you mentioned it. > > > Interesting...thanks! > > > > I''ve honestly given fairly little of my time to Ferret, though I have > > tinkered with it some and it is mighty fine! > > > > Believe you me, I don''t want to steal any thunder from Ferret. And > > I''ve not compared/contrasted them much myself. Truth be told I''m > > still a Java dude, and knowing that Lucene and Solr are in Java, > > excel at what they are designed to do and already gulping the Apache > > cool-ade I really dig Solr. > > > > I''ve presented solr+ruby a couple of times now, once at RailsConf and > > then again a few weeks ago at rubyconf. > > > > RailsConf: > > <http://www.ehatchersolutions.com/~erik/SolrOnRails.pdf> > > > > rubyconf: > > <http://code4lib.org/files/solr-ruby.pdf> > > > > acts_as_solr as it exists today is sub-optimal compared to > > acts_as_ferret. I''m quite admittedly not much into relational > > databases so I have only tinkered in this area myself. > > > > Erik > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
On Nov 21, 2007, at 2:53 PM, Scott Davies wrote:> My guess is that JRuby+Lucene+DRb will be a fine choice in a few > months...Definitely not a bad choice. However I still implore you to give Solr another chance. More on that....> Meanwhile, I''m hoping to avoid Solr because it seems (1) kind of > complicated for what I''d actually get out of it in my particular > applicationHow so? It''s a "search server" with the same goals that I imagine you''d have for the JRuby+Lucene+DRb combination. It''s not really complicated, especially with the solr-ruby library. Add documents, delete them, query for them. Leverage highlighting and more-like these features, dismax querying, etc.> , (2) not particularly well-documented given its sizeWow. Have you seen the Solr wiki? http://wiki.apache.org/solr - there are nooks and crannies documented on that wiki that go well beyond what I''d consider good documentation. By all means point me to areas that aren''t documented that you need to know (off list) and I''ll get those taken care of.> (3) likely to get in my way when I want to do anything low-level and > gnarly with Lucene.Maybe, but not much in your way. You''d have to wrap your low-level mojo inside some Solr API perhaps, but not even if we''re just talking about custom analyzers or similarity implementation.> Which would be a shame, because Ferret does hold a lot of > promise..hear hear! I definitely extend major kudos to Dave and the other Ferret contributors. Great stuff. Erik
On Nov 21, 2007 12:24 PM, Erik Hatcher <erik at ehatchersolutions.com> wrote:> > How so? It''s a "search server" with the same goals that I imagine > you''d have for the JRuby+Lucene+DRb combination.It''s a bit more than I need right out of the gate, what with the caching, replication, faceted search, etc. Of course, that might not be a problem if it uses sensible configuration defaults I can safely ignore to start with.> It''s not really complicated, especially with the solr-ruby library. > Add documents, delete them, query for them. Leverage highlighting > and more-like these features, dismax querying, etc.My particular application does enough weird things that, for the most part, I''d prefer unfettered access to the low-level Lucene APIs. (For example, my application uses a lot of gnarly query trees involving filters and ranges, and I''m not sure whether those are easily transmitted through the Solr APIs. Then I have "run all of these queries against each of the documents in this specific set and tell me which document/query pairs match in one fell swoop" routines, in which case it might be a good idea to copy the documents into a temporary RAM index to run the queries against.)> > > , (2) not particularly well-documented given its size > > Wow. Have you seen the Solr wiki? http://wiki.apache.org/solr - > there are nooks and crannies documented on that wiki that go well > beyond what I''d consider good documentation. > > By all means point me to areas that aren''t documented that you need > to know (off list) and I''ll get those taken care of.Wikis are fine for looking up details when you already mostly know what you''re doing, but they''re not nearly as useful when you''re in the earlier stages trying to get the big "What does this system look like and how does it work?" picture and evaluate initial plans of attack. Ferret and Lucene both have entire *books* written about them that are *excellent* for those purposes. (They''re not free-as-in-beer, but are well worth the cost.) By comparison, Solr has a very simple "here is how you get a straightforward app off the ground" tutorial that says little about how Solr is actually organized, and then you''re basicaly left staring at a Wiki page with a thousand bullet points and no clear path to big-picture enlightenment. And given the choice between (1) using a lower-level system that''s been very well-documented in a well-organized explanatory fashion and (2) using a slightly higher-level system I still haven''t acquired a mental "big picture" for, I generally find (1) more productive. This isn''t a criticism of Solr''s documentation nearly as much as a hearty "Book-style documentation is useful, and, holy crap, Ferret and Lucene actually HAVE IT. Woohoo!", plus an added bonus testament to my own laziness.> > (3) likely to get in my way when I want to do anything low-level and > > gnarly with Lucene. > > Maybe, but not much in your way. You''d have to wrap your low-level > mojo inside some Solr API perhaps, but not even if we''re just talking > about custom analyzers or similarity implementation.Yeah, my guess is that if I sit down and figure out how Solr is laid out, adding APIs to do what I want won''t be too hard. Might still be kind of tedious implementing all the necessary marshaling, though. -- Scott