Hello. I''m the author of DataMapper (http://datamapper.org), and am trying to choose what Full-Text-Indexing engine/plugin I want to include by default. I was hoping you guys could help. :-) Sphinx comes highly recommended, but without live index updates, it just doesn''t seem practical for most of my work. I''m most experienced with Solr, but the whole HTTP::Request and general complexity of it is off-putting. I haven''t used Ferret in an application yet, but I love what I see so far. The ability to have an in-process server in development, and the clean Ruby API are big wins for me. But I''ve heard a lot of scary things about corrupted indexes, even when using the DRb server. Is this just FUD? Are there any unresolved issues revolving around corrupted indexes? Can I afford to use Ferret in big applications for Fortune-500 clients? (I know that sounds... pompous really, but it''s a genuine concern.) Any advice you could offer would be greatly appreciated. I''ve also read a few messages about serializing index requests/updates to Ferret through message-queues. Are there any decent guides/blog-posts on this topic? Thanks, -Sam
We have several 3GB indexes with approximately 1 million documents in each of them. Here are some quick notes, feel free to reach out with other questions: * no corruption problems that weren''t our fault. * there was an issue with large index files (> ~2GB) that was patched, but I''m honestly not sure if it is in the trunk, as the ferret trac/ svn is frequently MIA (which is a concern of course) * the code is clear and fairly easy to follow. AAF is very easy to follow. * I''ve been very happy with performance of the actual indexing/ searching, however you need to watch out for the processes that are actually doing the synchronization for writes. DRB is a bottleneck for us right now, though our volume isn''t high enough that I''d call it a real problem yet. * for moderately high-volume sites you''ll want to consider batching index updates "offline", though for large indexes make sure that you have enough IO capacity to optimize the index. We host on EC2 and the $.1/hour instances simply do not have anywhere near the IO capacity to optimize a large index without having _every other process_ waiting for IO. I haven''t tested the larger instance types yet. * we love how easy and efficient it is to combine many indexes into one. We index tens of thousands of websites in parallel and then combine 100 or so indexes into one index very quickly. * the mailing list is great. Jens is on top of things, very receptive to new ideas and takes *very* good care of AAF. Haven''t seen Dave Balmain in a while. Overall we are happy. There are times when search accuracy questions come up, and frequently the problem is that we are not effectively parsing queries or using the right analyzer for the problem at hand, so RTFM (http://www.oreilly.com/catalog/9780596527853/). That''s all I can think of now... Erik On Nov 15, 2007, at 9:37 AM, Sam Smoot wrote:> Hello. I''m the author of DataMapper (http://datamapper.org), and am > trying to choose what Full-Text-Indexing engine/plugin I want to > include by default. I was hoping you guys could help. :-) > > Sphinx comes highly recommended, but without live index updates, it > just doesn''t seem practical for most of my work. > > I''m most experienced with Solr, but the whole HTTP::Request and > general complexity of it is off-putting. > > I haven''t used Ferret in an application yet, but I love what I see so > far. The ability to have an in-process server in development, and the > clean Ruby API are big wins for me. But I''ve heard a lot of scary > things about corrupted indexes, even when using the DRb server. Is > this just FUD? Are there any unresolved issues revolving around > corrupted indexes? Can I afford to use Ferret in big applications for > Fortune-500 clients? (I know that sounds... pompous really, but it''s a > genuine concern.) > > Any advice you could offer would be greatly appreciated. > > I''ve also read a few messages about serializing index requests/updates > to Ferret through message-queues. Are there any decent > guides/blog-posts on this topic? > > Thanks, -Sam > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
Hey ..> I haven''t used Ferret in an application yet, but I love what I see so > far. The ability to have an in-process server in development, and the > clean Ruby API are big wins for me. But I''ve heard a lot of scary > things about corrupted indexes, even when using the DRb server. Is > this just FUD? Are there any unresolved issues revolving around > corrupted indexes? Can I afford to use Ferret in big applications for > Fortune-500 clients? (I know that sounds... pompous really, but it''s a > genuine concern.)We''re using ferret on omdb.org for 14 month without any problems. There''re a few things you might want to work around (Erik pointed some out). If you expect a huge amount of index updates, you need to think about a few infrastructural problems, because right now, AAF does not allow you to cluster indexing servers. but i know there is a solution for that :) If you just have huge amount of search queries, there is no need to worry.. i would not suggest usings AAF''s ferret server for searching, though .. but it''s quite easy to do the searching in each mongrel, so not concern here either. i guess we need more information about the data you want to index to give more detailed advices.> I''ve also read a few messages about serializing index requests/updates > to Ferret through message-queues. Are there any decent > guides/blog-posts on this topic?yes, that''s currently being worked on .. so there will be some guides later on :) Cheers Ben --- Benjamin Krause http://www.omdb.org/ bk at benjaminkrause.com Rails-Schulung "Advancing with Rails" mit David A. Black 19.11.-22.11.2007, Berlin-Mitte Details u. Anmeldung: http://www.railsschulung.de -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20071115/c58d15dc/attachment.html
On Nov 15, 2007, at 1:41 PM, Benjamin Krause wrote:> i would not suggest usings AAF''s ferret server for searching, > though .. but it''s quite easy to do the searching in each mongrel, so > not concern here either.I''m confused... what does "searching" mean in this context? :) John
John,> On Nov 15, 2007, at 1:41 PM, Benjamin Krause wrote: >> i would not suggest usings AAF''s ferret server for searching, >> though .. but it''s quite easy to do the searching in each mongrel, so >> not concern here either. > > I''m confused... what does "searching" mean in this context? :)If you''re using AAF, you should use the ferret drb server to index your objects. however, using the ferret server means, whenever someone is search (if you''re using Model.find_by_contents) the search will be forwarded to the ferret server. The ferret server will process the searching request and send the response back to the mongrel. This overhead isn''t necessary, as mongrel could use a local index to do the search. there is no need to bother the ferret server. so, indexing (aka updating, creating, saving, whatever) should use the ferret server, but searching (using find_by_contents) will use the ferret server if you''re using standard AAF, even though it''s not really necessary and could result in a bottleneck. don''t get me wrong. it is totally fine to use standard AAF, unless you''re having huge amounts of searches or livesearches. I would not recommend use a custom ferret solution, unless you expect a problem or already have one :) Cheers Ben --- Benjamin Krause http://www.omdb.org/ bk at benjaminkrause.com Rails-Schulung "Advancing with Rails" mit David A. Black 19.11.-22.11.2007, Berlin-Mitte Details u. Anmeldung: http://www.railsschulung.de -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20071115/122a7037/attachment.html
On Nov 15, 2007 9:37 AM, Sam Smoot <ssmoot at gmail.com> wrote:> Hello. I''m the author of DataMapper (http://datamapper.org), and am > trying to choose what Full-Text-Indexing engine/plugin I want to > include by default. I was hoping you guys could help. :-) > > Sphinx comes highly recommended, but without live index updates, it > just doesn''t seem practical for most of my work. > > I''m most experienced with Solr, but the whole HTTP::Request and > general complexity of it is off-putting.For a different perspective: I''m in the middle of switching from Ferret to Solr. I like Ferret a lot, and still use it on several sites, but I had some problems with one large site: 1. the patches for large-index support are still in development; 2. each update to Ferret requires rebuilding the index; 3. Ferret doesn''t yet support compressed indexes. My other reason for switching is that Rails'' ActiveRecord is not well-suited to storing large documents, which made acts_as_ferret less compelling. I was nervous about tackling Solr, but I''ve found it quite easy to use, and the built-in caching and multithreading make it fast. I think Ferret is adequate for most search tasks, but if (like me) you''re building a dedicated search engine, Solr is currently a stronger candidate. -Stuart Sierra
Hi! On Fri, Nov 16, 2007 at 12:19:10PM -0500, Stuart Sierra wrote: [..]> For a different perspective: I''m in the middle of switching from > Ferret to Solr. I like Ferret a lot, and still use it on several > sites, but I had some problems with one large site: > > 1. the patches for large-index support are still in development;Let''s hope Dave reads this ;-) However there are several sites I know of with Index sizes > several GB, so they seem to be working well enough.> 2. each update to Ferret requires rebuilding the index;This for sure is annoying but I''d consider this normal for a library that has developed that fast. I think Dave has had very good reasons for each of the changes he did to the index format. Plus I don''t think *every* release had a new index format ;-)> 3. Ferret doesn''t yet support compressed indexes.At least from the docs it looks like it does, see http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html . I didn''t ever try this out however.> My other reason for switching is that Rails'' ActiveRecord is not > well-suited to storing large documents, which made acts_as_ferret less > compelling.That''s a good point, and we plan to make aaf independent from active_record in the future.> I was nervous about tackling Solr, but I''ve found it quite easy to > use, and the built-in caching and multithreading make it fast.numbers, please :-)> I think Ferret is adequate for most search tasks, but if (like me) > you''re building a dedicated search engine, Solr is currently a > stronger candidate.Well, As Solr uses Lucene internally, the mechanics and performance characteristics naturally can''t be that different from Ferret. Maybe Ferret has a bug or two and a non-working inter-process locking (which doesn''t matter when you think about building a dedicated search server like Solr is, since it''s only one process), but the general internal handling of the index is the same, i.e. you can also only have one Writer open to a Lucene index at a time, and Searchers won''t see index changes until re-opened, too. Having that said, if my application''s main concern would be search, I most probably wouldn''t choose any pre-cooked solution like aaf or Solr, but build exactly the thing I need from scratch, basing it either on Lucene or Ferret. But maybe that''s just me ;-) Cheers, Jens -- Jens Kr?mer http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database
On Nov 17, 2007, at 7:39 AM, Jens Kraemer wrote:>> I think Ferret is adequate for most search tasks, but if (like me) >> you''re building a dedicated search engine, Solr is currently a >> stronger candidate. > > Well, As Solr uses Lucene internally, the mechanics and performance > characteristics naturally can''t be that different from Ferret. Maybe > Ferret has a bug or two and a non-working inter-process locking (which > doesn''t matter when you think about building a dedicated search server > like Solr is, since it''s only one process), but the general internal > handling of the index is the same, i.e. you can also only have one > Writer open to a Lucene index at a time, and Searchers won''t see index > changes until re-opened, too.That''s all true. However, Solr manages all the IndexWriter/ IndexSearcher stuff for you quite transparently (which I guess is comparable to Ferret + DRb, eh?). Because it is a single point of access to the index, it takes care of the single writer situation, and also handles warming IndexSearchers before coming online so that caches are built and a search on an updated index is as fast as it was before being updated.> Having that said, if my application''s main concern would be search, I > most probably wouldn''t choose any pre-cooked solution like aaf or > Solr, > but build exactly the thing I need from scratch, basing it either on > Lucene or Ferret. But maybe that''s just me ;-)You''d be reinventing a lot of wheels doing that, with IndexWriter synchronization, IndexSearcher warming, caching, and much more. Erik
On Nov 17, 2007 7:39 AM, Jens Kraemer <jk at jkraemer.net> wrote:> > 3. Ferret doesn''t yet support compressed indexes. > > At least from the docs it looks like it does, see > http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html . > I didn''t ever try this out however.Yes, it''s in the API, but there''s no code for it yet.> > I was nervous about tackling Solr, but I''ve found it quite easy to > > use, and the built-in caching and multithreading make it fast. > > numbers, please :-)I make no claim that it''s faster than Ferret, but it''s fast enough.> Having that said, if my application''s main concern would be search, I > most probably wouldn''t choose any pre-cooked solution like aaf or Solr, > but build exactly the thing I need from scratch, basing it either on > Lucene or Ferret. But maybe that''s just me ;-)I''d like to do that, but I lack sufficient time and skill. :) In the mean time, I''m hoping Solr will let me offer an open search API to my users without too much extra effort on my part. We''ll see how it goes; I may end up back on Ferret at some point. -Stuart