David Balmain
2006-Nov-04 02:25 UTC
[Rails] [ANN] Ferret 0.10.6 released (and some benchmarks)
Hey folks, ** Description ** Firstly for those who don''t know, Ferret is a full-text search library which makes adding search to your application a breeze. It''s much faster than MySQL full-text search as well most other search libraries out there. It allows you to do Boolean (+ruby + rails -jewelry) and phrase queries ("the quick brown fox") as well as some more unusual queries like fuzzy queries (misspelling~ matches mispeling or misspellng), wildcard queries (Aus?ral*), range queries (date:<=20050601) and a lot more. Ferret also now offers query result highlighting and excerpting. ** Announcement ** This is the first Ferret announcement I''ve put up for a while, the reason being, the most recent releases of Ferret have been alpha releases. I completely rewrote Ferret from the ground up so that it no-longer uses Lucene''s file format and I was able to gain so great performance improvements in the process. On the topic of performance, it has recently been brought to my attention that some people are aware of Ferret but avoid it because they think it is slow. Just to put that myth to rest, here are the outputs for a simple benchmark, indexing the reuters corpus available at: http://www.daviddlewis.com/resources/testcollections/reuters21578/ First Apache Lucene. (Yes Java users, as you can see, I did warm up the JVM (with 6 repetitions of the test) and I used the options -server -Xmx500M -XX:CompileThreshold=100 so this is a fair test). --------------------------------------------------- 1 Secs: 47.09 Docs: 19043 2 Secs: 46.46 Docs: 19043 3 Secs: 44.07 Docs: 19043 4 Secs: 45.92 Docs: 19043 5 Secs: 45.97 Docs: 19043 6 Secs: 47.06 Docs: 19043 --------------------------------------------------- Lucene 1.9-rc1-dev JVM 1.5.0_06 (Sun Microsystems Inc.) Linux 2.6.15-27-386 i386 Mean: 46.10 secs Truncated mean (4 kept, 2 discarded): 46.35 secs --------------------------------------------------- And now Ferret: ------------------------------------------------------------ 0 Secs: 8.03 Docs: 19043 1 Secs: 10.15 Docs: 19043 2 Secs: 9.78 Docs: 19043 3 Secs: 10.31 Docs: 19043 4 Secs: 9.78 Docs: 19043 5 Secs: 10.13 Docs: 19043 ------------------------------------------------------------ Mean 9.70 secs Truncated Mean (4 kept, 2 discarded): 9.96 secs ------------------------------------------------------------ So as you can see, performance is no longer a problem. (incidentally, the pure C version can index the reuters corpus in under 3 seconds, an order of magnitude faster than Lucene). One new addition in the 0.10.* series of Ferret is a win32 gem so all those windows users out there can now get the super speed searches too. There have also be a lot of other changes in the Ferret API. You may want to check out the documentation for a refresher: http://ferret.davebalmain.com/api http://ferret.davebalmain.com/api/files/TUTORIAL.html ** Now Accepting Donations ** Ferret has been a labour of love but it has taken up a lot more of my life than I ever expected. At in excess of 50,000 lines of code, I believe it is one of the largest Ruby projects, especially with only a single developer. (previous version before rewrite had >70,000 LOC so added together that is a lot of work). I would love to keep pushing Ferret forward at the rate it has been going but other things are going to have to start taking priority (like putting food on the table). If you find Ferret useful in your application and you aren''t able to contribute with the development, please consider making a donation at the Ferret website: http://ferret.davebalmain.com/trac So where do I see Ferret going in the future? I''d really like to build an object-database based on Ferret, with ActiveRecord and Og bindings. Why?: * Fixes the current DRY problems with Ferret. ie, should you store data in the Ferret index to take advantage or highlighting? Or build your own highlighter so that the data isn''t stored in two places. * Simplifies things. You''ll be able to forget about IndexReaders, IndexWriters, file-locking, etcetera. Just create the database as you usually would and you have Ferret full-text search built in. * Range queries just work. No need to pad numbers or format dates correctly. * Sort just works. And it won''t take forever to build the sort-index (currently a problem on very large indexes). * Performance, performance, performance. As people are often pointing out, the bottle neck in many applications falls in the data access layer. Mapping relational database schemas to Ruby objects (or any OO language for that matter) can be very expensive at run-time. A good object database should easily outperform even SQLite. (and I''m being very cautious here) Right now, I''d need to raise at least 5 figures before I''d consider this undertaking please send some encouragement my way if you would be interested in something like this. Otherwise I''d appreciate any kind of contribution, financial or assisting with development. In the mean time I will continue to improve test coverage and Ferret documentation, fix bugs and help people on the Ferret mailing list. Happy Ferreting. Dave