Hi all This week I have worked with Rails and Ferret to test Ferrets (and Lucenes) capabilities. I decided to make a mixin for ActiveRecord as it seemed the simplest possible solution and I ended up making this into a plugin. For more info on Ferret see: http://ferret.davebalmain.com/trac/ The plugin is functional but could easily be refined. Anyway I want to share it with you. Regard it as a basic solution. Most of the ideas and code is taken from these sources Howtos and help on Ferret with Rails: # http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails # http://article.gmane.org/gmane.comp.lang.ruby.rails/26859 # http://ferret.davebalmain.com/trac # http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord # http://rubyforge.org/pipermail/ferret-talk/2005-November/000014.html Howtos on creating plugins: # http://wiki.rubyonrails.com/rails/pages/HowToWriteAnActsAsFoxPlugin # http://www.jamis.jamisbuck.org/articles/2005/10/11/plugging-into-rails # http://lesscode.org/2005/10/27/rails-simplest-plugin-manager/ # http://wiki.rubyonrails.com/rails/pages/HowTosPlugins The result is the acts_as_ferret Mixin for ActivcRecord. Use it as follows: In any model.rb add acts_as_ferret class Foo < ActiveRecord::Base acts_as_ferret end All CRUD operations will be performed on both ActiveRecord (as usual) and a ferret index for further searching. The following method is available in your controllers: ActiveRecord::find_by_contents(query) # Query is a string representing you query The plugin follows the usual plugin structure and consists of 2 files: {RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb {RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb The Ferret DB is stored in: {RAILS_ROOT}/db/index.db Here follows the code: # CODE for init.rb require ''acts_as_ferret'' # END init.rb # CODE for acts_as_ferret.rb require ''active_record'' require ''ferret'' module FerretMixin #(was: Foo) module Acts #:nodoc: module ARFerret #:nodoc: def self.append_features(base) super base.extend(MacroMethods) end # declare the class level helper methods # which will load the relevant instance methods defined below when invoked module MacroMethods def acts_as_ferret extend FerretMixin::Acts::ARFerret::ClassMethods class_eval do include FerretMixin::Acts::ARFerret::ClassMethods after_create :ferret_create after_update :ferret_update after_destroy :ferret_destroy end end end module ClassMethods include Ferret INDEX_DIR = "#{RAILS_ROOT}/db/index.db" def self.reloadable?; false end # Finds instances by file contents. def find_by_contents(query, options = {}) index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) query_parser ||QueryParser.new(index_searcher.reader.get_field_names.to_a) query = query_parser.parse(query) result = [] index_searcher.search_each(query) do |doc, score| id = index_searcher.reader.get_document(doc)["id"] res = self.find(id) result << res if res end index_searcher.close() result end # private def ferret_create index ||= Index::Index.new(:key => :id, :path => INDEX_DIR, :create_if_missing => true, :default_field => "*") index << self.to_doc index.optimize() index.close() end def ferret_update #code to update index index ||= Index::Index.new(:key => :id, :path => INDEX_DIR, :create_if_missing => true, :default_field => "*") index.delete(self.id.to_s) index << self.to_doc index.optimize index.close() end def ferret_destroy # code to delete from index index ||= Index::Index.new(:key => :id, :path => INDEX_DIR, :create_if_missing => true, :default_field => "*") index_writer.delete(self.id.to_s) index_writer.optimize() index_writer.close() end def to_doc # Churn through the complete Active Record and add it to the Ferret document doc = Ferret::Document::Document.new self.attributes.each_pair do |key,val| doc << Ferret::Document::Field.new(key, val.to_s, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::TOKENIZED) end doc end end end end end # reopen ActiveRecord and include all the above to make # them available to all our models if they want it ActiveRecord::Base.class_eval do include FerretMixin::Acts::ARFerret end # END acts_as_ferret.rb
+1 great work On 12/2/05, Kasper Weibel <weibel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi all > > This week I have worked with Rails and Ferret to test Ferrets (and Lucenes) > capabilities. I decided to make a mixin for ActiveRecord as it seemed the > simplest possible solution and I ended up making this into a plugin. > > For more info on Ferret see: > http://ferret.davebalmain.com/trac/ > > The plugin is functional but could easily be refined. Anyway I want to share it > with you. Regard it as a basic solution. Most of the ideas and code is taken > from these sources > > Howtos and help on Ferret with Rails: > # http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > # http://article.gmane.org/gmane.comp.lang.ruby.rails/26859 > # http://ferret.davebalmain.com/trac > # http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord > # http://rubyforge.org/pipermail/ferret-talk/2005-November/000014.html > > Howtos on creating plugins: > # http://wiki.rubyonrails.com/rails/pages/HowToWriteAnActsAsFoxPlugin > # http://www.jamis.jamisbuck.org/articles/2005/10/11/plugging-into-rails > # http://lesscode.org/2005/10/27/rails-simplest-plugin-manager/ > # http://wiki.rubyonrails.com/rails/pages/HowTosPlugins > > > The result is the acts_as_ferret Mixin for ActivcRecord. > > Use it as follows: > In any model.rb add acts_as_ferret > > class Foo < ActiveRecord::Base > acts_as_ferret > end > > All CRUD operations will be performed on both ActiveRecord (as usual) and a > ferret index for further searching. > > The following method is available in your controllers: > > ActiveRecord::find_by_contents(query) # Query is a string representing you query > > The plugin follows the usual plugin structure and consists of 2 files: > > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb > > The Ferret DB is stored in: > > {RAILS_ROOT}/db/index.db > > Here follows the code: > > # CODE for init.rb > require ''acts_as_ferret'' > # END init.rb > > # CODE for acts_as_ferret.rb > require ''active_record'' > require ''ferret'' > > module FerretMixin #(was: Foo) > module Acts #:nodoc: > module ARFerret #:nodoc: > > def self.append_features(base) > super > base.extend(MacroMethods) > end > > # declare the class level helper methods > # which will load the relevant instance methods defined below when invoked > > module MacroMethods > > def acts_as_ferret > extend FerretMixin::Acts::ARFerret::ClassMethods > class_eval do > include FerretMixin::Acts::ARFerret::ClassMethods > > after_create :ferret_create > after_update :ferret_update > after_destroy :ferret_destroy > end > end > > end > > module ClassMethods > include Ferret > > INDEX_DIR = "#{RAILS_ROOT}/db/index.db" > > def self.reloadable?; false end > > # Finds instances by file contents. > def find_by_contents(query, options = {}) > index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) > query_parser ||> QueryParser.new(index_searcher.reader.get_field_names.to_a) > query = query_parser.parse(query) > > result = [] > index_searcher.search_each(query) do |doc, score| > id = index_searcher.reader.get_document(doc)["id"] > res = self.find(id) > result << res if res > end > index_searcher.close() > result > end > > # private > > def ferret_create > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index << self.to_doc > index.optimize() > index.close() > end > > def ferret_update > #code to update index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index.delete(self.id.to_s) > index << self.to_doc > index.optimize > index.close() > end > > def ferret_destroy > # code to delete from index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index_writer.delete(self.id.to_s) > index_writer.optimize() > index_writer.close() > end > > def to_doc > # Churn through the complete Active Record and add it to the Ferret document > doc = Ferret::Document::Document.new > self.attributes.each_pair do |key,val| > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::TOKENIZED) > end > doc > end > end > end > end > end > > # reopen ActiveRecord and include all the above to make > # them available to all our models if they want it > > ActiveRecord::Base.class_eval do > include FerretMixin::Acts::ARFerret > end > > # END acts_as_ferret.rb > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Very nice Kasper- Thanks for sharing! Cheers- -Ezra On Dec 2, 2005, at 10:22 AM, Kasper Weibel wrote:> Hi all > > This week I have worked with Rails and Ferret to test Ferrets (and > Lucenes) > capabilities. I decided to make a mixin for ActiveRecord as it > seemed the > simplest possible solution and I ended up making this into a plugin. > > For more info on Ferret see: > http://ferret.davebalmain.com/trac/ > > The plugin is functional but could easily be refined. Anyway I want > to share it > with you. Regard it as a basic solution. Most of the ideas and code > is taken > from these sources > > Howtos and help on Ferret with Rails: > # http://wiki.rubyonrails.com/rails/pages/ > HowToIntegrateFerretWithRails > # http://article.gmane.org/gmane.comp.lang.ruby.rails/26859 > # http://ferret.davebalmain.com/trac > # http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with- > activerecord > # http://rubyforge.org/pipermail/ferret-talk/2005-November/000014.html > > Howtos on creating plugins: > # http://wiki.rubyonrails.com/rails/pages/HowToWriteAnActsAsFoxPlugin > # http://www.jamis.jamisbuck.org/articles/2005/10/11/plugging-into- > rails > # http://lesscode.org/2005/10/27/rails-simplest-plugin-manager/ > # http://wiki.rubyonrails.com/rails/pages/HowTosPlugins > > > The result is the acts_as_ferret Mixin for ActivcRecord. > > Use it as follows: > In any model.rb add acts_as_ferret > > class Foo < ActiveRecord::Base > acts_as_ferret > end > > All CRUD operations will be performed on both ActiveRecord (as > usual) and a > ferret index for further searching. > > The following method is available in your controllers: > > ActiveRecord::find_by_contents(query) # Query is a string > representing you query > > The plugin follows the usual plugin structure and consists of 2 files: > > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb > > The Ferret DB is stored in: > > {RAILS_ROOT}/db/index.db > > Here follows the code: > > # CODE for init.rb > require ''acts_as_ferret'' > # END init.rb > > # CODE for acts_as_ferret.rb > require ''active_record'' > require ''ferret'' > > module FerretMixin #(was: Foo) > module Acts #:nodoc: > module ARFerret #:nodoc: > > def self.append_features(base) > super > base.extend(MacroMethods) > end > > # declare the class level helper methods > # which will load the relevant instance methods defined below when > invoked > > module MacroMethods > > def acts_as_ferret > extend FerretMixin::Acts::ARFerret::ClassMethods > class_eval do > include FerretMixin::Acts::ARFerret::ClassMethods > > after_create :ferret_create > after_update :ferret_update > after_destroy :ferret_destroy > end > end > > end > > module ClassMethods > include Ferret > > INDEX_DIR = "#{RAILS_ROOT}/db/index.db" > > def self.reloadable?; false end > > # Finds instances by file contents. > def find_by_contents(query, options = {}) > index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) > query_parser ||> QueryParser.new(index_searcher.reader.get_field_names.to_a) > query = query_parser.parse(query) > > result = [] > index_searcher.search_each(query) do |doc, score| > id = index_searcher.reader.get_document(doc)["id"] > res = self.find(id) > result << res if res > end > index_searcher.close() > result > end > > # private > > def ferret_create > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index << self.to_doc > index.optimize() > index.close() > end > > def ferret_update > #code to update index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index.delete(self.id.to_s) > index << self.to_doc > index.optimize > index.close() > end > > def ferret_destroy > # code to delete from index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index_writer.delete(self.id.to_s) > index_writer.optimize() > index_writer.close() > end > > def to_doc > # Churn through the complete Active Record and add it to the Ferret > document > doc = Ferret::Document::Document.new > self.attributes.each_pair do |key,val| > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, > Ferret::Document::Field::Index::TOKENIZED) > end > doc > end > end > end > end > end > > # reopen ActiveRecord and include all the above to make > # them available to all our models if they want it > > ActiveRecord::Base.class_eval do > include FerretMixin::Acts::ARFerret > end > > # END acts_as_ferret.rb > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-Ezra Zygmuntowicz Yakima Herald-Republic WebMaster http://yakimaherald.com 509-577-7732 ezra-gdxLOakOTQ9oetBuM9ipNAC/G2K4zDHf@public.gmane.org
Thanks... one problem. I beleive that I''m doing everything correctly except I keep getting this error on any CRUD operating: undefined local variable or method `document'' for #<Region:0xb7124c50> (where #<Region:....> is the name of my model) any ideas? The index is created and I''ve been able to test Ferret from a command line script just fine. -- Posted via http://www.ruby-forum.com/.
On 2-dec-2005, at 19:22, Kasper Weibel wrote:> Hi all > > This week I have worked with Rails and Ferret to test Ferrets (and > Lucenes) > capabilities. I decided to make a mixin for ActiveRecord as it > seemed the > simplest possible solution and I ended up making this into a plugin.I recently finished a simple search plugin, which works like this class Page < ActiveRecord::Base indexes_columns :title, :body, :into=>''somecolumn'' end it''s here http://julik.textdriven.com/svn/tools/rails_plugins/ simple_search/ (just finished the tests) Maybe we can join the two plugins and get a nice search hook for AR searching? Along the lines of class Page < ActiveRecord::Base indexes_columns :title, :body, :into=>MainFerretIndex # if you pass a Ferret index it gets hooked instead of a column for LIKE end Or even maintain named Ferret indexes if the user has Ferret and resort to LIKE queries if he doesn''t? -- Julian ''Julik'' Tarkhanov me at julik.nl
James R <adamjroth@...> writes:> > Thanks... one problem. I beleive that I''m doing everything correctly > except I keep getting this error on any CRUD operating:The following in acts_as_ferret.tb should be one line (almost at the end of the file) # Churn through the complete Active Record and add it to the Ferret document Take care with those line breaks :-) Kasper
Hi Kasper, Nice work. Do you mind if I put this on the Ferret Wiki? A few minor points. And a disclaimer, I haven''t had time to use Rails since I started working on Ferret so I could be wrong about a few things here. I noticed in ferret_destroy you have index_writer. I think this is meant to be just index. Also, where you have the lines; index.optimize() index.close() I would replace these with; index.flush() Optimizing the index every time is not necessary and can be quite slow for large indexes. Also, if you close the index, the next time you try to use it you should get an error. I''m not sure why it works for you. It might be a bug. I''ll have to check it out. Better to leave the index open. If you are optimizing every time because you are really concerned about search speed, it is better just to set the merge factor to 2. ie; index ||= Index::Index.new(:key => :id, :path => INDEX_DIR, :merge_factor => 2) Remember that there is generally a payoff between indexing speed and search speed. Also note that I removed the :default_field and :create_if_missing options. They were set to the defaults anyway. Another thing, since you are setting the key to :id, there is no need to do the delete when you do the update. This will happen automatically. Lastly, and most importantly, I think this will only work if you only apply it to one object or you''ll get conflicting ids from two different tables. To make this available to more than one object, there are two solutions I can think of. You could have a separate index directory for each object. Or you can set the key like this; index ||= Index::Index.new(:key => [:id, :table], :path => INDEX_DIR) And your to_doc method would need to store the name of the table in the :table field in the document. I hope all this information helps. When I get some time to use Rails I''ll post my own code. Cheers, Dave PS: I just released Ferret 0.3.0 so gem update and enjoy. :) On 12/3/05, Kasper Weibel <weibel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi all > > This week I have worked with Rails and Ferret to test Ferrets (and Lucenes) > capabilities. I decided to make a mixin for ActiveRecord as it seemed the > simplest possible solution and I ended up making this into a plugin. > > For more info on Ferret see: > http://ferret.davebalmain.com/trac/ > > The plugin is functional but could easily be refined. Anyway I want to share it > with you. Regard it as a basic solution. Most of the ideas and code is taken > from these sources > > Howtos and help on Ferret with Rails: > # http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails > # http://article.gmane.org/gmane.comp.lang.ruby.rails/26859 > # http://ferret.davebalmain.com/trac > # http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord > # http://rubyforge.org/pipermail/ferret-talk/2005-November/000014.html > > Howtos on creating plugins: > # http://wiki.rubyonrails.com/rails/pages/HowToWriteAnActsAsFoxPlugin > # http://www.jamis.jamisbuck.org/articles/2005/10/11/plugging-into-rails > # http://lesscode.org/2005/10/27/rails-simplest-plugin-manager/ > # http://wiki.rubyonrails.com/rails/pages/HowTosPlugins > > > The result is the acts_as_ferret Mixin for ActivcRecord. > > Use it as follows: > In any model.rb add acts_as_ferret > > class Foo < ActiveRecord::Base > acts_as_ferret > end > > All CRUD operations will be performed on both ActiveRecord (as usual) and a > ferret index for further searching. > > The following method is available in your controllers: > > ActiveRecord::find_by_contents(query) # Query is a string representing you query > > The plugin follows the usual plugin structure and consists of 2 files: > > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/init.rb > {RAILS_ROOT}/vendor/plugins/acts_as_ferret/lib/acts_as_ferret.rb > > The Ferret DB is stored in: > > {RAILS_ROOT}/db/index.db > > Here follows the code: > > # CODE for init.rb > require ''acts_as_ferret'' > # END init.rb > > # CODE for acts_as_ferret.rb > require ''active_record'' > require ''ferret'' > > module FerretMixin #(was: Foo) > module Acts #:nodoc: > module ARFerret #:nodoc: > > def self.append_features(base) > super > base.extend(MacroMethods) > end > > # declare the class level helper methods > # which will load the relevant instance methods defined below when invoked > > module MacroMethods > > def acts_as_ferret > extend FerretMixin::Acts::ARFerret::ClassMethods > class_eval do > include FerretMixin::Acts::ARFerret::ClassMethods > > after_create :ferret_create > after_update :ferret_update > after_destroy :ferret_destroy > end > end > > end > > module ClassMethods > include Ferret > > INDEX_DIR = "#{RAILS_ROOT}/db/index.db" > > def self.reloadable?; false end > > # Finds instances by file contents. > def find_by_contents(query, options = {}) > index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) > query_parser ||> QueryParser.new(index_searcher.reader.get_field_names.to_a) > query = query_parser.parse(query) > > result = [] > index_searcher.search_each(query) do |doc, score| > id = index_searcher.reader.get_document(doc)["id"] > res = self.find(id) > result << res if res > end > index_searcher.close() > result > end > > # private > > def ferret_create > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index << self.to_doc > index.optimize() > index.close() > end > > def ferret_update > #code to update index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index.delete(self.id.to_s) > index << self.to_doc > index.optimize > index.close() > end > > def ferret_destroy > # code to delete from index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index_writer.delete(self.id.to_s) > index_writer.optimize() > index_writer.close() > end > > def to_doc > # Churn through the complete Active Record and add it to the Ferret document > doc = Ferret::Document::Document.new > self.attributes.each_pair do |key,val| > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::TOKENIZED) > end > doc > end > end > end > end > end > > # reopen ActiveRecord and include all the above to make > # them available to all our models if they want it > > ActiveRecord::Base.class_eval do > include FerretMixin::Acts::ARFerret > end > > # END acts_as_ferret.rb > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
David Balmain <dbalmain.ml@...> writes:> > Hi Kasper, > > Nice work. Do you mind if I put this on the Ferret Wiki?Thanks David This is really quality input! It''s my first week with Ferret and I''m still working my way into it. I hope I''ll get time to reflect on your comments before monday. Feel free to put it on the wiki! Kasper
CC''ing ferret-talk also. Nice work, Kasper! You''ve beaten me to it - this was something I was planning on tackling in the near future. I''ve got some additional feedback for you inlined below. Keep in mind that I''m being highly detailed in my feedback, in order to help this extension become the best it can be given Lucene best practices. Your work is a great start, and I want to see this evolve. All comments below are constructive, not even ''criticism''. Thanks for getting this started! On Dec 2, 2005, at 1:22 PM, Kasper Weibel wrote:> The result is the acts_as_ferret Mixin for ActivcRecord. > > Use it as follows: > In any model.rb add acts_as_ferret > > class Foo < ActiveRecord::Base > acts_as_ferret > endIdeally there will be many options desired besides just enabling a table to be indexed fully. More on that in a moment.> All CRUD operations will be performed on both ActiveRecord (as > usual) and a > ferret index for further searching.The toughest issue to deal with here is transactions. Suppose a database operation rolls back - then what happens to the index? It''s out of sync. I don''t have any easy solutions though, and it is an issue that pops up regularly in the Java Lucene community as well. There is quite a mismatch between a relational database and a full- text index when it comes to how updates and additions are handled. At the very least, a warning should be included mentioning the transactional issue. Another facility that is desirable with Lucene is the ability to rebuild the entire index from scratch. Why? Perhaps you change the analyzer, you will need to re-index all documents to have them re- analyzed.> The following method is available in your controllers: > > ActiveRecord::find_by_contents(query) # Query is a string > representing you queryDave mentioned this, but you''re currently only indexing "id", but not the table name. Thus you could get documents that matching the query from other tables, and get an id that doesn''t exist for the current table or one from a different table. Table name needs to be considered somehow, either by building a separate index for each table, or adding the table name as an indexed, untokenized field.> The Ferret DB is stored in: > > {RAILS_ROOT}/db/index.dbPlease consider NOT calling it a "DB". Ferret is Lucene. What it builds is an "index", not a "database" in the traditional sense. I think it would be best to avoid "db" terminology to prevent confusion.> module ClassMethods > include Ferret > > INDEX_DIR = "#{RAILS_ROOT}/db/index.db"I''m not sure how to parameterize "acts_as" extensions, but making the index location more configurable would be good.> # Finds instances by file contents. > def find_by_contents(query, options = {}) > index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) > query_parser ||> QueryParser.new(index_searcher.reader.get_field_names.to_a) > query = query_parser.parse(query)QueryParser is only one (and often crude) way to formulate a Query. Ideally there would be a couple of methods to search with, one that takes a QueryParser-friendly expression like "foo AND bar NOT baz" and another that takes a Query instance allowing a developer to formulate sophisticated queries via the Ferret query API rather than parsing an expression. There are many good reasons for this, most importantly from a user interface perspective where the application makes more sense to have separate fields that build up a query rather than the one totally free-form Google-esque text box. Many applications need full-text search, but not in a way that users need to know query expression operators like +/-/AND/OR. Back to the table name issue, here you''ll want to wrap the query with a BooleanQuery AND''d with a TermQuery for table:<table name> so that you''re sure the only hits returned will be for the current table.> result = [] > index_searcher.search_each(query) do |doc, score| > id = index_searcher.reader.get_document(doc)["id"] > res = self.find(id) > result << res if res > endSome handling of paging needs to be added here. It is unlikely that all hits are needed, and accessing the Document for every hit will be an enormous performance bottle-neck with lots of data. It is very important to choose the hits enumeration carefully. Doing a database query for every hit is also likely to be a huge bottleneck. Perhaps doing a SQL "IN" query for all id''s after the narrowing the set of hits (by page) is feasible, though I''m not sure what limits exist on how many items you can have with an "IN" clause. I''ve not delved into Ferret in much depth yet, but in Java Lucene a HitCollector would possibly be a good way to handle this.> index_searcher.close() > result > endIt is definitely unwise to close the IndexSearcher instance for every search - leaving it open allows for field caches to warm up and speeds up successive searches.> # private > > def ferret_create > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*")Dave mentioned the key thing, and I''ll reiterate the need to add the table name to it.> index << self.to_doc > index.optimize() > index.close() > endReiterating Dave, but just to be thorough, optimizing and closing an index is not a good thing to do on every document operation as it can be slow. And definitely heed his advice about using flush. There does need to be a facility to optimize the index on demand, which developers may choose to do as a nightly batch process, or periodically as the index becomes segmented.> def ferret_update > #code to update index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*")I recommend centralizing the Index constructor, so as to not duplicate all of those parameters and allowing them to be changed in one spot.> index.delete(self.id.to_s) > index << self.to_doc > index.optimize > index.close() > end > > def ferret_destroy > # code to delete from index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index_writer.delete(self.id.to_s) > index_writer.optimize() > index_writer.close() > endAgain, the table name should be part of the key for all operations above.> def to_doc > # Churn through the complete Active Record and add it to the Ferret > document > doc = Ferret::Document::Document.new > self.attributes.each_pair do |key,val| > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, > Ferret::Document::Field::Index::TOKENIZED) > end > doc > endThis to_doc is where a lot of fun can be had. There are many options that need to be parameterized by the developer at the model level. For example, how a field is indexed is crucial. You''re storing and tokenizing every field, including the "id" field. You definitely do not want to tokenize the "id" field. Adding the table name is needed also, untokenized. Each field should allow flexibility on how it is (or is not) indexed, including whether to store/tokenize the field or not. Storing fields is unnecessary in the ActiveRecord sense, since what you''re returning from the search method are records from the database, not documents from the index. Making the analyzer controllable is necessary at a global level for the index, and overridable on a per-field level too. A common technique with Lucene when field-level searching granularity is not relevant is to create an aggregate field, say "contents" where all text is indexed. With Ferret, you could do this by iterating over all fields that should be indexed/tokenized using the "contents" as the field name for all fields of the record. Then searches would occur only against "contents". While Dave likes the default field to be "*", I personally find distributing a query expression across all fields tricky and error-prone, especially given that different fields may be analyzed differently. Consider a query for "foo bar". With two fields "title" and "body", how do you expand that query across all fields? Not trivial. This is why I like the aggregate "contents" field technique, which can work in conjunction with fields indexed individually also, so a query for "foo bar" would search the "contents" field by default, but someone could do "title:foo body:bar" to refine things. I think this is enough, and perhaps too much(!), feedback for now :) Sorry if it seems overly picky, but I think this is a very important addition to the Rails and ActiveRecord. The magic that is Lucene is very special, with I''m thrilled that it has now entered the Ruby world. I want to help Ferret and its integration into places like ActiveRecord goes as smoothly as possible and keeps the outstanding reputation that Lucene has in the Java (and C# and Python, etc) world. There are many ways to use Lucene inefficiently - I''ll be here doing what I can to help oversee that things are done in the best possible way. Erik
Thanks for the feedback Erik. I''ve actually posted the acts_as_ferret code on the Ferret wiki with a few improvements. But it''s far from optimal. Please add improvements or post your ideas here; http://ferret.davebalmain.com/trac/wiki/FerretOnRails Hopefully with Eriks feedback and a few Rails gurus looking over it we''ll soon have a really nice solution to Rails Ferret integration.> While Dave likes the default field to > be "*", I personally find distributing a query expression across all > fields tricky and error-prone, especially given that different fields > may be analyzed differently.Just to defend my honour :-P I actually totally agree with Erik here. Think of the default field "*" as like Rails scaffolding. It''s handy to get you started but you''ll have to put a bit of work and thought into it yourself to get the most out of Ferret. Cheers, Dave
great job on this Kasper. I took a look at this a few days ago and started playing with it this weekend. I''ve taken a few of Erik''s suggestions and started trying to implement them. I don''t know if you''ve already started working on enhancing it, but I''d be very interested in contributing my changes. It''ll probably be a few days before I can get back in and finish things up, though. (The Portland Ruby Brigade has their monthly meeting on Tuesday, so that''s one nights work missed. ;~) Here''s the changes I''ve started working on: 1. Adding configuration The notation I''m working on is something like this: acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields => {...} Still playing with the configuration of the fields. I''ve also written it so that the default is to index all fields with the default settings. In addition, it should be possible to simply pass an array to the fields parameter and default the settings for Storable, etc. 2. Adding the ability to pass Query objects to the find_by_contents method. I''ve been doing some refactoring along the way, too, and hope to add some unit tests eventually. One final suggestion, perhaps the name should be changed to acts_as_indexed? Anyway, this is great work. I hope I can make worthwhile contributions to this. -- Thomas Lockney
Hi Thomas, For additionial ideas look here; http://ferret.davebalmain.com/trac/wiki/FerretOnRails And of course, please feel free to add your improvements. Cheers, Dave On 12/5/05, Thomas Lockney <tlockney-SQzT33pxqo1BDgjK7y7TUQ@public.gmane.org> wrote:> great job on this Kasper. I took a look at this a few days ago and started > playing with it this weekend. I''ve taken a few of Erik''s suggestions and started > trying to implement them. I don''t know if you''ve already started working on > enhancing it, but I''d be very interested in contributing my changes. It''ll > probably be a few days before I can get back in and finish things up, though. > (The Portland Ruby Brigade has their monthly meeting on Tuesday, so that''s one > nights work missed. > ;~) > > Here''s the changes I''ve started working on: > > 1. Adding configuration > > The notation I''m working on is something like this: > > acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields => {...} > > Still playing with the configuration of the fields. I''ve also written it so > that the default is to index all fields with the default settings. In addition, > it should be possible to simply pass an array to the fields parameter and > default the settings for Storable, etc. > > 2. Adding the ability to pass Query objects to the find_by_contents method. > > I''ve been doing some refactoring along the way, too, and hope to add some unit > tests eventually. One final suggestion, perhaps the name should be changed to > acts_as_indexed? > > Anyway, this is great work. I hope I can make worthwhile contributions to this. > > -- > Thomas Lockney > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On Dec 4, 2005, at 11:39 PM, Thomas Lockney wrote:> (The Portland Ruby Brigade has their monthly meeting on Tuesday, so > that''s one > nights work missed. > ;~)You Portland Rubyists really know how to party! I went to the event during OSCON in August - what a blast.> 1. Adding configuration > > The notation I''m working on is something like this: > > acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields > => {...}So you''re thinking that each model may have its own index? I wasn''t sure if one index per model made sense or whether a single index, globally configured through environment.rb and friends, made the most sense. Using one index would allow some future clever things such as querying without the table name allowing results to come back with objects spanning multiple models. I''m leaning towards preferring a single index, such that the :index_dir configuration would be done via environments.rb globally, not per model.> 2. Adding the ability to pass Query objects to the find_by_contents > method.Cool. Maybe this should be renamed to find_by_ferret? If a String is passed in, it gets parsed (with the options hash allowing control over the parsing), and if a Query is passed in then it is used as-is.> I''ve been doing some refactoring along the way, too, and hope to > add some unit > tests eventually. One final suggestion, perhaps the name should be > changed to > acts_as_indexed?I like it being acts_as_ferret personally. "indexed" is overloaded within the relational database domain, so it could be construed as having to do with DB indexes.> Anyway, this is great work. I hope I can make worthwhile > contributions to this.Thanks for your efforts! I''m glad to see this all coming together. Erik
Hi all First of all I''d like to take the oppertunity to thank you all for the great response. Personally I feel that this approach to Ferret/Rails integration will be a good thing to investigate further. People need quality search. I think that we should agree on where to put the input for this project. The page on David Balmains wiki is a good start - thanks for that David. http://ferret.davebalmain.com/trac/wiki/FerretOnRails I needed this code for a specific task on my job and there is still many things to do to make it general usable. I will comment on different peoples input below. Thanks to David for giving direct input for enhancing the quality of the code and explaining index.flush() to me. It''s good to have the author of ferret giving direct input as I''m not really sure where the pitfalls in the implementation are speed/quality wise. As both David and Eric Hatcher has pointed out the current implementation will only index one model per application. My view on this issue is that I would like to have one index for all models as opposed to multiple index files; that is ONE Ferret index per application. I will also need to implement a method for rebuilding the index. This will come in handy both when in development mode and probably also in production. Eric pointed out that there will be problems with transactions and I must admit that I don''t have any viable ideas of how to approach this issue. I have thought of turning transactions off for the SQL tables in question - if that''s possible at all. Eric also had problems with the name index.db. Instead I suggest index.frt The current search method should be worked on. At the moment it fires quite a few SQL select statements. There is also a need for the implementation of pagination. The to_doc method is one way to approach things when building the index. I actually thought of Erics suggestion about an aggregate field which sounds practical. There should be a way of configuring which fields goes where. I have had many ideas of what other things to implement. One of them is that hard core Lucene folks will probably not put up with the limitations of a specific implementation if it makes things difficult. One of the things I like about Active Recored in Rails is the find_by_sql() method which lets you do whatever you want on the SQL side. A similar approach could be implemented with Ferret. find_by_fql() - if there is such a term as Ferret Query Language. Also the many possibilities for fine tuning should not be forgotten in favour of simplicity. There should allways be a way to make the configuration exactly as you would like it. I favour the configuration approach Thomas Lockney has suggested. Lastly: I really appreciate your contributions and I feel that with our combined efforts it will be possible to build a quality solution. In time acts_as_ferret could become the prefered choice for Ferret/Rails integration. Kasper
Kasper Weibel Nielsen-Refs
2005-Dec-05 12:23 UTC
[Ferret-talk] [Rails] Re: ANN: acts_as_ferret
Hi all First of all I''d like to take the oppertunity to thank you all for the great response. Personally I feel that this approach to Ferret/Rails integration will be a good thing to investigate further. People need quality search. I think that we should agree on where to put the input for this project. The page on David Balmains wiki is a good start - thanks for that David. http://ferret.davebalmain.com/trac/wiki/FerretOnRails I needed this code for a specific task on my job and there is still many things to do to make it general usable. I will comment on different peoples input below. Thanks to David for giving direct input for enhancing the quality of the code and explaining index.flush() to me. It''s good to have the author of ferret giving direct input as I''m not really sure where the pitfalls in the implementation are speed/quality wise. As both David and Eric Hatcher has pointed out the current implementation will only index one model per application. My view on this issue is that I would like to have one index for all models as opposed to multiple index files; that is ONE Ferret index per application. I will also need to implement a method for rebuilding the index. This will come in handy both when in development mode and probably also in production. Eric pointed out that there will be problems with transactions and I must admit that I don''t have any viable ideas of how to approach this issue. I have thought of turning transactions off for the SQL tables in question - if that''s possible at all. Eric also had problems with the name index.db. Instead I suggest index.frt The current search method should be worked on. At the moment it fires quite a few SQL select statements. There is also a need for the implementation of pagination. The to_doc method is one way to approach things when building the index. I actually thought of Erics suggestion about an aggregate field which sounds practical. There should be a way of configuring which fields goes where. I have had many ideas of what other things to implement. One of them is that hard core Lucene folks will probably not put up with the limitations of a specific implementation if it makes things difficult. One of the things I like about Active Recored in Rails is the find_by_sql() method which lets you do whatever you want on the SQL side. A similar approach could be implemented with Ferret. find_by_fql() - if there is such a term as Ferret Query Language. Also the many possibilities for fine tuning should not be forgotten in favour of simplicity. There should allways be a way to make the configuration exactly as you would like it. I favour the configuration approach Thomas Lockney has suggested. Lastly: I really appreciate your contributions and I feel that with our combined efforts it will be possible to build a quality solution. In time acts_as_ferret could become the prefered choice for Ferret/Rails integration. Kasper
Erik Hatcher <erik@...> writes:> > On Dec 4, 2005, at 11:39 PM, Thomas Lockney wrote: > > (The Portland Ruby Brigade has their monthly meeting on Tuesday, so > > that''s one > > nights work missed. > > ;~) > > You Portland Rubyists really know how to party! I went to the event > during OSCON in August - what a blast.Well, that was my first PRX.rb event since I had just moved here, so I can''t take credit for all that...> > The notation I''m working on is something like this: > > > > acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields > > => {...} > > So you''re thinking that each model may have its own index?Actually, I guess I didn''t indicate very well what was going to be optional configuration and what was fixed. I only put that there to indicate that you *could* have one index per model. I left out the part that would allow you to configure it globaly. I tend to agree with you, in fact, that one global index makes the most sense.> > > 2. Adding the ability to pass Query objects to the find_by_contents > > method. > > Cool. Maybe this should be renamed to find_by_ferret?sounds reasonable to me.> If a String is passed in, it gets parsed (with the options hash allowing > control over the parsing), and if a Query is passed in then it is used > as-is.That''s pretty much what I was aiming for.> I like it being acts_as_ferret personally. "indexed" is overloaded > within the relational database domain, so it could be construed as > having to do with DB indexes.Seems reasonable to me. Thomas
Since it''s been over a week and I''ve only had time to tinker here and there on my proposed changes to the acts_as_ferret plugin, I thought it was time to just post what I had so far and let others weigh in on it or take their own stab at making it more complete. I''ve posted my updated version along with some brief notes at the bottom of the ferret wiki page here: http://ferret.davebalmain.com/trac/wiki/FerretOnRails I''m still actively working on this, but I''ve only been able to do it in fits and spurts so far. I appologize for the ugliness of some of the code, I''m still trying to figure out how to do all the dynamic "magic" necessary for this sort of thing.
Great work Thomas, I just notices two things in my quick glance. Firstly, you need to change Document::Field::Index::NO to Document::Field::Index::UNTOKENIZED for the :ferret_class and :id fields. My fault as I made the same mistake in my code above. Also, I don''t know if you meant to use symbols but you shouldn''t use '':'' in a field name as it will through off the query parser. Get rid of the ''"'' around :ferret_class and :id and you''ll be fine. I made both these changes on the wiki already. One other change you may like to make is to allow Query objects to be passed to the find_by_contents method as well as Strings, but I''ll leave that one up to you for the moment. Hope that helps, Dave On 12/14/05, Thomas Lockney <tlockney-SQzT33pxqo1BDgjK7y7TUQ@public.gmane.org> wrote:> Since it''s been over a week and I''ve only had time to tinker here and there on > my proposed changes to the acts_as_ferret plugin, I thought it was time to just > post what I had so far and let others weigh in on it or take their own stab at > making it more complete. I''ve posted my updated version along with some brief > notes at the bottom of the ferret wiki page here: > http://ferret.davebalmain.com/trac/wiki/FerretOnRails > > I''m still actively working on this, but I''ve only been able to do it in fits and > spurts so far. I appologize for the ugliness of some of the code, I''m still > trying to figure out how to do all the dynamic "magic" necessary for this sort > of thing. > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
It''s so great that people are working on this! Ferret is great and I look forward to seeing it better integrated with Rails. Thomas -- I tried this code but experienced a few problems with it. I never got it to work, and gave up since it''s not exaclty what I need (the documents I''m storing in Ferret don''t exactly match my model objects, but are a composite of them). Still, I have some feedback that might (or might not) be helpful. In addition to what David mentioned, I noticed that you use the method class_variable_set in the method acts_as_ferret. This isn''t available in Ruby 1.8.2. Moreover, I''m not sure why you''re using this here since the variable names are not dynamic. I just changed these to: @@fields_for_ferret = Array.new @@class_index_dir = configuration[:index_dir] Also, I noticed that the indentation on the class method append_features was a bit off ... it looked like super was the beginning of a block. Just a minor thing. Also, I''m confused about the name for the SingletonMethods module. What is the singleton that''s being referred to here? This isn''t a criticism -- I''m just confused, since it seems to me that these methods get added to your model classes and are available to each instance. Are they named such because each model has a single instance of the index? Also, I was wondering -- since ferret_create is aliased as ferret_update, shouldn''t it first call a delete before adding itself to the index? For example, something like: def ferret_create begin ferret_delete rescue nil end ferret_index << self.to_doc end alias :ferret_update :ferret_create Also, a question for David -- is auto_flush => true supposed to remove the lock automatically after writes? I ask because I also tried the code that Kasper originally posted, and I kept getting locking errors unless I closed the index after updates (and I also wasn''t quite able to get that code to work before giving up). I was running both a Web instance and trying to get at it with console, which is similar, I think, to what would happen with multiple FCGI processes. Thanks to everyone for your efforts, especially David for Ferret itself! Jen
jennyw wrote:> Also, a question for David -- is auto_flush => true supposed to remove > the lock automatically after writes? I ask because I also tried the > code that Kasper originally posted, and I kept getting locking errors > unless I closed the index after updates (and I also wasn''t quite able > to get that code to work before giving up). I was running both a Web > instance and trying to get at it with console, which is similar, I > think, to what would happen with multiple FCGI processes.Oops! Never mind about the locking problem ... it turns out I had an older version of Ferret installed that probably didn''t support auto_flush. Jen
On 12/14/05, jennyw <jennyw-eRDYlh02QjuxE3qeFv2dE9BPR1lH4CV8@public.gmane.org> wrote:> Also, I was wondering -- since ferret_create is aliased as > ferret_update, shouldn''t it first call a delete before adding itself to > the index? For example, something like: > > def ferret_create > begin > ferret_delete > rescue nil > end > ferret_index << self.to_doc > end > alias :ferret_update :ferret_createHi Jenny, Glad to hear you like Ferret. Note that I''ve add a key option to the index; @@index ||= Index::Index.new(:key => [:id, :ferret_class], This will ensure that the index is kept unique for these fields, ie every time I do an update the old document will be automatically deleted. This only happens when you set the key option.> Also, a question for David -- is auto_flush => true supposed to remove > the lock automatically after writes?Yes, that is the way it is supposed to work.> I ask because I also tried the > code that Kasper originally posted, and I kept getting locking errors > unless I closed the index after updates (and I also wasn''t quite able to > get that code to work before giving up). I was running both a Web > instance and trying to get at it with console, which is similar, I > think, to what would happen with multiple FCGI processes.Have you tried it with the latest version of Ferret? 3.0 had a few bugs but 3.1 should be fine. Let me know if you are still getting lock errors. :-) Cheers, Dave> Thanks to everyone for your efforts, especially David for Ferret itself! > > Jen > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
I am rewriting parts of the plug (ill contribute it around next week), I wanted to use search, with some special arguments for ferret, and arguments for find. So that when search its done, it calls find with the found id''s and conditions/include enc. And return whats needed. I am hessitating about ferret_search (no risk of being reimplemented by someone else) or search (very common, maybe could someday became a method for rails itself), what is your opinion? I was thinking of fetching the ferret query first and then the database entry''s (from mysql for example). But I can''t really think of what would be faster (searching ferret first or activerecords), really depends on the use of conditions...
Thomas Lockney wrote:> Since it''s been over a week and I''ve only had time to tinker here and there on > my proposed changes to the acts_as_ferret plugin, I thought it was time to just > post what I had so far and let others weigh in on it or take their own stab at > making it more complete. I''ve posted my updated version along with some brief > notes at the bottom of the ferret wiki page here: > http://ferret.davebalmain.com/trac/wiki/FerretOnRails > > I''m still actively working on this, but I''ve only been able to do it in fits and > spurts so far. I appologize for the ugliness of some of the code, I''m still > trying to figure out how to do all the dynamic "magic" necessary for this sort > of thing.It''s great that you guys are working on this. I have been following the developments with a fair amount of interest and am hoping to integrate some of this work with my own code on a project I am working on. A couple of questions: Has anyone considered a universal search across multiple models yet? How would this work considering the fact that currently the code is per model? What about indexing fields that are not contained in the model? For example: say I have an Article model with a belongs_to relationship to an Author model. I would like the author''s name to be indexed along with the contents of the article in the ferret document. I guess this may be more of a ruby programming issue than a ferret issue. It seems that the general practice is to keep track of fields to be used/indexed/inspected as an array of symbols. In my notional article example that might be: [:title, :document] I''d prefer it to look more like: [:title, :document, :author.name] but ":author.name" is going to be problematic, is it not? Any thoughts on these issues? Let me know if I have not been clear enough. -F
First post here! Here''s my question: I have several related Category objects that all belong_to a Job object. When a new Job object is to be created a user will have to click on the CSS tabs that I have setup with link_to Action Methods. I do not want the data from the forms to be persisted until all the sections are complete and the user clicks "Create Project" Also I want the Controller to dynamically store/update each view''s session when any tab is arbitrarily selected For Example, the form tabs resemble this: Art Details | Dev Details | Marketing Details So when I am finished with "Art Details" and click on "Dev Details", I want to store that form data in a session - the same for other tabs when the new view is selected via clicking on a new tab. I considered using a pseudo-cart type of object to store the Projects "Details" objects and their associated attributes, but this doesn''t really Details for this model because the child Objects of Project will not know about their association or foreign keys until they are persisted. Moreover, it would seem logical that I just store the post variables in some object, but then how would I restore those values in the fields if they go back to a previous tab? Here''s my Object model Project |__ | ArtDetails belongs to Project DevDetails belongs_to Project MarketingDetails belongs_to Project Any suggestions? TIA!
On Dec 14, 2005, at 3:06 AM, Abdur-Rahman Advany wrote:> I am rewriting parts of the plug (ill contribute it around next > week), I wanted to use search, with some special arguments for > ferret, and arguments for find. So that when search its done, it > calls find with the found id''s and conditions/include enc. And > return whats needed. I am hessitating about ferret_search (no risk > of being reimplemented by someone else) or search (very common, > maybe could someday became a method for rails itself), what is your > opinion? I was thinking of fetching the ferret query first and then > the database entry''s (from mysql for example). But I can''t really > think of what would be faster (searching ferret first or > activerecords), really depends on the use of conditions...My recommendation is to index the fields you want to use as search criteria into Ferret rather than trying to mix and match Ferret and ActiveRecord searches. Optimizing the two will be tricky - would it be quicker to search with Ferret and then pull from the DB or constrain the set by the DB first then full-text search on those? My hunch is that no database will have better performance than the potential fully optimized Ferret. It''s certainly true in the Java Lucene that it is as fast and usually faster than a relational database for querying. If you do go the route of searching with ActiveRecord first and using those results to constrain the Ferret search, consider using a Filter (not sure how that is implemented in Ferret, but in Java Lucene there are overloaded search methods that accept a Filter). Erik
On 12/15/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote:> If you do go the route of searching with ActiveRecord first and using > those results to constrain the Ferret search, consider using a Filter > (not sure how that is implemented in Ferret, but in Java Lucene there > are overloaded search methods that accept a Filter).Filters are implemented in Ferret the same way as they are in Java. They''re unit tested but I haven''t used them very much and I don''t suspect many other people have yet either. But they''re there if you need them. You pass a filter object as one of the options to any of the search methods. Dave> > Erik > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On 14-dec-2005, at 19:48, Erik Hatcher wrote:> On Dec 14, 2005, at 3:06 AM, Abdur-Rahman Advany wrote: >> I am rewriting parts of the plug (ill contribute it around next >> week), I wanted to use search, with some special arguments for >> ferret, and arguments for find. So that when search its done, it >> calls find with the found id''s and conditions/include enc. And >> return whats needed. I am hessitating about ferret_search (no risk >> of being reimplemented by someone else) or search (very common, >> maybe could someday became a method for rails itself), what is >> your opinion? I was thinking of fetching the ferret query first >> and then the database entry''s (from mysql for example). But I >> can''t really think of what would be faster (searching ferret first >> or activerecords), really depends on the use of conditions... > > My recommendation is to index the fields you want to use as search > criteria into Ferret rather than trying to mix and match Ferret and > ActiveRecord searches. Optimizing the two will be tricky - would > it be quicker to search with Ferret and then pull from the DB or > constrain the set by the DB first then full-text search on those? > My hunch is that no database will have better performance than the > potential fully optimized Ferret. It''s certainly true in the Java > Lucene that it is as fast and usually faster than a relational > database for querying. > > If you do go the route of searching with ActiveRecord first and > using those results to constrain the Ferret search, consider using > a Filter (not sure how that is implemented in Ferret, but in Java > Lucene there are overloaded search methods that accept a Filter).Maybe someone can help me finish http://www.julik.nl/code/active- search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the kinks but I am stumbling upon RuntimeError: could not obtain lock: and I should admit I am absolutely lost in how to handle concurrency with Ferret. -- Julian ''Julik'' Tarkhanov me at julik.nl
David Balmain <dbalmain.ml@...> writes:> Also, I don''t know if you meant to use symbols but you shouldn''t use > '':'' in a field name as it will through off the query parser. Get rid > of the ''"'' around :ferret_class and :id and you''ll be fine.Yeah, I realized this one a little while after I pasted it. I had them as strings and had reverted back to the ":" prefixed names in an attempt to see if that was causing a problem I was having. I guessed I pasted it a little too soon.> I made both these changes on the wiki already.Great!> > One other change you may like to make is to allow Query objects to be > passed to the find_by_contents method as well as Strings, but I''ll > leave that one up to you for the moment.Yeah, that was the other thing I had started working on but didn''t want to paste in yet. I had an implementation of it, but it was ugly, so I''m reworking it a bit and hope to have that in place over the weekend.> > Hope that helps, > DaveThanks again for developing Ferret. I''ve been waiting for this ever since I first started playing with Ruby and saw Erik''s registered (though, sadly never completed) rlucene project. Thomas
jennyw <jennyw@...> writes:> > It''s so great that people are working on this! Ferret is great and I > look forward to seeing it better integrated with Rails. > > Thomas -- I tried this code but experienced a few problems with it. I > never got it to work, and gave up since it''s not exaclty what I need > (the documents I''m storing in Ferret don''t exactly match my model > objects, but are a composite of them). Still, I have some feedback that > might (or might not) be helpful.As I (think I) mentioned in my note on the wiki, the code I put there definitely was buggy. I just wanted to put it out in case anyone else wanted to start taking a stab at it. I''ll have a newer version sometime next week, I hope.> In addition to what David mentioned, I noticed that you use the method > class_variable_set in the method acts_as_ferret. This isn''t available in > Ruby 1.8.2. Moreover, I''m not sure why you''re using this here since the > variable names are not dynamic. I just changed these to: > > <at> <at> fields_for_ferret = Array.new > <at> <at> class_index_dir = configuration[:index_dir]I''m not sure why I did that either. :-/ Guess I was just trying to get anything to work at that point. I''ll implement your fix.> > Also, I noticed that the indentation on the class method append_features > was a bit off ... it looked like super was the beginning of a block. > Just a minor thing.I fixed a few indentation problems when I added it to the wiki, but must have missed that one. Thanks.> > Also, I''m confused about the name for the SingletonMethods module. What > is the singleton that''s being referred to here?I adopted that from the plugin howtos on the rails wiki: http://wiki.rubyonrails.org/rails/pages/HowToWriteAnActsAsFoxPlugin
David Balmain <dbalmain.ml@...> writes:> Also, I don''t know if you meant to use symbols but you shouldn''t use > '':'' in a field name as it will through off the query parser. Get rid > of the ''"'' around :ferret_class and :id and you''ll be fine.Now that I think about it, I was confused for a bit about the keys defined and was having trouble doing lookups. It turned out to be a different problem, but in my search for a way to fix it, I changed those fields names to match (I even tried just using symbols, but it seems that ferret didn''t like that too much (should symbols be an allowable option for a field name?). Ferrets truely a great piece of work and the documentation is already quite good, but I think there''s a lot more needed to make it fully accessible. Hopefully as more of us dig in, we can add to what''s there. I guess that''s a topic for the ferret mailing list, though. ;~) Thomas
On 12/15/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote:> > On 14-dec-2005, at 19:48, Erik Hatcher wrote: > > > On Dec 14, 2005, at 3:06 AM, Abdur-Rahman Advany wrote: > >> I am rewriting parts of the plug (ill contribute it around next > >> week), I wanted to use search, with some special arguments for > >> ferret, and arguments for find. So that when search its done, it > >> calls find with the found id''s and conditions/include enc. And > >> return whats needed. I am hessitating about ferret_search (no risk > >> of being reimplemented by someone else) or search (very common, > >> maybe could someday became a method for rails itself), what is > >> your opinion? I was thinking of fetching the ferret query first > >> and then the database entry''s (from mysql for example). But I > >> can''t really think of what would be faster (searching ferret first > >> or activerecords), really depends on the use of conditions... > > > > My recommendation is to index the fields you want to use as search > > criteria into Ferret rather than trying to mix and match Ferret and > > ActiveRecord searches. Optimizing the two will be tricky - would > > it be quicker to search with Ferret and then pull from the DB or > > constrain the set by the DB first then full-text search on those? > > My hunch is that no database will have better performance than the > > potential fully optimized Ferret. It''s certainly true in the Java > > Lucene that it is as fast and usually faster than a relational > > database for querying. > > > > If you do go the route of searching with ActiveRecord first and > > using those results to constrain the Ferret search, consider using > > a Filter (not sure how that is implemented in Ferret, but in Java > > Lucene there are overloaded search methods that accept a Filter). > > Maybe someone can help me finish http://www.julik.nl/code/active- > search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the > kinks but I am stumbling upon > > RuntimeError: could not obtain lock: > > and I should admit I am absolutely lost in how to handle concurrency > with Ferret.Using the latest version of ferret and setting :auto_flush => true should solve this problem. Have you tried that? It only works in Index::Index though and it''s not necessary for and IndexSearcher. If you use IndexWriter and IndexReader directly you have to handle it yourself.> -- > Julian ''Julik'' Tarkhanov > me at julik.nl > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On 15-dec-2005, at 5:59, David Balmain wrote:> On 12/15/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote: >> >> Maybe someone can help me finish http://www.julik.nl/code/active- >> search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the >> kinks but I am stumbling upon >> >> RuntimeError: could not obtain lock: >> >> and I should admit I am absolutely lost in how to handle concurrency >> with Ferret. > > Using the latest version of ferret and setting :auto_flush => true > should solve this problem. Have you tried that? It only works in > Index::Index though and it''s not necessary for and IndexSearcher. If > you use IndexWriter and IndexReader directly you have to handle it > yourself.David, thanks for the advice - I''ll try that and report the results. Basically, it feels sort of _odd_ - doing this macro-style Ferret binging. Ferret is so vast and powerful that this would be not enough to make use of all of it''s features. Maybe you can send me some advice off-list how I could probably expand the API of the FerretIndexer to give more access to the most needed Ferret features in a convenient way (without making it too big because the whole idea of the plugin is a one-liner integration into a model, not a document cluster with 10 million entries in it. If someone else wants to shed some light (or help with code) I would be glad to get some help, I am swamped now and won''t be able to get to it until at least next week. -- Julian ''Julik'' Tarkhanov me at julik.nl
Hi Julian, I''m really busy porting everything in Ferret to C at the moment. Next year though I should have some time to play around with integrating it into Rails. Until then I''ll try and be as helpful as possible to others trying to do the same thing. Good luck! :-) Cheers, Dave On 12/15/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote:> > On 15-dec-2005, at 5:59, David Balmain wrote: > > > On 12/15/05, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org> wrote: > >> > >> Maybe someone can help me finish http://www.julik.nl/code/active- > >> search/classes/ActiveSearch/FerretIndexer.html? I am sotring out the > >> kinks but I am stumbling upon > >> > >> RuntimeError: could not obtain lock: > >> > >> and I should admit I am absolutely lost in how to handle concurrency > >> with Ferret. > > > > Using the latest version of ferret and setting :auto_flush => true > > should solve this problem. Have you tried that? It only works in > > Index::Index though and it''s not necessary for and IndexSearcher. If > > you use IndexWriter and IndexReader directly you have to handle it > > yourself. > > David, thanks for the advice - I''ll try that and report the results. > Basically, it feels sort of _odd_ - doing this macro-style Ferret > binging. Ferret is so vast and powerful that > this would be not enough to make use of all of it''s features. Maybe > you can send me some advice off-list how I could > probably expand the API of the FerretIndexer to give more access to > the most needed Ferret features in a convenient way (without making > it too big because the whole idea of the plugin is a one-liner > integration into a model, not a document cluster with 10 million > entries in it. > > If someone else wants to shed some light (or help with code) I would > be glad to get some help, I am swamped now and won''t be able to get > to it until at least next week. > > -- > Julian ''Julik'' Tarkhanov > me at julik.nl > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Hello! I have been following this thread carefully, ferret just got a little easier to dive into. Kudos to you guys, and especially to the authors of ferret! This was just what we needed here at our little webdev shop. Now I have a problem you guys might know a solution to. I have managed to get the code from the wiki working, with a little bit of tweaking, but it does not seem to build queries correctly when it gets fed with UTF-8 characters. Is this a fault on my side or a known issue with ferret? I looked at the trac but it seemed it should support UTF-8? I must have overlooked something... I didnt dare to touch the wiki, but here is a somewhat altered version of the plugin, and it should be fully functional. I added some small things, since we wanted a counter for the Paginator. I know though that doing a full-out-search just to count might not be the best way to count, so if anyone has a suggestion to better this, please share! :) Oh, and I added a rake task to rebuild the index, but it relies on the INDEX_PATH being set in the environment.rb Here it is # CODE for acts_as_ferret.rb require ''active_record'' require ''ferret'' module FerretMixin module Acts #:nodoc: module ARFerret #:nodoc: def self.append_features(base) super base.extend(MacroMethods) end # declare the class level helper methods # which will load the relevant instance methods defined below when invoked module MacroMethods def acts_as_ferret extend FerretMixin::Acts::ARFerret::ClassMethods class_eval do include FerretMixin::Acts::ARFerret::ClassMethods after_create :ferret_create after_update :ferret_update after_destroy :ferret_destroy end end end module ClassMethods include Ferret INDEX_PATH = "#{RAILS_ROOT}/db/ferret" def self.reloadable?; false end # Finds instances by file contents. def find_by_ferret(query, options = {}) @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) @@query_parser ||= QueryParser.new(@@index_searcher.reader.get_field_names.to_a) query = @@query_parser.parse(query) result = [] conditions = {} conditions[:num_docs] = options[:limit] unless options[:limit].blank? conditions[:first_doc] = options[:offset] unless options[:offset].blank? hits = @@index_searcher.search(query, conditions) hits.each do |hit, score| id = @@index_searcher.reader.get_document(hit)[''id''] result << self.find(id) unless id.nil? end return result end def count_by_ferret(query) @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) @@query_parser ||= QueryParser.new(@@index_searcher.reader.get_field_names.to_a) query = @@query_parser.parse(query) return @@index_searcher.search(query).total_hits end # private def ferret_create # code to update or add to the index @@index ||= Index::Index.new(:path => INDEX_PATH, :auto_flush => true) @@index << self.to_doc end def ferret_update @@index ||= Index::Index.new(:path => INDEX_PATH, :auto_flush => true) @@index.query_delete("+id:#{self.id} +ferret_table:#{self.class.table_name}") @@index << self.to_doc end def ferret_destroy # code to delete from index @@index ||= Index::Index.new(:path => INDEX_PATH, :auto_flush => true) @@index.query_delete("+id:#{self.id} +ferret_table:#{self.class.table_name}") end def to_doc # Churn through the complete Active Record and add it to the Ferret document doc = Ferret::Document::Document.new doc << Ferret::Document::Field.new(''ferret_table'', self.class.table_name, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::UNTOKENIZED) self.attributes.each_pair do |key,val| if key == ''id'' doc << Ferret::Document::Field.new(key, val.to_s, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::UNTOKENIZED) else doc << Ferret::Document::Field.new(key, val.to_s, Ferret::Document::Field::Store::NO, Ferret::Document::Field::Index::TOKENIZED) end end return doc end end end end end # reopen ActiveRecord and include all the above to make # them available to all our models if they want it ActiveRecord::Base.class_eval do include FerretMixin::Acts::ARFerret end # END acts_as_ferret.rb RAKE TASK in /lib/tasks/indexer.rake include FileUtils desc "Perform ferret index" task :indexer => :environment do if !File.exist?(INDEX_PATH) puts "Creating index dir in #{INDEX_PATH}" FileUtils.mkdir_p(INDEX_PATH) end classes = [] Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do |rbfile| bname = File.basename(rbfile,''.rb'') classname = Inflector.camelize(bname) classes.push(classname) end classes.each do |class_obj| c = eval(class_obj) if c.respond_to?(:ferret_create) puts "REBUILDING #{c.name}" c.find_all.each{|cn|cn.save} end end end _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
To answer my own question... This is a hack to get unicode to work, and relies on the unicode gem. Also, this, as opposed to my previous code listing, should work out of the box... except that the constant INDEX_PATH must be set before, preferable in environment.rb # CODE for acts_as_ferret.rb require ''active_record'' require ''ferret'' require ''unicode'' class UnicodeLowerCaseFilter < Ferret::Analysis::TokenFilter def next() t = @input.next() if (t == nil) return nil end t.term_text = Unicode::downcase(t.term_text) return t end end class SwedishTokenizer < Ferret::Analysis::RegExpTokenizer P = /[_\/.,-]/ HASDIGIT = /\w*\d\w*/ def token_re() %r([[:alpha:]ÅÖÄåöä]+((''[[:alpha:]ÅÖÄåöä]+)+ |\.([[:alpha:]ÅÖÄåöä]\.)+ |(@|\&)\w+([-.]\w+)* ) |\w+(([\-._]\w+)*\@\w+([-.]\w+)+ |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)? |(\.\w+)+ | ) )x end end class SwedishAnalyzer < Ferret::Analysis::Analyzer def token_stream(field, string) return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string)) end end module FerretMixin module Acts #:nodoc: module ARFerret #:nodoc: def self.append_features(base) super base.extend(MacroMethods) end # declare the class level helper methods # which will load the relevant instance methods defined below when invoked module MacroMethods def acts_as_ferret extend FerretMixin::Acts::ARFerret::ClassMethods class_eval do include FerretMixin::Acts::ARFerret::ClassMethods after_create :ferret_create after_update :ferret_update after_destroy :ferret_destroy end end end module ClassMethods include Ferret def self.reloadable?; false end # Finds instances by file contents. def find_by_ferret(query, options = {}) index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) query_parser ||= QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer => SwedishAnalyzer.new()}) query = query_parser.parse(query) result = [] conditions = {} conditions[:num_docs] = options[:limit] unless options[:limit].blank? conditions[:first_doc] = options[:offset] unless options[:offset].blank? hits = index_searcher.search(query, conditions) hits.each do |hit, score| id = index_searcher.reader.get_document(hit)[''id''] result << self.find(id) unless id.nil? end return result end def count_by_ferret(query) index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) query_parser ||= QueryParser.new(index_searcher.reader.get_field_names.to_a, {:analyzer => SwedishAnalyzer.new()}) query = query_parser.parse(query) return index_searcher.search(query).total_hits end # private def ferret_create # code to update or add to the index index ||= Index::Index.new(:key => [:id, :ferret_table], :path => INDEX_PATH, :auto_flush => true, :analyzer => SwedishAnalyzer.new()) index << self.to_doc end def ferret_update index ||= Index::Index.new( :key => [:id, :ferret_table], :path => INDEX_PATH, :auto_flush => true, :analyzer => SwedishAnalyzer.new()) index.query_delete("+id:#{self.id.to_s} +ferret_table:#{self.class.table_name}") index << self.to_doc end def ferret_destroy # code to delete from index index ||= Index::Index.new(:key => [:id, :ferret_table], :path => INDEX_PATH, :auto_flush => true, :analyzer => SwedishAnalyzer.new()) index.query_delete("+id:#{self.id.to_s} +ferret_table:#{self.class.table_name}") end def to_doc # Churn through the complete Active Record and add it to the Ferret document doc = Ferret::Document::Document.new doc << Ferret::Document::Field.new(''ferret_table'', self.class.table_name, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::UNTOKENIZED) self.attributes.each_pair do |key,val| if key == ''id'' doc << Ferret::Document::Field.new("id", val.to_s, Ferret::Document::Field::Store::YES, Ferret::Document::Field::Index::UNTOKENIZED) else doc << Ferret::Document::Field.new(key, val.to_s, Ferret::Document::Field::Store::NO, Ferret::Document::Field::Index::TOKENIZED) end end return doc end end end end end # reopen ActiveRecord and include all the above to make # them available to all our models if they want it ActiveRecord::Base.class_eval do include FerretMixin::Acts::ARFerret end # END acts_as_ferret.rb And the rake task: include FileUtils desc "Perform ferret index" task :indexer => :environment do if !File.exist?(INDEX_PATH) puts "Creating index dir in #{INDEX_PATH}" FileUtils.mkdir_p(INDEX_PATH) end classes = [] Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do |rbfile| bname = File.basename(rbfile,''.rb'') classname = Inflector.camelize(bname) classes.push(classname) end classes.each do |class_obj| c = eval(class_obj) if c.respond_to?(:ferret_create) puts "REBUILDING #{c.name}" c.find_all.each{|cn|cn.save} end end end _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Hi Albert, Perhaps you could do something like this in the find_by_ferret method and get rid of your count_by_ferret method. Just an idea. total_hits = hits.each do |hit, score| id = @@index_searcher.reader.get_document(hit)[''id''] result << self.find(id) unless id.nil? end return result, total_hits Cheers, Dave On 12/15/05, albert ramstedt <albert-fwIc/cu1KZxHQX+h2pknIQ@public.gmane.org> wrote:> Hello! > > I have been following this thread carefully, ferret just got a little > easier to dive into. Kudos to you guys, and especially to the authors of > ferret! This was just what we needed here at our little webdev shop. > > Now I have a problem you guys might know a solution to. I have managed > to get the code from the wiki working, with a little bit of tweaking, > but it does not seem to build queries correctly when it gets fed with > UTF-8 characters. Is this a fault on my side or a known issue with > ferret? I looked at the trac but it seemed it should support UTF-8? I > must have overlooked something... > > I didnt dare to touch the wiki, but here is a somewhat altered version > of the plugin, and it should be fully functional. I added some small > things, since we wanted a counter for the Paginator. I know though that > doing a full-out-search just to count might not be the best way to > count, so if anyone has a suggestion to better this, please share! :) > > Oh, and I added a rake task to rebuild the index, but it relies on the > INDEX_PATH being set in the environment.rb > > Here it is > > # CODE for acts_as_ferret.rb > require ''active_record'' > require ''ferret'' > > module FerretMixin > module Acts #:nodoc: > module ARFerret #:nodoc: > > def self.append_features(base) > super > base.extend(MacroMethods) > end > > # declare the class level helper methods > # which will load the relevant instance methods defined below > when invoked > > module MacroMethods > > def acts_as_ferret > extend FerretMixin::Acts::ARFerret::ClassMethods > class_eval do > include FerretMixin::Acts::ARFerret::ClassMethods > > after_create :ferret_create > after_update :ferret_update > after_destroy :ferret_destroy > end > end > > end > > module ClassMethods > include Ferret > INDEX_PATH = "#{RAILS_ROOT}/db/ferret" > def self.reloadable?; false end > > # Finds instances by file contents. > def find_by_ferret(query, options = {}) > @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) > @@query_parser ||> QueryParser.new(@@index_searcher.reader.get_field_names.to_a) > query = @@query_parser.parse(query) > result = [] > conditions = {} > conditions[:num_docs] = options[:limit] unless > options[:limit].blank? > conditions[:first_doc] = options[:offset] unless > options[:offset].blank? > > hits = @@index_searcher.search(query, conditions) > hits.each do |hit, score| > id = @@index_searcher.reader.get_document(hit)[''id''] > result << self.find(id) unless id.nil? > end > return result > end > > def count_by_ferret(query) > @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) > @@query_parser ||> QueryParser.new(@@index_searcher.reader.get_field_names.to_a) > query = @@query_parser.parse(query) > return @@index_searcher.search(query).total_hits > end > > # private > > def ferret_create > # code to update or add to the index > @@index ||= Index::Index.new(:path => INDEX_PATH, > :auto_flush => true) > @@index << self.to_doc > end > def ferret_update > @@index ||= Index::Index.new(:path => INDEX_PATH, > :auto_flush => true) > @@index.query_delete("+id:#{self.id} > +ferret_table:#{self.class.table_name}") > @@index << self.to_doc > end > > def ferret_destroy > # code to delete from index > @@index ||= Index::Index.new(:path => INDEX_PATH, > :auto_flush => true) > @@index.query_delete("+id:#{self.id} > +ferret_table:#{self.class.table_name}") > end > > def to_doc > # Churn through the complete Active Record and add it to > the Ferret document > doc = Ferret::Document::Document.new > doc << Ferret::Document::Field.new(''ferret_table'', > self.class.table_name, Ferret::Document::Field::Store::YES, > Ferret::Document::Field::Index::UNTOKENIZED) > self.attributes.each_pair do |key,val| > if key == ''id'' > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, > Ferret::Document::Field::Index::UNTOKENIZED) > else > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::NO, > Ferret::Document::Field::Index::TOKENIZED) > end > end > return doc > end > end > end > end > end > > # reopen ActiveRecord and include all the above to make > # them available to all our models if they want it > ActiveRecord::Base.class_eval do > include FerretMixin::Acts::ARFerret > end > > # END acts_as_ferret.rb > > RAKE TASK in /lib/tasks/indexer.rake > > include FileUtils > > desc "Perform ferret index" > task :indexer => :environment do > if !File.exist?(INDEX_PATH) > puts "Creating index dir in #{INDEX_PATH}" > FileUtils.mkdir_p(INDEX_PATH) > end > > classes = [] > Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do > |rbfile| > bname = File.basename(rbfile,''.rb'') > classname = Inflector.camelize(bname) > classes.push(classname) > end > classes.each do |class_obj| > c = eval(class_obj) > if c.respond_to?(:ferret_create) > puts "REBUILDING #{c.name}" > c.find_all.each{|cn|cn.save} > end > end > end > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails > > > >
On 12/15/05, albert ramstedt <albert-fwIc/cu1KZxHQX+h2pknIQ@public.gmane.org> wrote:> Hello! > > I have been following this thread carefully, ferret just got a little > easier to dive into. Kudos to you guys, and especially to the authors of > ferret! This was just what we needed here at our little webdev shop. > > Now I have a problem you guys might know a solution to. I have managed > to get the code from the wiki working, with a little bit of tweaking, > but it does not seem to build queries correctly when it gets fed with > UTF-8 characters. Is this a fault on my side or a known issue with > ferret? I looked at the trac but it seemed it should support UTF-8? I > must have overlooked something...The problem is that the analyzer doesn''t understand UTF-8. You need to write an analyzer that matches the characters in your character set. Have at the analyzers and tokenizers included with Ferret. They''re quite simple. Basically you just need to come up with a regular expression that matches what you consider tokens in your data. For example, the whitespace tokenizer uses /\S+/. The letter tokenizer uses /[:alpha:]+/. This is actually where the problem with UTF-8 handling is. [:alpha:] only matches the ascii alphabet in the current Ruby regexp engine. That will change in Ruby 2.0. HTH, Dave
On 12/15/05, albert ramstedt <albert@delamednoll.se> wrote:> To answer my own question... > > This is a hack to get unicode to work, and relies on the unicode gem. > Also, this, as opposed to my previous code listing, should work out of > the box... except that the constant INDEX_PATH must be set before, > preferable in environment.rb > > # CODE for acts_as_ferret.rb > require 'active_record' > require 'ferret' > require 'unicode' > > class UnicodeLowerCaseFilter < Ferret::Analysis::TokenFilter > def next() > t = @input.next() > > if (t == nil) > return nil > end > > t.term_text = Unicode::downcase(t.term_text) > > return t > end > end > > class SwedishTokenizer < Ferret::Analysis::RegExpTokenizer > > P = /[_\/.,-]/ > HASDIGIT = /\w*\d\w*/ > > > def token_re() > %r([[:alpha:]ЕЦДецд]+(('[[:alpha:]ЕЦДецд]+)+ > |\.([[:alpha:]ЕЦДецд]\.)+ > |(@|\&)\w+([-.]\w+)* > ) > |\w+(([\-._]\w+)*\@\w+([-.]\w+)+ > |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)? > |(\.\w+)+ > | > ) > )x > end > end > > class SwedishAnalyzer < Ferret::Analysis::Analyzer > def token_stream(field, string) > return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string)) > end > endOh, very cool. Sorry, I just replied to your other email before I saw this. Do you mind if I put it on the Ferret Wiki in the howtos section? Even better if you could do it. ;-) Thanks for posting this Albert. Hope my other code snippet helped. Cheers, Dave _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
albert ramstedt <albert@...> writes:> > To answer my own question... > > This is a hack to get unicode to work, and relies on the unicode gem. > Also, this, as opposed to my previous code listing, should work out of > the box... except that the constant INDEX_PATH must be set before, > preferable in environment.rbNice to see this addition. I''m wondering wether this will work for other European languages besides Swedish though. Is there a way to make it more universal? Thanks.
On 12/16/05, Fabien Franzen <fabienf-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> albert ramstedt <albert@...> writes: > > > > > To answer my own question... > > > > This is a hack to get unicode to work, and relies on the unicode gem. > > Also, this, as opposed to my previous code listing, should work out of > > the box... except that the constant INDEX_PATH must be set before, > > preferable in environment.rb > > Nice to see this addition. I''m wondering wether this will work for other > European languages besides Swedish though. Is there a way to make it > more universal?Hi Fabien, As far as I know this will work for any european language, or any language for that matter. You just need to include the required characters in the regular expression. Once the data is split into tokens, Ferret doesn''t care what the string looks like. You can even store binary data like images in a Ferret index if you want to. Now we just need people to add the necessary characters for all the different European languages. :-) Dave As far> Thanks. > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Hi David, The problem is, that i need that query to use the paginator, ie i need the hits before i do the actual search with the limit and offset, and since that query also translates into model objects, it hits the database when it doesnt actually need to. But I agree, my solution is not really that nice either. Albert David Balmain wrote:>Hi Albert, > >Perhaps you could do something like this in the find_by_ferret method >and get rid of your count_by_ferret method. Just an idea. > > total_hits = hits.each do |hit, score| > id = @@index_searcher.reader.get_document(hit)[''id''] > result << self.find(id) unless id.nil? > end > return result, total_hits > >Cheers, >Dave > >On 12/15/05, albert ramstedt <albert-fwIc/cu1KZxHQX+h2pknIQ@public.gmane.org> wrote: > > >>Hello! >> >>I have been following this thread carefully, ferret just got a little >>easier to dive into. Kudos to you guys, and especially to the authors of >>ferret! This was just what we needed here at our little webdev shop. >> >>Now I have a problem you guys might know a solution to. I have managed >>to get the code from the wiki working, with a little bit of tweaking, >>but it does not seem to build queries correctly when it gets fed with >>UTF-8 characters. Is this a fault on my side or a known issue with >>ferret? I looked at the trac but it seemed it should support UTF-8? I >>must have overlooked something... >> >>I didnt dare to touch the wiki, but here is a somewhat altered version >>of the plugin, and it should be fully functional. I added some small >>things, since we wanted a counter for the Paginator. I know though that >>doing a full-out-search just to count might not be the best way to >>count, so if anyone has a suggestion to better this, please share! :) >> >>Oh, and I added a rake task to rebuild the index, but it relies on the >>INDEX_PATH being set in the environment.rb >> >>Here it is >> >># CODE for acts_as_ferret.rb >>require ''active_record'' >>require ''ferret'' >> >>module FerretMixin >> module Acts #:nodoc: >> module ARFerret #:nodoc: >> >> def self.append_features(base) >> super >> base.extend(MacroMethods) >> end >> >> # declare the class level helper methods >> # which will load the relevant instance methods defined below >>when invoked >> >> module MacroMethods >> >> def acts_as_ferret >> extend FerretMixin::Acts::ARFerret::ClassMethods >> class_eval do >> include FerretMixin::Acts::ARFerret::ClassMethods >> >> after_create :ferret_create >> after_update :ferret_update >> after_destroy :ferret_destroy >> end >> end >> >> end >> >> module ClassMethods >> include Ferret >> INDEX_PATH = "#{RAILS_ROOT}/db/ferret" >> def self.reloadable?; false end >> >> # Finds instances by file contents. >> def find_by_ferret(query, options = {}) >> @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) >> @@query_parser ||>>QueryParser.new(@@index_searcher.reader.get_field_names.to_a) >> query = @@query_parser.parse(query) >> result = [] >> conditions = {} >> conditions[:num_docs] = options[:limit] unless >>options[:limit].blank? >> conditions[:first_doc] = options[:offset] unless >>options[:offset].blank? >> >> hits = @@index_searcher.search(query, conditions) >> hits.each do |hit, score| >> id = @@index_searcher.reader.get_document(hit)[''id''] >> result << self.find(id) unless id.nil? >> end >> return result >> end >> >> def count_by_ferret(query) >> @@index_searcher ||= Search::IndexSearcher.new(INDEX_PATH) >> @@query_parser ||>>QueryParser.new(@@index_searcher.reader.get_field_names.to_a) >> query = @@query_parser.parse(query) >> return @@index_searcher.search(query).total_hits >> end >> >> # private >> >> def ferret_create >> # code to update or add to the index >> @@index ||= Index::Index.new(:path => INDEX_PATH, >> :auto_flush => true) >> @@index << self.to_doc >> end >> def ferret_update >> @@index ||= Index::Index.new(:path => INDEX_PATH, >> :auto_flush => true) >> @@index.query_delete("+id:#{self.id} >>+ferret_table:#{self.class.table_name}") >> @@index << self.to_doc >> end >> >> def ferret_destroy >> # code to delete from index >> @@index ||= Index::Index.new(:path => INDEX_PATH, >> :auto_flush => true) >> @@index.query_delete("+id:#{self.id} >>+ferret_table:#{self.class.table_name}") >> end >> >> def to_doc >> # Churn through the complete Active Record and add it to >>the Ferret document >> doc = Ferret::Document::Document.new >> doc << Ferret::Document::Field.new(''ferret_table'', >>self.class.table_name, Ferret::Document::Field::Store::YES, >>Ferret::Document::Field::Index::UNTOKENIZED) >> self.attributes.each_pair do |key,val| >> if key == ''id'' >> doc << Ferret::Document::Field.new(key, val.to_s, >>Ferret::Document::Field::Store::YES, >>Ferret::Document::Field::Index::UNTOKENIZED) >> else >> doc << Ferret::Document::Field.new(key, val.to_s, >>Ferret::Document::Field::Store::NO, >>Ferret::Document::Field::Index::TOKENIZED) >> end >> end >> return doc >> end >> end >> end >> end >>end >> >># reopen ActiveRecord and include all the above to make >># them available to all our models if they want it >>ActiveRecord::Base.class_eval do >> include FerretMixin::Acts::ARFerret >>end >> >># END acts_as_ferret.rb >> >>RAKE TASK in /lib/tasks/indexer.rake >> >>include FileUtils >> >>desc "Perform ferret index" >>task :indexer => :environment do >> if !File.exist?(INDEX_PATH) >> puts "Creating index dir in #{INDEX_PATH}" >> FileUtils.mkdir_p(INDEX_PATH) >> end >> >> classes = [] >> Dir.glob(File.join(RAILS_ROOT,"app","models","*.rb")).each do >>|rbfile| >> bname = File.basename(rbfile,''.rb'') >> classname = Inflector.camelize(bname) >> classes.push(classname) >> end >> classes.each do |class_obj| >> c = eval(class_obj) >> if c.respond_to?(:ferret_create) >> puts "REBUILDING #{c.name}" >> c.find_all.each{|cn|cn.save} >> end >> end >>end >> >> >>_______________________________________________ >>Rails mailing list >>Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >>http://lists.rubyonrails.org/mailman/listinfo/rails >> >> >> >> >> >> >_______________________________________________ >Rails mailing list >Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >http://lists.rubyonrails.org/mailman/listinfo/rails > > > >_______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Hi Ofcourse you can add it to the wiki! The mail seems to have scrambled the utf characters, so keep that in mind if you intend to use the swedish tokenizer. Albert David Balmain wrote:>On 12/15/05, albert ramstedt <albert-fwIc/cu1KZxHQX+h2pknIQ@public.gmane.org> wrote: > > >>To answer my own question... >> >>This is a hack to get unicode to work, and relies on the unicode gem. >>Also, this, as opposed to my previous code listing, should work out of >>the box... except that the constant INDEX_PATH must be set before, >>preferable in environment.rb >> >># CODE for acts_as_ferret.rb >>require ''active_record'' >>require ''ferret'' >>require ''unicode'' >> >>class UnicodeLowerCaseFilter < Ferret::Analysis::TokenFilter >> def next() >> t = @input.next() >> >> if (t == nil) >> return nil >> end >> >> t.term_text = Unicode::downcase(t.term_text) >> >> return t >> end >>end >> >>class SwedishTokenizer < Ferret::Analysis::RegExpTokenizer >> >> P = /[_\/.,-]/ >> HASDIGIT = /\w*\d\w*/ >> >> >> def token_re() >> %r([[:alpha:]ЕЦДецд]+((''[[:alpha:]ЕЦДецд]+)+ >> |\.([[:alpha:]ЕЦДецд]\.)+ >> |(@|\&)\w+([-.]\w+)* >> ) >> |\w+(([\-._]\w+)*\@\w+([-.]\w+)+ >> |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)? >> |(\.\w+)+ >> | >> ) >> )x >> end >>end >> >>class SwedishAnalyzer < Ferret::Analysis::Analyzer >> def token_stream(field, string) >> return UnicodeLowerCaseFilter.new(SwedishTokenizer.new(string)) >> end >>end >> >> > >Oh, very cool. Sorry, I just replied to your other email before I saw >this. Do you mind if I put it on the Ferret Wiki in the howtos >section? Even better if you could do it. ;-) > >Thanks for posting this Albert. Hope my other code snippet helped. > >Cheers, >Dave > > >------------------------------------------------------------------------ > >_______________________________________________ >Rails mailing list >Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >http://lists.rubyonrails.org/mailman/listinfo/rails > >
It''s so cool! I am just looking for the CJK solutions, Here is "JavaCC code for the Nutch lexical analyzer." Inlucded in Nutch source code, so could anyone port it into ferret? ===================================================/** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** JavaCC code for the Nutch lexical analyzer. */ options { STATIC = false; USER_CHAR_STREAM = true; OPTIMIZE_TOKEN_MANAGER = true; UNICODE_INPUT = true; //DEBUG_TOKEN_MANAGER = true; } PARSER_BEGIN(NutchAnalysis) package org.apache.nutch.analysis; import org.apache.nutch.searcher.Query; import org.apache.nutch.searcher.QueryFilters; import org.apache.nutch.searcher.Query.Clause; import org.apache.lucene.analysis.StopFilter; import java.io.*; import java.util.*; /** The JavaCC-generated Nutch lexical analyzer and query parser. */ public class NutchAnalysis { private static final String[] STOP_WORDS = { "a", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "s", "such", "t", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" }; private static final Set STOP_SET = StopFilter.makeStopSet(STOP_WORDS); private String queryString; /** True iff word is a stop word. Stop words are only removed from queries. * Every word is indexed. */ public static boolean isStopWord(String word) { return STOP_SET.contains(word); } /** Construct a query parser for the text in a reader. */ public static Query parseQuery(String queryString) throws IOException { NutchAnalysis parser new NutchAnalysis(new FastCharStream(new StringReader(queryString))); parser.queryString = queryString; return parser.parse(); } /** For debugging. */ public static void main(String[] args) throws Exception { BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); while (true) { System.out.print("Query: "); String line = in.readLine(); System.out.println(parseQuery(line)); } } } PARSER_END(NutchAnalysis) TOKEN_MGR_DECLS : { /** Constructs a token manager for the provided Reader. */ public NutchAnalysisTokenManager(Reader reader) { this(new FastCharStream(reader)); } } TOKEN : { // token regular expressions // basic word -- lowercase it <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)> { matchedToken.image = matchedToken.image.toLowerCase(); } // special handling for acronyms: U.S.A., I.B.M., etc: dots are removed | <ACRONYM: <LETTER> "." (<LETTER> ".")+ > { // remove dots for (int i = 0; i < image.length(); i++) { if (image.charAt(i) == ''.'') image.deleteCharAt(i--); } matchedToken.image = image.toString().toLowerCase(); } // chinese, japanese and korean characters | <SIGRAM: <CJK> > // irregular words | <#IRREGULAR_WORD: (<C_PLUS_PLUS>|<C_SHARP>)> | <#C_PLUS_PLUS: ("C"|"c") "++" > | <#C_SHARP: ("C"|"c") "#" > // query syntax characters | <PLUS: "+" > | <MINUS: "-" > | <QUOTE: "\"" > | <COLON: ":" > | <SLASH: "/" > | <DOT: "." > | <ATSIGN: "@" > | <APOSTROPHE: "''" > | <WHITE: ~[] > // treat unrecognized chars // as whitespace // primitive, non-token patterns | <#WORD_PUNCT: ("_"|"&")> // allowed anywhere in words | < #LETTER: // alphabets [ "\u0041"-"\u005a", "\u0061"-"\u007a", "\u00c0"-"\u00d6", "\u00d8"-"\u00f6", "\u00f8"-"\u00ff", "\u0100"-"\u1fff" ] > | <#CJK: // non-alphabets [ "\u3040"-"\u318f", "\u3300"-"\u337f", "\u3400"-"\u3d2d", "\u4e00"-"\u9fff", "\uf900"-"\ufaff" ] > | < #DIGIT: // unicode digits [ "\u0030"-"\u0039", "\u0660"-"\u0669", "\u06f0"-"\u06f9", "\u0966"-"\u096f", "\u09e6"-"\u09ef", "\u0a66"-"\u0a6f", "\u0ae6"-"\u0aef", "\u0b66"-"\u0b6f", "\u0be7"-"\u0bef", "\u0c66"-"\u0c6f", "\u0ce6"-"\u0cef", "\u0d66"-"\u0d6f", "\u0e50"-"\u0e59", "\u0ed0"-"\u0ed9", "\u1040"-"\u1049" ] > } /** Parse a query. */ Query parse() : { Query query = new Query(); ArrayList terms; Token token; String field; boolean stop; boolean prohibited; } { nonOpOrTerm() // skip noise ( { stop=true; prohibited=false; field = Clause.DEFAULT_FIELD; } // optional + or - operator ( <PLUS> {stop=false;} | (<MINUS> { stop=false;prohibited=true; } ))? // optional field spec. ( LOOKAHEAD(<WORD><COLON>(phrase(field)|compound(field))) token=<WORD> <COLON> { field = token.image; } )? ( terms=phrase(field) {stop=false;} | // quoted terms or terms=compound(field)) // single or compound term nonOpOrTerm() // skip noise { String[] array = (String[])terms.toArray(new String[terms.size()]); if (stop && field == Clause.DEFAULT_FIELD && terms.size()==1 && isStopWord(array[0])) { // ignore stop words only when single, unadorned terms in default field } else { if (prohibited) query.addProhibitedPhrase(array, field); else query.addRequiredPhrase(array, field); } } )* { return query; } } /** Parse an explcitly quoted phrase query. Note that this may return a single * term, a trivial phrase.*/ ArrayList phrase(String field) : { int start; int end; ArrayList result = new ArrayList(); String term; } { <QUOTE> { start = token.endColumn; } (nonTerm())* // skip noise ( term = term() { result.add(term); } // parse a term (nonTerm())*)* // skip noise { end = token.endColumn; } (<QUOTE>|<EOF>) { if (QueryFilters.isRawField(field)) { result.clear(); result.add(queryString.substring(start, end)); } return result; } } /** Parse a compound term that is interpreted as an implicit phrase query. * Compounds are a sequence of terms separated by infix characters. Note that * htis may return a single term, a trivial compound. */ ArrayList compound(String field) : { int start; ArrayList result = new ArrayList(); String term; } { { start = token.endColumn; } term = term() { result.add(term); } ( LOOKAHEAD( (infix())+ term() ) (infix())+ term = term() { result.add(term); })* { if (QueryFilters.isRawField(field)) { result.clear(); result.add(queryString.substring(start, token.endColumn)); } return result; } } /** Parse a single term. */ String term() : { Token token; } { ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>) { return token.image; } } /** Parse anything but a term or a quote. */ void nonTerm() : {} { <WHITE> | infix() } void nonTermOrEOF() : {} { nonTerm() | <EOF> } /** Parse anything but a term or an operator (plur or minus or quote). */ void nonOpOrTerm() : {} { (LOOKAHEAD(2) (<WHITE> | nonOpInfix() | ((<PLUS>|<MINUS>) nonTermOrEOF())))* } /** Characters which can be used to form compound terms. */ void infix() : {} { <PLUS> | <MINUS> | nonOpInfix() } /** Parse infix characters except plus and minus. */ void nonOpInfix() : {} { <COLON>|<SLASH>|<DOT>|<ATSIGN>|<APOSTROPHE> }
On Dec 16, 2005, at 12:14 AM, hui wrote:> It''s so cool! > I am just looking for the CJK solutions, > Here is "JavaCC code for the Nutch lexical analyzer." > Inlucded in Nutch source code, so could anyone port it into ferret?There are several other Analyzers in Lucene that can deal with CJK (and actually Korean doesn''t really fit with Chinese and Japanese). Lucene''s StandardAnalyzer recognizes the CJK range just as the Nutch one does, and there are also these additional ones (in the cjk and cn directories): <http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/ analyzers/src/java/org/apache/lucene/analysis/> Erik