hi all, I have two kinds of search needs: -public documents on a site. I''d like this to be a separate service that uses XMLRPC or SOAP and doesn''t need to be ''included'' in a rails app. this should be a package, and it should be _really_ easy to use, like: ./rodeum --db path --index http://www.blahsite.com/ and a simple CGI that will run under apache / lighthttpd and respond to XMLRPC or SOAP requests. -private indexing of certain fields in certain parts of my database schema. for example, I want to index all of the order, and I want to allow a customer to search orders, but obviously only _their_ orders ;) - there are other more simple applications: searching articles in a db, etc... but you could argue that it''s easier to just index all the public content. The former should be pretty easy, since the requirements are very clear. The latter part I would expect to be a bit, but not much, harder ;) Any pointers much appreciated... thanks, _alex -- alex black, founder the turing studio, inc. 510.666.0074 root-16h2cdTTKgpzNNFeSAH1EA@public.gmane.org http://www.turingstudio.com 2600 10th street, suite 635 berkeley, ca 94710
On May 31, 2005, at 9:07 PM, Zed A. Shaw wrote:> On Tue, 2005-05-31 at 17:18 -0700, alex black wrote: >> -public documents on a site. I''d like this to be a separate >> service >> that uses XMLRPC or SOAP and doesn''t need to be ''included'' in a rails >> app. this should be a package, and it should be _really_ easy to use, >> like: > This is possible. I''d say that unless you need to access the index > from > other non-Ruby systems that you should try using the Drb system first. > It''s really easy and would work without installing anything.I set up an IndexRepository as a facade for a remote DRb index service or local index lookup. I can use the local repository for development and test then fire up ./script/index_service in production. Other RPC mechanisms take a lot more work; DRb is very easy, and likely faster.>> -private indexing of certain fields in certain parts of my >> database >> schema. for example, I want to index all of the order, and I want to >> allow a customer to search orders, but obviously only _their_ >> orders ;) >> - there are other more simple applications: searching articles in >> a db, >> etc... but you could argue that it''s easier to just index all the >> public content. > > Yeah, this is a big requested feature. I have some ideas on this, > but I > was thinking the simplest approach is to use something like AR, > dump the > Object from AR as YAML records, and then run them through Ruby/ > Odeum to > index them. I''d also use the built-in meta-data features of Ruby/ > Odeum > to do some fields.I prefer to store no extra content with the index: just give me a set of URIs and scores as search results. Then look up the URIs in database, applying any conditions/ordering you like. From what I''ve seen of Lucene, this is a non-issue. You''d just add a customer field to the index and match against it also. Perhaps Odeum can do this too.> Go ahead and chat with me on it and then we can work on rolling > anything > you do into the project. I actually plan on the following two > improvements to Ruby/Odeum: > > 1. A performance improvement by altering the result iteration API.How can I page through search results? How can I do substring searches? These are the difficulties I''ve run into.> 2. Using Linda to make cheap remote querying and clustering. This > would > be really neat to put behind your proposed web service type API.Interesting. Auto-discovery would be nice, but that step isn''t too hard with plain DRb. I''m not sure whether clustering is worth the shared disk and locking headaches. I did a bit of integration work with Active Record that makes indexing simpler. You don''t have to know you''re using Ruby/Odeum at all, so swapping in Lucene (or Namazu or ..) should be a snap. Example: class Foo < ActiveRecord::Base index :name, :conditions => ''active=true'' index :full, :on => [:name, :description], :conditions => ''active=true'' end foos = Foo.search_by_name(''baz'') The index class method defines the "search_by_#{index_name}" class method which asks the index repository for an index named "# {model_name}-#{index_name}" and performs the search. Similarly, it defines after_save and after_destroy callbacks to keep the index in sync. I''d like to extend this to an ActiveRecord::Index that easily spans multiple models. Best, jeremy
Hi Alex, This is surely possible, but not part of the Ruby/Odeum distribution yet. More below... On Tue, 2005-05-31 at 17:18 -0700, alex black wrote:> hi all, > > I have two kinds of search needs: > > -public documents on a site. I''d like this to be a separate service > that uses XMLRPC or SOAP and doesn''t need to be ''included'' in a rails > app. this should be a package, and it should be _really_ easy to use, > like: >This is possible. I''d say that unless you need to access the index from other non-Ruby systems that you should try using the Drb system first. It''s really easy and would work without installing anything.> ./rodeum --db path --index http://www.blahsite.com/ > > and a simple CGI that will run under apache / lighthttpd and respond to > XMLRPC or SOAP requests. >There''s lots of C XML-RPC libraries if you mean a CGI written in that language for start-up. Otherwise SOAP or XML-RPC would work, and if you can do Ruby a quick start would be Drb. Hell you could get Drb working ni like 10 minutes. :-)> -private indexing of certain fields in certain parts of my database > schema. for example, I want to index all of the order, and I want to > allow a customer to search orders, but obviously only _their_ orders ;) > - there are other more simple applications: searching articles in a db, > etc... but you could argue that it''s easier to just index all the > public content. >Yeah, this is a big requested feature. I have some ideas on this, but I was thinking the simplest approach is to use something like AR, dump the Object from AR as YAML records, and then run them through Ruby/Odeum to index them. I''d also use the built-in meta-data features of Ruby/Odeum to do some fields. I think the key here is to be consistent, and remove any overly consistent stuffs like tags and such (like if you were to use XML for some reason).> The former should be pretty easy, since the requirements are very > clear. The latter part I would expect to be a bit, but not much, harder > ;) >Go ahead and chat with me on it and then we can work on rolling anything you do into the project. I actually plan on the following two improvements to Ruby/Odeum: 1. A performance improvement by altering the result iteration API. 2. Using Linda to make cheap remote querying and clustering. This would be really neat to put behind your proposed web service type API. Zed
On May 31, 2005, at 11:53 PM, Jeremy Kemper wrote:>> Go ahead and chat with me on it and then we can work on rolling >> anything >> you do into the project. I actually plan on the following two >> improvements to Ruby/Odeum: >> >> 1. A performance improvement by altering the result iteration API. >>On the Lucene side of things, here are answers to your questions:> How can I page through search results?The first (and often only) approach needed is to simply search again with the same query and pick up at whatever item (page_num * page_size + 1) you''d like. This is the technique I use for http:// www.lucenebook.com/search?query=lucene (scroll to the bottom to see the paging controls and their URLs). If re-searching is not good enough (I''ve yet to see a system where that was the case though) you could keep Lucene''s Hits object around on the server and use that to page through without searching again.> How can I do substring searches?In Lucene, there is a WildcardQuery. For example: http:// www.lucenebook.com/search?query=ru*+OR+ru%3Fy Note: Lucene''s QueryParser does not allow a wildcard character as the first character, but the WildcardQuery API does - this is to prevent a potentially big resource consuming query from occurring.> I did a bit of integration work with Active Record that makes > indexing simpler. You don''t have to know you''re using Ruby/Odeum > at all, so swapping in Lucene (or Namazu or ..) should be a snap. > Example: > > class Foo < ActiveRecord::Base > index :name, :conditions => ''active=true'' > index :full, :on => [:name, :description], :conditions => > ''active=true'' > end > > foos = Foo.search_by_name(''baz'') > > The index class method defines the "search_by_#{index_name}" class > method which asks the index repository for an index named "# > {model_name}-#{index_name}" and performs the search. Similarly, it > defines after_save and after_destroy callbacks to keep the index in > sync.Very slick! Erik
> This is possible. I''d say that unless you need to access the index > from > other non-Ruby systems that you should try using the Drb system first. > It''s really easy and would work without installing anything.Drb System?> There''s lots of C XML-RPC libraries if you mean a CGI written in that > language for start-up. Otherwise SOAP or XML-RPC would work, and if > you > can do Ruby a quick start would be Drb. Hell you could get Drb working > ni like 10 minutes. :-)Ok - what is Drb?> 2. Using Linda to make cheap remote querying and clustering. This > would > be really neat to put behind your proposed web service type API.Yep.. _a -- alex black, founder the turing studio, inc. 510.666.0074 root-16h2cdTTKgpzNNFeSAH1EA@public.gmane.org http://www.turingstudio.com 2600 10th street, suite 635 berkeley, ca 94710