Hi, I''ve been looking for some search engine code for Ruby or Rails in particular. I need a full text search engine w/Google like operators that iterates over a document store (*.html) in a single folder. I don''t need any Model stuff or DB interaction. I''ve looked at rubylucene :no released code at the forge yet. RICE : version 0.1.1 - doesn''t pass all the tests, and little documentation, esp sample code or how to integrate w/rails. SearchGenerator : Only works with Models SimpleSearch : might be what I am looking for, but haven''t grokked the API As for the last one, I''ve searched for an example w/rails with no luck so far. RICE may actually work, if I implement it, but the learning curve is steep with so little in the way of docs. The Lucene docs may be a help. Any other recommendations? How does RuWiki implement search? Thanks Ed
On 9/26/05, Ed Howland <ed.howland-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hi, > > I''ve been looking for some search engine code for Ruby or Rails in > particular. I need a full text search engine w/Google like operators > that iterates over a document store (*.html) in a single folder. > I don''t need any Model stuff or DB interaction.No operators, but it shouldn''t be hard to extend this. Dir["*.html"].each do |file_name| if File.read(file_name).match(regex) puts "file matched: #{ file_name }" end end
Hi ! Ed Howland said the following on 2005-09-26 23:33:> Any other recommendations? How does RuWiki implement search?What you want to look at is Ruby/Odeum. There was discussion on this this list a while back - search the archives for Zed E. Shaw''s comments on 2005-05-30. Hope that helps ! François
On 9/26/05, Ed Howland <ed.howland-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On 9/26/05, Joe Van Dyk <joevandyk-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > > No operators, but it shouldn''t be hard to extend this. > > > > Dir["*.html"].each do |file_name| > > if File.read(file_name).match(regex) > > puts "file matched: #{ file_name }" > > end > > end > > > Nice, but I have > 1000 HTML files ~10-50k each (probably a lot of tag > weightinessm tho.) so I think this would be a burden in a heavily used > web site. > > I''d prefer an inverted index. Following the YAGNI principal, I thought > a simple one would suffice. Parse files, collect and count unique > words in a set, serialize these to the file store. When querying, read > only word sets matching search terms. Apply set operations as > specified in the search phrase. Order the result collection by > weighted word freqs. Display 10 results. > > In my case, I have the doc id so I can pull up each file and apply the > regex pattern with some surrounding lines.Personally, I''d write my own Search API. Make it dumb and slow (i.e. what I wrote) and then make it faster, trying to keep the API the same. But yeah, some sort of hashing or index would work fine. Check out the instiki source. I believe its search method is much like mine (except its pages are kept in memory, so it''s a little quicker). Just as a data point though, I can search through 5000 files that each contain 5000 lines of crap in 0.5 seconds using my really stupid slow search method. Searching through 1000 files with 1000 lines of crap takes 0.035 seconds. So, if that''s not fast enough, then optimize away! Joe
On Sep 26, 2005, at 11:33 PM, Ed Howland wrote:> Hi, > > I''ve been looking for some search engine code for Ruby or Rails in > particular. I need a full text search engine w/Google like operators > that iterates over a document store (*.html) in a single folder. > I don''t need any Model stuff or DB interaction. > > I''ve looked at > rubylucene :no released code at the forge yet.*sigh* - if only I could clone myself, RubyLucene would exist. There have been a few efforts towards a port of Lucene to Ruby, but nothing to show for it yet. There is a discussion list at ruby- dev-PPu3vs9EauNd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org (use ruby-dev-subscribe-PPu3vs9EauNd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org to subscribe). I''ve been indexing content with Java Lucene, and exposing a Java- based search server interface via XML-RPC. This works extremely well with Ruby on Rails - very fast and pretty clean. While certainly the other suggestions of creating your own quick and easy search interface is reasonable if your needs are simple, it won''t be so trivial to scale and support sophisticated boolean queries. So use Ruby/Odeum or Java Lucene remotely - those are my recommendations. Erik
Erik Hatcher wrote:> I''ve been indexing content with Java Lucene, and exposing a Java- based > search server interface via XML-RPC. This works extremely well with > Ruby on Rails - very fast and pretty clean.I know there are lies, damn lies and benchmarks, but what sort of query rates do you get from that? And how fast does Lucene chew up new documents? -- Alex
On Sep 27, 2005, at 8:33 AM, Alex Young wrote:> Erik Hatcher wrote: > >> I''ve been indexing content with Java Lucene, and exposing a Java- >> based >> search server interface via XML-RPC. This works extremely well with >> Ruby on Rails - very fast and pretty clean. >> > > I know there are lies, damn lies and benchmarks, but what sort of > query > rates do you get from that? And how fast does Lucene chew up new > documents?I haven''t pushed it, and my application will not have Google-like demands on it, so it performs well enough for my purposes :) As for Lucene - it is FAST on indexing and searching. No worries there whatsoever. If you''ve got text, Lucene can index and search it in extremely powerful ways. Disclaimer: I''m co-author of "Lucene in Action", so I''m of course very biased. I''ve presented Lucene to numerous audiences and I''ve yet to have someone tell me it doesn''t scale or cannot handle their data sets. http://www.lucenebook.com Erik
On Sep 27, 2005, at 8:17 AM, Erik Hatcher wrote:> *sigh* - if only I could clone myself, RubyLucene would exist. > There have been a few efforts towards a port of Lucene to Ruby, but > nothing to show for it yet. There is a discussion list at ruby- > dev-PPu3vs9EauNd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org (use ruby-dev-subscribe-PPu3vs9EauNd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org to > subscribe).Yep, Andi is a saint for getting pylucene working, gcj is a royal... entertainment. I vote for cloning Erik, personally. -Brian