hi all, anyone know of a web spider in ruby? If not, we''re buildin'' one I think... it will require well formed XHTML to make our job easy, and w''ell use Ruby/Odeum :) thanks, _alex -- alex black, founder the turing studio, inc. 510.666.0074 root-16h2cdTTKgpzNNFeSAH1EA@public.gmane.org http://www.turingstudio.com 2600 10th street, suite 635 berkeley, ca 94710
I am working on one that will be open-source. There should be a release in the next couple weeks. Lucas Carlson http://tech.rufy.com/ On Jun 2, 2005, at 4:49 PM, alex black wrote:> hi all, > > anyone know of a web spider in ruby? > > If not, we''re buildin'' one I think... it will require well formed > XHTML to make our job easy, and w''ell use Ruby/Odeum :) > > thanks, > > _alex > > > -- > alex black, founder > the turing studio, inc. > > 510.666.0074 > root-16h2cdTTKgpzNNFeSAH1EA@public.gmane.org > http://www.turingstudio.com > > 2600 10th street, suite 635 > berkeley, ca 94710 > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On Jun 2, 2005, at 7:49 PM, alex black wrote:> hi all, > > anyone know of a web spider in ruby? > > If not, we''re buildin'' one I think... it will require well formed > XHTML to make our job easy, and w''ell use Ruby/Odeum :)Well-formed XHTML, eh? You''ll be limiting your spidering to a pretty small subset of the web. Since Zed has already proven Ruby/Odeum to be slower than Lucene *grin*, you might want to consider using Nutch. It already scales to hundreds of millions of pages and over the summer it will be benchmarked scaling to billions of pages using tens of machines (according the main developer). Nutch uses Lucene under the covers. Nutch exposes its searching capabilities through the OpenSearch web service (http://opensearch.a9.com/) so you could easily wrap a Ruby program around it. Hey, I love Ruby as much as all of you, but let''s not forget to be pragmatic about which technologies we pick. If a pure Ruby crawler/ search engine can beat Nutch/Lucene, I''m humble enough to switch... but I won''t be holding my breath. Erik
I know there is a Perl port of Lucene (called Plucene IIRC) which might not be too kard to work with. Failing that some kind soul could always choose to port it! sam On 6/3/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote:> > On Jun 2, 2005, at 7:49 PM, alex black wrote: > > > hi all, > > > > anyone know of a web spider in ruby? > > > > If not, we''re buildin'' one I think... it will require well formed > > XHTML to make our job easy, and w''ell use Ruby/Odeum :) > > Well-formed XHTML, eh? You''ll be limiting your spidering to a pretty > small subset of the web. > > Since Zed has already proven Ruby/Odeum to be slower than Lucene > *grin*, you might want to consider using Nutch. It already scales to > hundreds of millions of pages and over the summer it will be > benchmarked scaling to billions of pages using tens of machines > (according the main developer). Nutch uses Lucene under the covers. > > Nutch exposes its searching capabilities through the OpenSearch web > service (http://opensearch.a9.com/) so you could easily wrap a Ruby > program around it. > > Hey, I love Ruby as much as all of you, but let''s not forget to be > pragmatic about which technologies we pick. If a pure Ruby crawler/ > search engine can beat Nutch/Lucene, I''m humble enough to switch... > but I won''t be holding my breath. > > Erik > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-- sam http://www.magpiebrain.com/
there is a ruby port in progress. rucene It''s on Rubyforge but as far as i know has a ways to go before it''s usable. Erik, know what the current status is? -Kate On 6/3/05, Sam Newman <sam.newman-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> I know there is a Perl port of Lucene (called Plucene IIRC) which > might not be too kard to work with. Failing that some kind soul could > always choose to port it! > > sam > > On 6/3/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote: > > > > On Jun 2, 2005, at 7:49 PM, alex black wrote: > > > > > hi all, > > > > > > anyone know of a web spider in ruby? > > > > > > If not, we''re buildin'' one I think... it will require well formed > > > XHTML to make our job easy, and w''ell use Ruby/Odeum :) > > > > Well-formed XHTML, eh? You''ll be limiting your spidering to a pretty > > small subset of the web. > > > > Since Zed has already proven Ruby/Odeum to be slower than Lucene > > *grin*, you might want to consider using Nutch. It already scales to > > hundreds of millions of pages and over the summer it will be > > benchmarked scaling to billions of pages using tens of machines > > (according the main developer). Nutch uses Lucene under the covers. > > > > Nutch exposes its searching capabilities through the OpenSearch web > > service (http://opensearch.a9.com/) so you could easily wrap a Ruby > > program around it. > > > > Hey, I love Ruby as much as all of you, but let''s not forget to be > > pragmatic about which technologies we pick. If a pure Ruby crawler/ > > search engine can beat Nutch/Lucene, I''m humble enough to switch... > > but I won''t be holding my breath. > > > > Erik > > > > _______________________________________________ > > Rails mailing list > > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > > http://lists.rubyonrails.org/mailman/listinfo/rails > > > > > -- > sam > http://www.magpiebrain.com/ > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-- -Kate (masukomi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org)
On Jun 5, 2005, at 8:19 PM, kate rhodes wrote:> there is a ruby port in progress. rucene > It''s on Rubyforge but as far as i know has a ways to go before it''s > usable. > > Erik, know what the current status is?Why I do happen to know the status, since I was the creator of that particular RubyForge project myself and the sole committer there :) I started it up a long while ago when I first got jazzed on Ruby, and created some low-level Lucene index I/O code (very minimal) including unit tests that showed it worked with indexes written from Java. After seeing how rapidly Java Lucene evolves I lost interest in a pure Ruby port and saw that PyLucene with its GCJ/SWIG techniques was really the right way to go. rucene was renamed to RubyLucene at RubyForge because the former name seemed by some (but not intended so) to be poking fun. I''m encouraging anyone who has anything to contribute to a RubyLucene effort to join us at ruby-dev-PPu3vs9EauNd/SJB6HiN2Ni2O/JbrIOy@public.gmane.org - I personally don''t have the time or native C/make/GCJ/SWIG know-how to do it myself effectively. Erik
> I am working on one that will be open-source. There should be a > release in the next couple weeks.Groovy ;) _a
> Well-formed XHTML, eh? You''ll be limiting your spidering to a pretty > small subset of the web.heh. Yes, I am. Sites I build for clients - all of my search needs are ''internal'' - that is, within a single web application.> Since Zed has already proven Ruby/Odeum to be slower than Lucene > *grin*, you might want to consider using Nutch. It already scales to > hundreds of millions of pages and over the summer it will be > benchmarked scaling to billions of pages using tens of machines > (according the main developer). Nutch uses Lucene under the covers. > > Nutch exposes its searching capabilities through the OpenSearch web > service (http://opensearch.a9.com/) so you could easily wrap a Ruby > program around it.Does nutch have a SOAP api? _a -- alex black, founder the turing studio, inc. 510.666.0074 root-16h2cdTTKgpzNNFeSAH1EA@public.gmane.org http://www.turingstudio.com 2600 10th street, suite 635 berkeley, ca 94710
On Jun 8, 2005, at 4:52 AM, alex black wrote:>> Since Zed has already proven Ruby/Odeum to be slower than Lucene >> *grin*, you might want to consider using Nutch. It already scales >> to hundreds of millions of pages and over the summer it will be >> benchmarked scaling to billions of pages using tens of machines >> (according the main developer). Nutch uses Lucene under the covers. >> >> Nutch exposes its searching capabilities through the OpenSearch >> web service (http://opensearch.a9.com/) so you could easily wrap a >> Ruby program around it. >> > > Does nutch have a SOAP api?Not currently. But the OpenSearch API is basically RSS and could easily be consumed by Ruby. I have just done some Ruby to Kowari (www.kowari.org) SOAP communication over SOAP, so I know how even more trivial it is to do that sort of thing. But don''t let the current lack of built-in SOAP support deter you. Perhaps simply nudging with a feature request and e-mail list prodding would be enough to get it implemented? It''d be a pretty easy add-on to implement. (maybe someone has already done it even, but not contributed it back yet?) Erik
Erik Hatcher wrote:> On Jun 8, 2005, at 4:52 AM, alex black wrote: > >>> Since Zed has already proven Ruby/Odeum to be slower than Lucene >>> *grin*, you might want to consider using Nutch. It already scales >>> to hundreds of millions of pages and over the summer it will be >>> benchmarked scaling to billions of pages using tens of machines >>> (according the main developer). Nutch uses Lucene under the covers. >>> >>> Nutch exposes its searching capabilities through the OpenSearch web >>> service (http://opensearch.a9.com/) so you could easily wrap a Ruby >>> program around it. >>> >> >> Does nutch have a SOAP api? > > > Not currently. But the OpenSearch API is basically RSS and could > easily be consumed by Ruby. I have just done some Ruby to Kowari > (www.kowari.org) SOAP communication over SOAP, so I know how even > more trivial it is to do that sort of thing. But don''t let the > current lack of built-in SOAP support deter you. Perhaps simply > nudging with a feature request and e-mail list prodding would be > enough to get it implemented? It''d be a pretty easy add-on to > implement. (maybe someone has already done it even, but not > contributed it back yet?) > > Erik > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >In case you didn''t know, Brian McCallister has received the ruby codefest grant to do bindings to Lucene. http://kasparov.skife.org/blog-live/src/ruby/codefest-grant.writeback, and http://kasparov.skife.org/blog/src/ruby/gcj-osx.html