thr3ads.net - Rails - Rails spider/site copy with internal web requests [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Sean Treadway

2005-Jun-06 12:36 UTC

Rails spider/site copy with internal web requests

Hi all,

On a related topic to Web spidering, I would like to solve a problem  
of internally spidering and saving a Rails application from within  
rails.

The goal is to spider and save each resource in a site to a  
filesystem friendly structure
         /many/things/1?page=3 -> many_things_1_page_3.html
or    /many/things/1?page=3 -> many/things/1_page_3.html

If a redirect is created from an action, hash and follow the  
redirected resource, serializing it if it hasn''t been seen before.

Assuming that all links are generated and indexable through an API  
such as "url_for", the links could be rewritten to a flat hierarchy  
and internal URLs could easily be hashed for spidering to the next  
resources.

What are some ideas for simulating a web request to process the  
internal URLs?

I''ve looked at modifying a TestRequest, passing instances to  
ActionController::Base.process and serializing the response.body.

Another option I see is to hook into the Routing module to find and  
route each URL returned from "url_for", then call
"render_component"
with the parameters found from the route, but it seems that it would  
be better to run each controller in their own request from start to  
finish.

I''d love to hear your ideas and foreseen complications or if there  
are other tools that could easily accomplish the mentioned goals.

Sean

Wayne Larsen

2005-Jun-06 12:57 UTC

head link

Re: Rails spider/site copy with internal web requests

On 6-Jun-05, at 6:36 AM, Sean Treadway wrote:
> On a related topic to Web spidering, I would like to solve a problem 
> of internally spidering and saving a Rails application from within 
> rails.
>
> The goal is to spider and save each resource in a site to a filesystem 
> friendly structure
>         /many/things/1?page=3 -> many_things_1_page_3.html
> or    /many/things/1?page=3 -> many/things/1_page_3.html
>
Does it have to be done internally?  wget is a simple solution that 
I''ve used before to achieve the same effect...

Wayne

Apparently Analagous Threads

Search for more apparently analagous threads

Rails - Jun 2005 - Rails spider/site copy with internal web requests

Rails spider/site copy with internal web requests

Re: Rails spider/site copy with internal web requests

Apparently Analagous Threads