Hi all,
On a related topic to Web spidering, I would like to solve a problem
of internally spidering and saving a Rails application from within
rails.
The goal is to spider and save each resource in a site to a
filesystem friendly structure
/many/things/1?page=3 -> many_things_1_page_3.html
or /many/things/1?page=3 -> many/things/1_page_3.html
If a redirect is created from an action, hash and follow the
redirected resource, serializing it if it hasn''t been seen before.
Assuming that all links are generated and indexable through an API
such as "url_for", the links could be rewritten to a flat hierarchy
and internal URLs could easily be hashed for spidering to the next
resources.
What are some ideas for simulating a web request to process the
internal URLs?
I''ve looked at modifying a TestRequest, passing instances to
ActionController::Base.process and serializing the response.body.
Another option I see is to hook into the Routing module to find and
route each URL returned from "url_for", then call
"render_component"
with the parameters found from the route, but it seems that it would
be better to run each controller in their own request from start to
finish.
I''d love to hear your ideas and foreseen complications or if there
are other tools that could easily accomplish the mentioned goals.
Sean