Cc''ing to the list for archival purposes:
On Tue, Mar 25, 2008 at 7:55 PM, Brian Noguchi <brian.noguchi at
gmail.com> wrote:> Hi Hemant,
>
> I''m Brian Noguchi, a developer in the Bay Area. I have some
questions about
> backgroundrb, and I found your contact info on a forum. I figured its
> probably best to get answers straight from the source.
>
> First of all, thanks for your work on backgroundrb! I''ve heard
nothing but
> great things about the newest version. I''m looking forward to
incorporating
> it into my site.
>
> I had several questions regarding implementing some features on my site
> using backgroundrb. If you could help guide me in any way with any of
> these, that would be great!
>
> Background: I''m trying to write a series of web crawler tasks.
This is my
> first time writing a robust web crawler.
>
> A new web crawler task is initiated whenever a user decides to track
> information from a new site. Upon initialization by the user, that web
> crawler is supposed to run using backgroundrb and then (1) save the
> information to the db and (2) periodically provide data back to the view
> either with xml or json that displays the contents of its crawl thus far.
Great!
>
> After the web crawler runs once, it is then scheduled to run periodically
on
> a daily basis, saving information to the db but not generating any xml or
> json to send back to the view.
> The questions I have:
>
> Is the following the best/most scalable way to implement?...Each site I am
> crawling gets its own worker -- e.g., MyspaceWorker. Within each worker, I
> have a crawl method that uses concurrency to avoid latency when crawling
one
> set of several web pages within one website. If 2 users decides to track
> two different sets of pages from a given website, then I declare two new
> instances of MyspaceWorker. And so on and so forth.
Having one worker for each website is ok. But "If 2 users decides to track
two different sets of pages from a given website, then I declare two new
instances of MyspaceWorker." is BAD. Because if you have 100 users
tracking same website, then you will have 100 instances of workers
running. Thats where thread_pool comes in to the picture. Read below
for more details.
>
> How do I provide json or xml back to the view if I''m using
workers? It''s
> important for the UI to show crawled content periodically to show a more
> detailed progress "indicator" in the view.
Just generate xml/json and save as status/result objects for the
worker and then query the data back using ask_status method of a
worker.
>
> In one of your posts, you mention:
> " When you are processing too many tasks from rails, you should use
inbuilt
> thread pool, rather than firing new workers"
> ...We are planning to have 100s of web crawlers being initiated and thus
> periodically scheduled to run. I''m assuming I should use the
inbuilt thread
> pool. But does this mean that the workers are running in parallel as
> threads no matter the worker type? Or that the instances of each worker
are
> run in parallel for one given worker type? What exactly is being threaded?
> I''m not very familiar with the event model of network programming
that you
> mention; I looked into it, but am a little confused when it comes to
> figuring out exactly how the workers and everything works from a network
> programming theory point of view. Any clarification with regards
to this> issue or direction to some resources would be greatly appreciated. I hope
> you''ll have time to address some or all of these questions.
Hopefully, they
> will not take too long to answer. Sorry for the inconvenience. Thanks for
> the plugin and the help. I hope I can get my ruby-fu to the point where I
> can contribute back to the community in a similar fashion in the future.
As I said earlier in this mail, having one worker for each user
tracking same website is bad. What you need is, when 2 users are
tracking same website, but different pages, you can have 2 threads
within same worker. Let me explain:
Say user x, wants to track /social_revolution page of myspace and user
y wants to track /facebook_suks page of myspace. As you mentioned
earlier, tracking itself will be invoked from rails hence, assuming
our myspace worker is already started, we will send following data to
the worker:
MiddleMan.worker(:myspace_worker).crawl_page(:page_name =>
"social_revolution",:user_id => x)
Where crawl_page is a method inside MyspaceWorker.
class MyspaceWorker<BackgrounDRb::MetaWorker
def create
@crawled_data = {}
@data_mutex = Mutex.new
end
def crawl_page options = {}
thread_pool.defer(options) { start_crawling(options)}
end
def start_crawling options
# do your page crawling here
result = magical_crawl(options[:page_name])
# since multiple threads may write to the same hash, lets protect
it with a mutex
@data_mutex.sycncronize do
@crawled_data[options[:user_id]] = result
end
# save as status hash, so as it can be accessed later from rails
register_status(@crawled_data)
end
end
Similarly for other user. So whats being threaded here is, your
crawling task and now our worker can do concurrent crawling for many
users. You can get the generated xml/json using:
result = (MiddleMan.worker(:myspace_worker).ask_status)[:whatever_user_id]
You mentioned, you also need to schedule crawling for periodic
execution. For that case, you should probably separate actual crawling
function in a separate method and set it to periodically execute.
There are couple of finer points that needs to be kept in mind:
1. Crawled xml/json can be huge and hence consider using Memcache for
result storage. BackgrounDRb supports clustered memcache based result
storage.
2. If you are using same instance of MyspaceWorker for crawling pages
on demand as well as for scheduling crawling. Usually scheduled tasks
are not threaded, as in say you have scheduled crawling of pages
preferred by users x,y and z, then while worker is processing (thats ,
while its doing actual crawling, not sitting idle for schedule to
reach) task, it won''t be able to respond to requests from rails
immediately. You can mitigate this easily by adding scheduled tasks to
thread pool as well. What I mean is, say method "scheduled_crawling"
initiates scheduled crawling of pages preferred by all users on
current website. You can schedule "scheduled_crawling" through cron
scheduler or normal add_periodic_timer method. Now, since you do not
want scheduled processing to block rest of the request/response cycle,
you can do:
def scheduled_crawling
x = find_pages_to_crawl
thread_pool.defer {|x| crawl_without_xml_jxon(x)}
end
All in all, its a bit tricky and hence all the best.