Neil Mock
2008-Jun-10 23:56 UTC
[Backgroundrb-devel] adding results from threads to a collection and returning it
Forgive me if this has been addressed somewhere, but I have searched and can''t come up with anything. I am basically trying to distribute several web page scraping tasks among different threads, and have the results from each added to an Array which is ultimately returned by the backgroundrb worker. Here is an example of what I''m trying to do in a worker method: pages = Array.new pages_to_scrape.each do |url| thread_pool.defer(url) do |url| begin # model object performs the scraping page = ScrapedPage.new(page.url) pages << page rescue logger.info "page scrape failed" end end end end return pages>From monitoring the backgroundrb logs, it appears that all of the pages arecompleted successfully in the threads. However, the array that is returned is empty. This is to be expected I suppose because the threads don''t complete before the array is returned, but my question is: how can I make the worker wait to return the array only when all of the threads are complete? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://rubyforge.org/pipermail/backgroundrb-devel/attachments/20080610/3eb915be/attachment-0001.html>
hemant
2008-Jun-11 03:35 UTC
[Backgroundrb-devel] adding results from threads to a collection and returning it
On Wed, Jun 11, 2008 at 5:26 AM, Neil Mock <neilmock at gmail.com> wrote:> Forgive me if this has been addressed somewhere, but I have searched and > can''t come up with anything. > > I am basically trying to distribute several web page scraping tasks among > different threads, and have the results from each added to an Array which is > ultimately returned by the backgroundrb worker. Here is an example of what > I''m trying to do in a worker method: > > pages = Array.new > > pages_to_scrape.each do |url| > thread_pool.defer(url) do |url| > begin > # model object performs the scraping > page = ScrapedPage.new(page.url) > pages << page > rescue > logger.info "page scrape failed" > end > end > end > end > > return pages > > From monitoring the backgroundrb logs, it appears that all of the pages are > completed successfully in the threads. However, the array that is returned > is empty. This is to be expected I suppose because the threads don''t > complete before the array is returned, but my question is: how can I make > the worker wait to return the array only when all of the threads are > complete? >Actually, you are doing couple of things wrong. First, you are accessing a variable that you created outside thread_pool from inside of pool and hence have a big thread unsafe code, which can cause anything from deadlocks to random crashes. Thread pools are for running concurrent tasks in background without any reporting, its for fire and forget kinda of thing. However, I am contemplating some change in behaviour of thread pools, which will enable what you want perhaps, so unless your need is dire, please don''t use thread pools like above snippet.
hemant
2008-Jun-12 19:54 UTC
[Backgroundrb-devel] adding results from threads to a collection and returning it
On Wed, Jun 11, 2008 at 5:26 AM, Neil Mock <neilmock at gmail.com> wrote:> Forgive me if this has been addressed somewhere, but I have searched and > can''t come up with anything. > > I am basically trying to distribute several web page scraping tasks among > different threads, and have the results from each added to an Array which is > ultimately returned by the backgroundrb worker. Here is an example of what > I''m trying to do in a worker method: > > pages = Array.new > > pages_to_scrape.each do |url| > thread_pool.defer(url) do |url| > begin > # model object performs the scraping > page = ScrapedPage.new(page.url) > pages << page > rescue > logger.info "page scrape failed" > end > end > end > end > > return pages > > From monitoring the backgroundrb logs, it appears that all of the pages are > completed successfully in the threads. However, the array that is returned > is empty. This is to be expected I suppose because the threads don''t > complete before the array is returned, but my question is: how can I make > the worker wait to return the array only when all of the threads are > complete? >Neil, I have a solution for you in git version: http://gnufied.org/2008/06/12/unthreaded-threads-of-hobbiton/
Stevie Clifton
2008-Jun-13 08:45 UTC
[Backgroundrb-devel] adding results from threads to a collection and returning it
Hey Hemant, A couple of questions about fetch_parallely: 1) Does it operate in the same way as thread_pool.defer, where the number of concurrent threads are limited by :pool_size? 2) Why did you choose to introduce another method instead of providing a thread-safe register_status? (more out of curiosity than anything else -- in my code I''ve overridden register_status to use a Mutex, and am wondering what the benefit of fetch_parallely would be to this) Thanks! stevie On Thu, Jun 12, 2008 at 3:54 PM, hemant <gethemant at gmail.com> wrote:> On Wed, Jun 11, 2008 at 5:26 AM, Neil Mock <neilmock at gmail.com> wrote: >> Forgive me if this has been addressed somewhere, but I have searched and >> can''t come up with anything. >> >> I am basically trying to distribute several web page scraping tasks among >> different threads, and have the results from each added to an Array which is >> ultimately returned by the backgroundrb worker. Here is an example of what >> I''m trying to do in a worker method: >> >> pages = Array.new >> >> pages_to_scrape.each do |url| >> thread_pool.defer(url) do |url| >> begin >> # model object performs the scraping >> page = ScrapedPage.new(page.url) >> pages << page >> rescue >> logger.info "page scrape failed" >> end >> end >> end >> end >> >> return pages >> >> From monitoring the backgroundrb logs, it appears that all of the pages are >> completed successfully in the threads. However, the array that is returned >> is empty. This is to be expected I suppose because the threads don''t >> complete before the array is returned, but my question is: how can I make >> the worker wait to return the array only when all of the threads are >> complete? >> > > Neil, > > I have a solution for you in git version: > > http://gnufied.org/2008/06/12/unthreaded-threads-of-hobbiton/ > _______________________________________________ > Backgroundrb-devel mailing list > Backgroundrb-devel at rubyforge.org > http://rubyforge.org/mailman/listinfo/backgroundrb-devel >
hemant
2008-Jun-13 10:33 UTC
[Backgroundrb-devel] adding results from threads to a collection and returning it
On Fri, Jun 13, 2008 at 2:15 PM, Stevie Clifton <stevie at slowbicycle.com> wrote:> Hey Hemant, > > A couple of questions about fetch_parallely: > > 1) Does it operate in the same way as thread_pool.defer, where the > number of concurrent threads are limited by :pool_size? > > 2) Why did you choose to introduce another method instead of > providing a thread-safe register_status? (more out of curiosity than > anything else -- in my code I''ve overridden register_status to use a > Mutex, and am wondering what the benefit of fetch_parallely would be > to this)register_status is going to invoke send_data at one point or another. yeah sure, prolly i can make outbound_data instance variable protected by a mutex, but thats going to slow down the whole operation by a large margin. Thats not simply point of a event driven network programming library. It will mean that, I will have to check for mutex, each time i write. What a waste of time it will be! Now, on the other hand, if we can make sure that, we retrieve results from thread pool in a thread safe manner and then invoke send_data, everything is nice and dandy. fetch_parallely does exactly that. Name is a bit dubious, i didn''t want to break existing functionality and at the same time, wanted to add this. Let me know, if you have better name.