thr3ads.net - Backgroundrb devel - [Backgroundrb-devel] Using Backgroundrb? [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Julien Genestoux

2008-Apr-22 23:36 UTC

[Backgroundrb-devel] Using Backgroundrb?

Hello,

I am trying to build an application that will parse thousands of XML Feeds
in continue in the back.
I really have no idea on how to do this "correctly" with Rails.


Here is my code :


class Feed < ActiveRecord::Base
  def parse
  # Parsing using external lib (syndication gem)

  end
end


class FeedsController < ApplicationController
  def parse
    feed = Feed.find(params[:id])
    feed.parse
  end
end


So for now, if I want to parse all my feeds forever, what I have to do is to
call http://myapp/feeds/1/parse, and then http://myapp/feeds/2/parse ...
This is definetely not a good solution!


How can I use Backgroundrb to do this?

Thanks for your help!


--
Julien Genestoux
julien.genestoux at gmail.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/backgroundrb-devel/attachments/20080422/d86201e4/attachment.html

Adam Williams

2008-Apr-23 01:43 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

On Apr 22, 2008, at 7:36 PM, Julien Genestoux wrote:
> So for now, if I want to parse all my feeds forever, what I have to  
> do is to call http://myapp/feeds/1/parse, and then
http://myapp/feeds/2/parse
>  ...
> This is definetely not a good solution!
>
> How can I use Backgroundrb to do this?
1. Use the version of backgroundrb from subversion. The git one was  
having problems for me.
2. Follow these instructions: http://backgroundrb.rubyforge.org/
3. Then read this: http://backgroundrb.rubyforge.org/rails/index.html
4. Create a worker of your own. Schedule it according to
http://backgroundrb.rubyforge.org/scheduling/index.html

   adam (a 3 day old user of backgroundrb)

Julien Genestoux

2008-Apr-23 05:07 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Thanks Adam for your help...

I still have a few questions : shoud I have one worker for each feed
that is called periodically (add_periodic_timer) or rather one single
worker that calls every feed one by one?

What is the best solution, perfomance-wise?

Thanks again for your help!

Best


On 4/22/08, Adam Williams <adam at thewilliams.ws>
wrote:> On Apr 22, 2008, at 7:36 PM, Julien Genestoux wrote:
>
>  > So for now, if I want to parse all my feeds forever, what I have to
>  > do is to call http://myapp/feeds/1/parse, and then
http://myapp/feeds/2/parse
>  >  ...
>  > This is definetely not a good solution!
>  >
>  > How can I use Backgroundrb to do this?
>
>
> 1. Use the version of backgroundrb from subversion. The git one was
>  having problems for me.
>  2. Follow these instructions: http://backgroundrb.rubyforge.org/
>  3. Then read this: http://backgroundrb.rubyforge.org/rails/index.html
>  4. Create a worker of your own. Schedule it according to
http://backgroundrb.rubyforge.org/scheduling/index.html
>
>    adam (a 3 day old user of backgroundrb)
>  _______________________________________________
>  Backgroundrb-devel mailing list
>  Backgroundrb-devel at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>


-- 
--
Julien Genestoux
julien.genestoux at gmail.com
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29

Adam Williams

2008-Apr-23 10:45 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
> I still have a few questions : shoud I have one worker for each feed
> that is called periodically (add_periodic_timer) or rather one single
> worker that calls every feed one by one?
>
> What is the best solution, perfomance-wise?
Good question... I don''t suppose I know exactly. I would start by  
processing all the feeds in one worker invocation - that is what I  
have done for sending an unknown amount of email. It just seems wrong  
to me to invoke a worker for one email at a time.

The right answer likely lies in understanding the whole MasterWorker,  
Packet::Reactor/handler_instance.ask_work bits of the puzzle...

    adam

Julien Genestoux

2008-Apr-23 14:30 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Thanks Adam,

That sounded weird to me as well to have one worker for each feed...
However, if I only have one worker, that also means that I am parsing
one feed only at any moment. An option, is maybe to have a few workers
(denpending on the number of feeds)  that parse feeds concurrently?

If I only have one worker, according to you what should be the
winnning strategy to choose the "right" parse to feed? Obviously some
feeds need to be parsed one every few minutes, while some other might
no need to be parse more than every hour...

Any idea/tip on this?

On 4/23/08, Adam Williams <adam at thewilliams.ws>
wrote:> On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>
>  > I still have a few questions : shoud I have one worker for each feed
>  > that is called periodically (add_periodic_timer) or rather one single
>  > worker that calls every feed one by one?
>  >
>  > What is the best solution, perfomance-wise?
>
>
> Good question... I don''t suppose I know exactly. I would start by
>  processing all the feeds in one worker invocation - that is what I
>  have done for sending an unknown amount of email. It just seems wrong
>  to me to invoke a worker for one email at a time.
>
>  The right answer likely lies in understanding the whole MasterWorker,
>  Packet::Reactor/handler_instance.ask_work bits of the puzzle...
>
>
>     adam
>
> _______________________________________________
>  Backgroundrb-devel mailing list
>  Backgroundrb-devel at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>

-- 
--
Julien Genestoux
julien.genestoux at gmail.com
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29

Stevie Clifton

2008-Apr-23 16:14 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Hey Julien/Adam,

There was a great thread about a similar situation about 10 days ago.
Check it out here:
http://rubyforge.org/pipermail/backgroundrb-devel/2008-April/001681.html

Julien, you definitely don''t want a worker for each feed, and
you''ll
want to use thread_pool.defer, which will allow you to concurrently
process as many feeds as you want (or as many as your system can
handle).  From what you''ve said, it sounds like you''ll only
need one
worker coded up, but probably set multiple periodic timers (e.g one of
hourly parsing of high-priority feeds, one for nightlies, etc). The
method you specify in the periodic timer should use thread_pool.defer
to handle processing of multiple feeds at a time -- there''s no reason
to do them sequentially.

stevie

On Wed, Apr 23, 2008 at 10:30 AM, Julien Genestoux
<julien.genestoux at gmail.com> wrote:> Thanks Adam,
>
>  That sounded weird to me as well to have one worker for each feed...
>  However, if I only have one worker, that also means that I am parsing
>  one feed only at any moment. An option, is maybe to have a few workers
>  (denpending on the number of feeds)  that parse feeds concurrently?
>
>  If I only have one worker, according to you what should be the
>  winnning strategy to choose the "right" parse to feed? Obviously
some
>  feeds need to be parsed one every few minutes, while some other might
>  no need to be parse more than every hour...
>
>  Any idea/tip on this?
>
>
>
>
>
>
>  On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
>  > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>  >
>  >  > I still have a few questions : shoud I have one worker for each
feed
>  >  > that is called periodically (add_periodic_timer) or rather one
single
>  >  > worker that calls every feed one by one?
>  >  >
>  >  > What is the best solution, perfomance-wise?
>  >
>  >
>  > Good question... I don''t suppose I know exactly. I would
start by
>  >  processing all the feeds in one worker invocation - that is what I
>  >  have done for sending an unknown amount of email. It just seems
wrong
>  >  to me to invoke a worker for one email at a time.
>  >
>  >  The right answer likely lies in understanding the whole
MasterWorker,
>  >  Packet::Reactor/handler_instance.ask_work bits of the puzzle...
>  >
>  >
>  >     adam
>  >
>  > _______________________________________________
>  >  Backgroundrb-devel mailing list
>  >  Backgroundrb-devel at rubyforge.org
>  >  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >
>
>
>
> --
>  --
>  Julien Genestoux
>  julien.genestoux at gmail.com
>  http://www.ouvre-boite.com
>  +1 (415) 254 7340
>  +33 (0)8 70 44 76 29
>  _______________________________________________
>
>
> Backgroundrb-devel mailing list
>  Backgroundrb-devel at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>

Paul Kmiec

2008-Apr-23 16:17 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

You can use the built build thread pool to process more than one feed within
the same worker. So within the worker, you''d do,

def parse_feeds
  loop do
    feed = Feed.find_feed_to_process
    thread_pool.defer do
      feed.parse
    end
  end
end

I think the default pool size is 20. You can control the size of the thread
pool using a class level method, as I recall it is

pool_size x

Paul

On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux <
julien.genestoux at gmail.com> wrote:
> Thanks Adam,
>
> That sounded weird to me as well to have one worker for each feed...
> However, if I only have one worker, that also means that I am parsing
> one feed only at any moment. An option, is maybe to have a few workers
> (denpending on the number of feeds)  that parse feeds concurrently?
>
> If I only have one worker, according to you what should be the
> winnning strategy to choose the "right" parse to feed? Obviously
some
> feeds need to be parsed one every few minutes, while some other might
> no need to be parse more than every hour...
>
> Any idea/tip on this?
>
>
>
>
> On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
> > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
> >
> >  > I still have a few questions : shoud I have one worker for each
feed
> >  > that is called periodically (add_periodic_timer) or rather one
single
> >  > worker that calls every feed one by one?
> >  >
> >  > What is the best solution, perfomance-wise?
> >
> >
> > Good question... I don''t suppose I know exactly. I would
start by
> >  processing all the feeds in one worker invocation - that is what I
> >  have done for sending an unknown amount of email. It just seems wrong
> >  to me to invoke a worker for one email at a time.
> >
> >  The right answer likely lies in understanding the whole MasterWorker,
> >  Packet::Reactor/handler_instance.ask_work bits of the puzzle...
> >
> >
> >     adam
> >
> > _______________________________________________
> >  Backgroundrb-devel mailing list
> >  Backgroundrb-devel at rubyforge.org
> >  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> >
>
>
> --
> --
> Julien Genestoux
> julien.genestoux at gmail.com
> http://www.ouvre-boite.com
> +1 (415) 254 7340
> +33 (0)8 70 44 76 29
> _______________________________________________
> Backgroundrb-devel mailing list
> Backgroundrb-devel at rubyforge.org
> http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://rubyforge.org/pipermail/backgroundrb-devel/attachments/20080423/2437376c/attachment.html

Julien Genestoux

2008-Apr-23 21:46 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Thanks guys... that''s a ton of info! I am definetely gonna use the
thread_pool... as soon as I can find the documentation ;D

1- For each feed, I define a "frequency" (every minute, every hour...
every 30 minutes...) that will be updated every time I''m parsing the
feed: if the parser returns "new" element, I am increasding the
frequency (from 1 time per hour, to 1 time per 30 min.), if not, I''m
decreasing the frequency...

2- I also have a "last_update" field which remembers the time when the
feed was parsed for the last time.

3- With 1 & 2, I know how "late" I am to parse a feed... so when I
choose my next feed to parse, I am always choosing the one that is the
most "late"

I am not sure if Steevie''s approach of having multiple tasks for the
worker applies here. Actually, I am not even schedulling my worker, I
am just launching it once, and the parse_feeds runs forever (while
true do... end)

Also, if I understand well Paul''s code, his approach allows my worker
to be more efficient always, but doesn''t take into account the
"lateness" of my feeds.

My idea would be to add/remove worker according to "how late" I am in
parsing feeds.
If my the the lastest feed is late by more than 10min, I would add one
worker... and If my latest feed is late by less than 5 minutes, I
would remove one worker

Does this approach makes sense to you?

Thanks a lot for your help guys...

On 4/23/08, Paul Kmiec <paul.kmiec at appfolio.com>
wrote:> You can use the built build thread pool to process more than one feed
within
> the same worker. So within the worker, you''d do,
>
> def parse_feeds
>   loop do
>     feed = Feed.find_feed_to_process
>     thread_pool.defer do
>        feed.parse
>     end
>   end
> end
>
> I think the default pool size is 20. You can control the size of the thread
> pool using a class level method, as I recall it is
>
> pool_size x
>
> Paul
>
>
>  On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
> <julien.genestoux at gmail.com> wrote:
> > Thanks Adam,
> >
> > That sounded weird to me as well to have one worker for each feed...
> > However, if I only have one worker, that also means that I am parsing
> > one feed only at any moment. An option, is maybe to have a few workers
> > (denpending on the number of feeds)  that parse feeds concurrently?
> >
> > If I only have one worker, according to you what should be the
> > winnning strategy to choose the "right" parse to feed?
Obviously some
> > feeds need to be parsed one every few minutes, while some other might
> > no need to be parse more than every hour...
> >
> > Any idea/tip on this?
> >
> >
> >
> >
> >
> >
> >
> > On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
> > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
> > >
> > >  > I still have a few questions : shoud I have one worker for
each feed
> > >  > that is called periodically (add_periodic_timer) or rather
one single
> > >  > worker that calls every feed one by one?
> > >  >
> > >  > What is the best solution, perfomance-wise?
> > >
> > >
> > > Good question... I don''t suppose I know exactly. I would
start by
> > >  processing all the feeds in one worker invocation - that is what
I
> > >  have done for sending an unknown amount of email. It just seems
wrong
> > >  to me to invoke a worker for one email at a time.
> > >
> > >  The right answer likely lies in understanding the whole
MasterWorker,
> > >  Packet::Reactor/handler_instance.ask_work bits of the
> puzzle...
> > >
> > >
> > >     adam
> > >
> > > _______________________________________________
> > >  Backgroundrb-devel mailing list
> > >  Backgroundrb-devel at rubyforge.org
> > >
> http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> > >
> >
> >
> >
> > --
> > --
> > Julien Genestoux
> > julien.genestoux at gmail.com
> > http://www.ouvre-boite.com
> > +1 (415) 254 7340
> > +33 (0)8 70 44 76 29
> > _______________________________________________
> >
> >
> >
> > Backgroundrb-devel mailing list
> > Backgroundrb-devel at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
> >
>
>

-- 
--
Julien Genestoux
julien.genestoux at gmail.com
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29

Stevie Clifton

2008-Apr-25 14:54 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Hey Julien,

It sounds like you are planning on using one "long running" feed
parsing loop with a do...while.  This is exactly the sort of thing you
want to avoid in new bdrb, especially if you know you want to do
something at discrete time periods--it totally goes against the
twisted paradigm.  After thinking about it for a bit, I would
recommend setting just one periodic_timer for every minute, and then
determining in your parse_feeds method which feeds need to be parsed.
If I were you, I wouldn''t use last_updated to determine when to parse
your feeds -- it adds unnecessary complexity to your system.  You can
of course save that value for reference, but it''s not necessary for
your requirements.

In your db you could have a field for every feed call "interval" that
would determine the minute intervals to parse the feeds.  Then every
minute when parse_feed gets called, you could parse every feed with an
interval of "1", and then determine based on the current minute in the
hour whether or not to try to parse the 15, 30, or 60 minute feeds.
And you''ll of course want to use thread_pool.defer.  So, using
Paul''s
code as a starting point, something like this:

def parse_feeds
  feeds = Feed.find_feeds_to_process
  feeds.each do |feed|
    thread_pool.defer do
      feed.parse
    end
  end
end

class Feed
  def find_feeds_to_process
    feeds = []
    [1, 15, 30, 60].each |interval|
      feeds << Feeds.find_by_interval( interval ) if Time.now.min %
interval == 0
    end
  end
  def parse
    # parsing code
  end
end

On my way home yesterday I thought of another sexy addition you could
add to this.  In the above code, you know that you''ll be parsing
_every_ feed in your db on the hour, which isn''t a very efficient
setup.  If possible, you want to set it up so that you have even
parsing distribution throughout the hour so you''re not getting
hammered.  You could add a pretty simple heuristic that would give you
a relatively even distribution across the hour by using the hash of
the feed url.  Along with the url and the interval, save an "offset"
value like this example:

feed = Feed.new
feed.url = ''''my_feed_url''
feed.interval = 15
feed.offset = feed.url.hash % 60
feed.save

Then in find_feeds_to_process, you can do this (untested):

# the select will return any feed for which its interval offset
matches the current minute''s offset for the same interval
def find_feeds_to_process
  feeds = Feed.find(:all).select do |feed|
    [15, 30, 60].detect { |interval| feed.offset % interval =Time.now.min %
interval }
  end
end

Doing a Feed.find(:all) is probably not the best idea if you have a
ton of records, so you might want to do multiple db finds to get the
same results.

stevie


On Wed, Apr 23, 2008 at 5:46 PM, Julien Genestoux
<julien.genestoux at gmail.com> wrote:> Thanks guys... that''s a ton of info! I am definetely gonna use the
>  thread_pool... as soon as I can find the documentation ;D
>
>  1- For each feed, I define a "frequency" (every minute, every
hour...
>  every 30 minutes...) that will be updated every time I''m parsing
the
>  feed: if the parser returns "new" element, I am increasding the
>  frequency (from 1 time per hour, to 1 time per 30 min.), if not,
I''m
>  decreasing the frequency...
>
>  2- I also have a "last_update" field which remembers the time
when the
>  feed was parsed for the last time.
>
>  3- With 1 & 2, I know how "late" I am to parse a feed... so
when I
>  choose my next feed to parse, I am always choosing the one that is the
>  most "late"
>
>  I am not sure if Steevie''s approach of having multiple tasks for
the
>  worker applies here. Actually, I am not even schedulling my worker, I
>  am just launching it once, and the parse_feeds runs forever (while
>  true do... end)
>
>  Also, if I understand well Paul''s code, his approach allows my
worker
>  to be more efficient always, but doesn''t take into account the
>  "lateness" of my feeds.
>
>
>  My idea would be to add/remove worker according to "how late" I
am in
>  parsing feeds.
>  If my the the lastest feed is late by more than 10min, I would add one
>  worker... and If my latest feed is late by less than 5 minutes, I
>  would remove one worker
>
>  Does this approach makes sense to you?
>
>  Thanks a lot for your help guys...
>
>
>
>
>  On 4/23/08, Paul Kmiec <paul.kmiec at appfolio.com> wrote:
>  > You can use the built build thread pool to process more than one feed
within
>  > the same worker. So within the worker, you''d do,
>  >
>  > def parse_feeds
>  >   loop do
>  >     feed = Feed.find_feed_to_process
>  >     thread_pool.defer do
>  >        feed.parse
>  >     end
>  >   end
>  > end
>  >
>  > I think the default pool size is 20. You can control the size of the
thread
>  > pool using a class level method, as I recall it is
>  >
>  > pool_size x
>  >
>  > Paul
>  >
>  >
>  >  On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
>  > <julien.genestoux at gmail.com> wrote:
>  > > Thanks Adam,
>  > >
>  > > That sounded weird to me as well to have one worker for each
feed...
>  > > However, if I only have one worker, that also means that I am
parsing
>  > > one feed only at any moment. An option, is maybe to have a few
workers
>  > > (denpending on the number of feeds)  that parse feeds
concurrently?
>  > >
>  > > If I only have one worker, according to you what should be the
>  > > winnning strategy to choose the "right" parse to feed?
Obviously some
>  > > feeds need to be parsed one every few minutes, while some other
might
>  > > no need to be parse more than every hour...
>  > >
>  > > Any idea/tip on this?
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  > > On 4/23/08, Adam Williams <adam at thewilliams.ws> wrote:
>  > > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>  > > >
>  > > >  > I still have a few questions : shoud I have one
worker for each feed
>  > > >  > that is called periodically (add_periodic_timer) or
rather one single
>  > > >  > worker that calls every feed one by one?
>  > > >  >
>  > > >  > What is the best solution, perfomance-wise?
>  > > >
>  > > >
>  > > > Good question... I don''t suppose I know exactly. I
would start by
>  > > >  processing all the feeds in one worker invocation - that
is what I
>  > > >  have done for sending an unknown amount of email. It just
seems wrong
>  > > >  to me to invoke a worker for one email at a time.
>  > > >
>  > > >  The right answer likely lies in understanding the whole
MasterWorker,
>  > > >  Packet::Reactor/handler_instance.ask_work bits of the
>  > puzzle...
>  > > >
>  > > >
>  > > >     adam
>  > > >
>  > > > _______________________________________________
>  > > >  Backgroundrb-devel mailing list
>  > > >  Backgroundrb-devel at rubyforge.org
>  > > >
>  > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  > > >
>  > >
>  > >
>  > >
>  > > --
>  > > --
>  > > Julien Genestoux
>  > > julien.genestoux at gmail.com
>  > > http://www.ouvre-boite.com
>  > > +1 (415) 254 7340
>  > > +33 (0)8 70 44 76 29
>  > > _______________________________________________
>  > >
>  > >
>  > >
>  > > Backgroundrb-devel mailing list
>  > > Backgroundrb-devel at rubyforge.org
>  > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  > >
>  >
>  >
>
>
>  --
>
>
> --
>  Julien Genestoux
>  julien.genestoux at gmail.com
>  http://www.ouvre-boite.com
>  +1 (415) 254 7340
>  +33 (0)8 70 44 76 29
>  _______________________________________________
>  Backgroundrb-devel mailing list
>  Backgroundrb-devel at rubyforge.org
>  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>

Julien Genestoux

2008-Apr-28 20:43 UTC

head link

[Backgroundrb-devel] Using Backgroundrb?

Thanks a lot for this very helpful answer.
I implement a solution very similar to yours and it runs, but I have 2
big problems.

The first one is "throughput".

If I have a periodic timer of 1 minute, I can only parse 20 (number of
threads) feeds per minute, which leads to 1200 per hour (since I want
to parse a feed at least once every hour). The problem is that I
really need to be able to parse at least 10 times this number of
feeds... and probably closer to 100k! What if I increase the number of
threads? Will I be able to parse more feeds?


The second one is actually a lot worse. I''ve had my system running for
a little more than a day... without monitoring it, and well, this
morning, everything was "down". I did a "ps aux" and here is
what I
got :

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     21697  0.0  0.8  32524 15620 ?        D    Apr27   0:13 ruby
/mnt/app/current/script/backgroundrb start -e production
root     21698  0.0  0.2  32504  4736 ?        D    Apr27   0:08 ruby
log_worker
root     21699  1.1 90.5 2170872 1576364 ?     D    Apr27  25:58 ruby
parser_worker


As you can see, my parser_worker is consuming a little over 1,5Gb of
RAM : wayyyy too much ;) it seems that the vars are not destroyed in
my worker? Any idea of what''s wrong?

Thanks a lot once again for your help!

Best,







On 4/25/08, Stevie Clifton <stevie at slowbicycle.com>
wrote:> Hey Julien,
>
>  It sounds like you are planning on using one "long running" feed
>  parsing loop with a do...while.  This is exactly the sort of thing you
>  want to avoid in new bdrb, especially if you know you want to do
>  something at discrete time periods--it totally goes against the
>  twisted paradigm.  After thinking about it for a bit, I would
>  recommend setting just one periodic_timer for every minute, and then
>  determining in your parse_feeds method which feeds need to be parsed.
>  If I were you, I wouldn''t use last_updated to determine when to
parse
>  your feeds -- it adds unnecessary complexity to your system.  You can
>  of course save that value for reference, but it''s not necessary
for
>  your requirements.
>
>  In your db you could have a field for every feed call "interval"
that
>  would determine the minute intervals to parse the feeds.  Then every
>  minute when parse_feed gets called, you could parse every feed with an
>  interval of "1", and then determine based on the current minute
in the
>  hour whether or not to try to parse the 15, 30, or 60 minute feeds.
>  And you''ll of course want to use thread_pool.defer.  So, using
Paul''s
>  code as a starting point, something like this:
>
>  def parse_feeds
>   feeds = Feed.find_feeds_to_process
>   feeds.each do |feed|
>
>     thread_pool.defer do
>       feed.parse
>     end
>   end
>  end
>
>
> class Feed
>   def find_feeds_to_process
>     feeds = []
>     [1, 15, 30, 60].each |interval|
>       feeds << Feeds.find_by_interval( interval ) if Time.now.min %
>  interval == 0
>     end
>   end
>   def parse
>     # parsing code
>   end
>  end
>
>  On my way home yesterday I thought of another sexy addition you could
>  add to this.  In the above code, you know that you''ll be parsing
>  _every_ feed in your db on the hour, which isn''t a very efficient
>  setup.  If possible, you want to set it up so that you have even
>  parsing distribution throughout the hour so you''re not getting
>  hammered.  You could add a pretty simple heuristic that would give you
>  a relatively even distribution across the hour by using the hash of
>  the feed url.  Along with the url and the interval, save an
"offset"
>  value like this example:
>
>  feed = Feed.new
>  feed.url = ''''my_feed_url''
>  feed.interval = 15
>  feed.offset = feed.url.hash % 60
>  feed.save
>
>  Then in find_feeds_to_process, you can do this (untested):
>
>  # the select will return any feed for which its interval offset
>  matches the current minute''s offset for the same interval
>  def find_feeds_to_process
>   feeds = Feed.find(:all).select do |feed|
>     [15, 30, 60].detect { |interval| feed.offset % interval => 
Time.now.min % interval }
>   end
>  end
>
>  Doing a Feed.find(:all) is probably not the best idea if you have a
>  ton of records, so you might want to do multiple db finds to get the
>  same results.
>
>  stevie
>
>
>  On Wed, Apr 23, 2008 at 5:46 PM, Julien Genestoux
>
> <julien.genestoux at gmail.com> wrote:
>  > Thanks guys... that''s a ton of info! I am definetely gonna
use the
>  >  thread_pool... as soon as I can find the documentation ;D
>  >
>  >  1- For each feed, I define a "frequency" (every minute,
every hour...
>  >  every 30 minutes...) that will be updated every time I''m
parsing the
>  >  feed: if the parser returns "new" element, I am
increasding the
>  >  frequency (from 1 time per hour, to 1 time per 30 min.), if not,
I''m
>  >  decreasing the frequency...
>  >
>  >  2- I also have a "last_update" field which remembers the
time when the
>  >  feed was parsed for the last time.
>  >
>  >  3- With 1 & 2, I know how "late" I am to parse a
feed... so when I
>  >  choose my next feed to parse, I am always choosing the one that is
the
>  >  most "late"
>  >
>  >  I am not sure if Steevie''s approach of having multiple
tasks for the
>  >  worker applies here. Actually, I am not even schedulling my worker,
I
>  >  am just launching it once, and the parse_feeds runs forever (while
>  >  true do... end)
>  >
>  >  Also, if I understand well Paul''s code, his approach allows
my worker
>  >  to be more efficient always, but doesn''t take into account
the
>  >  "lateness" of my feeds.
>  >
>  >
>  >  My idea would be to add/remove worker according to "how
late" I am in
>  >  parsing feeds.
>  >  If my the the lastest feed is late by more than 10min, I would add
one
>  >  worker... and If my latest feed is late by less than 5 minutes, I
>  >  would remove one worker
>  >
>  >  Does this approach makes sense to you?
>  >
>  >  Thanks a lot for your help guys...
>  >
>  >
>  >
>  >
>  >  On 4/23/08, Paul Kmiec <paul.kmiec at appfolio.com> wrote:
>  >  > You can use the built build thread pool to process more than
one feed within
>  >  > the same worker. So within the worker, you''d do,
>  >  >
>  >  > def parse_feeds
>  >  >   loop do
>  >  >     feed = Feed.find_feed_to_process
>  >  >     thread_pool.defer do
>  >  >        feed.parse
>  >  >     end
>  >  >   end
>  >  > end
>  >  >
>  >  > I think the default pool size is 20. You can control the size
of the thread
>  >  > pool using a class level method, as I recall it is
>  >  >
>  >  > pool_size x
>  >  >
>  >  > Paul
>  >  >
>  >  >
>  >  >  On Wed, Apr 23, 2008 at 7:30 AM, Julien Genestoux
>  >  > <julien.genestoux at gmail.com> wrote:
>  >  > > Thanks Adam,
>  >  > >
>  >  > > That sounded weird to me as well to have one worker for
each feed...
>  >  > > However, if I only have one worker, that also means that I
am parsing
>  >  > > one feed only at any moment. An option, is maybe to have a
few workers
>  >  > > (denpending on the number of feeds)  that parse feeds
concurrently?
>  >  > >
>  >  > > If I only have one worker, according to you what should be
the
>  >  > > winnning strategy to choose the "right" parse to
feed? Obviously some
>  >  > > feeds need to be parsed one every few minutes, while some
other might
>  >  > > no need to be parse more than every hour...
>  >  > >
>  >  > > Any idea/tip on this?
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > >
>  >  > > On 4/23/08, Adam Williams <adam at thewilliams.ws>
wrote:
>  >  > > > On Apr 23, 2008, at 1:07 AM, Julien Genestoux wrote:
>  >  > > >
>  >  > > >  > I still have a few questions : shoud I have one
worker for each feed
>  >  > > >  > that is called periodically
(add_periodic_timer) or rather one single
>  >  > > >  > worker that calls every feed one by one?
>  >  > > >  >
>  >  > > >  > What is the best solution, perfomance-wise?
>  >  > > >
>  >  > > >
>  >  > > > Good question... I don''t suppose I know
exactly. I would start by
>  >  > > >  processing all the feeds in one worker invocation -
that is what I
>  >  > > >  have done for sending an unknown amount of email. It
just seems wrong
>  >  > > >  to me to invoke a worker for one email at a time.
>  >  > > >
>  >  > > >  The right answer likely lies in understanding the
whole MasterWorker,
>  >  > > >  Packet::Reactor/handler_instance.ask_work bits of
the
>  >  > puzzle...
>  >  > > >
>  >  > > >
>  >  > > >     adam
>  >  > > >
>  >  > > > _______________________________________________
>  >  > > >  Backgroundrb-devel mailing list
>  >  > > >  Backgroundrb-devel at rubyforge.org
>  >  > > >
>  >  > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >  > > >
>  >  > >
>  >  > >
>  >  > >
>  >  > > --
>  >  > > --
>  >  > > Julien Genestoux
>  >  > > julien.genestoux at gmail.com
>  >  > > http://www.ouvre-boite.com
>  >  > > +1 (415) 254 7340
>  >  > > +33 (0)8 70 44 76 29
>  >  > > _______________________________________________
>  >  > >
>  >  > >
>  >  > >
>  >  > > Backgroundrb-devel mailing list
>  >  > > Backgroundrb-devel at rubyforge.org
>  >  > > http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >  > >
>  >  >
>  >  >
>  >
>  >
>  >  --
>  >
>  >
>  > --
>  >  Julien Genestoux
>  >  julien.genestoux at gmail.com
>  >  http://www.ouvre-boite.com
>  >  +1 (415) 254 7340
>  >  +33 (0)8 70 44 76 29
>  >  _______________________________________________
>  >  Backgroundrb-devel mailing list
>  >  Backgroundrb-devel at rubyforge.org
>  >  http://rubyforge.org/mailman/listinfo/backgroundrb-devel
>  >
>

-- 
--
Julien Genestoux
julien.genestoux at gmail.com
http://www.ouvre-boite.com
+1 (415) 254 7340
+33 (0)8 70 44 76 29

Backgroundrb devel - Apr 2008 - Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?

[Backgroundrb-devel] Using Backgroundrb?