Hello, For an internal project we were looking for a solution which extends ActiveRecord by an iterator function. Using find(:all) is no fun if you have to process some 10.000 records. Until now there have been two ways of dealing with that scenario: - write your logic a second times (e.g. use stored procedure) - bet on AR plugins which work around that limitation by fetching IDs and processing them in groups, or by using :offset and :limit Rewriting logic is something we wanted to avoid, and the plugins don''t fully respect transactional contexts. So we started to implement our own true iterator support for AR. Our project is on Oracle, but in the meantime we have added support for MySQL. Probably other adapters can be extended easily, too. We also tried JRuby 1.1.x, which is sometimes faster than Ruby 1.8.6, but a patch is needed to bring the Java part of the connection adapter into shape for a result set iteration. Okay, you''re about to ask: how does it work. Here we go: MyTable.find_each_by_sql("some SQL here") { |my_table| ... } MyTable.find_each(:all, :conditions => ..., :include => ...) { |my_table| ... } Attached you find the magic code which can be used as a plugin for Rails. When testing, please keep in mind that only Oracle and MySQL is fully supported. JDBC will take lots of RAM for large result sets until you have patched the JdbcConnectionAdapter. Some figures with JRuby: I''ve tested the code for an export of ~80.000 customer data records. Originally I couldn''t run the export with heap space less than 2 GB (JRuby 1.1.4 without extensive garbage collection). After having patched the connection adapter, it works with less than 128 MB heap space (JRuby 1.1). I''d be happy if our idea would be picked up and AR would get these iterator methods integrated. I''ve seen lots of people asking for exactly this behavior. It''s possible to implement, it''s easy to implement, and IMHO it doesn''t break the AR metaphor. If you like the idea and want to send feedback, please CC me. I''m not subscribed to the list. Regards, Andreas -- OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany Andreas Gungl (e-mail: Andreas.Gungl@osp-dd.de) Tel. +49-351-49723-140 - Fax: +49-351-49723-119 AG Dresden, HRB 2475 Geschäftsführer Burkhard Arrenberg, Jens Gruhl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
On Sat, Nov 29, 2008 at 8:14 AM, Andreas Gungl <gungl@osp-dd.de> wrote:> > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > MyTable.find_each(:all, :conditions => ..., > :include => ...) { |my_table| ... } > >Any chance of having these work by scope/proxy, instead? For instance: MyTable.all.each do |record| ... end Or, something like this, with named scopes: User.registered.customers.each do |user| ... end I think this already works, actually -- it just implicitly converts the AssociationProxy to an array. In other words, it slurps the entire result set. It would be nice if this behavior could be made to transparently (or optionally) do what you''ve implemented, rather than inventing a whole new interface -- for example, find_each doesn''t help when you want to use other things from Enumerable, like select/reject, inject, etc. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
Hey Andreas, I just quickly went through your changes and it looks like a good start. A few things : - Could you please open a ticket at http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview and upload a .diff file with the patch written for Rails ? Check http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview for more help - Patch is missing tests - Those JDBC patches don''t really belong to Rails :) Also, it''ll really help if you could join the mailing list while we talk about your patch. Most of the people are likely to forget CCing you. But if you don''t want to join our lil group, that''s not a big deal :) Thanks for working on this. On Sat, Nov 29, 2008 at 3:14 PM, Andreas Gungl <gungl@osp-dd.de> wrote:> Hello, > > For an internal project we were looking for a solution which extends > ActiveRecord by an iterator function. Using find(:all) is no fun if you > have to process some 10.000 records. > > Until now there have been two ways of dealing with that scenario: > - write your logic a second times (e.g. use stored procedure) > - bet on AR plugins which work around that limitation by fetching IDs and > processing them in groups, or by using :offset and :limit > > Rewriting logic is something we wanted to avoid, and the plugins don''t fully > respect transactional contexts. So we started to implement our own true > iterator support for AR. Our project is on Oracle, but in the meantime we > have added support for MySQL. Probably other adapters can be extended > easily, too. We also tried JRuby 1.1.x, which is sometimes faster than Ruby > 1.8.6, but a patch is needed to bring the Java part of the connection > adapter into shape for a result set iteration. > > Okay, you''re about to ask: how does it work. Here we go: > > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > MyTable.find_each(:all, :conditions => ..., > :include => ...) { |my_table| ... } > > Attached you find the magic code which can be used as a plugin for Rails. > When testing, please keep in mind that only Oracle and MySQL is fully > supported. JDBC will take lots of RAM for large result sets until you have > patched the JdbcConnectionAdapter. > > Some figures with JRuby: > I''ve tested the code for an export of ~80.000 customer data records. > Originally I couldn''t run the export with heap space less than 2 GB (JRuby > 1.1.4 without extensive garbage collection). After having patched the > connection adapter, it works with less than 128 MB heap space (JRuby 1.1). > > > I''d be happy if our idea would be picked up and AR would get these iterator > methods integrated. I''ve seen lots of people asking for exactly this > behavior. It''s possible to implement, it''s easy to implement, and IMHO it > doesn''t break the AR metaphor. > > If you like the idea and want to send feedback, please CC me. I''m not > subscribed to the list. > > Regards, > Andreas > -- > OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany > Andreas Gungl (e-mail: Andreas.Gungl@osp-dd.de) > Tel. +49-351-49723-140 - Fax: +49-351-49723-119 > > AG Dresden, HRB 2475 > Geschäftsführer Burkhard Arrenberg, Jens Gruhl > > > >-- Cheers! - Pratik http://m.onkey.org --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
David, I don''t think it would be possible to support the enumerable interface in this extension because it would negate the memory management benefits that are the reason for using such a technique. Looking at the code, it looks like Andreas instantiates one ruby object at a time from the result set, then yields that object to the block for work to be done on it and frees the object after the yield returns (it''s a little fancier than this with the Oracle extension, but in the MySQL version that''s about it). If you were to allow for things like select/ reject, inject, etc then you''re back in a situation where some potentially large subset (or all) of those ruby objects are going to be held in memory. As for named scopes, like you said, I think that just works. The named scopes just insert there additional magic into the final query that gets made (adds additional conditions, joins, etc). Cheers, Andrew On Dec 1, 1:47 pm, "David Masover" <d...@3mix.com> wrote:> On Sat, Nov 29, 2008 at 8:14 AM, Andreas Gungl <gu...@osp-dd.de> wrote: > > > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > > MyTable.find_each(:all, :conditions => ..., > > :include => ...) { |my_table| ... } > > Any chance of having these work by scope/proxy, instead? For instance: > > MyTable.all.each do |record| > ... > end > > Or, something like this, with named scopes: > > User.registered.customers.each do |user| > ... > end > > I think this already works, actually -- it just implicitly converts the > AssociationProxy to an array. In other words, it slurps the entire result > set. It would be nice if this behavior could be made to transparently (or > optionally) do what you''ve implemented, rather than inventing a whole new > interface -- for example, find_each doesn''t help when you want to use other > things from Enumerable, like select/reject, inject, etc.--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
The one problem I see with this approach is that it does not allow for the eager-loading of polymorphic associations. In fact, as the code stands now I couldn''t get vanilla :include options to work, only :joins (but I think that''s just a little bug in the parsing of the options hash). The problem is, when Rails uses the eager include approach rather than the join approach it looks through the entire result set to see which associations need to be queried and then grabs those associations in separate queries. This is relatively efficient because the object graph is being built out for all objects at once, my guess is that doing the same thing one object at a time would be very difficult to do efficiently. As an alternative approach, the project I work on uses a home-grown version of what Andreas mentions as one of the two options -- we fetch IDs and process them in groups. The one thing I really like about that approach is all the same finder options still work as expected with no additional programming required. IMO losing eager includes of polymorphic associations is a non-starter. just my 2 cents :) -ac On Nov 29, 6:14 am, Andreas Gungl <gu...@osp-dd.de> wrote:> Hello, > > For an internal project we were looking for a solution which extends > ActiveRecord by an iterator function. Using find(:all) is no fun if you > have to process some 10.000 records. > > Until now there have been two ways of dealing with that scenario: > - write your logic a second times (e.g. use stored procedure) > - bet on AR plugins which work around that limitation by fetching IDs and > processing them in groups, or by using :offset and :limit > > Rewriting logic is something we wanted to avoid, and the plugins don''t fully > respect transactional contexts. So we started to implement our own true > iterator support for AR. Our project is on Oracle, but in the meantime we > have added support for MySQL. Probably other adapters can be extended > easily, too. We also tried JRuby 1.1.x, which is sometimes faster than Ruby > 1.8.6, but a patch is needed to bring the Java part of the connection > adapter into shape for a result set iteration. > > Okay, you''re about to ask: how does it work. Here we go: > > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > MyTable.find_each(:all, :conditions => ..., > :include => ...) { |my_table| ... } > > Attached you find the magic code which can be used as a plugin for Rails. > When testing, please keep in mind that only Oracle and MySQL is fully > supported. JDBC will take lots of RAM for large result sets until you have > patched the JdbcConnectionAdapter. > > Some figures with JRuby: > I''ve tested the code for an export of ~80.000 customer data records. > Originally I couldn''t run the export with heap space less than 2 GB (JRuby > 1.1.4 without extensive garbage collection). After having patched the > connection adapter, it works with less than 128 MB heap space (JRuby 1.1). > > I''d be happy if our idea would be picked up and AR would get these iterator > methods integrated. I''ve seen lots of people asking for exactly this > behavior. It''s possible to implement, it''s easy to implement, and IMHO it > doesn''t break the AR metaphor. > > If you like the idea and want to send feedback, please CC me. I''m not > subscribed to the list. > > Regards, > Andreas > -- > OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany > Andreas Gungl (e-mail: Andreas.Gu...@osp-dd.de) > Tel. +49-351-49723-140 - Fax: +49-351-49723-119 > > AG Dresden, HRB 2475 > Geschäftsführer Burkhard Arrenberg, Jens Gruhl > > active_record_iterator.tgz > 9KViewDownload--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
On 2 Dec 2008, at 15:55, acechase wrote:> > David, > > I don''t think it would be possible to support the enumerable interface > in this extension because it would negate the memory management > benefits that are the reason for using such a technique. Looking at > the code, it looks like Andreas instantiates one ruby object at a time > from the result set, then yields that object to the block for work to > be done on it and frees the object after the yield returns (it''s a > little fancier than this with the Oracle extension, but in the MySQL > version that''s about it). If you were to allow for things like select/ > reject, inject, etc then you''re back in a situation where some > potentially large subset (or all) of those ruby objects are going to > be held in memory.I don''t thing that''s necessarily true. Consider for example (1..10000000).select {|x| x == 100000} This range contains ten million elements, but executing that statement doesn''t require having all of those in memory at any given time (keep an eye on ruby''s memory consumption while this runs) On the other hand this would (1..10000000).to_a.select {|x| x == 100000} Of course if the result set you are returning is large eg (1..10000000).to_a.select {|x| x % 2 == 0 } then that will take a big chunk of memory but that seems fair enough. Fred> > > As for named scopes, like you said, I think that just works. The named > scopes just insert there additional magic into the final query that > gets made (adds additional conditions, joins, etc). > > Cheers, > Andrew > > On Dec 1, 1:47 pm, "David Masover" <d...@3mix.com> wrote: >> On Sat, Nov 29, 2008 at 8:14 AM, Andreas Gungl <gu...@osp-dd.de> >> wrote: >> >>> MyTable.find_each_by_sql("some SQL here") { |my_table| ... } >> >>> MyTable.find_each(:all, :conditions => ..., >>> :include => ...) { |my_table| ... } >> >> Any chance of having these work by scope/proxy, instead? For >> instance: >> >> MyTable.all.each do |record| >> ... >> end >> >> Or, something like this, with named scopes: >> >> User.registered.customers.each do |user| >> ... >> end >> >> I think this already works, actually -- it just implicitly converts >> the >> AssociationProxy to an array. In other words, it slurps the entire >> result >> set. It would be nice if this behavior could be made to >> transparently (or >> optionally) do what you''ve implemented, rather than inventing a >> whole new >> interface -- for example, find_each doesn''t help when you want to >> use other >> things from Enumerable, like select/reject, inject, etc. > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
I understand -- it would be a case where you do not want to deliberately cache the resultset, and so, it would not be quite like an association proxy. But of course, each is all that''s needed for things like select/reject/inject. And inject, in particular, should be safe -- select/reject, any?, all?, and similar things will still benefit from not _holding_ all items in memory, even if all items will have to be iterated over anyway. In fact, it''s my understanding that this is the main reason for having an enumerable interface -- when using such an interface, you''re not required to know or care whether it''s a native array, the result of slurping the entire table, one select call per iteration, or anything in between. Of course, there are also the things like sort, uniq, and map that would ruin your whole day -- though map, at least, could be implemented in a lazier way, for this purpose. On Tue, Dec 2, 2008 at 9:55 AM, acechase <acechase@gmail.com> wrote:> > David, > > I don''t think it would be possible to support the enumerable interface > in this extension because it would negate the memory management > benefits that are the reason for using such a technique. Looking at > the code, it looks like Andreas instantiates one ruby object at a time > from the result set, then yields that object to the block for work to > be done on it and frees the object after the yield returns (it''s a > little fancier than this with the Oracle extension, but in the MySQL > version that''s about it). If you were to allow for things like select/ > reject, inject, etc then you''re back in a situation where some > potentially large subset (or all) of those ruby objects are going to > be held in memory. > > As for named scopes, like you said, I think that just works. The named > scopes just insert there additional magic into the final query that > gets made (adds additional conditions, joins, etc). > > Cheers, > Andrew > > On Dec 1, 1:47 pm, "David Masover" <d...@3mix.com> wrote: > > On Sat, Nov 29, 2008 at 8:14 AM, Andreas Gungl <gu...@osp-dd.de> wrote: > > > > > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > > > > MyTable.find_each(:all, :conditions => ..., > > > :include => ...) { |my_table| ... } > > > > Any chance of having these work by scope/proxy, instead? For instance: > > > > MyTable.all.each do |record| > > ... > > end > > > > Or, something like this, with named scopes: > > > > User.registered.customers.each do |user| > > ... > > end > > > > I think this already works, actually -- it just implicitly converts the > > AssociationProxy to an array. In other words, it slurps the entire result > > set. It would be nice if this behavior could be made to transparently (or > > optionally) do what you''ve implemented, rather than inventing a whole new > > interface -- for example, find_each doesn''t help when you want to use > other > > things from Enumerable, like select/reject, inject, etc. > > >--~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---
Am Tuesday 02 December 2008 schrieb Pratik:> Hey Andreas, > > I just quickly went through your changes and it looks like a good > start. A few things : > > - Could you please open a ticket at > http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview > and upload a .diff file with the patch written for Rails ? Check > http://rails.lighthouseapp.com/projects/8994-ruby-on-rails/overview > for more help > - Patch is missing tests > - Those JDBC patches don''t really belong to Rails :)Well, it''s just that people would wonder about the JDBC behavior. You risk an OOM killed process if you work without the patches for large sets of data. If the adapter maintainers picked up the changes, the adapter would work out of the box. Reading the other comments about the incompleteness of the #find_each part, I do wonder if I better focus on the first part for now. #find_each_by_sql is pretty straight forward. Probably the other part can be improved, let me see. In general I see the main benefit in the impressively decreased memory need when working with large result sets. It''s still a single transaction which is what I missed in the other plugins which work around the memory limitation by multiple selects.> Also, it''ll really help if you could join the mailing list while we > talk about your patch. Most of the people are likely to forget CCing > you. But if you don''t want to join our lil group, that''s not a big > deal :)You''re right. ;-) I hope to find some more time over the Xmas holidays to put some more effort into the current implementation. I certainly get back then.> Thanks for working on this. > > On Sat, Nov 29, 2008 at 3:14 PM, Andreas Gungl <gungl@osp-dd.de> wrote: > > Hello, > > > > For an internal project we were looking for a solution which extends > > ActiveRecord by an iterator function. Using find(:all) is no fun if you > > have to process some 10.000 records. > > > > Until now there have been two ways of dealing with that scenario: > > - write your logic a second times (e.g. use stored procedure) > > - bet on AR plugins which work around that limitation by fetching IDs > > and processing them in groups, or by using :offset and :limit > > > > Rewriting logic is something we wanted to avoid, and the plugins don''t > > fully respect transactional contexts. So we started to implement our > > own true iterator support for AR. Our project is on Oracle, but in the > > meantime we have added support for MySQL. Probably other adapters can > > be extended easily, too. We also tried JRuby 1.1.x, which is sometimes > > faster than Ruby 1.8.6, but a patch is needed to bring the Java part of > > the connection adapter into shape for a result set iteration. > > > > Okay, you''re about to ask: how does it work. Here we go: > > > > MyTable.find_each_by_sql("some SQL here") { |my_table| ... } > > > > MyTable.find_each(:all, :conditions => ..., > > > > :include => ...) { |my_table| ... } > > > > Attached you find the magic code which can be used as a plugin for > > Rails. When testing, please keep in mind that only Oracle and MySQL is > > fully supported. JDBC will take lots of RAM for large result sets until > > you have patched the JdbcConnectionAdapter. > > > > Some figures with JRuby: > > I''ve tested the code for an export of ~80.000 customer data records. > > Originally I couldn''t run the export with heap space less than 2 GB > > (JRuby 1.1.4 without extensive garbage collection). After having > > patched the connection adapter, it works with less than 128 MB heap > > space (JRuby 1.1). > > > > > > I''d be happy if our idea would be picked up and AR would get these > > iterator methods integrated. I''ve seen lots of people asking for > > exactly this behavior. It''s possible to implement, it''s easy to > > implement, and IMHO it doesn''t break the AR metaphor. > > > > If you like the idea and want to send feedback, please CC me. I''m not > > subscribed to the list. > > > > Regards, > > Andreas > > -- > > OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany > > Andreas Gungl (e-mail: Andreas.Gungl@osp-dd.de) > > Tel. +49-351-49723-140 - Fax: +49-351-49723-119 > > > > AG Dresden, HRB 2475 > > Geschäftsführer Burkhard Arrenberg, Jens Gruhl > > > >-- OTTO Software Partner GmbH, Freiberger Str. 35, 01067 Dresden, Germany Andreas Gungl (e-mail: Andreas.Gungl@osp-dd.de) Tel. +49-351-49723-140 - Fax: +49-351-49723-119 AG Dresden, HRB 2475 Geschäftsführer Burkhard Arrenberg, Jens Gruhl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group. To post to this group, send email to rubyonrails-core@googlegroups.com To unsubscribe from this group, send email to rubyonrails-core+unsubscribe@googlegroups.com For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en -~----------~----~----~----~------~----~------~--~---