Zach Dennis
2005-Nov-03 18:42 UTC
Processing large data sets w/rails, at blazing speeds, question
I''ve been doing alot recently with large data sets, 300k records and upwards. One thing rails doesn''t have yet is support for temporary tables (or at least that i''m not aware of). I extended ActiveRecord::Base to support temporary tables with the the file attached to this email. I only did so much to get what I needed done and am looking for input from others as to how to make this better. I would like something like this to be put into Rails, because when processing large data sets it speeds things up 10000 times! Basic example of code: # Account is an existing model # This line creates a TempAccount model which is an in memory # temporary table like Account which is named TempAccount model = Account.create_temporary_model # The next line is true Object.const_defined? model.name ... process data here... # kill temporary model model.drop_temporary_model Now one problem I''ve hit when doing large amounts of processing is that to have everything be created as an instance of the newly created model makes everything go very slow. If you create your own sql statements and pass them into the model.connection.execute method things are blazing fast. (this has been tested in development and production modes, only running cgi) For example (based of from above example): rgx = /(\d+)\s(\w+)/ uploaded_file.each_line do |line| line =~ rgx record = model.new record.accountno = $1 record.account_name = $2 record.save end This was incredibly slow in my tests Here were my results: 2 records: 0.69 seconds 100 records: 4.98 seconds 1000 records: 41.88 seconds (imagine doing that for 300k plus records) Now altering that to use code which looks like the below had awesome results: rgx = /(\d+)\s(\w+)/ arr = [] uploaded_file.each_line do |line| line =~ rgx arr << "insert into #{model.table_name}" + "(accountno,account_name) " + "values(''#{$1}'',''#{$2}'');" end arr.each{ |sql| model.connection.execute( sql ) } The results were much faster: 2 records: 0.35 seconds 100 records: 0.73 seconds 1000 records: 3.90 seconds 5000 records: 21.91 seconds 50000 records: 253.96 seconds Now these are much better then before. If instead I write the sql statements to a temporary file and make a system call to run the mysql client in batch mode, it goes even faster, but then I lose the ability to do things easily in rails. The results for calling system( ) were: 5000 records: 16.44 seconds 50000 records: 141.21 seconds This is all using a development box, which is has 640Mb of meory and runs a 1.7Ghz Intel Celeron Processor, with Rails 0.13.1, and it is using Apache/CGI. I know running things on a faster server with FastCGI or SCGI will make things blaze like lightning, but I figure if I can improve Apache/CGI performance then Apache/FastCGI will just rock. I want to add a method to ActiveRecord::Base which is only added on models based off from temporary tables which will optimize it''s own insert calls, so I *dont''* have to manually write my own "insert" statements. I want to use the following code and achieve the same (or pretty darn close) results as I did when I manually inserted my own sql statements : rgx = /(\d+)\s(\w+)/ uploaded_file.each_line do |line| line =~ rgx record = model.new record.accountno = $1 record.account_name = $2 record.save end I know how-to add it, but I didn''t know if there are methods I should be reusing instead to help generate that optimization. Thoughts? Zach _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Jamis Buck
2005-Nov-03 18:51 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:> I want to add a method to ActiveRecord::Base which is only added on > models based off from temporary tables which will optimize it''s own > insert calls, so I *dont''* have to manually write my own "insert" > statements. I want to use the following code and achieve the same > (or pretty darn close) results as I did when I manually inserted my > own sql statements : > rgx = /(\d+)\s(\w+)/ > uploaded_file.each_line do |line| > line =~ rgx > record = model.new > record.accountno = $1 > record.account_name = $2 > record.save > end > > I know how-to add it, but I didn''t know if there are methods I > should be reusing instead to help generate that optimization. > Thoughts? >Keep in mind that the creation of the new model and the generation of the SQL is part of the performance hit you were seeing (possibly a large part). It is just part of the price you pay for abstraction. However, you could do something like: model.insert :accountno => $1, :account_name => $2 And have that basically do what you were doing with the SQL. This avoids the instantiation of a new model, and encapsulates the bare- metal SQL nicely. - Jamis
Zach Dennis
2005-Nov-03 19:34 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Jamis Buck wrote:> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: > >> I want to add a method to ActiveRecord::Base which is only added on >> models based off from temporary tables which will optimize it''s own >> insert calls, so I *dont''* have to manually write my own "insert" >> statements. I want to use the following code and achieve the same (or >> pretty darn close) results as I did when I manually inserted my own >> sql statements : >> rgx = /(\d+)\s(\w+)/ >> uploaded_file.each_line do |line| >> line =~ rgx >> record = model.new >> record.accountno = $1 >> record.account_name = $2 >> record.save >> end >> >> I know how-to add it, but I didn''t know if there are methods I should >> be reusing instead to help generate that optimization. Thoughts? >> > > Keep in mind that the creation of the new model and the generation of > the SQL is part of the performance hit you were seeing (possibly a > large part). It is just part of the price you pay for abstraction. > > However, you could do something like: > > model.insert :accountno => $1, :account_name => $2 > > And have that basically do what you were doing with the SQL. This > avoids the instantiation of a new model, and encapsulates the bare- > metal SQL nicely.Ok, should that be: model.connection.insert Or it this an addition to 0.14.x, i am still on 0.13.1. irb(main):036:0> Account.respond_to? :insert => false irb(main):037:0> Account.connection.respond_to? :insert => true Thanks Jamis! Zach
zdennis
2005-Nov-04 04:46 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Zach Dennis wrote:> Jamis Buck wrote: > >> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: >> >>> I want to add a method to ActiveRecord::Base which is only added on >>> models based off from temporary tables which will optimize it''s own >>> insert calls, so I *dont''* have to manually write my own "insert" >>> statements. I want to use the following code and achieve the same >>> (or pretty darn close) results as I did when I manually inserted my >>> own sql statements : >>> rgx = /(\d+)\s(\w+)/ >>> uploaded_file.each_line do |line| >>> line =~ rgx >>> record = model.new >>> record.accountno = $1 >>> record.account_name = $2 >>> record.save >>> end >>> >>> I know how-to add it, but I didn''t know if there are methods I >>> should be reusing instead to help generate that optimization. Thoughts? >>> >> >> Keep in mind that the creation of the new model and the generation of >> the SQL is part of the performance hit you were seeing (possibly a >> large part). It is just part of the price you pay for abstraction. >> >> However, you could do something like: >> >> model.insert :accountno => $1, :account_name => $2 >> >> And have that basically do what you were doing with the SQL. This >> avoids the instantiation of a new model, and encapsulates the bare- >> metal SQL nicely. > > > Ok, should that be: > model.connection.insert > Or it this an addition to 0.14.x, i am still on 0.13.1. > > irb(main):036:0> Account.respond_to? :insert > => false > irb(main):037:0> Account.connection.respond_to? :insert > => true > > Thanks Jamis! >Perhaps I misread your answer. Is what you posted, functionality that should work, or a suggestion of a clean way to implement what I want? Zach
Kyle Maxwell
2005-Nov-04 04:51 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Just a quick FYI, but you wont see a significant performance improvement from switching to FCGI/SCGI, because that only adds a small constant (~1-2 sec) to each request. It''s not adding anything O(n), i.e. 10%. On 11/3/05, zdennis <zdennis-aRAREQmnvsAAvxtiuMwx3w@public.gmane.org> wrote:> Zach Dennis wrote: > > Jamis Buck wrote: > > > >> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: > >> > >>> I want to add a method to ActiveRecord::Base which is only added on > >>> models based off from temporary tables which will optimize it''s own > >>> insert calls, so I *dont''* have to manually write my own "insert" > >>> statements. I want to use the following code and achieve the same > >>> (or pretty darn close) results as I did when I manually inserted my > >>> own sql statements : > >>> rgx = /(\d+)\s(\w+)/ > >>> uploaded_file.each_line do |line| > >>> line =~ rgx > >>> record = model.new > >>> record.accountno = $1 > >>> record.account_name = $2 > >>> record.save > >>> end > >>> > >>> I know how-to add it, but I didn''t know if there are methods I > >>> should be reusing instead to help generate that optimization. Thoughts? > >>> > >> > >> Keep in mind that the creation of the new model and the generation of > >> the SQL is part of the performance hit you were seeing (possibly a > >> large part). It is just part of the price you pay for abstraction. > >> > >> However, you could do something like: > >> > >> model.insert :accountno => $1, :account_name => $2 > >> > >> And have that basically do what you were doing with the SQL. This > >> avoids the instantiation of a new model, and encapsulates the bare- > >> metal SQL nicely. > > > > > > Ok, should that be: > > model.connection.insert > > Or it this an addition to 0.14.x, i am still on 0.13.1. > > > > irb(main):036:0> Account.respond_to? :insert > > => false > > irb(main):037:0> Account.connection.respond_to? :insert > > => true > > > > Thanks Jamis! > > > > Perhaps I misread your answer. Is what you posted, functionality that should work, or a suggestion > of a clean way to implement what I want? > > Zach > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Zach Dennis
2005-Nov-04 14:48 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Jamis Buck wrote:> However, you could do something like: > > model.insert :accountno => $1, :account_name => $2 >Ok, i have incorporated this change into my ActiveRecord::Base extension. I have also added some other features, example: # create in memory table which has same structure has # ExistingTableModel model = ExistingTableModel.create_temporary_model # bring our data from file (whether on hard drive or upload # as a Tempfile or StringIO object by the user file.each_line do |line| fields = line.scan /(\d+)(\w+)/ model.insert :column1=>$1, :column2=>$2 end # update our temporary model data, (i needed to do this so reset id''s # 0 in order to force an duplicate record check on a different field # then my primary key) model.update :all, :id=>0 # insert the data we just populated into our temporary # table into our existing table. We are inserting all of # records, and updating on duplicate records (which are # found based on primary key or unique index) model.insert_into ExistingTableModel, :select=>:all, :on_duplicate_key_update=>{ :column2=>:column2 } # drop our temporary table, we dont'' need it anymore model.drop Using Jamis'' suggestion I really like how the "model.insert" turned out. Now this is benefiting me, because I need to update large recordsets from an uploaded file. It may be 100 records, 50k records, 300k records, or more, and it is having a pretty decent turnaround on the amount of time it takes to process the information. Is anyone else interested in these changes? If so, is there a preferred way to package extending ActiveRecord::Base (do i submit a patch to dev.rubyonrails.com)? Also, if you have other suggestions of what might go along with this functionality I''m open ears! thx, Zach
Jonathan Younger
2005-Nov-04 15:05 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
I''m interested in your changes. Perhaps you could make it a plugin and/or submit it as a patch to rails core. I''m working on a process right now that has to insert/update 100k rows for data import and man is it slow. -Jonathan On Nov 4, 2005, at 6:48 AM, Zach Dennis wrote:> Is anyone else interested in these changes? If so, is there a > preferred way to package extending ActiveRecord::Base (do i submit > a patch to dev.rubyonrails.com)? Also, if you have other > suggestions of what might go along with this functionality I''m open > ears! thx,_______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails