Zach Dennis
2005-Nov-03 18:42 UTC
Processing large data sets w/rails, at blazing speeds, question
I''ve been doing alot recently with large data sets, 300k records and
upwards.
One thing rails doesn''t have yet is support for temporary tables (or at
least that i''m not aware of). I extended ActiveRecord::Base to support
temporary tables with the the file attached to this email.
I only did so much to get what I needed done and am looking for input
from others as to how to make this better. I would like something like
this to be put into Rails, because when processing large data sets it
speeds things up 10000 times!
Basic example of code:
# Account is an existing model
# This line creates a TempAccount model which is an in memory
# temporary table like Account which is named TempAccount
model = Account.create_temporary_model
# The next line is true
Object.const_defined? model.name
... process data here...
# kill temporary model
model.drop_temporary_model
Now one problem I''ve hit when doing large amounts of processing is that
to have everything be created as an instance of the newly created model
makes everything go very slow. If you create your own sql statements and
pass them into the model.connection.execute method things are blazing
fast. (this has been tested in development and production modes, only
running cgi)
For example (based of from above example):
rgx = /(\d+)\s(\w+)/
uploaded_file.each_line do |line|
line =~ rgx
record = model.new
record.accountno = $1
record.account_name = $2
record.save
end
This was incredibly slow in my tests Here were my results:
2 records: 0.69 seconds
100 records: 4.98 seconds
1000 records: 41.88 seconds
(imagine doing that for 300k plus records)
Now altering that to use code which looks like the below had awesome
results:
rgx = /(\d+)\s(\w+)/
arr = []
uploaded_file.each_line do |line|
line =~ rgx
arr << "insert into #{model.table_name}" +
"(accountno,account_name) " +
"values(''#{$1}'',''#{$2}'');"
end
arr.each{ |sql| model.connection.execute( sql ) }
The results were much faster:
2 records: 0.35 seconds
100 records: 0.73 seconds
1000 records: 3.90 seconds
5000 records: 21.91 seconds
50000 records: 253.96 seconds
Now these are much better then before. If instead I write the sql
statements to a temporary file and make a system call to run the mysql
client in batch mode, it goes even faster, but then I lose the ability
to do things easily in rails. The results for calling system( ) were:
5000 records: 16.44 seconds
50000 records: 141.21 seconds
This is all using a development box, which is has 640Mb of meory and
runs a 1.7Ghz Intel Celeron Processor, with Rails 0.13.1, and it is
using Apache/CGI.
I know running things on a faster server with FastCGI or SCGI will make
things blaze like lightning, but I figure if I can improve Apache/CGI
performance then Apache/FastCGI will just rock.
I want to add a method to ActiveRecord::Base which is only added on
models based off from temporary tables which will optimize it''s own
insert calls, so I *dont''* have to manually write my own
"insert"
statements. I want to use the following code and achieve the same (or
pretty darn close) results as I did when I manually inserted my own sql
statements :
rgx = /(\d+)\s(\w+)/
uploaded_file.each_line do |line|
line =~ rgx
record = model.new
record.accountno = $1
record.account_name = $2
record.save
end
I know how-to add it, but I didn''t know if there are methods I should
be
reusing instead to help generate that optimization. Thoughts?
Zach
_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails
Jamis Buck
2005-Nov-03 18:51 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:> I want to add a method to ActiveRecord::Base which is only added on > models based off from temporary tables which will optimize it''s own > insert calls, so I *dont''* have to manually write my own "insert" > statements. I want to use the following code and achieve the same > (or pretty darn close) results as I did when I manually inserted my > own sql statements : > rgx = /(\d+)\s(\w+)/ > uploaded_file.each_line do |line| > line =~ rgx > record = model.new > record.accountno = $1 > record.account_name = $2 > record.save > end > > I know how-to add it, but I didn''t know if there are methods I > should be reusing instead to help generate that optimization. > Thoughts? >Keep in mind that the creation of the new model and the generation of the SQL is part of the performance hit you were seeing (possibly a large part). It is just part of the price you pay for abstraction. However, you could do something like: model.insert :accountno => $1, :account_name => $2 And have that basically do what you were doing with the SQL. This avoids the instantiation of a new model, and encapsulates the bare- metal SQL nicely. - Jamis
Zach Dennis
2005-Nov-03 19:34 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Jamis Buck wrote:> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: > >> I want to add a method to ActiveRecord::Base which is only added on >> models based off from temporary tables which will optimize it''s own >> insert calls, so I *dont''* have to manually write my own "insert" >> statements. I want to use the following code and achieve the same (or >> pretty darn close) results as I did when I manually inserted my own >> sql statements : >> rgx = /(\d+)\s(\w+)/ >> uploaded_file.each_line do |line| >> line =~ rgx >> record = model.new >> record.accountno = $1 >> record.account_name = $2 >> record.save >> end >> >> I know how-to add it, but I didn''t know if there are methods I should >> be reusing instead to help generate that optimization. Thoughts? >> > > Keep in mind that the creation of the new model and the generation of > the SQL is part of the performance hit you were seeing (possibly a > large part). It is just part of the price you pay for abstraction. > > However, you could do something like: > > model.insert :accountno => $1, :account_name => $2 > > And have that basically do what you were doing with the SQL. This > avoids the instantiation of a new model, and encapsulates the bare- > metal SQL nicely.Ok, should that be: model.connection.insert Or it this an addition to 0.14.x, i am still on 0.13.1. irb(main):036:0> Account.respond_to? :insert => false irb(main):037:0> Account.connection.respond_to? :insert => true Thanks Jamis! Zach
zdennis
2005-Nov-04 04:46 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Zach Dennis wrote:> Jamis Buck wrote: > >> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: >> >>> I want to add a method to ActiveRecord::Base which is only added on >>> models based off from temporary tables which will optimize it''s own >>> insert calls, so I *dont''* have to manually write my own "insert" >>> statements. I want to use the following code and achieve the same >>> (or pretty darn close) results as I did when I manually inserted my >>> own sql statements : >>> rgx = /(\d+)\s(\w+)/ >>> uploaded_file.each_line do |line| >>> line =~ rgx >>> record = model.new >>> record.accountno = $1 >>> record.account_name = $2 >>> record.save >>> end >>> >>> I know how-to add it, but I didn''t know if there are methods I >>> should be reusing instead to help generate that optimization. Thoughts? >>> >> >> Keep in mind that the creation of the new model and the generation of >> the SQL is part of the performance hit you were seeing (possibly a >> large part). It is just part of the price you pay for abstraction. >> >> However, you could do something like: >> >> model.insert :accountno => $1, :account_name => $2 >> >> And have that basically do what you were doing with the SQL. This >> avoids the instantiation of a new model, and encapsulates the bare- >> metal SQL nicely. > > > Ok, should that be: > model.connection.insert > Or it this an addition to 0.14.x, i am still on 0.13.1. > > irb(main):036:0> Account.respond_to? :insert > => false > irb(main):037:0> Account.connection.respond_to? :insert > => true > > Thanks Jamis! >Perhaps I misread your answer. Is what you posted, functionality that should work, or a suggestion of a clean way to implement what I want? Zach
Kyle Maxwell
2005-Nov-04 04:51 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Just a quick FYI, but you wont see a significant performance improvement from switching to FCGI/SCGI, because that only adds a small constant (~1-2 sec) to each request. It''s not adding anything O(n), i.e. 10%. On 11/3/05, zdennis <zdennis-aRAREQmnvsAAvxtiuMwx3w@public.gmane.org> wrote:> Zach Dennis wrote: > > Jamis Buck wrote: > > > >> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote: > >> > >>> I want to add a method to ActiveRecord::Base which is only added on > >>> models based off from temporary tables which will optimize it''s own > >>> insert calls, so I *dont''* have to manually write my own "insert" > >>> statements. I want to use the following code and achieve the same > >>> (or pretty darn close) results as I did when I manually inserted my > >>> own sql statements : > >>> rgx = /(\d+)\s(\w+)/ > >>> uploaded_file.each_line do |line| > >>> line =~ rgx > >>> record = model.new > >>> record.accountno = $1 > >>> record.account_name = $2 > >>> record.save > >>> end > >>> > >>> I know how-to add it, but I didn''t know if there are methods I > >>> should be reusing instead to help generate that optimization. Thoughts? > >>> > >> > >> Keep in mind that the creation of the new model and the generation of > >> the SQL is part of the performance hit you were seeing (possibly a > >> large part). It is just part of the price you pay for abstraction. > >> > >> However, you could do something like: > >> > >> model.insert :accountno => $1, :account_name => $2 > >> > >> And have that basically do what you were doing with the SQL. This > >> avoids the instantiation of a new model, and encapsulates the bare- > >> metal SQL nicely. > > > > > > Ok, should that be: > > model.connection.insert > > Or it this an addition to 0.14.x, i am still on 0.13.1. > > > > irb(main):036:0> Account.respond_to? :insert > > => false > > irb(main):037:0> Account.connection.respond_to? :insert > > => true > > > > Thanks Jamis! > > > > Perhaps I misread your answer. Is what you posted, functionality that should work, or a suggestion > of a clean way to implement what I want? > > Zach > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Zach Dennis
2005-Nov-04 14:48 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
Jamis Buck wrote:> However, you could do something like: > > model.insert :accountno => $1, :account_name => $2 >Ok, i have incorporated this change into my ActiveRecord::Base extension. I have also added some other features, example: # create in memory table which has same structure has # ExistingTableModel model = ExistingTableModel.create_temporary_model # bring our data from file (whether on hard drive or upload # as a Tempfile or StringIO object by the user file.each_line do |line| fields = line.scan /(\d+)(\w+)/ model.insert :column1=>$1, :column2=>$2 end # update our temporary model data, (i needed to do this so reset id''s # 0 in order to force an duplicate record check on a different field # then my primary key) model.update :all, :id=>0 # insert the data we just populated into our temporary # table into our existing table. We are inserting all of # records, and updating on duplicate records (which are # found based on primary key or unique index) model.insert_into ExistingTableModel, :select=>:all, :on_duplicate_key_update=>{ :column2=>:column2 } # drop our temporary table, we dont'' need it anymore model.drop Using Jamis'' suggestion I really like how the "model.insert" turned out. Now this is benefiting me, because I need to update large recordsets from an uploaded file. It may be 100 records, 50k records, 300k records, or more, and it is having a pretty decent turnaround on the amount of time it takes to process the information. Is anyone else interested in these changes? If so, is there a preferred way to package extending ActiveRecord::Base (do i submit a patch to dev.rubyonrails.com)? Also, if you have other suggestions of what might go along with this functionality I''m open ears! thx, Zach
Jonathan Younger
2005-Nov-04 15:05 UTC
Re: Processing large data sets w/rails, at blazing speeds, question
I''m interested in your changes. Perhaps you could make it a plugin and/or submit it as a patch to rails core. I''m working on a process right now that has to insert/update 100k rows for data import and man is it slow. -Jonathan On Nov 4, 2005, at 6:48 AM, Zach Dennis wrote:> Is anyone else interested in these changes? If so, is there a > preferred way to package extending ActiveRecord::Base (do i submit > a patch to dev.rubyonrails.com)? Also, if you have other > suggestions of what might go along with this functionality I''m open > ears! thx,_______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails