thr3ads.net - Rails - Processing large data sets w/rails, at blazing speeds, question [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Zach Dennis

2005-Nov-03 18:42 UTC

Processing large data sets w/rails, at blazing speeds, question

I''ve been doing alot recently with large data sets, 300k records and 
upwards.

One thing rails doesn''t have yet is support for temporary tables (or at
least that i''m not aware of). I extended ActiveRecord::Base to support 
temporary tables with the the file attached to this email.

I only did so much to get what I needed done and am looking for input 
from others as to how to make this better. I would like something like 
this to be put into Rails, because when processing large data sets it 
speeds things up 10000 times!

Basic example of code:
   # Account is an existing model
   # This line creates a TempAccount model which is an in memory
   # temporary table like Account which is named TempAccount
   model = Account.create_temporary_model

   # The next line is true
   Object.const_defined? model.name

   ... process data here...

   # kill temporary model
   model.drop_temporary_model


Now one problem I''ve hit when doing large amounts of processing is that
to have everything be created as an instance of the newly created model 
makes everything go very slow. If you create your own sql statements and 
pass them into the model.connection.execute method things are blazing 
fast. (this has been tested in development and production modes, only 
running cgi)

For example (based of from above example):
    rgx = /(\d+)\s(\w+)/
    uploaded_file.each_line do |line|
       line =~ rgx
       record = model.new
       record.accountno = $1
       record.account_name = $2
       record.save
    end

This was incredibly slow in my tests Here were my results:
    2 records: 0.69 seconds
  100 records: 4.98 seconds
1000 records: 41.88 seconds
(imagine doing that for 300k plus records)

Now altering that to use code which looks like the below had awesome 
results:
    rgx = /(\d+)\s(\w+)/
    arr = []
    uploaded_file.each_line do |line|
       line =~ rgx
       arr << "insert into #{model.table_name}" +
              "(accountno,account_name) " +
             
"values(''#{$1}'',''#{$2}'');"
    end
    arr.each{ |sql| model.connection.execute( sql ) }

The results were much faster:
     2 records: 0.35 seconds
   100 records: 0.73 seconds
  1000 records: 3.90 seconds
  5000 records: 21.91 seconds
50000 records: 253.96 seconds

Now these are much better then before. If instead I write the sql 
statements to a temporary file and make a system call to run the mysql 
client in batch mode, it goes even faster, but then I lose the ability 
to do things easily in rails. The results for calling system( ) were:
  5000 records: 16.44 seconds
50000 records: 141.21 seconds

This is all using a development box, which is has 640Mb of meory and 
runs a 1.7Ghz Intel Celeron Processor, with Rails 0.13.1, and it is 
using Apache/CGI.

I know running things on a faster server with FastCGI or SCGI will make 
things blaze like lightning, but I figure if I can improve Apache/CGI 
performance then Apache/FastCGI will just rock.

I want to add a method to ActiveRecord::Base which is only added on 
models based off from temporary tables which will optimize it''s own 
insert calls, so I *dont''* have to manually write my own
"insert"
statements. I want to use the following code and achieve the same (or 
pretty darn close) results as I did when I manually inserted my own sql 
statements :
    rgx = /(\d+)\s(\w+)/
    uploaded_file.each_line do |line|
       line =~ rgx
       record = model.new
       record.accountno = $1
       record.account_name = $2
       record.save
    end

I know how-to add it, but I didn''t know if there are methods I should
be
reusing instead to help generate that optimization. Thoughts?

Zach



_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Jamis Buck

2005-Nov-03 18:51 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:
> I want to add a method to ActiveRecord::Base which is only added on  
> models based off from temporary tables which will optimize it''s
own
> insert calls, so I *dont''* have to manually write my own
"insert"
> statements. I want to use the following code and achieve the same  
> (or pretty darn close) results as I did when I manually inserted my  
> own sql statements :
>    rgx = /(\d+)\s(\w+)/
>    uploaded_file.each_line do |line|
>       line =~ rgx
>       record = model.new
>       record.accountno = $1
>       record.account_name = $2
>       record.save
>    end
>
> I know how-to add it, but I didn''t know if there are methods I  
> should be reusing instead to help generate that optimization.  
> Thoughts?
>
Keep in mind that the creation of the new model and the generation of  
the SQL is part of the performance hit you were seeing (possibly a  
large part). It is just part of the price you pay for abstraction.

However, you could do something like:

   model.insert :accountno => $1, :account_name => $2

And have that basically do what you were doing with the SQL. This  
avoids the instantiation of a new model, and encapsulates the bare- 
metal SQL nicely.

- Jamis

Zach Dennis

2005-Nov-03 19:34 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

Jamis Buck wrote:> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:
> 
>> I want to add a method to ActiveRecord::Base which is only added on  
>> models based off from temporary tables which will optimize
it''s own
>> insert calls, so I *dont''* have to manually write my own
"insert"
>> statements. I want to use the following code and achieve the same  (or 
>> pretty darn close) results as I did when I manually inserted my  own 
>> sql statements :
>>    rgx = /(\d+)\s(\w+)/
>>    uploaded_file.each_line do |line|
>>       line =~ rgx
>>       record = model.new
>>       record.accountno = $1
>>       record.account_name = $2
>>       record.save
>>    end
>>
>> I know how-to add it, but I didn''t know if there are methods I
should
>> be reusing instead to help generate that optimization.  Thoughts?
>>
> 
> Keep in mind that the creation of the new model and the generation of  
> the SQL is part of the performance hit you were seeing (possibly a  
> large part). It is just part of the price you pay for abstraction.
> 
> However, you could do something like:
> 
>   model.insert :accountno => $1, :account_name => $2
> 
> And have that basically do what you were doing with the SQL. This  
> avoids the instantiation of a new model, and encapsulates the bare- 
> metal SQL nicely.
Ok, should that be:
    model.connection.insert
Or it this an addition to 0.14.x, i am still on 0.13.1.

irb(main):036:0> Account.respond_to? :insert
=> false
irb(main):037:0> Account.connection.respond_to? :insert
=> true

Thanks Jamis!

Zach

zdennis

2005-Nov-04 04:46 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

Zach Dennis wrote:> Jamis Buck wrote:
> 
>> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:
>>
>>> I want to add a method to ActiveRecord::Base which is only added on
>>> models based off from temporary tables which will optimize
it''s own
>>> insert calls, so I *dont''* have to manually write my own
"insert"
>>> statements. I want to use the following code and achieve the same  
>>> (or pretty darn close) results as I did when I manually inserted my
>>> own sql statements :
>>>    rgx = /(\d+)\s(\w+)/
>>>    uploaded_file.each_line do |line|
>>>       line =~ rgx
>>>       record = model.new
>>>       record.accountno = $1
>>>       record.account_name = $2
>>>       record.save
>>>    end
>>>
>>> I know how-to add it, but I didn''t know if there are
methods I
>>> should be reusing instead to help generate that optimization. 
Thoughts?
>>>
>>
>> Keep in mind that the creation of the new model and the generation of  
>> the SQL is part of the performance hit you were seeing (possibly a  
>> large part). It is just part of the price you pay for abstraction.
>>
>> However, you could do something like:
>>
>>   model.insert :accountno => $1, :account_name => $2
>>
>> And have that basically do what you were doing with the SQL. This  
>> avoids the instantiation of a new model, and encapsulates the bare- 
>> metal SQL nicely.
> 
> 
> Ok, should that be:
>    model.connection.insert
> Or it this an addition to 0.14.x, i am still on 0.13.1.
> 
> irb(main):036:0> Account.respond_to? :insert
> => false
> irb(main):037:0> Account.connection.respond_to? :insert
> => true
> 
> Thanks Jamis!
> 
Perhaps I misread your answer. Is what you posted, functionality that should
work, or a suggestion
of a clean way to implement what I want?

Zach

Kyle Maxwell

2005-Nov-04 04:51 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

Just a quick FYI, but you wont see a significant performance
improvement from switching to FCGI/SCGI, because that only adds a
small constant (~1-2 sec) to each request.  It''s not adding anything
O(n), i.e. 10%.

On 11/3/05, zdennis <zdennis-aRAREQmnvsAAvxtiuMwx3w@public.gmane.org>
wrote:> Zach Dennis wrote:
> > Jamis Buck wrote:
> >
> >> On Nov 3, 2005, at 11:42 AM, Zach Dennis wrote:
> >>
> >>> I want to add a method to ActiveRecord::Base which is only
added on
> >>> models based off from temporary tables which will optimize
it''s own
> >>> insert calls, so I *dont''* have to manually write my
own "insert"
> >>> statements. I want to use the following code and achieve the
same
> >>> (or pretty darn close) results as I did when I manually
inserted my
> >>> own sql statements :
> >>>    rgx = /(\d+)\s(\w+)/
> >>>    uploaded_file.each_line do |line|
> >>>       line =~ rgx
> >>>       record = model.new
> >>>       record.accountno = $1
> >>>       record.account_name = $2
> >>>       record.save
> >>>    end
> >>>
> >>> I know how-to add it, but I didn''t know if there are
methods I
> >>> should be reusing instead to help generate that optimization. 
Thoughts?
> >>>
> >>
> >> Keep in mind that the creation of the new model and the generation
of
> >> the SQL is part of the performance hit you were seeing (possibly a
> >> large part). It is just part of the price you pay for abstraction.
> >>
> >> However, you could do something like:
> >>
> >>   model.insert :accountno => $1, :account_name => $2
> >>
> >> And have that basically do what you were doing with the SQL. This
> >> avoids the instantiation of a new model, and encapsulates the
bare-
> >> metal SQL nicely.
> >
> >
> > Ok, should that be:
> >    model.connection.insert
> > Or it this an addition to 0.14.x, i am still on 0.13.1.
> >
> > irb(main):036:0> Account.respond_to? :insert
> > => false
> > irb(main):037:0> Account.connection.respond_to? :insert
> > => true
> >
> > Thanks Jamis!
> >
>
> Perhaps I misread your answer. Is what you posted, functionality that
should work, or a suggestion
> of a clean way to implement what I want?
>
> Zach
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Zach Dennis

2005-Nov-04 14:48 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

Jamis Buck wrote:
> However, you could do something like:
> 
>   model.insert :accountno => $1, :account_name => $2
> 
Ok, i have incorporated this change into my ActiveRecord::Base 
extension. I have also added some other features, example:

   # create in memory table which has same structure has
   # ExistingTableModel
   model = ExistingTableModel.create_temporary_model

   # bring our data from file (whether on hard drive or upload
   # as a Tempfile or StringIO object by the user
   file.each_line do |line|
     fields = line.scan /(\d+)(\w+)/
     model.insert :column1=>$1, :column2=>$2
   end

   # update our temporary model data, (i needed to do this so reset
id''s
   # 0 in order to force an duplicate record check on a different field
   # then my primary key)
   model.update :all, :id=>0

   # insert the data we just populated into our temporary
   # table into our existing table. We are inserting all of
   # records, and updating on duplicate records (which are
   # found based on primary key or unique index)
   model.insert_into ExistingTableModel,
     :select=>:all,
     :on_duplicate_key_update=>{ :column2=>:column2 }

   # drop our temporary table, we dont'' need it anymore
   model.drop

Using Jamis'' suggestion I really like how the "model.insert"
turned out.
Now this is benefiting me, because I need to update large recordsets 
from an uploaded file. It may be 100 records, 50k records, 300k records, 
or more, and it is having a pretty decent turnaround on the amount of 
time it takes to process the information.

Is anyone else interested in these changes? If so, is there a preferred 
way to package extending ActiveRecord::Base (do i submit a patch to 
dev.rubyonrails.com)?  Also, if you have other suggestions of what might 
go along with this functionality I''m open ears!  thx,

Zach

Jonathan Younger

2005-Nov-04 15:05 UTC

head link

Re: Processing large data sets w/rails, at blazing speeds, question

I''m interested in your changes. Perhaps you could make it a plugin  
and/or submit it as a patch to rails core.

I''m working on a process right now that has to insert/update 100k  
rows for data import and man is it slow.

-Jonathan

On Nov 4, 2005, at 6:48 AM, Zach Dennis wrote:
> Is anyone else interested in these changes? If so, is there a  
> preferred way to package extending ActiveRecord::Base (do i submit  
> a patch to dev.rubyonrails.com)?  Also, if you have other  
> suggestions of what might go along with this functionality I''m
open
> ears!  thx,


_______________________________________________
Rails mailing list
Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
http://lists.rubyonrails.org/mailman/listinfo/rails

Rails - Nov 2005 - Processing large data sets w/rails, at blazing speeds, question

Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question

Re: Processing large data sets w/rails, at blazing speeds, question