I want to load about 14000 words (a subset of /usr/share/dict/words) into a MySQL table in a migration: class CreateWords < ActiveRecord::Migration def self.up create_table :words, :force => true do |t| t.column :word, :string end say_with_time ''loading words...'' do words = File.join(RAILS_ROOT, ''db'', ''words'') Word.transaction do File.foreach(words) do |line| line.chomp! Word.create!(:word => line) end end end end def self.down drop_table :words end end That takes about 45 seconds, and a one-liner shows the time needed by the foreach loop itself is negligible. The analogous SQL file with inserts takes about 3 seconds to load passed directly to the mysql client. I have no problem with those 45 seconds, but I would like to know why that''s that way. Do you know why there is so much difference? Is it the driver? -- fxn
On 5/15/06, Xavier Noria <fxn@hashref.com> wrote:> I want to load about 14000 words (a subset of /usr/share/dict/words) > into a MySQL table in a migration: > > class CreateWords < ActiveRecord::Migration > def self.up > create_table :words, :force => true do |t| > t.column :word, :string > end > say_with_time ''loading words...'' do > words = File.join(RAILS_ROOT, ''db'', ''words'') > Word.transaction do > File.foreach(words) do |line| > line.chomp! > Word.create!(:word => line) > end > end > end > end > > def self.down > drop_table :words > end > end > > That takes about 45 seconds, and a one-liner shows the time needed by > the foreach loop itself is negligible. The analogous SQL file with > inserts takes about 3 seconds to load passed directly to the mysql > client. > > I have no problem with those 45 seconds, but I would like to know why > that''s that way. Do you know why there is so much difference? Is it > the driver?The ActiveRecord approach is doing far more work simple sql insert statements. As well as the (relatively) negligible for loop and file access, you''re creating a fully fledged model object (composed of many other underlying objects) for each of 14000 words, not to mention the overhead of running 14000 db statements through ruby. This is the main reason its so much slower, not any inefficiencies in the driver. Tom
On May 15, 2006, at 17:55, Tom Ward wrote:> The ActiveRecord approach is doing far more work simple sql insert > statements. As well as the (relatively) negligible for loop and file > access, you''re creating a fully fledged model object (composed of many > other underlying objects) for each of 14000 words, not to mention the > overhead of running 14000 db statements through ruby. This is the > main reason its so much slower, not any inefficiencies in the driver.OK, I discarded that 14000 db statement through ruby count, this simple script require ''mysql'' my = Mysql.connect(...) my.autocommit(false) st = my.prepare("insert into words (word) values (?)") File.foreach(''words'') do |line| line.chomp! st.execute(line) end my.commit runs at the same speed that the SQL file. We seem to conclude that the overhead comes from AR alone then. -- fxn
>> The ActiveRecord approach is doing far more work simple sql insert >> statements. As well as the (relatively) negligible for loop and file >> access, you''re creating a fully fledged model object (composed of many >> other underlying objects) for each of 14000 words, not to mention the >> overhead of running 14000 db statements through ruby. This is the >> main reason its so much slower, not any inefficiencies in the driver. > > OK, I discarded that 14000 db statement through ruby count, this simple > script > > require ''mysql'' > > my = Mysql.connect(...) > my.autocommit(false) > st = my.prepare("insert into words (word) values (?)") > File.foreach(''words'') do |line| > line.chomp! > st.execute(line) > end > my.commit > > runs at the same speed that the SQL file. We seem to conclude that the > overhead comes from AR alone then.Just for kicks, remove your autocommit/commit lines and see what happens if you commit *per* insert. My understanding is this is what active record does.... -philip
Generally OR mapping tools such as ActiveRecord (or Hibernate, OJB etc in Java) shouldn''t be used for batch transactions such as data migration or other large scale data processing. Direct SQL would be much more efficient and logical. These tools should be used as they were intended for -- an abstraction of a relational database into a data model for interactive usage. If you''re thinking there''s something wrong with ActiveRecord, there''s not, it''s just not meant for such kind job tasks you described below. Best tool for the job, not the same tool for every job :) Xavier Noria wrote:> On May 15, 2006, at 17:55, Tom Ward wrote: > >> The ActiveRecord approach is doing far more work simple sql insert >> statements. As well as the (relatively) negligible for loop and file >> access, you''re creating a fully fledged model object (composed of many >> other underlying objects) for each of 14000 words, not to mention the >> overhead of running 14000 db statements through ruby. This is the >> main reason its so much slower, not any inefficiencies in the driver. > > OK, I discarded that 14000 db statement through ruby count, this > simple script > > require ''mysql'' > > my = Mysql.connect(...) > my.autocommit(false) > st = my.prepare("insert into words (word) values (?)") > File.foreach(''words'') do |line| > line.chomp! > st.execute(line) > end > my.commit > > runs at the same speed that the SQL file. We seem to conclude that the > overhead comes from AR alone then. > > -- fxn > > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-- Sau Sheong http://blog.saush.com http://www.projectible.com http://jaccal.sourceforge.net
On May 15, 2006, at 18:28, Philip Hallstrom wrote:> Just for kicks, remove your autocommit/commit lines and see what > happens if you commit *per* insert. My understanding is this is > what active record does....Good. It takes about 10 seconds (versus about 3 with autocommit off). In the migration, if the loading is not wrapped in a transaction it takes about 80 seconds instead of 45. I use a transaction to have as a side effect autocommit off, a trick I learned the hard way loading data into SQLite :-). -- fxn
On May 15, 2006, at 18:31, Chang Sau Sheong wrote:> Generally OR mapping tools such as ActiveRecord (or Hibernate, OJB etc > in Java) shouldn''t be used for batch transactions such as data > migration > or other large scale data processing. Direct SQL would be much more > efficient and logical. These tools should be used as they were > intended > for -- an abstraction of a relational database into a data model for > interactive usage. If you''re thinking there''s something wrong with > ActiveRecord, there''s not, it''s just not meant for such kind job tasks > you described below. > > Best tool for the job, not the same tool for every job :)Yeah, yeah, no problem with AR whatsoever. Just technical curiosity that may lead to a more appropriate way of doing this in any case. How do you load data like that in a way that feels natural working with migrations? I mean, linked to the migration itself somehow, portable, etc.? -- fxn
This is from personal experience and might or might not be applicable for your case, but ... most data migrations that I''ve done are one-shots meaning it''s a migration from one system (mostly legacy) to another. Nothing is really re-usable because they are not likely to migrate again :) The old system is de-commissioned and that''s the end of the story (putting it simply of course -- usually they keep the data ''just in case'' for a while). But if you''re talking abt product migration, you will want to keep the code, in which case I would suggest that you keep it where it''s closest to, that is, in the DB itself. For DBs with stored procedures, it might be worthwhile to write migration stored procedures, otherwise some shell scripts (or Ruby scripts, which I''ll fully recommend :) ) to execute some predefined SQL statements. Just my 2 cents. Xavier Noria wrote:> On May 15, 2006, at 18:31, Chang Sau Sheong wrote: > >> Generally OR mapping tools such as ActiveRecord (or Hibernate, OJB etc >> in Java) shouldn''t be used for batch transactions such as data migration >> or other large scale data processing. Direct SQL would be much more >> efficient and logical. These tools should be used as they were intended >> for -- an abstraction of a relational database into a data model for >> interactive usage. If you''re thinking there''s something wrong with >> ActiveRecord, there''s not, it''s just not meant for such kind job tasks >> you described below. >> >> Best tool for the job, not the same tool for every job :) > > Yeah, yeah, no problem with AR whatsoever. Just technical curiosity > that may lead to a more appropriate way of doing this in any case. > > How do you load data like that in a way that feels natural working > with migrations? I mean, linked to the migration itself somehow, > portable, etc.? > > -- fxn > > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-- Sau Sheong http://blog.saush.com http://www.projectible.com http://jaccal.sourceforge.net
On May 15, 2006, at 18:47, Chang Sau Sheong wrote:> This is from personal experience and might or might not be applicable > for your case, but ... most data migrations that I''ve done are one- > shots > meaning it''s a migration from one system (mostly legacy) to another. > Nothing is really re-usable because they are not likely to migrate > again > :) The old system is de-commissioned and that''s the end of the story > (putting it simply of course -- usually they keep the data ''just in > case'' for a while). > > But if you''re talking abt product migration, you will want to keep the > code, in which case I would suggest that you keep it where it''s > closest > to, that is, in the DB itself. For DBs with stored procedures, it > might > be worthwhile to write migration stored procedures, otherwise some > shell > scripts (or Ruby scripts, which I''ll fully recommend :) ) to execute > some predefined SQL statements.I see thank you. I was talking indeed about migrations as in Rails jargon, what you run with rake db:migrate. Switching to execute() we go from ~45 s down to ~18 s, better! I wonder whether that can be yet improved. say_with_time ''loading words...'' do words = File.join(RAILS_ROOT, ''db'', ''words'') Word.transaction do suppress_messages do File.foreach(words) do |line| line.chomp! execute "insert into words (word) values (''#{line}'')" end end end end -- fxn
On May 15, 2006, at 10:06 AM, Xavier Noria wrote:> I was talking indeed about migrations as in Rails jargon, what you > run with rake db:migrate. Switching to execute() we go from ~45 s > down to ~18 s, better! I wonder whether that can be yet improved.The timing is faster, but this is not an improvement: you''ve just thrown quoting and type casting out the window. Someone posted recently about having Active Record generate rather than execute SQL. This way, generating the batch is still costly but executing its result is as fast as you''d expect. Best, jeremy
On 5/15/06, Xavier Noria <fxn@hashref.com> wrote:> I was talking indeed about migrations as in Rails jargon, what you > run with rake db:migrate. Switching to execute() we go from ~45 s > down to ~18 s, better! I wonder whether that can be yet improved.I had to solve a similar problem. First, I concatenated all the ''words'' files I could find into one, not worrying about duplicates. Then using a migration, I create the ''word'' table, load the data into a temp table, then move it into the real table with a select that orders the words alphabetically. Then I ''optimize'' and ''analyze'' the table in mysql to speed up access to the records. It takes ~17s on my Athlon XP 1800 Linux box, and I''m left with a table that includes 335,171 words. Here''s my migration: class WordTable < ActiveRecord::Migration class Word < ActiveRecord::Base; end def self.up # Table: word create_table :word do |t| t.primary_key :id t.column :word, :string, :limit => 128, :null => false end # Load words the slow way # load_words_ar # Load words the fast way load_words_sql end def self.load_words_ar f = File.new(''/path/words.txt'') begin while(line = f.readline) line.chomp w = Word.new w.word = line w.save end rescue EOFError f.close end end def self.load_words_sql begin db = Word.connection.current_database %x{mysql #{db} < ''db/sql/load_words.sql''} rescue puts "Execution of ''db/sql/load_words.sql'' failed." end end def self.down drop_table :word end end And here''s the SQL that is referenced above: create temporary table t_word ( word varchar(128) not null, id integer auto_increment not null, primary key(id), key x_word (word) ); load data infile ''/path/words.txt'' replace into table t_word; delete from word; insert into word select id, word from t_word order by t_word.id; drop table t_word; alter table word add key x_word (word); optimize table word; analyze table word; The temporary table stuff isn''t really necessary. The point is that the ''load data infile'' command in MySQL is fastest way to get data into your tables that I know of. Daniel Butler http://www.yup.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060515/0704dc61/attachment-0001.html
On May 15, 2006, at 19:33, Jeremy Kemper wrote:> On May 15, 2006, at 10:06 AM, Xavier Noria wrote: >> I was talking indeed about migrations as in Rails jargon, what you >> run with rake db:migrate. Switching to execute() we go from ~45 s >> down to ~18 s, better! I wonder whether that can be yet improved. > > The timing is faster, but this is not an improvement: you''ve just > thrown quoting and type casting out the window.Yeah, but that it is not a fault of the approach itself, AFAICT it is a limitation of execute(), which does not accept placeholders.> Someone posted recently about having Active Record generate rather > than execute SQL. This way, generating the batch is still costly > but executing its result is as fast as you''d expect.Interesting, but how do you execute de generated SQL from within the migration in a portable way? -- fxn