Stack Overflow and various forums are rife with questions about how to efficiently update or insert ActiveRecords into a table. (In fact, one of my earliest questions to this forum was on this very topic.) My recent work regularly calls for an efficient "update or insert" method for tens of thousands of records. After a fair amount of googling, I don''t see a generalized solution, so I''ve decided to design and implement the function for real. In this note, I''m proposing a design specifically for critique and feedback. (First question: are there other forums where I should be posing this question?) A typical use case might look like this: ====class MyModel < ActiveRecord::Base with_update_or_insert def self.bulk_load(csv_file) # generate a lot of records records = CSV.table(csv_file).map {|row| MyModel.new(row.to_hash)} # :name and :group form a compound key for determining uniqueness update_or_insert(records, :key => [:name, :group]) end end ==== The method itself is defined as follows: def self.update_or_insert(array_of_records, options = {}) ... end The options: :key determines what column or columns determine uniqueness of each record. May be a symbol that names a single column or an array of symbols that names a compound key. If omitted or blank, records are inserted without checking whether they are already present in the db. :on_duplicate determines what happens when a duplicate is present in the db. Recognized values are :update (the default), :ignore, :error. For :update, the existing record is updated. For :ignore, the existing record is untouched. For :error, the method raises an error. :batch_size determines the maximum number of records to be processed in one transaction. If omitted, defaults to a value specific to the current database adaptor. Regardless of the value given, the current database adaptor may impose a smaller value. Comments and open issues: C1: I''m assuming that some parts of update_or_insert will dispatch to database specific code. I''ve got a pretty good sense of what PostgreSQL, MySQL and SQLite need to implement this. I have no clue what the other database adaptors (e.g. MongoDB) will require. C2: Type conversion might be problematic. Different backends require different type-specific quoting. (But ActiveRecord appears to know how to do this -- I should be able to use its code.) C3: I''ve assuming that array_of_records is an array of unsaved ActiveRecords. But perhaps it should accept hash-like objects to make it independent of ActiveRecord. If that''s the case, we lose the per-column type information (see C2 above). C4: I''m assuming that for efficiency, ActiveRecord validations will NOT be called, nor will the in-memory ActiveRecords be updated. In effect, the ActiveRecords are simply in-memory structures to hold the data. C5: duplicate records in array_of_records may be problematic for some dbs. The most general ANSI compliant implementations of UPSERT essentially makes two db calls: an UPDATE for records that are already present and an INSERT for records that are new. But if the array_of_records itself contains duplicate records, the INSERT operation will attempt to insert them twice. If there is a unique index on the table, the db will raise an error, otherwise you''ll end up with duplicate copies, which could be unexpected. Can you tolerate this as a restriction? Or should update_or_insert do an extra operation on the array_of_records to detect duplicates? C6: How should this be packaged? (Hint: I''ve never created a gem before.) I''d like to start with support for one DB (likely PostgreSQL), but provide a framework to make it easy to support other adaptors. C7: What value, if any, should update_or_insert return? Have your eyes glazed over yet? Comments are warmly welcomed... - ff -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Seamus Abshere
2012-Jul-05 16:08 UTC
Re: design of bulk UPSERT (aka MERGE): update_or_insert
hey ff, You might want to take a look at https://github.com/seamusabshere/upsert - released same week you posted this. Best, Seamus -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en-US.