How are others doing duplicate record detection? I''m not finding very many solutions, or methods online. I found one called SimString, but not much else. I was wondering how others are detecting duplicates. Similar to suggested items, I would like to show records that may match or have similar attributes. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
validate_uniqueness_of might be of help? On Jul 10, 6:21 pm, Justin Stanczak <rizen...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> How are others doing duplicate record detection? I''m not finding very many > solutions, or methods online. I found one called SimString, but not much > else. I was wondering how others are detecting duplicates. Similar to > suggested items, I would like to show records that may match or have similar > attributes.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
That would help slow the duplication, but if someone fills out a form and submits fname, lname, ss#, and they typo the ss# I would have a duplicate. I would like to display to admin users that this record has a related link, or is similar. Similar to how Google finds duplicates in your contacts and merges them. On Mon, Jul 11, 2011 at 12:25 PM, pepe <Pepe-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote:> validate_uniqueness_of might be of help? > > On Jul 10, 6:21 pm, Justin Stanczak <rizen...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > How are others doing duplicate record detection? I''m not finding very > many > > solutions, or methods online. I found one called SimString, but not much > > else. I was wondering how others are detecting duplicates. Similar to > > suggested items, I would like to show records that may match or have > similar > > attributes. > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Mon, Jul 11, 2011 at 9:42 AM, Justin Stanczak <rizenine-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> That would help slow the duplication, but if someone fills out a form and > submits fname, lname, ss#, and they typo the ss# I would have a duplicate."slow the duplication"?? No, insuring that the SSNs *are unique* via validations and unique indexes would prevent duplicates, period. What is it about that you don''t like? -- Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org http://about.me/hassanschroeder twitter: @hassan -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
I don''t believe you''re going to find a magic formula for what you''re suggesting. The same thing could be said about last or first names as you are suggesting could happen with SSNs. What if somebody misspells Smith for Smit, for example? But worse yet, what if it is not a misspelling situation and the Smit is actually Smit? The same is true for SSNs, switching the last 2 digits does not mean it was a "misspell", it could just be that 2 different people have the same name and very similar SSNs. You have to draw a line somewhere, I think. You could use auto-complete fields and then provide options based on records found using the ''LIKE'' option in the where clause using the information currently being entered. That might help but I think you''ll find it''s not worth the effort. On Jul 11, 12:42 pm, Justin Stanczak <rizen...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> That would help slow the duplication, but if someone fills out a form and > submits fname, lname, ss#, and they typo the ss# I would have a duplicate. I > would like to display to admin users that this record has a related link, or > is similar. Similar to how Google finds duplicates in your contacts and > merges them. > > > > > > > > On Mon, Jul 11, 2011 at 12:25 PM, pepe <P...-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote: > > validate_uniqueness_of might be of help? > > > On Jul 10, 6:21 pm, Justin Stanczak <rizen...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > How are others doing duplicate record detection? I''m not finding very > > many > > > solutions, or methods online. I found one called SimString, but not much > > > else. I was wondering how others are detecting duplicates. Similar to > > > suggested items, I would like to show records that may match or have > > similar > > > attributes. > > > -- > > You received this message because you are subscribed to the Google Groups > > "Ruby on Rails: Talk" group. > > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To unsubscribe from this group, send email to > > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > For more options, visit this group at > >http://groups.google.com/group/rubyonrails-talk?hl=en.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
It would stop duplication of that unique string, but not fix a typo like transposed numbers. Besides, that was a simple example, not meant to be challenged. It''s the process of detecting duplicates, I''m looking for. I know how to validate and key tables. Maybe another example is loading a million records from external source, and you need to find duplicates. I''m just asking if anyone has seen api''s or ruby utilities that preform this function. Like SimString it seems to compare how close to strings are to matching. On Mon, Jul 11, 2011 at 4:02 PM, Hassan Schroeder < hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On Mon, Jul 11, 2011 at 9:42 AM, Justin Stanczak <rizenine-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> > wrote: > > That would help slow the duplication, but if someone fills out a form and > > submits fname, lname, ss#, and they typo the ss# I would have a > duplicate. > > "slow the duplication"?? No, insuring that the SSNs *are unique* via > validations and unique indexes would prevent duplicates, period. > > What is it about that you don''t like? > > -- > Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > http://about.me/hassanschroeder > twitter: @hassan > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Yes, this is all very true. I was thinking if a comparison was done on multiple attributes that would help with just one name being wrong. I''m not looking for magic, just wondering how others find duplicated records. I could see this being used to detect data that links or is similar in nature. Found this as well. http://en.wikipedia.org/wiki/User:Ipeirotis/Duplicate_Record_Detection On Mon, Jul 11, 2011 at 4:17 PM, pepe <Pepe-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote:> I don''t believe you''re going to find a magic formula for what you''re > suggesting. The same thing could be said about last or first names as > you are suggesting could happen with SSNs. What if somebody misspells > Smith for Smit, for example? But worse yet, what if it is not a > misspelling situation and the Smit is actually Smit? The same is true > for SSNs, switching the last 2 digits does not mean it was a > "misspell", it could just be that 2 different people have the same > name and very similar SSNs. You have to draw a line somewhere, I > think. > > You could use auto-complete fields and then provide options based on > records found using the ''LIKE'' option in the where clause using the > information currently being entered. That might help but I think > you''ll find it''s not worth the effort. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Whenever I have worked on similar projects in ended up being the customer''s idea of what a "close approximation" was that made a possible duplicate. It was usually something like: same birth date same last name same city (optional) same state (optional) Since there is not such a thing as a tried and true method for what a duplicate record is I believe you''ll just need to do some manual work. My advise would be to ask your customer/boss for what the rules are. On Jul 11, 4:29 pm, Justin Stanczak <rizen...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Yes, this is all very true. I was thinking if a comparison was done on > multiple attributes that would help with just one name being wrong. I''m not > looking for magic, just wondering how others find duplicated records. I > could see this being used to detect data that links or is similar in nature. > > Found this as well.http://en.wikipedia.org/wiki/User:Ipeirotis/Duplicate_Record_Detection > > > > > > > > On Mon, Jul 11, 2011 at 4:17 PM, pepe <P...-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote: > > I don''t believe you''re going to find a magic formula for what you''re > > suggesting. The same thing could be said about last or first names as > > you are suggesting could happen with SSNs. What if somebody misspells > > Smith for Smit, for example? But worse yet, what if it is not a > > misspelling situation and the Smit is actually Smit? The same is true > > for SSNs, switching the last 2 digits does not mean it was a > > "misspell", it could just be that 2 different people have the same > > name and very similar SSNs. You have to draw a line somewhere, I > > think. > > > You could use auto-complete fields and then provide options based on > > records found using the ''LIKE'' option in the where clause using the > > information currently being entered. That might help but I think > > you''ll find it''s not worth the effort.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Mon, Jul 11, 2011 at 1:56 PM, pepe <Pepe-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote:> Whenever I have worked on similar projects in ended up being the > customer''s idea of what a "close approximation" was that made a > possible duplicate.Exactly -- if they''re not *identical* they''re not "duplicates". On the other hand if you define "similarity" to some degree you can use e.g. the Levenshtein gem to measure how "different" 2 given fields are.> Levenshtein.distance("Hassan Schroeder", "Hassan A. Schroeder")=> 3 HTH! -- Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org http://about.me/hassanschroeder twitter: @hassan -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Yes, this is very nice. On Mon, Jul 11, 2011 at 5:04 PM, Hassan Schroeder < hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On Mon, Jul 11, 2011 at 1:56 PM, pepe <Pepe-gUAqH5+0sKL6V6G2DxALlg@public.gmane.org> wrote: > > Whenever I have worked on similar projects in ended up being the > > customer''s idea of what a "close approximation" was that made a > > possible duplicate. > > Exactly -- if they''re not *identical* they''re not "duplicates". > > On the other hand if you define "similarity" to some degree you can use > e.g. the Levenshtein gem to measure how "different" 2 given fields are. > > > Levenshtein.distance("Hassan Schroeder", "Hassan A. Schroeder") > => 3 > > HTH! > -- > Hassan Schroeder ------------------------ hassan.schroeder-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > http://about.me/hassanschroeder > twitter: @hassan > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.