I have a Rails app with a contact list that needs to interface with Outlook and at least one other external data source. My fear is that as my client throws CSVs at my Web app, I will have subtly different items referring to the same contact. I''ve come up with several alternate approaches to this but thought I''d ask if anyone else has already faced this problem. FWIW, here were two approaches I felt might work: 1. Tag contacts that have already been sync''ed with Outlook. Strangely, Outlook does not provide any unique identifier with its contact information so this would have to be done in some custom field. Ack! 2. Use a proximity or fuzzy match to determine whether the same contact is being updated. So, "Sam Smith" and "Sammy Smith" might be the same person, but "Sam Jones" would not be. The user could then manually resolve possible duplicates. Regarding (2), ferret seems like a good way to get a Levenshtein distance for my existing data, as the data can be indexed as added, economizing on the matching hassle later. Anyone have any thoughts or experience with this? Thanks -- View this message in context: http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7350065 Sent from the RubyOnRails Users mailing list archive at Nabble.com. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
Steve Ross wrote:> Anyone have any thoughts or experience with this? > > Thanks > -- > View this message in context: > http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7350065 > Sent from the RubyOnRails Users mailing list archive at Nabble.com.The only other thought that comes to mind is maybe using a Soundex algorithm (http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm) in combination with your other ideas. c. -- Posted via http://www.ruby-forum.com/. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
> I have a Rails app with a contact list that needs to interface with Outlook > and at least one other external data source. My fear is that as my client > throws CSVs at my Web app, I will have subtly different items referring to > the same contact. > > I''ve come up with several alternate approaches to this but thought I''d ask > if anyone else has already faced this problem. FWIW, here were two > approaches I felt might work: > > 1. Tag contacts that have already been sync''ed with Outlook. Strangely, > Outlook does not provide any unique identifier with its contact information > so this would have to be done in some custom field. Ack! > > 2. Use a proximity or fuzzy match to determine whether the same contact is > being updated. So, "Sam Smith" and "Sammy Smith" might be the same person, > but "Sam Jones" would not be. The user could then manually resolve possible > duplicates. > > Regarding (2), ferret seems like a good way to get a Levenshtein distance > for my existing data, as the data can be indexed as added, economizing on > the matching hassle later. > > Anyone have any thoughts or experience with this?There''s also metaphone (http://en.wikipedia.org/wiki/Metaphone) which is supposed to be better than soundex. Not sure how levenshtein fits in. I also don''t remember how well any of them do with *names* as opposed to similar sounding, but *normal* words. Couldn''t you match the phone numbers up? Odds are if home,business,cell all match it''s the same person... -philip --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---
It looks like the text gem has a lot of good implementations of these algorithms. I''m going to throw a bunch of test data at this and see what happens. Metaphone, soundex and levenshtein all seem promising... I''m concerned that if the change were, say a change of address or a change of phone number, the test mentioned below would fail erroneously, allowing a duplicate into the database. However, names often stay the same. Eeeek, except when people change them because of marriage or personal preference. Hmmmmm. Am I overthinking this? Thanks, Steve Philip Hallstrom-8 wrote:> > >> I have a Rails app with a contact list that needs to interface with >> Outlook >> and at least one other external data source. My fear is that as my client >> throws CSVs at my Web app, I will have subtly different items referring >> to >> the same contact. >> >> I''ve come up with several alternate approaches to this but thought I''d >> ask >> if anyone else has already faced this problem. FWIW, here were two >> approaches I felt might work: >> >> 1. Tag contacts that have already been sync''ed with Outlook. Strangely, >> Outlook does not provide any unique identifier with its contact >> information >> so this would have to be done in some custom field. Ack! >> >> 2. Use a proximity or fuzzy match to determine whether the same contact >> is >> being updated. So, "Sam Smith" and "Sammy Smith" might be the same >> person, >> but "Sam Jones" would not be. The user could then manually resolve >> possible >> duplicates. >> >> Regarding (2), ferret seems like a good way to get a Levenshtein distance >> for my existing data, as the data can be indexed as added, economizing on >> the matching hassle later. >> >> Anyone have any thoughts or experience with this? > > There''s also metaphone (http://en.wikipedia.org/wiki/Metaphone) which is > supposed to be better than soundex. Not sure how levenshtein fits in. > > I also don''t remember how well any of them do with *names* as opposed to > similar sounding, but *normal* words. > > Couldn''t you match the phone numbers up? Odds are if home,business,cell > all match it''s the same person... > > -philip > > > > >-- View this message in context: http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7352074 Sent from the RubyOnRails Users mailing list archive at Nabble.com. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en -~----------~----~----~----~------~----~------~--~---