Jeff Johnson
2014-Jan-17 04:38 UTC
[R] Any recommendations for reusable profiling of name fields?
Hi, I'm pretty new to R and am trying to develop a reusable set of scripts that I can use to profile various data types and common fields in our database. I know that what I'm asking is a can of worms, so please bear with me. :) For example, we store a person's first name, last name, phone number, email address, last gift amount, gift date, etc. as well as integer type data. I'm wondering if there's a "best practice" for validating a field that holds, for example, first name or last name. A couple of things I've come up with are: 1) Count of characters (nchar) in the first (or last) name field 2) Number of unique tokens 3) Patterns (converting alpha to A and numeric to N) and count the frequency of each unique pattern that results.I suppose I could make lower case alpha 'a' and upper = 'A' to be more specific. 4) Min and max name (helps identify those with leading spaces, numbers) Does anyone have more suggestions for techniques that are common or that you'd recommend for name fields? Ultimately, I'm looking to develop a common set of profiles for various data types, so if there's a white paper (I've googled, but not found any that hit the mark yet) I'd love to see it. Perhaps there's even a package for this type of thing? Thanks much! -- Jeff [[alternative HTML version deleted]]