Dear colleagues, This may be a question with a really obvious answer, but I can't find it. I have access to a large file with real medical record identifiers (mixed strings of characters and numbers) in it. These represent medical events for many thousands of people. It's important to be able to link events for the same people. It's much more important that the real record numbers are strongly obscured. I'm interested in some kind of strong one-way hash function to which I can feed the real numbers and get back unique codes for each record identifier fed in. I can do this on the health service system, and I have to do this before making further use of the data! There is the 'digest' function, in the digest package, but this seems to work on the whole vector of IDs, producing, in my case, a vector with 60,000 identical entries. H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5') I could do this in Perl, but I'd have to do quite a bit of work to get it installed. Any quick suggestions? Anthony Staines -- Anthony Staines, Professor of Health Systems Research, School of Nursing, Dublin City University, Dublin 9,Ireland. Tel:- +353 1 700 7807. Mobile:- +353 86 606 9713
Seeliger.Curt at epamail.epa.gov
2011-Jan-05 21:42 UTC
[R] Advice on obscuring unique IDs in R
Dr. Anthony wrote on 01/05/2011 01:19:49 PM:> This may be a question with a really obvious answer, but I > can't find it. I have access to a large file with real > medical record identifiers (mixed strings of characters and > numbers) in it. ...It's not that trivial of a question, or more organizations would have gotten it right. I bet a method (or two) for obscuring PII is recommended by your university or department. When that method has been determined, the requisite R package will probably be easy to find, and down the road you'll dodge the bullet of "I thought it would work" by not guessing at a method. cur -- Curt Seeliger, Data Ranger Raytheon Information Services - Contractor to ORD seeliger.curt@epa.gov 541/754-4638 [[alternative HTML version deleted]]
On Jan 5, 2011, at 3:19 PM, Anthony Staines wrote:> Dear colleagues, > > This may be a question with a really obvious answer, but I > can't find it. I have access to a large file with real > medical record identifiers (mixed strings of characters and > numbers) in it. These represent medical events for many > thousands of people. It's important to be able to link > events for the same people. > > It's much more important that the real record numbers are > strongly obscured. I'm interested in some kind of strong > one-way hash function to which I can feed the real numbers > and get back unique codes for each record identifier fed > in. I can do this on the health service system, and I have > to do this before making further use of the data! > > There is the 'digest' function, in the digest package, but > this seems to work on the whole vector of IDs, producing, in > my case, a vector with 60,000 identical entries. > > H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5') > > I could do this in Perl, but I'd have to do quite a bit of > work to get it installed. > > Any quick suggestions? > Anthony StainesTry using sapply(): L <- replicate(60000, paste(sample(letters, 10, replace = TRUE), collapse = ""))> str(L)chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" ...> head(L)[1] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" "kmlcxydygl" [6] "wpftnyrzwe" # Use sapply() to run digest() over each element of L> system.time(L.Digest <- sapply(L, digest))user system elapsed 6.920 0.031 7.361> str(L.Digest)Named chr [1:60000] "6d5861904ee004d251504cb0f731a69a" ... - attr(*, "names")= chr [1:60000] "dfederergw" "nwphehurvb" "avzmvltrhn" "ecmeiasmbk" ...> head(L.Digest)dfederergw nwphehurvb "6d5861904ee004d251504cb0f731a69a" "bf8ee61f69c83468988cad681a9f7ad0" avzmvltrhn ecmeiasmbk "ba1c66af41359cf1a3f5e91f22c6dfe5" "95ca2deaa6c1118852c9ffed71994a7f" kmlcxydygl wpftnyrzwe "f3647a7937a2c484123ef33bb52a27ac" "e84f17180703e4805493d88a760be682" HTH, Marc Schwartz
On Wed, Jan 05, 2011 at 09:19:49PM +0000, Anthony Staines wrote:> Dear colleagues, > > This may be a question with a really obvious answer, but I > can't find it. I have access to a large file with real > medical record identifiers (mixed strings of characters and > numbers) in it. These represent medical events for many > thousands of people. It's important to be able to link > events for the same people. > > It's much more important that the real record numbers are > strongly obscured. I'm interested in some kind of strong > one-way hash function to which I can feed the real numbers > and get back unique codes for each record identifier fed > in. I can do this on the health service system, and I have > to do this before making further use of the data!Producing unique integer codes for character values may be done using a factor, for example s <- c("cd", "bc", "ab", "bc", "ab") f <- factor(s) as.integer(f) # [1] 3 2 1 2 1 levels(f) # [1] "ab" "bc" "cd" If the codes should be ordered by the first ocurrence in the data, then use f <- factor(s, levels=unique(s)) as.integer(f) # [1] 1 2 3 2 3 levels(f) # [1] "cd" "bc" "ab" This does not perform any approximate matching. The codes are assigned based on exact equality. If an approximate matching is required, then an example of the identifiers would be helpful. Filtering out different types of delimiters may be done as a preprocessing step, for example, using gsub() s <- c("ab cd", "ab cd", "a b cd") gsub(" ", "", s) # [1] "abcd" "abcd" "abcd" where a general regular expression may also be used. Petr Savicky.