thr3ads.net - R help - [R] Advice on obscuring unique IDs in R [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Anthony Staines

2011-Jan-05 21:19 UTC

[R] Advice on obscuring unique IDs in R

Dear colleagues,

This may be a question with a really obvious answer, but I
can't find it. I have access to a large file with real
medical record identifiers (mixed strings of characters and
numbers) in it. These represent medical events for many
thousands of people. It's important to be able to link
events for the same people.

It's much more important that the real record numbers are
strongly obscured. I'm interested in some kind of strong
one-way hash function to which I can feed the real numbers
and get back unique codes for each record  identifier fed
in. I can do this on the health service system, and I have
to do this before making further use of the data!

There is the 'digest' function, in the digest package, but
this seems to work on the whole vector of IDs, producing, in
my case, a vector with 60,000 identical entries.

H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5')

I could do this in Perl, but I'd have to do quite a bit of
work to get it installed.

Any quick suggestions?
Anthony Staines
-- 
Anthony Staines, Professor of Health Systems Research,
School of Nursing, Dublin City University, Dublin 9,Ireland.
Tel:- +353 1 700 7807. Mobile:- +353 86 606 9713

Seeliger.Curt at epamail.epa.gov

2011-Jan-05 21:42 UTC

head link

[R] Advice on obscuring unique IDs in R

Dr. Anthony wrote on 01/05/2011 01:19:49 PM:> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. ...
It's not that trivial of a question, or more organizations would have 
gotten it right.  I bet a method (or two) for obscuring PII is recommended 
by your university or department.  When that method has been determined, 
the requisite R package will probably be easy to find, and down the road 
you'll dodge the bullet of "I thought it would work" by not
guessing at a
method.

cur
-- 
Curt Seeliger, Data Ranger
Raytheon Information Services - Contractor to ORD
seeliger.curt@epa.gov
541/754-4638


	[[alternative HTML version deleted]]

Marc Schwartz

2011-Jan-05 21:43 UTC

head link

[R] Advice on obscuring unique IDs in R

On Jan 5, 2011, at 3:19 PM, Anthony Staines wrote:
> Dear colleagues,
> 
> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. These represent medical events for many
> thousands of people. It's important to be able to link
> events for the same people.
> 
> It's much more important that the real record numbers are
> strongly obscured. I'm interested in some kind of strong
> one-way hash function to which I can feed the real numbers
> and get back unique codes for each record  identifier fed
> in. I can do this on the health service system, and I have
> to do this before making further use of the data!
> 
> There is the 'digest' function, in the digest package, but
> this seems to work on the whole vector of IDs, producing, in
> my case, a vector with 60,000 identical entries.
> 
> H.Out$P_ID = digest(H.In$MRNr,serialize=FALSE, algo='md5')
> 
> I could do this in Perl, but I'd have to do quite a bit of
> work to get it installed.
> 
> Any quick suggestions?
> Anthony Staines

Try using sapply():


L <- replicate(60000, paste(sample(letters, 10, replace = TRUE), collapse =
""))
> str(L) chr [1:60000] "dfederergw" "nwphehurvb"
"avzmvltrhn" ...
> head(L)[1] "dfederergw" "nwphehurvb" "avzmvltrhn"
"ecmeiasmbk" "kmlcxydygl"
[6] "wpftnyrzwe"


# Use sapply() to run digest() over each element of L
> system.time(L.Digest <- sapply(L, digest))   user  system elapsed 
  6.920   0.031   7.361 

> str(L.Digest) Named chr [1:60000] "6d5861904ee004d251504cb0f731a69a" ...
 - attr(*, "names")= chr [1:60000] "dfederergw"
"nwphehurvb" "avzmvltrhn" "ecmeiasmbk" ...

> head(L.Digest)                        dfederergw                         nwphehurvb 
"6d5861904ee004d251504cb0f731a69a"
"bf8ee61f69c83468988cad681a9f7ad0"
                        avzmvltrhn                         ecmeiasmbk 
"ba1c66af41359cf1a3f5e91f22c6dfe5"
"95ca2deaa6c1118852c9ffed71994a7f"
                        kmlcxydygl                         wpftnyrzwe 
"f3647a7937a2c484123ef33bb52a27ac"
"e84f17180703e4805493d88a760be682"


HTH,

Marc Schwartz

Petr Savicky

2011-Jan-05 22:01 UTC

head link

[R] Advice on obscuring unique IDs in R

On Wed, Jan 05, 2011 at 09:19:49PM +0000, Anthony Staines
wrote:> Dear colleagues,
> 
> This may be a question with a really obvious answer, but I
> can't find it. I have access to a large file with real
> medical record identifiers (mixed strings of characters and
> numbers) in it. These represent medical events for many
> thousands of people. It's important to be able to link
> events for the same people.
> 
> It's much more important that the real record numbers are
> strongly obscured. I'm interested in some kind of strong
> one-way hash function to which I can feed the real numbers
> and get back unique codes for each record  identifier fed
> in. I can do this on the health service system, and I have
> to do this before making further use of the data!
Producing unique integer codes for character values may be
done using a factor, for example

  s <- c("cd", "bc", "ab", "bc",
"ab")
  f <- factor(s)
  as.integer(f) # [1] 3 2 1 2 1
  levels(f) # [1] "ab" "bc" "cd"

If the codes should be ordered by the first ocurrence in the
data, then use

  f <- factor(s, levels=unique(s))
  as.integer(f) # [1] 1 2 3 2 3
  levels(f) # [1] "cd" "bc" "ab"

This does not perform any approximate matching. The codes are
assigned based on exact equality. If an approximate matching
is required, then an example of the identifiers would be helpful.

Filtering out different types of delimiters may be done as
a preprocessing step, for example, using gsub()

  s <- c("ab cd", "ab  cd", "a b cd")
  gsub(" ", "", s) # [1] "abcd" "abcd"
"abcd"

where a general regular expression may also be used.

Petr Savicky.

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Jan 2011 - Advice on obscuring unique IDs in R

[R] Advice on obscuring unique IDs in R

[R] Advice on obscuring unique IDs in R

[R] Advice on obscuring unique IDs in R

[R] Advice on obscuring unique IDs in R

Apparently Analagous Threads