Anders Alexandersson
2016-Jan-28 15:18 UTC
[R] How to use compare.linkage in RecordLinkage package -- unexpected output
I am using the compare.linkage function in the RecordLinkage package, and getting a result I know is wrong, so I know I'm misunderstanding something. I am using R 3.2.3 for x64 Windows. I am very familar with Stata but not so much with R. I can create record pairs from the blocking fields but all pairs are unknown status (NA). I cannot create matches or non-matches. I want a simple working example of how to link datasets using the RecordLinkage package. It seems that the manual and the R Journal Vol. 2/2 only show how to de-duplicate a single dataset using the compare.dedup function, not how to link two datasets together using the compare.linkage function. I can reproduce the examples in the R Journal article, so my R installation is fine. The example dataset in the manual have 500 and 10000 observations on 7 variables, but 1 observation and 2 variables will be enough to show the problem. My first comparison pattern loooks like this: id1 id2 fname_c1 bm is_match 1 17 343 1 1 NA Instead, I want and expect a comparison pattern that looks like this: id1 id2 fname_c1 bm is_match 1 17 343 1 1 1 My blocking variable is fname_c1 for first component of first name. My matching variable is bm for birth month. My understanding is that row 1 in my example output is the first row where fname_c1 matched in the underlying datasets. I want and expect is_match to be 1 when the matching variable bm=1 in both linkage datasets, as in the example. For more details, this is what I typed and the R output:> library(RecordLinkage) > data(RLdata500) > data(RLdata10000) > RLdata500[17, ]fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 17 ALEXANDER <NA> MUELLER <NA> 1974 9 9> RLdata10000[343, ]fname_c1 fname_c2 lname_c1 lname_c2 by bm bd 343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7> rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1),exclude=c(2:5,7))> rpairs$pairs[c(1:2), ] # Why is_match=NA? (should be 1)id1 id2 fname_c1 bm is_match 1 17 343 1 1 NA 2 17 2385 1 0 NA> rpairs <- epiWeights(rpairs) # (Weight calculation) > summary(rpairs) # (0 matches in Linkage Dataset)Linkage Data Set 500 records in data set 1 10000 records in data set 2 47890 record pairs 0 matches 0 non-matches 47890 pairs with unknown status Weight distribution: [omitted here to save space] References: 1. Manual for Package ?RecordLinkage? (Available online at cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf) 2. R Journal article Article "The RecordLinkage Package: Detecting Errors in Data" (Available online in PDF at journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf ) I saw something in the manual and R journal article about identity argument for true match results, but I guess I only need that for reference ("gold standard") datasets. There is a non-missing value (bm=1) for my example in both underlying datasets, so that is not why the result is NA. What am I missing? How does one link two simple datasets using compare.linkage? [[alternative HTML version deleted]]