Anders Alexandersson
2016-Jan-28 20:01 UTC
[R] How to use compare.linkage in RecordLinkage package? -- more details but problem remains
How does one link two datasets using the compare.linkage function in the RecordLinkage package? This is to follow-up on my original posting earlier today: stat.ethz.ch/pipermail/r-help/2016-January/435736.html I suggested then that I should perhaps have added the identity argument. But if I add the identity argument, then I unexpectedly get 5 matches, 47885 non-matches and 0 pairs with unknown status. For example, I get a match for row 4256 which is unexpected because the matching variable bm does not match -- is 0 in the result pair (because bm is 1 for BERND JUNG and 4 for BERND MUELLER). Also, is_match in row 1 changes from unknown (NA) to no match (0) which is unexpected since the matching variable bm matches (bm=1). Here are the major new R commands that I ran and the output:> rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1),identity1=identity.RLdata500,identity2=identity.RLdata10000,exclude=c(2:5,7))> subset(rpairs$pairs, is_match=="1") # Why these 5 matches?id1 id2 fname_c1 bm is_match 4256 59 1394 1 0 1 5811 174 3684 1 0 1 14699 139 4199 1 0 1 16453 92 4580 1 0 1 21840 73 737 1 0 1> RLdata500[c(17, 59), ] # first obs, and first matching obsfname_c1 fname_c2 lname_c1 lname_c2 by bm bd 17 ALEXANDER <NA> MUELLER <NA> 1974 9 9 59 BERND <NA> JUNG KLEIN 1935 1 14> RLdata10000[c(343, 1394), ] # first obs, and first matching obsfname_c1 fname_c2 lname_c1 lname_c2 by bm bd 343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7 1394 BERND <NA> MUELLER <NA> 1942 4 4> rpairs$pairs[1:2, ]; # list first 2 obsid1 id2 fname_c1 bm is_match 1 17 343 1 1 0 2 17 2385 1 0 0 What am I missing? How to probabilistically link two datasets using the compare.linkage function in the RecordLinkage package? Anders Alexandersson andersalex at gmail.com [[alternative HTML version deleted]]