Colleagues, I am trying to de-duplicate a large (long) database (approx 1mil records) of diagnostic tests. Individuals in the database can have up-to 25 observations, but most will have only one. IDs for de-duplication (names, sex, lab number...) are patchy. In a first step, I am using Andreas Borg's excellent record linkage package (), that leaves me with a list of 'pairs' looking very much like this: id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10) id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16) id<-data.frame(cbind(id1,id2)) where a pair means that the records belong to the same individual (e.g., record 4 and record 8; 17 and 18...). My problem now is to get a list with all records that belong to the same person (in the example, obervations 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the link between 1 and 17 (through 18). I can do it in my head, but I am missing the code that would work its way through too many records. Any clever ideas? (using R 2.10.1 on Windows XP) Thanks, Christian -- View this message in context: http://r.789695.n4.nabble.com/deduplication-tp2241637p2241637.html Sent from the R help mailing list archive at Nabble.com.
Maybe something like the following will get you started: library("igraph") g <- graph.data.frame(id, directed=FALSE) neighborhood(g, +Inf) There is perhaps a more efficient way, but I hope this helps a little. Allan. On 03/06/10 14:14, Epi-schnier wrote:> Colleagues, > > I am trying to de-duplicate a large (long) database (approx 1mil records) of > diagnostic tests. Individuals in the database can have up-to 25 > observations, but most will have only one. IDs for de-duplication (names, > sex, lab number...) are patchy. In a first step, I am using Andreas Borg's > excellent record linkage package (), that leaves me with a list of 'pairs' > looking very much like this: > id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10) > id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16) > id<-data.frame(cbind(id1,id2)) > where a pair means that the records belong to the same individual (e.g., > record 4 and record 8; 17 and 18...). My problem now is to get a list with > all records that belong to the same person (in the example, obervations > 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to > find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the > link between 1 and 17 (through 18). I can do it in my head, but I am missing > the code that would work its way through too many records. > > Any clever ideas? > (using R 2.10.1 on Windows XP) > > Thanks, > > Christian > > >
Please try this ## Import data id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10) id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16) id<-data.frame(id1 = id1, id2 = id2) ## Create same structure table id <- id0 <- unique(id) leng <- nrow(id) n <- 0 repeat { if (n == leng) {break} n <- 0 id <- id[order(-id$id1, -id$id2),] for (i in 1:leng) { if (id$id1[i] == id$id2[i]) { n <- n+1 next } smal <- min(id[i,]) larg <- max(id[i,]) id$id2[which(id$id2 == larg)] <- smal id$id1[which(id$id1 == larg)] <- smal }} ## Create results tab <- table(as.matrix(id0), as.matrix(id[order(as.numeric(rownames(id))),])) res <- list() for (i in 1:ncol(tab)) { res[[i]] <- rownames(tab[(tab[,i] != 0),])} res ----- A R learner. -- View this message in context: http://r.789695.n4.nabble.com/deduplication-tp2241637p2242921.html Sent from the R help mailing list archive at Nabble.com.