Steven Lubitz
2009-Mar-07 05:02 UTC
[R] merge data frames with same column names of different lengths and missing values
Hello, I'm switching over from SAS to R and am having trouble merging data frames. The data frames have several columns with the same name, and each has a different number of rows. Some of the values are missing from cells with the same column names in each data frame. I had hoped that when I merged the dataframes, every column with the same name would be merged, with the value in a complete cell overwriting the value in an empty cell from the other data frame. I cannot seem to achieve this result, though I've tried several merge adaptations: x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), id=1:5) y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA), id=1:6) merge(x,y,by="id") #I lose observations here (n=1 in this example), and my items are duplicated - I do not want this result id item1.x item2.x item1.y item2.y 1 1 NA 1 NA NA 2 2 NA NA 2 NA 3 3 3 NA NA 3 4 4 4 4 4 4 5 5 5 5 5 5 merge(x,y,by=c("id","item1","item2")) #again I lose observations (n=4 here) and do not want this result id item1 item2 1 4 4 4 2 5 5 5 merge(x,y,by=c("id","item1","item2"),all.x=T,all.y=T) #my rows are duplicated and the NA values are retained - I instead want one row per ID id item1 item2 1 1 NA 1 2 1 NA NA 3 2 2 NA 4 2 NA NA 5 3 3 NA 6 3 NA 3 7 4 4 4 8 5 5 5 9 6 6 NA In reality I have multiple data frames with numerous columns, all with this problem. I can do the merge seamlessly in SAS, but am trying to learn and stick with R for my analyses. Any help would be greatly appreciated. Steve Lubitz Cardiovascular Research Fellow, Brigham and Women's Hospital and Massachusetts General Hospital
Dieter Menne
2009-Mar-07 08:56 UTC
[R] merge data frames with same column names of different lengths and missing values
Steven Lubitz <slubitz1 <at> yahoo.com> writes:> > x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), id=1:5) > y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA), id=1:6) >....> merge(x,y,by=c("id","item1","item2"),all.x=T,all.y=T) #my rows are duplicatedand the NA values are> retained - I instead want one row per ID > id item1 item2 > 1 1 NA 1 > 2 1 NA NA > 3 2 2 NA > 4 2 NA NA > 5 3 3 NA > 6 3 NA 3 > 7 4 4 4 > 8 5 5 5 > 9 6 6 NA >I think you only got the wrong (too complex) function. Try rbind(x,y) Dieter
Domenico Vistocco
2009-Mar-07 10:24 UTC
[R] merge data frames with same column names of different lengths and missing values
Steven Lubitz wrote:> Hello, I'm switching over from SAS to R and am having trouble merging data frames. The data frames have several columns with the same name, and each has a different number of rows. Some of the values are missing from cells with the same column names in each data frame. I had hoped that when I merged the dataframes, every column with the same name would be merged, with the value in a complete cell overwriting the value in an empty cell from the other data frame. I cannot seem to achieve this result, though I've tried several merge adaptations: > > x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), id=1:5) > y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA), id=1:6) > > > merge(x,y,by="id") #I lose observations here (n=1 in this example), and my items are duplicated - I do not want this result > id item1.x item2.x item1.y item2.y > 1 1 NA 1 NA NA > 2 2 NA NA 2 NA > 3 3 3 NA NA 3 > 4 4 4 4 4 4 > 5 5 5 5 5 5 > > > merge(x,y,by=c("id","item1","item2")) #again I lose observations (n=4 here) and do not want this result > id item1 item2 > 1 4 4 4 > 2 5 5 5 > > > merge(x,y,by=c("id","item1","item2"),all.x=T,all.y=T) #my rows are duplicated and the NA values are retained - I instead want one row per ID > id item1 item2 > 1 1 NA 1 > 2 1 NA NA > 3 2 2 NA > 4 2 NA NA > 5 3 3 NA > 6 3 NA 3 > 7 4 4 4 > 8 5 5 5 > 9 6 6 NA >You should obtain the desired solution using: merge(y, x, by=c("id","item1","item2"), all=TRUE) In database terminology all=TRUE corresponds to the full outer join, all.x to the left outer join and all.y to the right outer join. Ciao, domenico> In reality I have multiple data frames with numerous columns, all with this problem. I can do the merge seamlessly in SAS, but am trying to learn and stick with R for my analyses. Any help would be greatly appreciated. > > Steve Lubitz > Cardiovascular Research Fellow, Brigham and Women's Hospital and Massachusetts General Hospital > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >
Jun Shen
2009-Mar-07 21:52 UTC
[R] merge data frames with same column names of different lengths and missing values
Steve, I don't know if R has such a function to perform the task you were asking. I wrote one myself. Try the following to see if it works for you. The new function "merge.new" has one additional argument col.ID, which is the column number of ID column. To use your x, y as examples, type: merge.new(x,y,all=TRUE,col.ID=3) ################################################# merge.new<-function(...,col.ID){ inter<-merge(...) inter<-inter[order(inter[col.ID]),] #merged data sorted by ID #total columns and rows for the target dataframe total.row<-length(unique(inter[[col.ID]])) total.col<-dim(inter)[2] row.ID<-unique(inter[[col.ID]]) target<-matrix(NA,total.row,total.col) target<-as.data.frame(target) names(target)<-names(inter) for (i in 1:total.row){ inter.part<-inter[inter[col.ID]==row.ID[i],] #select all rows with the same ID for (j in 1:total.col){ if (is.na(inter.part[1,j])){ if(is.na(inter.part[2,j])) {target[i,j]=NA} else {target[i,j]=inter.part[2,j]} } else {target[i,j]=inter.part[1,j]} } } print(paste("total rows=",total.row)) print(paste("total columns=",total.col)) return(target) } ################################################# -- Jun Shen PhD PK/PD Scientist BioPharma Services Millipore Corporation 15 Research Park Dr. St Charles, MO 63304 Direct: 636-720-1589 On Fri, Mar 6, 2009 at 11:02 PM, Steven Lubitz <slubitz1@yahoo.com> wrote:> > Hello, I'm switching over from SAS to R and am having trouble merging data > frames. The data frames have several columns with the same name, and each > has a different number of rows. Some of the values are missing from cells > with the same column names in each data frame. I had hoped that when I > merged the dataframes, every column with the same name would be merged, with > the value in a complete cell overwriting the value in an empty cell from the > other data frame. I cannot seem to achieve this result, though I've tried > several merge adaptations: > > x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), id=1:5) > y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA), id=1:6) > > > merge(x,y,by="id") #I lose observations here (n=1 in this example), and my > items are duplicated - I do not want this result > id item1.x item2.x item1.y item2.y > 1 1 NA 1 NA NA > 2 2 NA NA 2 NA > 3 3 3 NA NA 3 > 4 4 4 4 4 4 > 5 5 5 5 5 5 > > > merge(x,y,by=c("id","item1","item2")) #again I lose observations (n=4 here) > and do not want this result > id item1 item2 > 1 4 4 4 > 2 5 5 5 > > > merge(x,y,by=c("id","item1","item2"),all.x=T,all.y=T) #my rows are > duplicated and the NA values are retained - I instead want one row per ID > id item1 item2 > 1 1 NA 1 > 2 1 NA NA > 3 2 2 NA > 4 2 NA NA > 5 3 3 NA > 6 3 NA 3 > 7 4 4 4 > 8 5 5 5 > 9 6 6 NA > > In reality I have multiple data frames with numerous columns, all with this > problem. I can do the merge seamlessly in SAS, but am trying to learn and > stick with R for my analyses. Any help would be greatly appreciated. > > Steve Lubitz > Cardiovascular Research Fellow, Brigham and Women's Hospital and > Massachusetts General Hospital > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Steven Lubitz
2009-Mar-07 22:05 UTC
[R] merge data frames with same column names of different lengths and missing values
Subject: Re: [R] merge data frames with same column names of different lengths and missing values To: "Phil Spector" <spector@stat.berkeley.edu> Date: Saturday, March 7, 2009, 5:01 PM Phil, Thank you - this is very helpful. However I realized that with my real data sets (not the example I have here), I also have different numbers of columns in each data frame. rbind doesn't seem to like this. Here's a modified example: x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), item3=c(NA,2,NA,4,NA), id=1:5) y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA), id=1:6) rbind(x,y) Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match Any ideas? Thanks, Steve --- On Sat, 3/7/09, Phil Spector <spector@stat.berkeley.edu> wrote: From: Phil Spector <spector@stat.berkeley.edu> Subject: Re: [R] merge data frames with same column names of different lengths and missing values To: "Steven Lubitz" <slubitz1@yahoo.com> Date: Saturday, March 7, 2009, 1:56 AM Steven - I believe this gives the output that you desire:> xy = rbind(x,y) >aggregate(subset(xy,select=-id),xy['id'],function(x)rev(x[!is.na(x)])[1]) id item1 item2 1 1 NA 1 2 2 2 NA 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 NA But I think what merge x y; by id; would give you is> aggregate(subset(xy,select=-id),xy['id'],function(x)x[length(x)])id item1 item2 1 1 NA NA 2 2 2 NA 3 3 NA 3 4 4 4 4 5 5 5 5 6 6 6 NA - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector@stat.berkeley.edu On Fri, 6 Mar 2009, Steven Lubitz wrote:> > Hello, I'm switching over from SAS to R and am having trouble mergingdata frames. The data frames have several columns with the same name, and each has a different number of rows. Some of the values are missing from cells with the same column names in each data frame. I had hoped that when I merged the dataframes, every column with the same name would be merged, with the value in a complete cell overwriting the value in an empty cell from the other data frame. I cannot seem to achieve this result, though I've tried several merge adaptations:> > x <- data.frame(item1=c(NA,NA,3,4,5), item2=c(1,NA,NA,4,5), id=1:5) > y <- data.frame(item1=c(NA,2,NA,4,5,6), item2=c(NA,NA,3,4,5,NA),id=1:6)> > > merge(x,y,by="id") #I lose observations here (n=1 in thisexample), and my items are duplicated - I do not want this result> id item1.x item2.x item1.y item2.y > 1 1 NA 1 NA NA > 2 2 NA NA 2 NA > 3 3 3 NA NA 3 > 4 4 4 4 4 4 > 5 5 5 5 5 5 > > > merge(x,y,by=c("id","item1","item2")) #againI lose observations (n=4 here) and do not want this result> id item1 item2 > 1 4 4 4 > 2 5 5 5 > > >merge(x,y,by=c("id","item1","item2"),all.x=T,all.y=T) #my rows are duplicated and the NA values are retained - I instead want one row per ID> id item1 item2 > 1 1 NA 1 > 2 1 NA NA > 3 2 2 NA > 4 2 NA NA > 5 3 3 NA > 6 3 NA 3 > 7 4 4 4 > 8 5 5 5 > 9 6 6 NA > > In reality Ihave multiple data frames with numerous columns, all with this problem. I can do the merge seamlessly in SAS, but am trying to learn and stick with R for my analyses. Any help would be greatly appreciated.> > Steve Lubitz > Cardiovascular Research Fellow, Brigham and Women's Hospital andMassachusetts General Hospital> > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]