Denis Chabot
2006-Nov-08 02:30 UTC
[R] combining dataframes with different numbers of columns
Dear list members, I have to combine dataframes together. However they contain different numbers of variables. It is possible that all the variables in the dataframe with fewer variables are contained in the dataframe with more variables, though it is not always the case. There are key variables identifying observations. These could be used in a merge statement, although this won't quite work for me (see below). I was hoping to find a way to combine dataframes where I needed only to ensure the key variables were present. The total number of variables in the final dataframe would be the total number of different variables in both initial dataframes. Variables that were absent in one dataframe would automatically get missing values in the joint dataframe. Here is a simple example. The initial dataframes are a and b. All variables in b are also in a. a <- data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10)) b <- data.frame(X=seq(16,20), X4=runif(5,0,15)) A merge does not work because the common variable X4 becomes 2 variables, X4.x and X4.y. c <- merge(a,b,by="X", all=T) This can be fixed but it requires several steps (although my solution is probably not optimal): names(c)[5] <- "X4" c$X4[is.na(c$X4)] <- c$X4.y[is.na(c$X4)] c <- c[,1:11] One quickly becomes tired with this solution with my real-life dataframes where different columns would require "repair" from one case to the next. I think I still prefer making the narrower dataframe like the wider one: b2 <- upData(b, X1=NA, X2=NA, X3=NA, X5=NA, X6=NA, X7=NA, X8=NA, X9=NA, X10=NA) b2 <- b2[,c(1, 3:5, 2, 6:11)] d <- rbind(a, b2) But again this requires quite a bit of fine-tuning from one case to the next in my real-life dataframes. I suspect R has a neat way to do this and I just cannot come up with the proper search term to find help on my own. Or this can be automated: can one compare variable lists from 2 dataframes and add missing variables in the "narrower" dataframe? Ideally, the solution would be able to handle the situation where the narrower dataframe contains one or more variables that are absent from the wider one. If this was the case, I'd like the new variable to be present in the combined dataframe, with missing values given to the observations from the wider dataframe. Thanks in advance, Denis Chabot
hadley wickham
2006-Nov-08 02:46 UTC
[R] combining dataframes with different numbers of columns
On 11/7/06, Denis Chabot <chabotd at globetrotter.net> wrote:> Dear list members, > > I have to combine dataframes together. However they contain different > numbers of variables. It is possible that all the variables in the > dataframe with fewer variables are contained in the dataframe with > more variables, though it is not always the case. > > There are key variables identifying observations. These could be used > in a merge statement, although this won't quite work for me (see below). > > I was hoping to find a way to combine dataframes where I needed only > to ensure the key variables were present. The total number of > variables in the final dataframe would be the total number of > different variables in both initial dataframes. Variables that were > absent in one dataframe would automatically get missing values in the > joint dataframe.Have a look at rbind.fill in the reshape package. library(reshape) rbind.fill(data.frame(a=1), data.frame(b=2)) rbind.fill(data.frame(a=1), data.frame(a=2, b=2)) Hadley
Stephen D. Weigand
2006-Nov-08 02:51 UTC
[R] combining dataframes with different numbers of columns
Denis, On Nov 7, 2006, at 8:30 PM, Denis Chabot wrote:> Dear list members, > > I have to combine dataframes together. However they contain different > numbers of variables. It is possible that all the variables in the > dataframe with fewer variables are contained in the dataframe with > more variables, though it is not always the case. > > There are key variables identifying observations. These could be used > in a merge statement, although this won't quite work for me (see > below). > > I was hoping to find a way to combine dataframes where I needed only > to ensure the key variables were present. The total number of > variables in the final dataframe would be the total number of > different variables in both initial dataframes. Variables that were > absent in one dataframe would automatically get missing values in the > joint dataframe. > > Here is a simple example. The initial dataframes are a and b. All > variables in b are also in a. > > a <- data.frame(X=seq(1,10), matrix(runif(100, 0,15), ncol=10)) > b <- data.frame(X=seq(16,20), X4=runif(5,0,15)) > > A merge does not work because the common variable X4 becomes 2 > variables, X4.x and X4.y. > > c <- merge(a,b,by="X", all=T)[snipped]> Thanks in advance, > > Denis Chabot >Will merge(a, b, by = intersect(names(a), names(b)), all = TRUE) do what you want? (Note the 'by' argument uses the default so it can be left out.) Hope this helps, Stephen Stephen Weigand Rochester, Minnesota, USA