I'm wondering about the behavior of the merge function when using factors as by variables. I know that when you combine two factors using c() the results can be odd, as in: c(factor(1:5),factor(6:10)) which prints: [1] 1 2 3 4 5 1 2 3 4 5 I presume this is because factors are actually stored as integers, with 6,7,8,9,10 stored internally as 1,2,3,4,5. This concerns me somewhat, as I often merge data frames using factors as the by variables. From what I can tell, the merge function creates matches based on factor labels (i.e. the result of as.character(factor_var)) and not the internally stored integers, but I'm wondering if there are particular lurking problems that I should be aware of? I'm especially curious as to how R recalculates the levels of the by variables in outer joins where not every observation is matched, as in: df1<-data.frame(a=factor(c("a","b")),b=1:2) df2<-data.frame(a=factor(c("b","c")),c=2:3) df3<-merge(df1,df2,by="a",all=T) Many thanks! [[alternative HTML version deleted]]
H Roark wrote:> I'm wondering about the behavior of the merge function when using factors as by variables. I know that when you combine two factors using c() the results can be odd, as in: > > c(factor(1:5),factor(6:10)) > > which prints: [1] 1 2 3 4 5 1 2 3 4 5 > > I presume this is because factors are actually stored as integers, with 6,7,8,9,10 stored internally as 1,2,3,4,5. > > This concerns me somewhat, as I often merge data frames using factors as the by variables. From what I can tell, the merge function creates matches based on factor labels (i.e. the result of as.character(factor_var)) and not the internally stored integers, but I'm wondering if there are particular lurking problems that I should be aware of? I'm especially curious as to how R recalculates the levels of the by variables in outer joins where not every observation is matched, as in: > > df1<-data.frame(a=factor(c("a","b")),b=1:2) > df2<-data.frame(a=factor(c("b","c")),c=2:3) > df3<-merge(df1,df2,by="a",all=T)As far as I know, there is no reason to be concerned when using merge as you do. The magic that ?merge is performing is actually being done in ?rbind, and you should read the help for that, particularly under "Data frame methods". You can also study the code of base.rbind.data.frame to see what it's actually doing. --Erik
Seemingly Similar Threads
- merging several dataframes from a list
- merging dataframes with an unequal number of variables
- Best way/practice to create a new data frame from two given ones with last column computed from the two data frames?
- Merge two columns of a data frame
- merging single column from different dataframe