Rock Ouimet
2009-May-08 21:28 UTC
[R] Merging two data frames with 3 common variables makes duplicated rows
I am new to R (ex SAS user) , and I cannot merge two data frames without getting duplicated rows in the results. How to avoid this happening without using the unique() function? 1. First data frame is called "tmv" with 6 variables and 239 rows:> tmv[1:10,]temps nom prenom sexe dist style 1 01:59:36 Cyr Steve H 45 free 2 02:09:55 Gosselin Erick H 45 free 3 02:12:18 Desfosses Sacha H 45 free 4 02:12:23 Lapointe Sebastien H 45 free 5 02:12:52 Labrie Michel H 45 free 6 02:12:54 Leblanc Michel H 45 free 7 02:13:02 Thibeault Sylvain H 45 free 8 02:13:49 Martel Stephane H 45 free 9 02:14:03 Lavoie Jean-Philippe H 45 free 10 02:14:05 Boivin Jean-Claude H 45 free Its structure is:> str(tmv)'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, "format")= chr "h:m:s" $ nom : Factor w/ 167 levels "Aubut","Audy",..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels "Alain","Alexandre",..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2 2 2 2 2 ... 2. The second data frame is called "meil2" with 4 variables and 16 rows;> meil2[1:10,]dist sexe style meil 1 38 F clas 02:43:17 2 38 F free 02:24:46 3 38 H clas 02:37:36 4 38 H free 01:59:35 5 45 F clas 03:46:15 6 45 F free 02:20:15 7 45 H clas 02:30:07 8 45 H free 01:59:36 9 38 F clas 02:43:17 10 38 F free 02:24:46 Its structure is:> str(tmv)'data.frame': 239 obs. of 6 variables: $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 ... .. ..- attr(*, "format")= chr "h:m:s" $ nom : Factor w/ 167 levels "Aubut","Audy",..: 45 84 55 105 98 110 158 117 109 22 ... $ prenom: Factor w/ 135 levels "Alain","Alexandre",..: 128 33 121 122 93 93 130 126 63 59 ... $ sexe : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2 2 ... $ dist : int 45 45 45 45 45 45 45 45 45 45 ... $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2 2 2 2 2 ... Note that the two data frames have "sexe", "dist", and "style" as common variables, and of the same class (Factor) and number of levels. When merging the two data frames into "tmv3", the merging is fine, but all the rows get duplicated:> tmv3 <- merge(tmv, meil2, sort=TRUE, by=c("sexe", "dist", "style"))> tmv3[1:10,]sexe dist style temps nom prenom meil 1 F 38 clas 02:49:15 Boucher Marie-Amelie 02:43:17 2 F 38 clas 02:49:15 Boucher Marie-Amelie 02:43:17 3 F 38 clas 03:24:05 Vachon Guylaine 02:43:17 4 F 38 clas 03:24:05 Vachon Guylaine 02:43:17 5 F 38 clas 03:13:11 Villeneuve Rejean 02:43:17 6 F 38 clas 03:13:11 Villeneuve Rejean 02:43:17 7 F 38 clas 03:37:54 Stevens Julie 02:43:17 8 F 38 clas 03:37:54 Stevens Julie 02:43:17 9 F 38 clas 03:53:03 Cote Marthe 02:43:17 10 F 38 clas 03:53:03 Cote Marthe 02:43:17 Can anyone explain this behavior from R ? $version.string [1] "R version 2.8.1 (2008-12-22)" Rock [[alternative HTML version deleted]]
Thomas Lumley
2009-May-08 22:06 UTC
[R] Merging two data frames with 3 common variables makes duplicated rows
On Fri, 8 May 2009, Rock Ouimet wrote:> I am new to R (ex SAS user) , and I cannot merge two data frames without > getting duplicated rows in the results. How to avoid this happening without > using the unique() function? > > 1. First data frame is called "tmv" with 6 variables and 239 rows: > >> tmv[1:10,] > temps nom prenom sexe dist style > 1 01:59:36 Cyr Steve H 45 free > 2 02:09:55 Gosselin Erick H 45 free > 3 02:12:18 Desfosses Sacha H 45 free > 4 02:12:23 Lapointe Sebastien H 45 free > 5 02:12:52 Labrie Michel H 45 free > 6 02:12:54 Leblanc Michel H 45 free > 7 02:13:02 Thibeault Sylvain H 45 free > 8 02:13:49 Martel Stephane H 45 free > 9 02:14:03 Lavoie Jean-Philippe H 45 free > 10 02:14:05 Boivin Jean-Claude H 45 free > > Its structure is: >> str(tmv) > 'data.frame': 239 obs. of 6 variables: > $ temps :Class 'times' atomic [1:239] 0.0831 0.0902 0.0919 0.0919 0.0923 > ... > .. ..- attr(*, "format")= chr "h:m:s" > $ nom : Factor w/ 167 levels "Aubut","Audy",..: 45 84 55 105 98 110 158 > 117 109 22 ... > $ prenom: Factor w/ 135 levels "Alain","Alexandre",..: 128 33 121 122 93 93 > 130 126 63 59 ... > $ sexe : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2 2 ... > $ dist : int 45 45 45 45 45 45 45 45 45 45 ... > $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2 2 2 2 2 ... > > > 2. The second data frame is called "meil2" with 4 variables and 16 rows; >> meil2[1:10,] > dist sexe style meil > 1 38 F clas 02:43:17 > 2 38 F free 02:24:46 > 3 38 H clas 02:37:36 > 4 38 H free 01:59:35 > 5 45 F clas 03:46:15 > 6 45 F free 02:20:15 > 7 45 H clas 02:30:07 > 8 45 H free 01:59:36 > 9 38 F clas 02:43:17 > 10 38 F free 02:24:46Lines 9 and 1 appear to be the same in meil2, as do 2 and 10. If the 16 rows consist of two repeats of 8 rows that would explain why you are getting two copies of each individual in the output. unique(meil2) would have just the distinct rows. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle