thr3ads.net - R help - [R] Merging two data frames with 3 common variables makes duplicated rows [May 2009]

If this information is useful, please help other people find it:
Share via:

Rock Ouimet

2009-May-08 21:28 UTC

[R] Merging two data frames with 3 common variables makes duplicated rows

I am new to R (ex SAS user) , and I cannot merge two data frames without
getting duplicated rows in the results. How to avoid this happening without
using the unique() function?

1. First data frame is called "tmv" with 6 variables and 239 rows:
> tmv[1:10,]      temps       nom        prenom sexe dist style
1  01:59:36       Cyr         Steve    H   45  free
2  02:09:55  Gosselin         Erick    H   45  free
3  02:12:18 Desfosses         Sacha    H   45  free
4  02:12:23  Lapointe     Sebastien    H   45  free
5  02:12:52    Labrie        Michel    H   45  free
6  02:12:54   Leblanc        Michel    H   45  free
7  02:13:02 Thibeault       Sylvain    H   45  free
8  02:13:49    Martel      Stephane    H   45  free
9  02:14:03    Lavoie Jean-Philippe    H   45  free
10 02:14:05    Boivin   Jean-Claude    H   45  free

Its structure is:> str(tmv)'data.frame':   239 obs. of  6 variables:
 $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919
0.0923
...
  .. ..- attr(*, "format")= chr "h:m:s"
 $ nom   : Factor w/ 167 levels "Aubut","Audy",..: 45 84 55
105 98 110 158
117 109 22 ...
 $ prenom: Factor w/ 135 levels "Alain","Alexandre",..: 128
33 121 122 93 93
130 126 63 59 ...
 $ sexe  : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2 2
...
 $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
 $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2 2 2
2 2 ...


2. The second data frame is called "meil2" with 4 variables and 16
rows;> meil2[1:10,]   dist sexe style     meil
1    38    F  clas 02:43:17
2    38    F  free 02:24:46
3    38    H  clas 02:37:36
4    38    H  free 01:59:35
5    45    F  clas 03:46:15
6    45    F  free 02:20:15
7    45    H  clas 02:30:07
8    45    H  free 01:59:36
9    38    F  clas 02:43:17
10   38    F  free 02:24:46

Its structure is:> str(tmv)'data.frame':   239 obs. of  6 variables:
 $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919
0.0923
...
  .. ..- attr(*, "format")= chr "h:m:s"
 $ nom   : Factor w/ 167 levels "Aubut","Audy",..: 45 84 55
105 98 110 158
117 109 22 ...
 $ prenom: Factor w/ 135 levels "Alain","Alexandre",..: 128
33 121 122 93 93
130 126 63 59 ...
 $ sexe  : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2 2
...
 $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
 $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2 2 2
2 2 ...

Note that the two data frames have "sexe", "dist", and
"style" as common
variables, and of the same class (Factor) and number of levels.

When merging the two data frames into "tmv3", the merging is fine, but
all
the rows get duplicated:> tmv3 <- merge(tmv, meil2, sort=TRUE, by=c("sexe",
"dist", "style")
)> tmv3[1:10,]   sexe dist style    temps        nom       prenom     meil
1     F   38  clas 02:49:15    Boucher Marie-Amelie 02:43:17
2     F   38  clas 02:49:15    Boucher Marie-Amelie 02:43:17
3     F   38  clas 03:24:05     Vachon     Guylaine 02:43:17
4     F   38  clas 03:24:05     Vachon     Guylaine 02:43:17
5     F   38  clas 03:13:11 Villeneuve       Rejean 02:43:17
6     F   38  clas 03:13:11 Villeneuve       Rejean 02:43:17
7     F   38  clas 03:37:54    Stevens        Julie 02:43:17
8     F   38  clas 03:37:54    Stevens        Julie 02:43:17
9     F   38  clas 03:53:03       Cote       Marthe 02:43:17
10    F   38  clas 03:53:03       Cote       Marthe 02:43:17

Can anyone explain this behavior from R ?
$version.string
[1] "R version 2.8.1 (2008-12-22)"

Rock

	[[alternative HTML version deleted]]

Thomas Lumley

2009-May-08 22:06 UTC

head link

[R] Merging two data frames with 3 common variables makes duplicated rows

On Fri, 8 May 2009, Rock Ouimet wrote:
> I am new to R (ex SAS user) , and I cannot merge two data frames without
> getting duplicated rows in the results. How to avoid this happening without
> using the unique() function?
>
> 1. First data frame is called "tmv" with 6 variables and 239
rows:
>
>> tmv[1:10,]
>      temps       nom        prenom sexe dist style
> 1  01:59:36       Cyr         Steve    H   45  free
> 2  02:09:55  Gosselin         Erick    H   45  free
> 3  02:12:18 Desfosses         Sacha    H   45  free
> 4  02:12:23  Lapointe     Sebastien    H   45  free
> 5  02:12:52    Labrie        Michel    H   45  free
> 6  02:12:54   Leblanc        Michel    H   45  free
> 7  02:13:02 Thibeault       Sylvain    H   45  free
> 8  02:13:49    Martel      Stephane    H   45  free
> 9  02:14:03    Lavoie Jean-Philippe    H   45  free
> 10 02:14:05    Boivin   Jean-Claude    H   45  free
>
> Its structure is:
>> str(tmv)
> 'data.frame':   239 obs. of  6 variables:
> $ temps :Class 'times'  atomic [1:239] 0.0831 0.0902 0.0919 0.0919
0.0923
> ...
>  .. ..- attr(*, "format")= chr "h:m:s"
> $ nom   : Factor w/ 167 levels "Aubut","Audy",..: 45 84
55 105 98 110 158
> 117 109 22 ...
> $ prenom: Factor w/ 135 levels "Alain","Alexandre",..:
128 33 121 122 93 93
> 130 126 63 59 ...
> $ sexe  : Factor w/ 2 levels "F","H": 2 2 2 2 2 2 2 2 2
2 ...
> $ dist  : int  45 45 45 45 45 45 45 45 45 45 ...
> $ style : Factor w/ 2 levels "clas","free": 2 2 2 2 2 2
2 2 2 2 ...
>
>
> 2. The second data frame is called "meil2" with 4 variables and
16 rows;
>> meil2[1:10,]
>   dist sexe style     meil
> 1    38    F  clas 02:43:17
> 2    38    F  free 02:24:46
> 3    38    H  clas 02:37:36
> 4    38    H  free 01:59:35
> 5    45    F  clas 03:46:15
> 6    45    F  free 02:20:15
> 7    45    H  clas 02:30:07
> 8    45    H  free 01:59:36
> 9    38    F  clas 02:43:17
> 10   38    F  free 02:24:46

Lines 9 and 1 appear to be the same in meil2, as do 2 and 10.  If the 16 rows
consist of two repeats of 8 rows that would explain why you are getting two
copies of each individual in the output. unique(meil2) would have just the
distinct rows.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Possibly Parallel Threads

Search for more maybe matching threads

R help - May 2009 - Merging two data frames with 3 common variables makes duplicated rows

[R] Merging two data frames with 3 common variables makes duplicated rows

[R] Merging two data frames with 3 common variables makes duplicated rows

Possibly Parallel Threads