thr3ads.net - R help - [R] merge: right set overwrite left set [Jul 2015]

If this information is useful, please help other people find it:
Share via:

aldi

2015-Jul-12 14:45 UTC

[R] merge: right set overwrite left set

Hi,
I have two sets of data x.HHu and y.HHo, rows are IDs and columns are 
individuals. I do not know in advance indv or HHid, both of them will be 
captured from the data. As the y.HHo set updates, y.HHo set has better 
information then x.HHu set. Thus I want a merge where right set 
overwrites left set info based on HHid, i.e. to overwrite x.HHu set with 
y.HHo set but keep any extra info from the x.HHu set that is not present 
in y.HHo set.
HHids will be complete based on z.map, with the corresponding positions.
I am having trouble with the part after this line: ### 
============================================+++++++++++++++++++++++++++
I am thinking that I am creating new columns "position"
"indv1" and
"indv2", but R is interpreting them as row information.
See the expected final table at the end. HHid is common, indv3 is from 
x.HHu, and the rest position and indv1 and indv2 are from y.HHo
Any suggestions are appreciated.
Thank you in advance,
Aldi

x.HHu<- data.frame(
            HHid = c( 'HH1', 'HH2', 'HH3',
'HH4', 'HH5', 'HH10')
          , indv1 = c( 2, 0, 2 , 0, 2, 0)
          , indv2 = c( 0, NA, 2, 2, 2, 2)
          , ind3 = c( 0, 0, 0, 0, 0, 0)
          )
### the HHo data will be the top set to overwrite any HHu data, when 
they exist, thinking that HHo are better than HHu results
### when they are available

y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5',
'HH3', 'HH10')
          , indv1 = c(2, 0, 2, 0, NA)
          , indv2 = c(0, 2, 2, 1, 2)
          )

z.map<-data.frame(HHid = c('HH1', 'HH2', 'HH3',
'HH4', 'HH5',
'HH6','HH8', 'HH7', 'HH9', 'HH10',
'HH11')
                 , position= c(10,20,30,42,55,66,81,75,92,101,111)
                 )
### see objects
x.HHu
y.HHo
z.map
### now sort the map by position, this sorted map will be used to sort 
finally all data
z.map<-z.map[with(z.map, order(position)), ]
z.map

### First I introduce position to both sets so I can sort them in 
advance by position.
x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
x.HHu.map

y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
y.HHo.map

### now merge HHu  and HHo  with the hope to overwrite the HHu set with 
HHo wherever they overlap by column names.
zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
zzz
### find common variable names in two sets

commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in% 
colnames(y.geno.map))]

## remove HHid wich is common for x and y, but work with the rest of columns
commonNames<-commonNames[-c(1)]

### ============================================+++++++++++++++++++++++++++
for(i in 1:length(commonNames)){

print(commonNames[i])
zzz$commonNames[i] <- NA

print(paste("zzz","$",commonNames[i],".y",sep=""))

zzz$commonNames[i] <-
zzz[,paste(commonNames[i],".y",sep="")]

### paste(zzz$commonNames[i],".x",sep='') <- NULL;
### paste(zzz$commonNames[i],".y",sep='') <- NULL;

}
zzz

The final expected set has to be: HHid is common, indv3 is from x.HHu, 
and the rest position and indv1 and indv2 are from y.HHo
    HHid     position     ind3  indv1 indv2
1   HH1         10          0     2       0
2  HH10        101          0    NA       2
3  HH11        111         NA    NA      NA
4   HH2         20          0     0       2
5   HH3         30          0     0       1
6   HH4         42          0    NA      NA
7   HH5         55          0     2       2
8   HH6         66         NA    NA      NA
9   HH7         75         NA    NA      NA
10  HH8         81         NA    NA      NA
11  HH9         92         NA    NA      NA

-- 


	[[alternative HTML version deleted]]

Ista Zahn

2015-Jul-12 17:06 UTC

head link

[R] merge: right set overwrite left set

I think this does what you want:

## find idiv coloumns in x.HHu.map that don't exist in y.HHo.map
x.HHu.map <- x.HHu.map[
    c("HHid",
      "position",
      names(x.HHu.map)[
               !names(x.HHu.map)
               %in% names(y.HHo.map)]
      )]
## merge, adding extra column from x.HHu.map
zzz <- merge(y.HHo.map, x.HHu.map, by=c('HHid', 'position'),
all=T)
## order by HHid
zzz <- zzz[order(zzz$HHid),]

Best,
Ista

On Sun, Jul 12, 2015 at 10:45 AM, aldi <aldi at dsgmail.wustl.edu>
wrote:> Hi,
> I have two sets of data x.HHu and y.HHo, rows are IDs and columns are
> individuals. I do not know in advance indv or HHid, both of them will be
> captured from the data. As the y.HHo set updates, y.HHo set has better
> information then x.HHu set. Thus I want a merge where right set
> overwrites left set info based on HHid, i.e. to overwrite x.HHu set with
> y.HHo set but keep any extra info from the x.HHu set that is not present
> in y.HHo set.
> HHids will be complete based on z.map, with the corresponding positions.
> I am having trouble with the part after this line: ###
> ============================================+++++++++++++++++++++++++++
> I am thinking that I am creating new columns "position"
"indv1" and
> "indv2", but R is interpreting them as row information.
> See the expected final table at the end. HHid is common, indv3 is from
> x.HHu, and the rest position and indv1 and indv2 are from y.HHo
> Any suggestions are appreciated.
> Thank you in advance,
> Aldi
>
> x.HHu<- data.frame(
>             HHid = c( 'HH1', 'HH2', 'HH3',
'HH4', 'HH5', 'HH10')
>           , indv1 = c( 2, 0, 2 , 0, 2, 0)
>           , indv2 = c( 0, NA, 2, 2, 2, 2)
>           , ind3 = c( 0, 0, 0, 0, 0, 0)
>           )
> ### the HHo data will be the top set to overwrite any HHu data, when
> they exist, thinking that HHo are better than HHu results
> ### when they are available
>
> y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5',
'HH3', 'HH10')
>           , indv1 = c(2, 0, 2, 0, NA)
>           , indv2 = c(0, 2, 2, 1, 2)
>           )
>
> z.map<-data.frame(HHid = c('HH1', 'HH2', 'HH3',
'HH4', 'HH5',
> 'HH6','HH8', 'HH7', 'HH9', 'HH10',
'HH11')
>                  , position= c(10,20,30,42,55,66,81,75,92,101,111)
>                  )
> ### see objects
> x.HHu
> y.HHo
> z.map
> ### now sort the map by position, this sorted map will be used to sort
> finally all data
> z.map<-z.map[with(z.map, order(position)), ]
> z.map
>
> ### First I introduce position to both sets so I can sort them in
> advance by position.
> x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
> x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
> x.HHu.map
>
> y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
> y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
> y.HHo.map
>
> ### now merge HHu  and HHo  with the hope to overwrite the HHu set with
> HHo wherever they overlap by column names.
> zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
> zzz
> ### find common variable names in two sets
>
> commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in%
> colnames(y.geno.map))]
>
> ## remove HHid wich is common for x and y, but work with the rest of
columns
> commonNames<-commonNames[-c(1)]
>
> ### ============================================+++++++++++++++++++++++++++
> for(i in 1:length(commonNames)){
>
> print(commonNames[i])
> zzz$commonNames[i] <- NA
>
>
print(paste("zzz","$",commonNames[i],".y",sep=""))
>
> zzz$commonNames[i] <-
zzz[,paste(commonNames[i],".y",sep="")]
>
> ### paste(zzz$commonNames[i],".x",sep='') <- NULL;
> ### paste(zzz$commonNames[i],".y",sep='') <- NULL;
>
> }
> zzz
>
> The final expected set has to be: HHid is common, indv3 is from x.HHu,
> and the rest position and indv1 and indv2 are from y.HHo
>     HHid     position     ind3  indv1 indv2
> 1   HH1         10          0     2       0
> 2  HH10        101          0    NA       2
> 3  HH11        111         NA    NA      NA
> 4   HH2         20          0     0       2
> 5   HH3         30          0     0       1
> 6   HH4         42          0    NA      NA
> 7   HH5         55          0     2       2
> 8   HH6         66         NA    NA      NA
> 9   HH7         75         NA    NA      NA
> 10  HH8         81         NA    NA      NA
> 11  HH9         92         NA    NA      NA
>
> --
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2015-Jul-12 18:35 UTC

head link

[R] merge: right set overwrite left set

I get confused by your use of the position map table. If I follow your 
description toward your desired result, I take a different route that 
makes sense to me. Perhaps it will make sense to you as well. The key idea 
is to make individual comparisons of the values for each combination of 
HHid and indv, regardless of where they are in the original data frames.

Below are two different syntactic representations of my understanding of 
your problem. They differ only in the approach taken to strip away syntax 
clutter. I start by making the HHid identifiers character values in the 
original data frames, because their respective factor levels in the two 
data frames would not necessarily correspond.

x.HHu <- data.frame(
     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5', 'HH10' )
   , indv1 = c( 2, 0, 2 , 0, 2, 0 )
   , indv2 = c( 0, NA, 2, 2, 2, 2 )
   , ind3 = c( 0, 0, 0, 0, 0, 0 )
   , stringsAsFactors = FALSE # avoid creating HHid as a factor
)
y.HHo <- data.frame(
     HHid=c( 'HH1', 'HH2','HH5', 'HH3',
'HH10' )
   , indv1 = c( 2, 0, 2, 0, NA )
   , indv2 = c( 0, 2, 2, 1, 2 )
   , stringsAsFactors = FALSE
)
z.map <- data.frame(
     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5'
             , 'HH6','HH8', 'HH7', 'HH9',
'HH10', 'HH11' )
   , position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 )
   , stringsAsFactors = FALSE
)


# reshape2 solution
library(reshape2)

x.HHu.long <- melt( x.HHu, "HHid", variable.name = "indv"
)
x.HHu.long$indv <- as.character( x.HHu.long$indv )
y.HHo.long <- melt( y.HHo, "HHid", variable.name = "indv"
)
y.HHo.long$indv <- as.character( y.HHo.long$indv )
xy.HH.long <- merge( x.HHu.long
                    , y.HHo.long
                    , by = c( "HHid", "indv" )
                    , all = TRUE )
xy.HH.long$value <- with( xy.HH.long
                         , ifelse( is.na( value.y )
                                 , value.x
                                 , value.y ) )
xy.HH0 <- dcast( xy.HH.long, HHid ~ indv )
xy.HH <- merge( xy.HH0, z.map, all=TRUE )
xy.HH <- xy.HH[ order( xy.HH$HHid ), ]
# compare xy.HH with zzz ... I think there is an error in zzz for 
# HH10/indv1, because NA should not be considered more informative
# than 0...

# tidyr/dplyr solution
library(tidyr)
library(dplyr)

# define a common processing sequence
lengthen <- (   . # period is placeholder for data frame
             %>% gather( indv, value, -HHid )
             %>% mutate( indv = as.character( indv ) )
             )
x.HHu.dlong <- x.HHu %>% lengthen
y.HHo.dlong <- y.HHo %>% lengthen
xy.HH.d <- (   x.HHu.dlong
            %>% full_join( y.HHo.dlong, by= c( "HHid",
"indv" ) )
            %>% transmute( HHid = HHid
                         , indv = indv
                         , value = ifelse( is.na( value.y )
                                         , value.x
                                         , value.y )
                         )
            %>% spread( indv, value )
            %>% full_join( z.map, by="HHid" )
            %>% arrange( HHid )
            %>% as.data.frame
            )

On Sun, 12 Jul 2015, aldi wrote:
> Hi,
> I have two sets of data x.HHu and y.HHo, rows are IDs and columns are
> individuals. I do not know in advance indv or HHid, both of them will be
> captured from the data. As the y.HHo set updates, y.HHo set has better
> information then x.HHu set. Thus I want a merge where right set
> overwrites left set info based on HHid, i.e. to overwrite x.HHu set with
> y.HHo set but keep any extra info from the x.HHu set that is not present
> in y.HHo set.
> HHids will be complete based on z.map, with the corresponding positions.
> I am having trouble with the part after this line: ###
> ============================================+++++++++++++++++++++++++++
> I am thinking that I am creating new columns "position"
"indv1" and
> "indv2", but R is interpreting them as row information.
> See the expected final table at the end. HHid is common, indv3 is from
> x.HHu, and the rest position and indv1 and indv2 are from y.HHo
> Any suggestions are appreciated.
> Thank you in advance,
> Aldi
>
> x.HHu<- data.frame(
>            HHid = c( 'HH1', 'HH2', 'HH3',
'HH4', 'HH5', 'HH10')
>          , indv1 = c( 2, 0, 2 , 0, 2, 0)
>          , indv2 = c( 0, NA, 2, 2, 2, 2)
>          , ind3 = c( 0, 0, 0, 0, 0, 0)
>          )
> ### the HHo data will be the top set to overwrite any HHu data, when
> they exist, thinking that HHo are better than HHu results
> ### when they are available
>
> y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5',
'HH3', 'HH10')
>          , indv1 = c(2, 0, 2, 0, NA)
>          , indv2 = c(0, 2, 2, 1, 2)
>          )
>
> z.map<-data.frame(HHid = c('HH1', 'HH2', 'HH3',
'HH4', 'HH5',
> 'HH6','HH8', 'HH7', 'HH9', 'HH10',
'HH11')
>                 , position= c(10,20,30,42,55,66,81,75,92,101,111)
>                 )
> ### see objects
> x.HHu
> y.HHo
> z.map
> ### now sort the map by position, this sorted map will be used to sort
> finally all data
> z.map<-z.map[with(z.map, order(position)), ]
> z.map
>
> ### First I introduce position to both sets so I can sort them in
> advance by position.
> x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
> x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
> x.HHu.map
>
> y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
> y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
> y.HHo.map
>
> ### now merge HHu  and HHo  with the hope to overwrite the HHu set with
> HHo wherever they overlap by column names.
> zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
> zzz
> ### find common variable names in two sets
>
> commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in%
> colnames(y.geno.map))]
>
> ## remove HHid wich is common for x and y, but work with the rest of
columns
> commonNames<-commonNames[-c(1)]
>
> ### ============================================+++++++++++++++++++++++++++
> for(i in 1:length(commonNames)){
>
> print(commonNames[i])
> zzz$commonNames[i] <- NA
>
>
print(paste("zzz","$",commonNames[i],".y",sep=""))
>
> zzz$commonNames[i] <-
zzz[,paste(commonNames[i],".y",sep="")]
>
> ### paste(zzz$commonNames[i],".x",sep='') <- NULL;
> ### paste(zzz$commonNames[i],".y",sep='') <- NULL;
>
> }
> zzz
>
> The final expected set has to be: HHid is common, indv3 is from x.HHu,
> and the rest position and indv1 and indv2 are from y.HHo
>    HHid     position     ind3  indv1 indv2
> 1   HH1         10          0     2       0
> 2  HH10        101          0    NA       2
> 3  HH11        111         NA    NA      NA
> 4   HH2         20          0     0       2
> 5   HH3         30          0     0       1
> 6   HH4         42          0    NA      NA
> 7   HH5         55          0     2       2
> 8   HH6         66         NA    NA      NA
> 9   HH7         75         NA    NA      NA
> 10  HH8         81         NA    NA      NA
> 11  HH9         92         NA    NA      NA
>
> -- 
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

aldi

2015-Jul-13 11:23 UTC

head link

[R] merge: right set overwrite left set

Thank you Ista,
Your solution is smart, by sub-setting from x.HHu.map data only
"HHid",
"position" as indices (because they are unique) for the merge, and any
extra columns in x.HHu.map that are not present in y.HHo,map, thus when 
the merge is done with option all=T, will work among the two sets of 
data, without creating .x and .y when the variables are in common in two 
sets.

With best wishes,
Aldi

 > ## find indv columns in x.HHu.map that don't exist in y.HHo.map
 > x.HHu.map <- x.HHu.map[
+     c("HHid",
+       "position",
+       names(x.HHu.map)[
+                !names(x.HHu.map)
+                %in% names(y.HHo.map)]
+       )]
 > ## merge, adding extra column from x.HHu.map
 > zzz <- merge(y.HHo.map, x.HHu.map, by=c('HHid',
'position'), all=T)
 > ## order by HHid
 > zzz <- zzz[order(zzz$HHid),]
 > zzz
    HHid position indv1 indv2 ind3
1   HH1       10     2     0    0
2  HH10      101    NA     2    0
3  HH11      111    NA    NA   NA
4   HH2       20     0     2    0
5   HH3       30     0     1    0
6   HH4       42    NA    NA    0
7   HH5       55     2     2    0
8   HH6       66    NA    NA   NA
9   HH7       75    NA    NA   NA
10  HH8       81    NA    NA   NA
11  HH9       92    NA    NA   NA


On 7/12/2015 12:06 PM, Ista Zahn wrote:> I think this does what you want:
>
> ## find idiv coloumns in x.HHu.map that don't exist in y.HHo.map
> x.HHu.map <- x.HHu.map[
>      c("HHid",
>        "position",
>        names(x.HHu.map)[
>                 !names(x.HHu.map)
>                 %in% names(y.HHo.map)]
>        )]
> ## merge, adding extra column from x.HHu.map
> zzz <- merge(y.HHo.map, x.HHu.map, by=c('HHid',
'position'), all=T)
> ## order by HHid
> zzz <- zzz[order(zzz$HHid),]
>
> Best,
> Ista
>
> On Sun, Jul 12, 2015 at 10:45 AM, aldi <aldi at dsgmail.wustl.edu>
wrote:
>> Hi,
>> I have two sets of data x.HHu and y.HHo, rows are IDs and columns are
>> individuals. I do not know in advance indv or HHid, both of them will
be
>> captured from the data. As the y.HHo set updates, y.HHo set has better
>> information then x.HHu set. Thus I want a merge where right set
>> overwrites left set info based on HHid, i.e. to overwrite x.HHu set
with
>> y.HHo set but keep any extra info from the x.HHu set that is not
present
>> in y.HHo set.
>> HHids will be complete based on z.map, with the corresponding
positions.
>> I am having trouble with the part after this line: ###
>> ============================================+++++++++++++++++++++++++++
>> I am thinking that I am creating new columns "position"
"indv1" and
>> "indv2", but R is interpreting them as row information.
>> See the expected final table at the end. HHid is common, indv3 is from
>> x.HHu, and the rest position and indv1 and indv2 are from y.HHo
>> Any suggestions are appreciated.
>> Thank you in advance,
>> Aldi
>>
>> x.HHu<- data.frame(
>>              HHid = c( 'HH1', 'HH2', 'HH3',
'HH4', 'HH5', 'HH10')
>>            , indv1 = c( 2, 0, 2 , 0, 2, 0)
>>            , indv2 = c( 0, NA, 2, 2, 2, 2)
>>            , ind3 = c( 0, 0, 0, 0, 0, 0)
>>            )
>> ### the HHo data will be the top set to overwrite any HHu data, when
>> they exist, thinking that HHo are better than HHu results
>> ### when they are available
>>
>> y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5',
'HH3', 'HH10')
>>            , indv1 = c(2, 0, 2, 0, NA)
>>            , indv2 = c(0, 2, 2, 1, 2)
>>            )
>>
>> z.map<-data.frame(HHid = c('HH1', 'HH2',
'HH3', 'HH4', 'HH5',
>> 'HH6','HH8', 'HH7', 'HH9',
'HH10', 'HH11')
>>                   , position= c(10,20,30,42,55,66,81,75,92,101,111)
>>                   )
>> ### see objects
>> x.HHu
>> y.HHo
>> z.map
>> ### now sort the map by position, this sorted map will be used to sort
>> finally all data
>> z.map<-z.map[with(z.map, order(position)), ]
>> z.map
>>
>> ### First I introduce position to both sets so I can sort them in
>> advance by position.
>> x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
>> x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
>> x.HHu.map
>>
>> y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
>> y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
>> y.HHo.map
>>
>> ### now merge HHu  and HHo  with the hope to overwrite the HHu set with
>> HHo wherever they overlap by column names.
>> zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
>> zzz
>> ### find common variable names in two sets
>>
>> commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in%
>> colnames(y.geno.map))]
>>
>> ## remove HHid wich is common for x and y, but work with the rest of
columns
>> commonNames<-commonNames[-c(1)]
>>
>> ###
============================================+++++++++++++++++++++++++++
>> for(i in 1:length(commonNames)){
>>
>> print(commonNames[i])
>> zzz$commonNames[i] <- NA
>>
>>
print(paste("zzz","$",commonNames[i],".y",sep=""))
>>
>> zzz$commonNames[i] <-
zzz[,paste(commonNames[i],".y",sep="")]
>>
>> ### paste(zzz$commonNames[i],".x",sep='') <- NULL;
>> ### paste(zzz$commonNames[i],".y",sep='') <- NULL;
>>
>> }
>> zzz
>>
>> The final expected set has to be: HHid is common, indv3 is from x.HHu,
>> and the rest position and indv1 and indv2 are from y.HHo
>>      HHid     position     ind3  indv1 indv2
>> 1   HH1         10          0     2       0
>> 2  HH10        101          0    NA       2
>> 3  HH11        111         NA    NA      NA
>> 4   HH2         20          0     0       2
>> 5   HH3         30          0     0       1
>> 6   HH4         42          0    NA      NA
>> 7   HH5         55          0     2       2
>> 8   HH6         66         NA    NA      NA
>> 9   HH7         75         NA    NA      NA
>> 10  HH8         81         NA    NA      NA
>> 11  HH9         92         NA    NA      NA
>>
>> --
>>
>>

	[[alternative HTML version deleted]]

aldi

2015-Jul-13 12:14 UTC

head link

[R] merge: right set overwrite left set

Thank you Jeff,
Your solutions have two great aspects: a) you provide a different 
approach by using reshape2 syntax / tidyr, and b) the concern that it is 
better to update x.HHu.map with y.HHo.map, without overwriting x.HHu.map 
with NA from y.HHo.map, thus keeping intact the old value(s). That is 
the ideal situation, but I do not know exactly if the old value is 
correct, which if it is not correct then it may create strata in the 
data. Therefore I prefer better to overwrite any old column with the new 
column when it exits, even with NA. In addition, you introduce the 
safety stringAsFactors=F. The map is only for getting "position" based
on HHid, and at the end I plan sorting final set based on position.

While the reshape2 worked with no problems from the start, the other 
solution with
# tidyr/dplyr solution
library(tidyr)
library(dplyr)
did not work correct in the start (it stopped R working and OS says R 
have to close) in windows 8.1 with R 3.1.2, quite possible because 
called libraries were compiled under R 3.1.3.
When installed the new version of R: 3.2.1, the second solution worked 
also with no problem.
For whoever has written the libraries tidyr and dplyr, they produced 
warnings for function name conflicts with base and stats:
Attaching package: ?dplyr?
The following objects are masked from ?package:stats?:
     filter, lag
The following objects are masked from ?package:base?:
     intersect, setdiff, setequal, union

Great solutions!
Thank you,
Aldi

 > x.HHu <- data.frame(
+     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5', 'HH10' )
+   , indv1 = c( 2, 0, 2 , 0, 2, 0 )
+   , indv2 = c( 0, NA, 2, 2, 2, 2 )
+   , ind3 = c( 0, 0, 0, 0, 0, 0 )
+   , stringsAsFactors = FALSE # avoid creating HHid as a factor
+ )
 > y.HHo <- data.frame(
+     HHid=c( 'HH1', 'HH2','HH5', 'HH3',
'HH10' )
+   , indv1 = c( 2, 0, 2, 0, NA )
+   , indv2 = c( 0, 2, 2, 1, 2 )
+   , stringsAsFactors = FALSE
+ )
 > z.map <- data.frame(
+     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5'
+             , 'HH6','HH8', 'HH7', 'HH9',
'HH10', 'HH11' )
+   , position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 )
+   , stringsAsFactors = FALSE
+ )
 > # reshape2 solution
 > library(reshape2)
 >
 > x.HHu.long <- melt( x.HHu, "HHid", variable.name =
"indv" )
 > x.HHu.long$indv <- as.character( x.HHu.long$indv )
 > y.HHo.long <- melt( y.HHo, "HHid", variable.name =
"indv" )
 > y.HHo.long$indv <- as.character( y.HHo.long$indv )
 > xy.HH.long <- merge( x.HHu.long
+                    , y.HHo.long
+                    , by = c( "HHid", "indv" )
+                    , all = TRUE )
 > xy.HH.long$value <- with( xy.HH.long
+                         , ifelse( is.na( value.y )
+                                 , value.x
+                                 , value.y ) )
 > xy.HH0 <- dcast( xy.HH.long, HHid ~ indv )
 > xy.HH <- merge( xy.HH0, z.map, all=TRUE )
 > xy.HH <- xy.HH[ order( xy.HH$HHid ), ]
 > # compare xy.HH with zzz ... I think there is an error in zzz for # 
HH10/indv1, because NA should not be considered more informative
 > # than 0...
### solution with reshape2
 > xy.HH
    HHid ind3 indv1 indv2 position
1   HH1    0     2     0       10
2  HH10    0     0     2      101
3  HH11   NA    NA    NA      111
4   HH2    0     0     2       20
5   HH3    0     0     1       30
6   HH4    0     0     2       42
7   HH5    0     2     2       55
8   HH6   NA    NA    NA       66
9   HH7   NA    NA    NA       75
10  HH8   NA    NA    NA       81
11  HH9   NA    NA    NA       92

### Solution with tidyr:
 > xy.HH.d
    HHid ind3 indv1 indv2 position
1   HH1    0     2     0       10
2  HH10    0     0     2      101
3  HH11   NA    NA    NA      111
4   HH2    0     0     2       20
5   HH3    0     0     1       30
6   HH4    0     0     2       42
7   HH5    0     2     2       55
8   HH6   NA    NA    NA       66
9   HH7   NA    NA    NA       75
10  HH8   NA    NA    NA       81
11  HH9   NA    NA    NA       92

On 7/12/2015 1:35 PM, Jeff Newmiller wrote:> I get confused by your use of the position map table. If I follow your 
> description toward your desired result, I take a different route that 
> makes sense to me. Perhaps it will make sense to you as well. The key 
> idea is to make individual comparisons of the values for each 
> combination of HHid and indv, regardless of where they are in the 
> original data frames.
>
> Below are two different syntactic representations of my understanding 
> of your problem. They differ only in the approach taken to strip away 
> syntax clutter. I start by making the HHid identifiers character 
> values in the original data frames, because their respective factor 
> levels in the two data frames would not necessarily correspond.
>
> x.HHu <- data.frame(
>     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5', 'HH10' )
>   , indv1 = c( 2, 0, 2 , 0, 2, 0 )
>   , indv2 = c( 0, NA, 2, 2, 2, 2 )
>   , ind3 = c( 0, 0, 0, 0, 0, 0 )
>   , stringsAsFactors = FALSE # avoid creating HHid as a factor
> )
> y.HHo <- data.frame(
>     HHid=c( 'HH1', 'HH2','HH5', 'HH3',
'HH10' )
>   , indv1 = c( 2, 0, 2, 0, NA )
>   , indv2 = c( 0, 2, 2, 1, 2 )
>   , stringsAsFactors = FALSE
> )
> z.map <- data.frame(
>     HHid = c( 'HH1', 'HH2', 'HH3', 'HH4',
'HH5'
>             , 'HH6','HH8', 'HH7', 'HH9',
'HH10', 'HH11' )
>   , position= c( 10, 20, 30, 42, 55, 66, 81, 75, 92, 101, 111 )
>   , stringsAsFactors = FALSE
> )
>
>
> # reshape2 solution
> library(reshape2)
>
> x.HHu.long <- melt( x.HHu, "HHid", variable.name =
"indv" )
> x.HHu.long$indv <- as.character( x.HHu.long$indv )
> y.HHo.long <- melt( y.HHo, "HHid", variable.name =
"indv" )
> y.HHo.long$indv <- as.character( y.HHo.long$indv )
> xy.HH.long <- merge( x.HHu.long
>                    , y.HHo.long
>                    , by = c( "HHid", "indv" )
>                    , all = TRUE )
> xy.HH.long$value <- with( xy.HH.long
>                         , ifelse( is.na( value.y )
>                                 , value.x
>                                 , value.y ) )
> xy.HH0 <- dcast( xy.HH.long, HHid ~ indv )
> xy.HH <- merge( xy.HH0, z.map, all=TRUE )
> xy.HH <- xy.HH[ order( xy.HH$HHid ), ]
> # compare xy.HH with zzz ... I think there is an error in zzz for # 
> HH10/indv1, because NA should not be considered more informative
> # than 0...
>
> # tidyr/dplyr solution
> library(tidyr)
> library(dplyr)
>
> # define a common processing sequence
> lengthen <- (   . # period is placeholder for data frame
>             %>% gather( indv, value, -HHid )
>             %>% mutate( indv = as.character( indv ) )
>             )
> x.HHu.dlong <- x.HHu %>% lengthen
> y.HHo.dlong <- y.HHo %>% lengthen
> xy.HH.d <- (   x.HHu.dlong
>            %>% full_join( y.HHo.dlong, by= c( "HHid",
"indv" ) )
>            %>% transmute( HHid = HHid
>                         , indv = indv
>                         , value = ifelse( is.na( value.y )
>                                         , value.x
>                                         , value.y )
>                         )
>            %>% spread( indv, value )
>            %>% full_join( z.map, by="HHid" )
>            %>% arrange( HHid )
>            %>% as.data.frame
>            )
>
> On Sun, 12 Jul 2015, aldi wrote:
>
>> Hi,
>> I have two sets of data x.HHu and y.HHo, rows are IDs and columns are
>> individuals. I do not know in advance indv or HHid, both of them will
be
>> captured from the data. As the y.HHo set updates, y.HHo set has better
>> information then x.HHu set. Thus I want a merge where right set
>> overwrites left set info based on HHid, i.e. to overwrite x.HHu set
with
>> y.HHo set but keep any extra info from the x.HHu set that is not
present
>> in y.HHo set.
>> HHids will be complete based on z.map, with the corresponding
positions.
>> I am having trouble with the part after this line: ###
>> ============================================+++++++++++++++++++++++++++
>> I am thinking that I am creating new columns "position"
"indv1" and
>> "indv2", but R is interpreting them as row information.
>> See the expected final table at the end. HHid is common, indv3 is from
>> x.HHu, and the rest position and indv1 and indv2 are from y.HHo
>> Any suggestions are appreciated.
>> Thank you in advance,
>> Aldi
>>
>> x.HHu<- data.frame(
>>            HHid = c( 'HH1', 'HH2', 'HH3',
'HH4', 'HH5', 'HH10')
>>          , indv1 = c( 2, 0, 2 , 0, 2, 0)
>>          , indv2 = c( 0, NA, 2, 2, 2, 2)
>>          , ind3 = c( 0, 0, 0, 0, 0, 0)
>>          )
>> ### the HHo data will be the top set to overwrite any HHu data, when
>> they exist, thinking that HHo are better than HHu results
>> ### when they are available
>>
>> y.HHo<-data.frame(HHid=c('HH1', 'HH2','HH5',
'HH3', 'HH10')
>>          , indv1 = c(2, 0, 2, 0, NA)
>>          , indv2 = c(0, 2, 2, 1, 2)
>>          )
>>
>> z.map<-data.frame(HHid = c('HH1', 'HH2',
'HH3', 'HH4', 'HH5',
>> 'HH6','HH8', 'HH7', 'HH9',
'HH10', 'HH11')
>>                 , position= c(10,20,30,42,55,66,81,75,92,101,111)
>>                 )
>> ### see objects
>> x.HHu
>> y.HHo
>> z.map
>> ### now sort the map by position, this sorted map will be used to sort
>> finally all data
>> z.map<-z.map[with(z.map, order(position)), ]
>> z.map
>>
>> ### First I introduce position to both sets so I can sort them in
>> advance by position.
>> x.HHu.map <-merge( z.map, x.HHu, by='HHid', all=T)
>> x.HHu.map<-x.HHu.map[with(x.HHu.map, order(position)), ]
>> x.HHu.map
>>
>> y.HHo.map <-merge( z.map, y.HHo, by='HHid', all= T)
>> y.HHo.map<-y.HHo.map[with(y.HHo.map, order(position)), ]
>> y.HHo.map
>>
>> ### now merge HHu  and HHo  with the hope to overwrite the HHu set with
>> HHo wherever they overlap by column names.
>> zzz <- merge(x.HHu.map, y.HHo.map, by='HHid', all=T)
>> zzz
>> ### find common variable names in two sets
>>
>> commonNames <- names(x.impu.map)[which(colnames(x.impu.map) %in%
>> colnames(y.geno.map))]
>>
>> ## remove HHid wich is common for x and y, but work with the rest of 
>> columns
>> commonNames<-commonNames[-c(1)]
>>
>> ### 
>> ============================================+++++++++++++++++++++++++++
>> for(i in 1:length(commonNames)){
>>
>> print(commonNames[i])
>> zzz$commonNames[i] <- NA
>>
>>
print(paste("zzz","$",commonNames[i],".y",sep=""))
>>
>> zzz$commonNames[i] <-
zzz[,paste(commonNames[i],".y",sep="")]
>>
>> ### paste(zzz$commonNames[i],".x",sep='') <- NULL;
>> ### paste(zzz$commonNames[i],".y",sep='') <- NULL;
>>
>> }
>> zzz
>>
>> The final expected set has to be: HHid is common, indv3 is from x.HHu,
>> and the rest position and indv1 and indv2 are from y.HHo
>>    HHid     position     ind3  indv1 indv2
>> 1   HH1         10          0     2       0
>> 2  HH10        101          0    NA       2
>> 3  HH11        111         NA    NA      NA
>> 4   HH2         20          0     0       2
>> 5   HH3         30          0     0       1
>> 6   HH4         42          0    NA      NA
>> 7   HH5         55          0     2       2
>> 8   HH6         66         NA    NA      NA
>> 9   HH7         75         NA    NA      NA
>> 10  HH8         81         NA    NA      NA
>> 11  HH9         92         NA    NA      NA
>>
>> -- 
>>
	[[alternative HTML version deleted]]

R help - Jul 2015 - merge: right set overwrite left set

[R] merge: right set overwrite left set

[R] merge: right set overwrite left set

[R] merge: right set overwrite left set

[R] merge: right set overwrite left set

[R] merge: right set overwrite left set