RKinzer
2012-Feb-06 20:29 UTC
[R] Duplicate rows when I combine two data.frames with merge!
Hello all, First I have done extensive searches on this forum and others and nothing seems to work. So I decided to post thinking someone could point me to the write post or give me some help. I have drawn a 100 samples from a fictitious population (N=1000), and then randomly selected 25% of the 100 samples. I would like to now merge the data.frame from the 100 samples with the data.frame for the 25 individuals from the sample. When I do this with the following code I get duplicate rows, when I should have at most is 100. x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9)) #sets up 1000 random numbers for age 3,4,5 x.3<-sample(x[,1],60) #randomly selects 60 lengths from age 3 x.4<-sample(x[,2],740) x.5<-sample(x[,3],200) length<-c(x.3,x.4,x.5) length<-round(length,digits=0) #rounds lengths to whole number age3<-rep(3,60) age4<-rep(4,740) age5<-rep(5,200) age<-c(age3,age4,age5) #combines ages into one vector unique<-1:1000 #gives each fish a unique id pop<-data.frame(unique,length,age) pop<-pop[sample(1:1000,size=1000,replace=FALSE),] #randomized the order of pop c.one<-pop[sample(1:1000,size=100,replace=TRUE),] a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE) What I would ultimately like to have is one row for all 100 in the sample and three columns (unique, length, age). And then some way to identify the 25 individual selected rows. Thank you upfront for any help. I have been stuck for days. Ryan -- View this message in context: http://r.789695.n4.nabble.com/Duplicate-rows-when-I-combine-two-data-frames-with-merge-tp4362685p4362685.html Sent from the R help mailing list archive at Nabble.com.
Sarah Goslee
2012-Feb-06 20:49 UTC
[R] Duplicate rows when I combine two data.frames with merge!
Hi, Why do you need to merge them? c.one contains what I think you want, and then you want to randomly select 25 rows from that without replacement: c.one <- cbind(c.one, a.qtr = sample(c(rep(TRUE, 25), rep(FALSE, 75))))> head(c.one)unique length age a.qtr 649 649 71 4 TRUE 200 200 79 4 FALSE 410 410 82 4 TRUE 620 620 73 4 FALSE 723 723 81 4 FALSE 855 855 96 5 FALSE If what you mean by the subject line is that there are duplicate rows in the merged data frame, of course there are, because there are duplicated rows in c.one, because you selected sampling with replacement when selecting rows from pop to make c.one. Something to be VERY careful of: length, merge and unique are all base functions, and shouldn't be used as variable names. After you named something merge, what happens when you try to use merge()? If I'm misunderstanding the question, then please try to explain more clearly what you are looking for. Sarah On Mon, Feb 6, 2012 at 3:29 PM, RKinzer <ryank at nezperce.org> wrote:> Hello all, > > First I have done extensive searches on this forum and others and nothing > seems to work. ?So I decided to post thinking someone could point me to the > write post or give me some help. > > I have drawn a 100 samples from a fictitious population (N=1000), and then > randomly selected 25% of the 100 samples. ?I would like to now merge the > data.frame from the 100 samples with the data.frame for the 25 individuals > from the sample. ?When I do this with the following code I get duplicate > rows, when I should have at most is 100. > > x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9)) ?#sets up 1000 random > numbers for age 3,4,5 > x.3<-sample(x[,1],60) ?#randomly selects 60 lengths from age 3 > x.4<-sample(x[,2],740) > x.5<-sample(x[,3],200) > length<-c(x.3,x.4,x.5) > length<-round(length,digits=0) ?#rounds lengths to whole number > age3<-rep(3,60) > age4<-rep(4,740) > age5<-rep(5,200) > age<-c(age3,age4,age5) ?#combines ages into one vector > unique<-1:1000 ?#gives each fish a unique id > pop<-data.frame(unique,length,age) > pop<-pop[sample(1:1000,size=1000,replace=FALSE),] ?#randomized the order of > pop > c.one<-pop[sample(1:1000,size=100,replace=TRUE),] > a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] > merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE) > > What I would ultimately like to have is one row for all 100 in the sample > and three columns (unique, length, age). ?And then some way to identify the > 25 individual selected rows. > > Thank you upfront for any help. ?I have been stuck for days. > > Ryan > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Duplicate-rows-when-I-combine-two-data-frames-with-merge-tp4362685p4362685.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Sarah Goslee http://www.stringpage.com http://www.sarahgoslee.com http://www.functionaldiversity.org
Ista Zahn
2012-Feb-06 21:07 UTC
[R] Duplicate rows when I combine two data.frames with merge!
Hi Ryan, You're getting this result because five of the rows in a.one.qtr are duplicates, as a results of sampling with replacement: length(which(duplicated(a.one.qtr))) [1] 5 The relevant section of the documentation (see ?merge) reads "The rows in the two data frames that match on the specified columns are extracted, and joined together. *If there is more than one match, all possible matches contribute one row each*" (emphasis mine). Best, Ista On Mon, Feb 6, 2012 at 3:29 PM, RKinzer <ryank at nezperce.org> wrote:> Hello all, > > First I have done extensive searches on this forum and others and nothing > seems to work. ?So I decided to post thinking someone could point me to the > write post or give me some help. > > I have drawn a 100 samples from a fictitious population (N=1000), and then > randomly selected 25% of the 100 samples. ?I would like to now merge the > data.frame from the 100 samples with the data.frame for the 25 individuals > from the sample. ?When I do this with the following code I get duplicate > rows, when I should have at most is 100. > > x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9)) ?#sets up 1000 random > numbers for age 3,4,5 > x.3<-sample(x[,1],60) ?#randomly selects 60 lengths from age 3 > x.4<-sample(x[,2],740) > x.5<-sample(x[,3],200) > length<-c(x.3,x.4,x.5) > length<-round(length,digits=0) ?#rounds lengths to whole number > age3<-rep(3,60) > age4<-rep(4,740) > age5<-rep(5,200) > age<-c(age3,age4,age5) ?#combines ages into one vector > unique<-1:1000 ?#gives each fish a unique id > pop<-data.frame(unique,length,age) > pop<-pop[sample(1:1000,size=1000,replace=FALSE),] ?#randomized the order of > pop > c.one<-pop[sample(1:1000,size=100,replace=TRUE),] > a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] > merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE) > > What I would ultimately like to have is one row for all 100 in the sample > and three columns (unique, length, age). ?And then some way to identify the > 25 individual selected rows. > > Thank you upfront for any help. ?I have been stuck for days. > > Ryan > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Duplicate-rows-when-I-combine-two-data-frames-with-merge-tp4362685p4362685.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Petr Savicky
2012-Feb-06 21:26 UTC
[R] Duplicate rows when I combine two data.frames with merge!
On Mon, Feb 06, 2012 at 12:29:53PM -0800, RKinzer wrote:> Hello all, > > First I have done extensive searches on this forum and others and nothing > seems to work. So I decided to post thinking someone could point me to the > write post or give me some help. > > I have drawn a 100 samples from a fictitious population (N=1000), and then > randomly selected 25% of the 100 samples. I would like to now merge the > data.frame from the 100 samples with the data.frame for the 25 individuals > from the sample. When I do this with the following code I get duplicate > rows, when I should have at most is 100. > > x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9)) #sets up 1000 random > numbers for age 3,4,5 > x.3<-sample(x[,1],60) #randomly selects 60 lengths from age 3 > x.4<-sample(x[,2],740) > x.5<-sample(x[,3],200) > length<-c(x.3,x.4,x.5) > length<-round(length,digits=0) #rounds lengths to whole number > age3<-rep(3,60) > age4<-rep(4,740) > age5<-rep(5,200) > age<-c(age3,age4,age5) #combines ages into one vector > unique<-1:1000 #gives each fish a unique id > pop<-data.frame(unique,length,age) > pop<-pop[sample(1:1000,size=1000,replace=FALSE),] #randomized the order of > pop > c.one<-pop[sample(1:1000,size=100,replace=TRUE),] > a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] > merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE) > > What I would ultimately like to have is one row for all 100 in the sample > and three columns (unique, length, age). And then some way to identify the > 25 individual selected rows.The function merge() here includes additional columns, which contain in the rows from a.one.qtr copies of the columns length and age. So, the same values appear twice in the row. I am not sure, whether this is intended. Another representation of the subsample a.one.qtr may be done by adding a column to c.one, which specifies, how many times was the row selected to a.one.qtr. For example as follows. a.one.qtr2 <- sample(1:100,size=25,replace=TRUE) c.one2 <- cbind(c.one, selected=tabulate(a.one.qtr2, nbins=100)) # a random result may look like c.one2 unique length age selected 657 657 81 4 0 488 488 78 4 1 886 886 85 5 0 448 448 82 4 0 292 292 80 4 0 431 431 78 4 0 683 683 82 4 0 32 32 56 3 2 740 740 80 4 0 519 519 81 4 1 986 986 88 5 0 437 437 84 4 0 247 247 88 4 0 122 122 73 4 0 ... The sum of the column "selected" is 25. Hope this helps. Petr Savicky.