Dimitri Liakhovitski
2015-Dec-22 20:34 UTC
[R] Trying to avoid the loop while merging two data frames
I know I am overwriting. merge doesn't solve it because each version in mydata is given to more than one id. Hence, I thought I can't merge by version. I am not sure how to answer the question about "the problem". I described the current state and the desired state. If possible, I'd like to get from the current state to the desired state faster than when using a loop. On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:> You seem to be saving 'myid' and then overwriting it with the last > statement: > > result[[i]] <- result[[i]][c(5, 1:4)] > > Why doesn't 'merge' work for you? I tried it on your data, and seem to get > back the same number of rows; may not be in the same order, but the content > looks the same, and it does have 'myid' on it. > > > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: >> >> Hello! >> I have a solution for my task that is based on a loop. However, it's >> too slow for my real-life problem that is much larger in scope. >> However, I cannot use merge. Any advice on how to do it faster? >> Thanks a lot for any hint on how to speed it up! >> >> # I have 'mydata' data frame: >> set.seed(123) >> mydata <- data.frame(myid = 1001:1100, >> version = sample(1:20, 100, replace = T)) >> head(mydata) >> table(mydata$version) >> >> # I have 'myinfo' data frame that contains information for each 'version': >> set.seed(12) >> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >> rnorm(60), >> c = rnorm(60), d = rnorm(60)) >> head(myinfo, 40) >> >> ### MY SOLUTION WITH A LOOP: >> ### Looping through each id of mydata and grabbing >> ### all columns from 'myinfo' for the corresponding 'version': >> >> # 1. Creating placeholder list for the results: >> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >> length(result) >> (result)[1:3] >> >> >> # 2. Looping through each element of 'result': >> for(i in 1:length(result)){ >> id <- result[[i]]$myid >> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >> result[[i]]$myid <- id >> result[[i]] <- result[[i]][c(5, 1:4)] >> } >> result <- do.call(rbind, result) >> head(result) # This is the desired result >> >> -- >> Dimitri Liakhovitski >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >-- Dimitri Liakhovitski
Dimitri Liakhovitski
2015-Dec-22 20:50 UTC
[R] Trying to avoid the loop while merging two data frames
You are right, guys, merge is working. Somehow I was under the erroneous impression that because the second data frame (myinfo) contains no column 'myid' merge will not work. Below is the cleaner code and comparison: ######################################### ### Example with smaller data frames ######################################### set.seed(123) mydata <- data.frame(myid = 1001:1020, version = sample(1:10, 20, replace = T)) head(mydata) table(mydata$version) set.seed(12) myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b rnorm(50), c = rnorm(50), d = rnorm(50)) head(myinfo, 40) table(myinfo$version) ###---------------------------------------- ### METHOD 1 - Looping through each id of mydata and grabbing ### all columns of myinfo for the corresponding 'version': # Create placeholder list for the results: result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) length(result) (result)[1:3] # Looping through each element of 'result': for(i in 1:length(result)){ id <- result[[i]]$myid result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] result[[i]]$myid <- id result[[i]] <- result[[i]][c(6, 1:5)] } result <- do.call(rbind, result) result.order <- arrange(result, myid, version, a, b, c, d) head(result.order) # This is the desired result ###---------------------------------------- ### METHOD 2 - merge my.merge <- merge(myinfo, mydata, by="version") names(my.merge) result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")] names(result2) result2.order <- arrange(result2, myid, version, a, b, c, d) dim(result2.order) head(result2.order) # Same result? all.equal(result.order, result2.order) On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:> I know I am overwriting. > merge doesn't solve it because each version in mydata is given to more > than one id. Hence, I thought I can't merge by version. > I am not sure how to answer the question about "the problem". > I described the current state and the desired state. If possible, I'd > like to get from the current state to the desired state faster than > when using a loop. > > On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote: >> You seem to be saving 'myid' and then overwriting it with the last >> statement: >> >> result[[i]] <- result[[i]][c(5, 1:4)] >> >> Why doesn't 'merge' work for you? I tried it on your data, and seem to get >> back the same number of rows; may not be in the same order, but the content >> looks the same, and it does have 'myid' on it. >> >> >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> Tell me what you want to do, not how you want to do it. >> >> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski >> <dimitri.liakhovitski at gmail.com> wrote: >>> >>> Hello! >>> I have a solution for my task that is based on a loop. However, it's >>> too slow for my real-life problem that is much larger in scope. >>> However, I cannot use merge. Any advice on how to do it faster? >>> Thanks a lot for any hint on how to speed it up! >>> >>> # I have 'mydata' data frame: >>> set.seed(123) >>> mydata <- data.frame(myid = 1001:1100, >>> version = sample(1:20, 100, replace = T)) >>> head(mydata) >>> table(mydata$version) >>> >>> # I have 'myinfo' data frame that contains information for each 'version': >>> set.seed(12) >>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >>> rnorm(60), >>> c = rnorm(60), d = rnorm(60)) >>> head(myinfo, 40) >>> >>> ### MY SOLUTION WITH A LOOP: >>> ### Looping through each id of mydata and grabbing >>> ### all columns from 'myinfo' for the corresponding 'version': >>> >>> # 1. Creating placeholder list for the results: >>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >>> length(result) >>> (result)[1:3] >>> >>> >>> # 2. Looping through each element of 'result': >>> for(i in 1:length(result)){ >>> id <- result[[i]]$myid >>> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >>> result[[i]]$myid <- id >>> result[[i]] <- result[[i]][c(5, 1:4)] >>> } >>> result <- do.call(rbind, result) >>> head(result) # This is the desired result >>> >>> -- >>> Dimitri Liakhovitski >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> > > > > -- > Dimitri Liakhovitski-- Dimitri Liakhovitski
Dimitri Liakhovitski
2015-Dec-22 20:56 UTC
[R] Trying to avoid the loop while merging two data frames
Actually, the correct merge line should be: my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F) On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:> You are right, guys, merge is working. Somehow I was under the > erroneous impression that because the second data frame (myinfo) > contains no column 'myid' merge will not work. > Below is the cleaner code and comparison: > > ######################################### > ### Example with smaller data frames > ######################################### > > set.seed(123) > mydata <- data.frame(myid = 1001:1020, > version = sample(1:10, 20, replace = T)) > head(mydata) > table(mydata$version) > > set.seed(12) > myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b > rnorm(50), c = rnorm(50), d = rnorm(50)) > head(myinfo, 40) > table(myinfo$version) > > ###---------------------------------------- > ### METHOD 1 - Looping through each id of mydata and grabbing > ### all columns of myinfo for the corresponding 'version': > > > # Create placeholder list for the results: > result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) > length(result) > (result)[1:3] > > > # Looping through each element of 'result': > for(i in 1:length(result)){ > id <- result[[i]]$myid > result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] > result[[i]]$myid <- id > result[[i]] <- result[[i]][c(6, 1:5)] > } > result <- do.call(rbind, result) > result.order <- arrange(result, myid, version, a, b, c, d) > head(result.order) # This is the desired result > > ###---------------------------------------- > ### METHOD 2 - merge > > my.merge <- merge(myinfo, mydata, by="version") > names(my.merge) > result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")] > names(result2) > result2.order <- arrange(result2, myid, version, a, b, c, d) > dim(result2.order) > head(result2.order) > > # Same result? > all.equal(result.order, result2.order) > > On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: >> I know I am overwriting. >> merge doesn't solve it because each version in mydata is given to more >> than one id. Hence, I thought I can't merge by version. >> I am not sure how to answer the question about "the problem". >> I described the current state and the desired state. If possible, I'd >> like to get from the current state to the desired state faster than >> when using a loop. >> >> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote: >>> You seem to be saving 'myid' and then overwriting it with the last >>> statement: >>> >>> result[[i]] <- result[[i]][c(5, 1:4)] >>> >>> Why doesn't 'merge' work for you? I tried it on your data, and seem to get >>> back the same number of rows; may not be in the same order, but the content >>> looks the same, and it does have 'myid' on it. >>> >>> >>> Jim Holtman >>> Data Munger Guru >>> >>> What is the problem that you are trying to solve? >>> Tell me what you want to do, not how you want to do it. >>> >>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski >>> <dimitri.liakhovitski at gmail.com> wrote: >>>> >>>> Hello! >>>> I have a solution for my task that is based on a loop. However, it's >>>> too slow for my real-life problem that is much larger in scope. >>>> However, I cannot use merge. Any advice on how to do it faster? >>>> Thanks a lot for any hint on how to speed it up! >>>> >>>> # I have 'mydata' data frame: >>>> set.seed(123) >>>> mydata <- data.frame(myid = 1001:1100, >>>> version = sample(1:20, 100, replace = T)) >>>> head(mydata) >>>> table(mydata$version) >>>> >>>> # I have 'myinfo' data frame that contains information for each 'version': >>>> set.seed(12) >>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >>>> rnorm(60), >>>> c = rnorm(60), d = rnorm(60)) >>>> head(myinfo, 40) >>>> >>>> ### MY SOLUTION WITH A LOOP: >>>> ### Looping through each id of mydata and grabbing >>>> ### all columns from 'myinfo' for the corresponding 'version': >>>> >>>> # 1. Creating placeholder list for the results: >>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >>>> length(result) >>>> (result)[1:3] >>>> >>>> >>>> # 2. Looping through each element of 'result': >>>> for(i in 1:length(result)){ >>>> id <- result[[i]]$myid >>>> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >>>> result[[i]]$myid <- id >>>> result[[i]] <- result[[i]][c(5, 1:4)] >>>> } >>>> result <- do.call(rbind, result) >>>> head(result) # This is the desired result >>>> >>>> -- >>>> Dimitri Liakhovitski >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> >> -- >> Dimitri Liakhovitski > > > > -- > Dimitri Liakhovitski-- Dimitri Liakhovitski