Dimitri Liakhovitski
2015-Dec-22 17:27 UTC
[R] Trying to avoid the loop while merging two data frames
Hello! I have a solution for my task that is based on a loop. However, it's too slow for my real-life problem that is much larger in scope. However, I cannot use merge. Any advice on how to do it faster? Thanks a lot for any hint on how to speed it up! # I have 'mydata' data frame: set.seed(123) mydata <- data.frame(myid = 1001:1100, version = sample(1:20, 100, replace = T)) head(mydata) table(mydata$version) # I have 'myinfo' data frame that contains information for each 'version': set.seed(12) myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b rnorm(60), c = rnorm(60), d = rnorm(60)) head(myinfo, 40) ### MY SOLUTION WITH A LOOP: ### Looping through each id of mydata and grabbing ### all columns from 'myinfo' for the corresponding 'version': # 1. Creating placeholder list for the results: result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) length(result) (result)[1:3] # 2. Looping through each element of 'result': for(i in 1:length(result)){ id <- result[[i]]$myid result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] result[[i]]$myid <- id result[[i]] <- result[[i]][c(5, 1:4)] } result <- do.call(rbind, result) head(result) # This is the desired result -- Dimitri Liakhovitski
jim holtman
2015-Dec-22 19:26 UTC
[R] Trying to avoid the loop while merging two data frames
You seem to be saving 'myid' and then overwriting it with the last statement: result[[i]] <- result[[i]][c(5, 1:4)] Why doesn't 'merge' work for you? I tried it on your data, and seem to get back the same number of rows; may not be in the same order, but the content looks the same, and it does have 'myid' on it. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski < dimitri.liakhovitski at gmail.com> wrote:> Hello! > I have a solution for my task that is based on a loop. However, it's > too slow for my real-life problem that is much larger in scope. > However, I cannot use merge. Any advice on how to do it faster? > Thanks a lot for any hint on how to speed it up! > > # I have 'mydata' data frame: > set.seed(123) > mydata <- data.frame(myid = 1001:1100, > version = sample(1:20, 100, replace = T)) > head(mydata) > table(mydata$version) > > # I have 'myinfo' data frame that contains information for each 'version': > set.seed(12) > myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b > rnorm(60), > c = rnorm(60), d = rnorm(60)) > head(myinfo, 40) > > ### MY SOLUTION WITH A LOOP: > ### Looping through each id of mydata and grabbing > ### all columns from 'myinfo' for the corresponding 'version': > > # 1. Creating placeholder list for the results: > result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) > length(result) > (result)[1:3] > > > # 2. Looping through each element of 'result': > for(i in 1:length(result)){ > id <- result[[i]]$myid > result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] > result[[i]]$myid <- id > result[[i]] <- result[[i]][c(5, 1:4)] > } > result <- do.call(rbind, result) > head(result) # This is the desired result > > -- > Dimitri Liakhovitski > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Dimitri Liakhovitski
2015-Dec-22 20:34 UTC
[R] Trying to avoid the loop while merging two data frames
I know I am overwriting. merge doesn't solve it because each version in mydata is given to more than one id. Hence, I thought I can't merge by version. I am not sure how to answer the question about "the problem". I described the current state and the desired state. If possible, I'd like to get from the current state to the desired state faster than when using a loop. On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:> You seem to be saving 'myid' and then overwriting it with the last > statement: > > result[[i]] <- result[[i]][c(5, 1:4)] > > Why doesn't 'merge' work for you? I tried it on your data, and seem to get > back the same number of rows; may not be in the same order, but the content > looks the same, and it does have 'myid' on it. > > > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: >> >> Hello! >> I have a solution for my task that is based on a loop. However, it's >> too slow for my real-life problem that is much larger in scope. >> However, I cannot use merge. Any advice on how to do it faster? >> Thanks a lot for any hint on how to speed it up! >> >> # I have 'mydata' data frame: >> set.seed(123) >> mydata <- data.frame(myid = 1001:1100, >> version = sample(1:20, 100, replace = T)) >> head(mydata) >> table(mydata$version) >> >> # I have 'myinfo' data frame that contains information for each 'version': >> set.seed(12) >> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >> rnorm(60), >> c = rnorm(60), d = rnorm(60)) >> head(myinfo, 40) >> >> ### MY SOLUTION WITH A LOOP: >> ### Looping through each id of mydata and grabbing >> ### all columns from 'myinfo' for the corresponding 'version': >> >> # 1. Creating placeholder list for the results: >> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >> length(result) >> (result)[1:3] >> >> >> # 2. Looping through each element of 'result': >> for(i in 1:length(result)){ >> id <- result[[i]]$myid >> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >> result[[i]]$myid <- id >> result[[i]] <- result[[i]][c(5, 1:4)] >> } >> result <- do.call(rbind, result) >> head(result) # This is the desired result >> >> -- >> Dimitri Liakhovitski >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >-- Dimitri Liakhovitski