Dimitri Liakhovitski
2015-Dec-22 20:34 UTC
[R] Trying to avoid the loop while merging two data frames
I know I am overwriting. merge doesn't solve it because each version in mydata is given to more than one id. Hence, I thought I can't merge by version. I am not sure how to answer the question about "the problem". I described the current state and the desired state. If possible, I'd like to get from the current state to the desired state faster than when using a loop. On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote:> You seem to be saving 'myid' and then overwriting it with the last > statement: > > result[[i]] <- result[[i]][c(5, 1:4)] > > Why doesn't 'merge' work for you? I tried it on your data, and seem to get > back the same number of rows; may not be in the same order, but the content > looks the same, and it does have 'myid' on it. > > > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: >> >> Hello! >> I have a solution for my task that is based on a loop. However, it's >> too slow for my real-life problem that is much larger in scope. >> However, I cannot use merge. Any advice on how to do it faster? >> Thanks a lot for any hint on how to speed it up! >> >> # I have 'mydata' data frame: >> set.seed(123) >> mydata <- data.frame(myid = 1001:1100, >> version = sample(1:20, 100, replace = T)) >> head(mydata) >> table(mydata$version) >> >> # I have 'myinfo' data frame that contains information for each 'version': >> set.seed(12) >> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >> rnorm(60), >> c = rnorm(60), d = rnorm(60)) >> head(myinfo, 40) >> >> ### MY SOLUTION WITH A LOOP: >> ### Looping through each id of mydata and grabbing >> ### all columns from 'myinfo' for the corresponding 'version': >> >> # 1. Creating placeholder list for the results: >> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >> length(result) >> (result)[1:3] >> >> >> # 2. Looping through each element of 'result': >> for(i in 1:length(result)){ >> id <- result[[i]]$myid >> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >> result[[i]]$myid <- id >> result[[i]] <- result[[i]][c(5, 1:4)] >> } >> result <- do.call(rbind, result) >> head(result) # This is the desired result >> >> -- >> Dimitri Liakhovitski >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >-- Dimitri Liakhovitski
Dimitri Liakhovitski
2015-Dec-22 20:50 UTC
[R] Trying to avoid the loop while merging two data frames
You are right, guys, merge is working. Somehow I was under the
erroneous impression that because the second data frame (myinfo)
contains no column 'myid' merge will not work.
Below is the cleaner code and comparison:
#########################################
### Example with smaller data frames
#########################################
set.seed(123)
mydata <- data.frame(myid = 1001:1020,
version = sample(1:10, 20, replace = T))
head(mydata)
table(mydata$version)
set.seed(12)
myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b
rnorm(50), c = rnorm(50), d = rnorm(50))
head(myinfo, 40)
table(myinfo$version)
###----------------------------------------
### METHOD 1 - Looping through each id of mydata and grabbing
### all columns of myinfo for the corresponding 'version':
# Create placeholder list for the results:
result <- split(mydata[c("myid", "version")], f =
list(mydata$myid))
length(result)
(result)[1:3]
# Looping through each element of 'result':
for(i in 1:length(result)){
id <- result[[i]]$myid
result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
result[[i]]$myid <- id
result[[i]] <- result[[i]][c(6, 1:5)]
}
result <- do.call(rbind, result)
result.order <- arrange(result, myid, version, a, b, c, d)
head(result.order) # This is the desired result
###----------------------------------------
### METHOD 2 - merge
my.merge <- merge(myinfo, mydata, by="version")
names(my.merge)
result2 <- my.merge[,c("myid", "version", "a",
"b", "c", "d")]
names(result2)
result2.order <- arrange(result2, myid, version, a, b, c, d)
dim(result2.order)
head(result2.order)
# Same result?
all.equal(result.order, result2.order)
On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:> I know I am overwriting.
> merge doesn't solve it because each version in mydata is given to more
> than one id. Hence, I thought I can't merge by version.
> I am not sure how to answer the question about "the problem".
> I described the current state and the desired state. If possible, I'd
> like to get from the current state to the desired state faster than
> when using a loop.
>
> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com>
wrote:
>> You seem to be saving 'myid' and then overwriting it with the
last
>> statement:
>>
>> result[[i]] <- result[[i]][c(5, 1:4)]
>>
>> Why doesn't 'merge' work for you? I tried it on your data,
and seem to get
>> back the same number of rows; may not be in the same order, but the
content
>> looks the same, and it does have 'myid' on it.
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>>
>>> Hello!
>>> I have a solution for my task that is based on a loop. However,
it's
>>> too slow for my real-life problem that is much larger in scope.
>>> However, I cannot use merge. Any advice on how to do it faster?
>>> Thanks a lot for any hint on how to speed it up!
>>>
>>> # I have 'mydata' data frame:
>>> set.seed(123)
>>> mydata <- data.frame(myid = 1001:1100,
>>> version = sample(1:20, 100, replace = T))
>>> head(mydata)
>>> table(mydata$version)
>>>
>>> # I have 'myinfo' data frame that contains information for
each 'version':
>>> set.seed(12)
>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a =
rnorm(60), b >>> rnorm(60),
>>> c = rnorm(60), d = rnorm(60))
>>> head(myinfo, 40)
>>>
>>> ### MY SOLUTION WITH A LOOP:
>>> ### Looping through each id of mydata and grabbing
>>> ### all columns from 'myinfo' for the corresponding
'version':
>>>
>>> # 1. Creating placeholder list for the results:
>>> result <- split(mydata[c("myid",
"version")], f = list(mydata$myid))
>>> length(result)
>>> (result)[1:3]
>>>
>>>
>>> # 2. Looping through each element of 'result':
>>> for(i in 1:length(result)){
>>> id <- result[[i]]$myid
>>> result[[i]] <- myinfo[myinfo$version ==
result[[i]]$version, ]
>>> result[[i]]$myid <- id
>>> result[[i]] <- result[[i]][c(5, 1:4)]
>>> }
>>> result <- do.call(rbind, result)
>>> head(result) # This is the desired result
>>>
>>> --
>>> Dimitri Liakhovitski
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Dimitri Liakhovitski
--
Dimitri Liakhovitski
Dimitri Liakhovitski
2015-Dec-22 20:56 UTC
[R] Trying to avoid the loop while merging two data frames
Actually, the correct merge line should be: my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y = F) On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:> You are right, guys, merge is working. Somehow I was under the > erroneous impression that because the second data frame (myinfo) > contains no column 'myid' merge will not work. > Below is the cleaner code and comparison: > > ######################################### > ### Example with smaller data frames > ######################################### > > set.seed(123) > mydata <- data.frame(myid = 1001:1020, > version = sample(1:10, 20, replace = T)) > head(mydata) > table(mydata$version) > > set.seed(12) > myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b > rnorm(50), c = rnorm(50), d = rnorm(50)) > head(myinfo, 40) > table(myinfo$version) > > ###---------------------------------------- > ### METHOD 1 - Looping through each id of mydata and grabbing > ### all columns of myinfo for the corresponding 'version': > > > # Create placeholder list for the results: > result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) > length(result) > (result)[1:3] > > > # Looping through each element of 'result': > for(i in 1:length(result)){ > id <- result[[i]]$myid > result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] > result[[i]]$myid <- id > result[[i]] <- result[[i]][c(6, 1:5)] > } > result <- do.call(rbind, result) > result.order <- arrange(result, myid, version, a, b, c, d) > head(result.order) # This is the desired result > > ###---------------------------------------- > ### METHOD 2 - merge > > my.merge <- merge(myinfo, mydata, by="version") > names(my.merge) > result2 <- my.merge[,c("myid", "version", "a", "b", "c", "d")] > names(result2) > result2.order <- arrange(result2, myid, version, a, b, c, d) > dim(result2.order) > head(result2.order) > > # Same result? > all.equal(result.order, result2.order) > > On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski > <dimitri.liakhovitski at gmail.com> wrote: >> I know I am overwriting. >> merge doesn't solve it because each version in mydata is given to more >> than one id. Hence, I thought I can't merge by version. >> I am not sure how to answer the question about "the problem". >> I described the current state and the desired state. If possible, I'd >> like to get from the current state to the desired state faster than >> when using a loop. >> >> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com> wrote: >>> You seem to be saving 'myid' and then overwriting it with the last >>> statement: >>> >>> result[[i]] <- result[[i]][c(5, 1:4)] >>> >>> Why doesn't 'merge' work for you? I tried it on your data, and seem to get >>> back the same number of rows; may not be in the same order, but the content >>> looks the same, and it does have 'myid' on it. >>> >>> >>> Jim Holtman >>> Data Munger Guru >>> >>> What is the problem that you are trying to solve? >>> Tell me what you want to do, not how you want to do it. >>> >>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski >>> <dimitri.liakhovitski at gmail.com> wrote: >>>> >>>> Hello! >>>> I have a solution for my task that is based on a loop. However, it's >>>> too slow for my real-life problem that is much larger in scope. >>>> However, I cannot use merge. Any advice on how to do it faster? >>>> Thanks a lot for any hint on how to speed it up! >>>> >>>> # I have 'mydata' data frame: >>>> set.seed(123) >>>> mydata <- data.frame(myid = 1001:1100, >>>> version = sample(1:20, 100, replace = T)) >>>> head(mydata) >>>> table(mydata$version) >>>> >>>> # I have 'myinfo' data frame that contains information for each 'version': >>>> set.seed(12) >>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b >>>> rnorm(60), >>>> c = rnorm(60), d = rnorm(60)) >>>> head(myinfo, 40) >>>> >>>> ### MY SOLUTION WITH A LOOP: >>>> ### Looping through each id of mydata and grabbing >>>> ### all columns from 'myinfo' for the corresponding 'version': >>>> >>>> # 1. Creating placeholder list for the results: >>>> result <- split(mydata[c("myid", "version")], f = list(mydata$myid)) >>>> length(result) >>>> (result)[1:3] >>>> >>>> >>>> # 2. Looping through each element of 'result': >>>> for(i in 1:length(result)){ >>>> id <- result[[i]]$myid >>>> result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ] >>>> result[[i]]$myid <- id >>>> result[[i]] <- result[[i]][c(5, 1:4)] >>>> } >>>> result <- do.call(rbind, result) >>>> head(result) # This is the desired result >>>> >>>> -- >>>> Dimitri Liakhovitski >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> >> -- >> Dimitri Liakhovitski > > > > -- > Dimitri Liakhovitski-- Dimitri Liakhovitski