thr3ads.net - R help - [R] Trying to avoid the loop while merging two data frames [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Dimitri Liakhovitski

2015-Dec-22 20:34 UTC

[R] Trying to avoid the loop while merging two data frames

I know I am overwriting.
merge doesn't solve it because each version in mydata is given to more
than one id. Hence, I thought I can't merge by version.
I am not sure how to answer the question about "the problem".
I described the current state and the desired state. If possible, I'd
like to get from the current state to the desired state faster than
when using a loop.

On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com>
wrote:> You seem to be saving 'myid' and then overwriting it with the last
> statement:
>
>  result[[i]] <- result[[i]][c(5, 1:4)]
>
> Why doesn't 'merge' work for you?  I tried it on your data, and
seem to get
> back the same number of rows; may not be in the same order, but the content
> looks the same, and it does have 'myid' on it.
>
>
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>>
>> Hello!
>> I have a solution for my task that is based on a loop. However,
it's
>> too slow for my real-life problem that is much larger in scope.
>> However, I cannot use merge. Any advice on how to do it faster?
>> Thanks a lot for any hint on how to speed it up!
>>
>> # I have 'mydata' data frame:
>> set.seed(123)
>> mydata <- data.frame(myid = 1001:1100,
>>                      version = sample(1:20, 100, replace = T))
>> head(mydata)
>> table(mydata$version)
>>
>> # I have 'myinfo' data frame that contains information for each
'version':
>> set.seed(12)
>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a = rnorm(60), b
>> rnorm(60),
>>                                  c = rnorm(60), d = rnorm(60))
>> head(myinfo, 40)
>>
>> ### MY SOLUTION WITH A LOOP:
>> ### Looping through each id of mydata and grabbing
>> ### all columns from 'myinfo' for the corresponding
'version':
>>
>> # 1. Creating placeholder list for the results:
>> result <- split(mydata[c("myid", "version")], f
= list(mydata$myid))
>> length(result)
>> (result)[1:3]
>>
>>
>> # 2. Looping through each element of 'result':
>> for(i in 1:length(result)){
>>       id <- result[[i]]$myid
>>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>>       result[[i]]$myid <- id
>>       result[[i]] <- result[[i]][c(5, 1:4)]
>> }
>> result <- do.call(rbind, result)
>> head(result) # This is the desired result
>>
>> --
>> Dimitri Liakhovitski
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Dimitri Liakhovitski

Dimitri Liakhovitski

2015-Dec-22 20:50 UTC

head link

[R] Trying to avoid the loop while merging two data frames

You are right, guys, merge is working. Somehow I was under the
erroneous impression that because the second data frame (myinfo)
contains no column 'myid' merge will not work.
Below is the cleaner code and comparison:

#########################################
### Example with smaller data frames
#########################################

set.seed(123)
mydata <- data.frame(myid = 1001:1020,
                     version = sample(1:10, 20, replace = T))
head(mydata)
table(mydata$version)

set.seed(12)
myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b
rnorm(50), c = rnorm(50), d = rnorm(50))
head(myinfo, 40)
table(myinfo$version)

###----------------------------------------
### METHOD 1 - Looping through each id of mydata and grabbing
### all columns of myinfo for the corresponding 'version':


# Create placeholder list for the results:
result <- split(mydata[c("myid", "version")], f =
list(mydata$myid))
length(result)
(result)[1:3]


# Looping through each element of 'result':
for(i in 1:length(result)){
      id <- result[[i]]$myid
      result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
      result[[i]]$myid <- id
      result[[i]] <- result[[i]][c(6, 1:5)]
}
result <- do.call(rbind, result)
result.order <- arrange(result,  myid, version, a, b, c, d)
head(result.order) # This is the desired result

###----------------------------------------
### METHOD 2 - merge

my.merge <- merge(myinfo, mydata, by="version")
names(my.merge)
result2 <- my.merge[,c("myid", "version", "a",
"b", "c", "d")]
names(result2)
result2.order <- arrange(result2, myid, version, a, b, c, d)
dim(result2.order)
head(result2.order)

# Same result?
all.equal(result.order, result2.order)

On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:> I know I am overwriting.
> merge doesn't solve it because each version in mydata is given to more
> than one id. Hence, I thought I can't merge by version.
> I am not sure how to answer the question about "the problem".
> I described the current state and the desired state. If possible, I'd
> like to get from the current state to the desired state faster than
> when using a loop.
>
> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at gmail.com>
wrote:
>> You seem to be saving 'myid' and then overwriting it with the
last
>> statement:
>>
>>  result[[i]] <- result[[i]][c(5, 1:4)]
>>
>> Why doesn't 'merge' work for you?  I tried it on your data,
and seem to get
>> back the same number of rows; may not be in the same order, but the
content
>> looks the same, and it does have 'myid' on it.
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>>
>>> Hello!
>>> I have a solution for my task that is based on a loop. However,
it's
>>> too slow for my real-life problem that is much larger in scope.
>>> However, I cannot use merge. Any advice on how to do it faster?
>>> Thanks a lot for any hint on how to speed it up!
>>>
>>> # I have 'mydata' data frame:
>>> set.seed(123)
>>> mydata <- data.frame(myid = 1001:1100,
>>>                      version = sample(1:20, 100, replace = T))
>>> head(mydata)
>>> table(mydata$version)
>>>
>>> # I have 'myinfo' data frame that contains information for
each 'version':
>>> set.seed(12)
>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a =
rnorm(60), b >>> rnorm(60),
>>>                                  c = rnorm(60), d = rnorm(60))
>>> head(myinfo, 40)
>>>
>>> ### MY SOLUTION WITH A LOOP:
>>> ### Looping through each id of mydata and grabbing
>>> ### all columns from 'myinfo' for the corresponding
'version':
>>>
>>> # 1. Creating placeholder list for the results:
>>> result <- split(mydata[c("myid",
"version")], f = list(mydata$myid))
>>> length(result)
>>> (result)[1:3]
>>>
>>>
>>> # 2. Looping through each element of 'result':
>>> for(i in 1:length(result)){
>>>       id <- result[[i]]$myid
>>>       result[[i]] <- myinfo[myinfo$version ==
result[[i]]$version, ]
>>>       result[[i]]$myid <- id
>>>       result[[i]] <- result[[i]][c(5, 1:4)]
>>> }
>>> result <- do.call(rbind, result)
>>> head(result) # This is the desired result
>>>
>>> --
>>> Dimitri Liakhovitski
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
>
> --
> Dimitri Liakhovitski


-- 
Dimitri Liakhovitski

Dimitri Liakhovitski

2015-Dec-22 20:56 UTC

head link

[R] Trying to avoid the loop while merging two data frames

Actually, the correct merge line should be:
my.merge <- merge(myinfo, mydata, by="version", all.x = T, all.y =
F)

On Tue, Dec 22, 2015 at 3:50 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:> You are right, guys, merge is working. Somehow I was under the
> erroneous impression that because the second data frame (myinfo)
> contains no column 'myid' merge will not work.
> Below is the cleaner code and comparison:
>
> #########################################
> ### Example with smaller data frames
> #########################################
>
> set.seed(123)
> mydata <- data.frame(myid = 1001:1020,
>                      version = sample(1:10, 20, replace = T))
> head(mydata)
> table(mydata$version)
>
> set.seed(12)
> myinfo <- data.frame(version = sort(rep(1:10, 5)), a = rnorm(50), b >
rnorm(50), c = rnorm(50), d = rnorm(50))
> head(myinfo, 40)
> table(myinfo$version)
>
> ###----------------------------------------
> ### METHOD 1 - Looping through each id of mydata and grabbing
> ### all columns of myinfo for the corresponding 'version':
>
>
> # Create placeholder list for the results:
> result <- split(mydata[c("myid", "version")], f =
list(mydata$myid))
> length(result)
> (result)[1:3]
>
>
> # Looping through each element of 'result':
> for(i in 1:length(result)){
>       id <- result[[i]]$myid
>       result[[i]] <- myinfo[myinfo$version == result[[i]]$version, ]
>       result[[i]]$myid <- id
>       result[[i]] <- result[[i]][c(6, 1:5)]
> }
> result <- do.call(rbind, result)
> result.order <- arrange(result,  myid, version, a, b, c, d)
> head(result.order) # This is the desired result
>
> ###----------------------------------------
> ### METHOD 2 - merge
>
> my.merge <- merge(myinfo, mydata, by="version")
> names(my.merge)
> result2 <- my.merge[,c("myid", "version",
"a", "b", "c", "d")]
> names(result2)
> result2.order <- arrange(result2, myid, version, a, b, c, d)
> dim(result2.order)
> head(result2.order)
>
> # Same result?
> all.equal(result.order, result2.order)
>
> On Tue, Dec 22, 2015 at 3:34 PM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I know I am overwriting.
>> merge doesn't solve it because each version in mydata is given to
more
>> than one id. Hence, I thought I can't merge by version.
>> I am not sure how to answer the question about "the problem".
>> I described the current state and the desired state. If possible,
I'd
>> like to get from the current state to the desired state faster than
>> when using a loop.
>>
>> On Tue, Dec 22, 2015 at 2:26 PM, jim holtman <jholtman at
gmail.com> wrote:
>>> You seem to be saving 'myid' and then overwriting it with
the last
>>> statement:
>>>
>>>  result[[i]] <- result[[i]][c(5, 1:4)]
>>>
>>> Why doesn't 'merge' work for you?  I tried it on your
data, and seem to get
>>> back the same number of rows; may not be in the same order, but the
content
>>> looks the same, and it does have 'myid' on it.
>>>
>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>> On Tue, Dec 22, 2015 at 12:27 PM, Dimitri Liakhovitski
>>> <dimitri.liakhovitski at gmail.com> wrote:
>>>>
>>>> Hello!
>>>> I have a solution for my task that is based on a loop. However,
it's
>>>> too slow for my real-life problem that is much larger in scope.
>>>> However, I cannot use merge. Any advice on how to do it faster?
>>>> Thanks a lot for any hint on how to speed it up!
>>>>
>>>> # I have 'mydata' data frame:
>>>> set.seed(123)
>>>> mydata <- data.frame(myid = 1001:1100,
>>>>                      version = sample(1:20, 100, replace = T))
>>>> head(mydata)
>>>> table(mydata$version)
>>>>
>>>> # I have 'myinfo' data frame that contains information
for each 'version':
>>>> set.seed(12)
>>>> myinfo <- data.frame(version = sort(rep(1:20, 30)), a =
rnorm(60), b >>>> rnorm(60),
>>>>                                  c = rnorm(60), d = rnorm(60))
>>>> head(myinfo, 40)
>>>>
>>>> ### MY SOLUTION WITH A LOOP:
>>>> ### Looping through each id of mydata and grabbing
>>>> ### all columns from 'myinfo' for the corresponding
'version':
>>>>
>>>> # 1. Creating placeholder list for the results:
>>>> result <- split(mydata[c("myid",
"version")], f = list(mydata$myid))
>>>> length(result)
>>>> (result)[1:3]
>>>>
>>>>
>>>> # 2. Looping through each element of 'result':
>>>> for(i in 1:length(result)){
>>>>       id <- result[[i]]$myid
>>>>       result[[i]] <- myinfo[myinfo$version ==
result[[i]]$version, ]
>>>>       result[[i]]$myid <- id
>>>>       result[[i]] <- result[[i]][c(5, 1:4)]
>>>> }
>>>> result <- do.call(rbind, result)
>>>> head(result) # This is the desired result
>>>>
>>>> --
>>>> Dimitri Liakhovitski
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>>
>>
>>
>>
>> --
>> Dimitri Liakhovitski
>
>
>
> --
> Dimitri Liakhovitski


-- 
Dimitri Liakhovitski

R help - Dec 2015 - Trying to avoid the loop while merging two data frames

[R] Trying to avoid the loop while merging two data frames

[R] Trying to avoid the loop while merging two data frames

[R] Trying to avoid the loop while merging two data frames