forgot to send it back to the list.
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
---------- Forwarded message ----------
From: jim holtman <jholtman at gmail.com>
Date: Fri, Aug 30, 2013 at 8:10 AM
Subject: Re: ddply for comparing simulation results
To: john doe <anon.r.user at gmail.com>
try the 'data.table package. It gives the answer in less than a second.
> # 1 million leads, half of which were simulated, half of which were not
> id=1:1000000
> isSimulated = c(rep(0,500000), rep(1, 500000))
> userId=sample(1:100000, 1000000, replace=T)
> df_leads=data.frame(id, isSimulated, userId)
> require(data.table)
Loading required package: data.table
data.table 1.8.8 For help type:
help("data.table")> system.time({
+ df_leads <- data.table(df_leads)
+ df_leads_sum <- df_leads[
+ , list(count = .N)
+ , keyby = c('isSimulated', 'userId')
+ ]
+ })
user system elapsed
0.75 0.01 0.76>
> head(df_leads_sum)
isSimulated userId count
1: 0 1 5
2: 0 2 9
3: 0 3 5
4: 0 4 4
5: 0 5 3
6: 0 6 7
you can use 'setdiff' to find userIDs that are missing from one group
or the other:
> #see which userIDs are missing between the groups
> not_in <- setdiff(df_leads_sum$userId[df_leads_sum$isSimulated == 0]
+ , df_leads_sum$userId[df_leads_sum$isSimulated == 1]
+ )> str(not_in)
int [1:697] 59 100 204 584 656 828 840 999 1012 1046
...>
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Thu, Aug 29, 2013 at 11:33 PM, john doe <anon.r.user at gmail.com>
wrote:> I am trying to use R and plyr to compare the effectiveness of various
> algorithms for online advertising. At the core, I am simply counting when
a
> user receives a lead: this is measured with the userId column. Leads that
> were sent in production have a 0 in the isSimulated column, and leads that
> were sent in our simulation have isSimulated=1. I have two questions: one
> about performance and one about how to use plyr to get the data in a form
> that I want.
>
> Here is an example of my code:
>
> # 1 million leads, half of which were simulated, half of which were not
> id=1:1000000
> isSimulated = c(rep(0,500000), rep(1, 500000))
> userId=sample(1:100000, 1000000, replace=T)
> df_leads=data.frame(id, isSimulated, userId)
>
> # split by simulated and userid, and then sum
> system.time(df_leads_sum <- ddply(df_leads, .(isSimulated, userId),
nrow))
> user system elapsed
> 38.167 0.212 38.386
>
> The above call to ddply is great because it allows me to create histograms
> of how many people receive just a few leads, or a lot of leads, both in
> production and in the simulator.
>
> Question 1: The above ddply call takes a while to execute. With production
> data it takes several minutes in R, but only a few seconds in MySQL. Is
> there a way to improve the performance of the above call?
>
> Question 2: What I would really like to do is create a histogram which
> measures the distribution of change in leads between non-simulated and
> simulated data. A complicating fact is that some users might only appear
in
> simulated or non-simulated data, so I need to correclty handle the absense
> of a userId. (In production, users are actually guaranteed to appear in
> production - but the crux of the problem is the same: userIds might be
> missing in one of the splits). Can someone help me with this? I've
read
> the documentation a few times, and think that the summarize function might
> be able to help, but I'm not quite sure how to do this.
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+unsubscribe at googlegroups.com.
> To post to this group, send email to manipulatr at googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/groups/opt_out.