thr3ads.net - R help - [R] binning runtimes [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Giovanni Azua

2011-Oct-24 09:01 UTC

[R] binning runtimes

Hello,

Suppose I have the dataset shown below. The amount of observations is too
massive to get a nice geom_point and smoother on top. What I would like to do is
to bin the data first. The data is indexed by Time (minutes from 1 to 120 i.e.
two hours of System benchmarking).

Option 1) group the data by Time i.e. minute 1, minute 2, etc and within each
group create bins of N consecutive observations and average them into one
observation, the bins become the new data points to use for the geom_point plot.
How can I do this? Shingle? how to do that?

Option 2)  Another option is to again group by Time i.e. minute 1, minute 2, etc
and within each group draw a random observation to be the representative for the
corresponding bin. I could not clearly see how to use Random.
> dfs <- subset(df, Partitioning == "Sharding")
> head(dfs)  Time Partitioning Workload Runtime
1    1     Sharding    Query    3301
2    1     Sharding    Query    3268
3    1     Sharding    Query    2878
4    1     Sharding    Query    2819
5    1     Sharding    Query    3310
6    1     Sharding    Query    3428> str(dfs)'data.frame':	102384 obs. of  4 variables:
 $ Time        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Partitioning: Factor w/ 2 levels "Replication",..: 2 2 2 2 2 2 2 2
2 2 ...
 $ Workload    : Factor w/ 2 levels "Query","Refresh": 1 1 1
1 1 1 1 1 1 1 ...
 $ Runtime     : int  3301 3268 2878 2819 3310 3428 2837 2954 2902 2936
...> 
Many thanks in advance,
Best regards,
Giovanni

Dennis Murphy

2011-Oct-24 12:45 UTC

head link

[R] binning runtimes

Hi:

On Mon, Oct 24, 2011 at 2:01 AM, Giovanni Azua <bravegag at gmail.com>
wrote:> Hello,
>
> Suppose I have the dataset shown below. The amount of observations is too
massive to get a nice geom_point and smoother on top. What I would like to do is
to bin the data first. The data is indexed by Time (minutes from 1 to 120 i.e.
two hours of System benchmarking).
>
> Option 1) group the data by Time i.e. minute 1, minute 2, etc and within
each group create bins of N consecutive observations and average them into one
observation, the bins become the new data points to use for the geom_point plot.
How can I do this? Shingle? how to do that?
If necessary, create a variable for minute; if Time already represents
minutes, you shouldn't need to do anything. To average Runtime by one
or more factors, there are many ways to do it: aggregate() in base R,
ddply() in plyr, summaryBy() in the doBy package or data.table. For
example, with aggregate() [R-2.11.0 or later], you could do (assuming
Time is in minutes; otherwise substitute the minute variable instead)

aggregate(Runtime ~ Time + Partitioning, data = dfs, FUN = mean)
>
> Option 2) ?Another option is to again group by Time i.e. minute 1, minute
2, etc and within each group draw a random observation to be the representative
for the corresponding bin. I could not clearly see how to use Random.
# Example:
# sampfun() samples one row of a data frame at random
sampfun <- function(d) d[sample(seq_len(nrow(d)), 1), ]
library('plyr')
ddply(dfs, .(Time, Partitioning), sampfun)


HTH,
Dennis
>
>> dfs <- subset(df, Partitioning == "Sharding")
>> head(dfs)
> ?Time Partitioning Workload Runtime
> 1 ? ?1 ? ? Sharding ? ?Query ? ?3301
> 2 ? ?1 ? ? Sharding ? ?Query ? ?3268
> 3 ? ?1 ? ? Sharding ? ?Query ? ?2878
> 4 ? ?1 ? ? Sharding ? ?Query ? ?2819
> 5 ? ?1 ? ? Sharding ? ?Query ? ?3310
> 6 ? ?1 ? ? Sharding ? ?Query ? ?3428
>> str(dfs)
> 'data.frame': ? 102384 obs. of ?4 variables:
> ?$ Time ? ? ? ?: int ?1 1 1 1 1 1 1 1 1 1 ...
> ?$ Partitioning: Factor w/ 2 levels "Replication",..: 2 2 2 2 2 2
2 2 2 2 ...
> ?$ Workload ? ?: Factor w/ 2 levels "Query","Refresh":
1 1 1 1 1 1 1 1 1 1 ...
> ?$ Runtime ? ? : int ?3301 3268 2878 2819 3310 3428 2837 2954 2902 2936 ...
>>
>
> Many thanks in advance,
> Best regards,
> Giovanni
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Oct 2011 - binning runtimes

[R] binning runtimes

[R] binning runtimes

Seemingly Similar Threads