thr3ads.net - R help - [R] code review: is it too much to ask? [Oct 2011]

If this information is useful, please help other people find it:
Share via:
Giovanni Azua
2011-Oct-23 21:03 UTC
[R] code review: is it too much to ask?

Hello all,

I really appreciate how helpful the people in this list are. Would it be too
much to ask to send a small script to have it peer-reviewed? to make sure I am
not making blatant mistakes? The script takes an experiment.dat as input and
generates system Throughput using ggplot2. It works now ... [sigh] but I have
this nasty feeling that I might be doing something wrong :). Changing
"samples" i.e. number of samples per group produces arbitrarily
different results, I basically increased it (until 9) until there were no
strongly deterministic periodicities. This is not a full-fledge experiment but
just a preliminary report that will show I have implemented a healthy system.
Proper experimental analysis comes after varying factors according to the 2^k*r
experimental design etc

Some key points I would like to find out:
- aggregation is not breaking the natural order of the measurements i.e. if
there are 20 runtimes taken in that order, and I make groups of 10 measurements
(to compute statistics on them) the first group must contain the first 10
runtimes and the second group must contain the second 10 runtimes. I am not sure
if the choice of aggregation etc is respecting this.
- I am not sure if it is best to do the binning by filling the bins by time
intervals of by number of observations.

Your help will be greatly appreciated!

I have the data too and the plots look very nice but it is a 4mb file.

TIA
Best regards,
Giovanni

#
========================================================================================#
Advanced Systems Lab
# Milestone 1
# Author: Giovanni Azua
# Date: 22 October 2011
#
========================================================================================
rm(list=ls())                                                        # clear
workspace

library(boot)                                                        # use boot
library
library(ggplot2)                                                     # use
ggplot2 library
library(doBy)                                                        # use doBy
library

#
========================================================================================#
ETL Step
#
========================================================================================
data_file <-
file("/Users/bravegag/code/asl11/trunk/report/experiment.dat")
df <- read.table(data_file)                                          # reads
the data as data frame
class(df)                                                            # show the
class to be 'list'
names(df)                                                            # data is
prepared correcly in Python
str(df)
head(df)

names(df)[names(df)=="V1"] <- "Time"                     
# change column names
names(df)[names(df)=="V2"] <- "Partitioning"
names(df)[names(df)=="V3"] <- "Workload"
names(df)[names(df)=="V4"] <- "Runtime"
str(df)
head(df)

#
========================================================================================#
Define utility functions
#
========================================================================================
se <- function(x) sqrt(var(x)/length(x))
sst <- function(x) sum(x-mean(x))^2

## ************************************ COPIED FROM
********************************************
##
http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
##
*********************************************************************************************
## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and
confidence interval (default 95%).
## If there are within-subject variables, calculate adjusted values using method
from Morey (2008).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be
summariezed
##   groupvars: a vector containing names of columns that contain grouping
variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is
95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95) {
    require(doBy)

    # New version of length which can handle NA's: if na.rm==T, don't
count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # Collapse the data
    formula <- as.formula(paste(measurevar, paste(groupvars, collapse="
+ "), sep=" ~ "))
    datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd),
na.rm=na.rm)

    # Rename columns
    names(datac)[ names(datac) == paste(measurevar, ".mean",
sep="") ] <- measurevar
    names(datac)[ names(datac) == paste(measurevar, ".sd",
sep="") ] <- "sd"
    names(datac)[ names(datac) == paste(measurevar, ".length2",
sep="") ] <- "N"
    
    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the
mean
    
    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult
    
    return(datac)
}

#
========================================================================================#
Prepare the Throughput data
#
========================================================================================
throughput <- aggregate(x=df$Runtime, by=list(df$Time,df$Partitioning),
FUN=length)
head(throughput)
names(throughput)[names(throughput)=="Group.1"] <- "Time"
# change column names
names(throughput)[names(throughput)=="Group.2"] <-
"Partitioning"
names(throughput)[names(throughput)=="x"] <- "Y"
head(throughput)

samples = 9
throughput$Time_group <- floor(throughput$Time/samples) + 1          #
generate Time groups of "samples"

dfc <- summarySE(throughput, measurevar="Y",
groupvars=c("Time_group", "Partitioning"))
last <- length(dfc$Time)
dfc <- dfc[c(-1,-2,-(last-1),-last),]
dfc$Time <- dfc$Time - min(dfc$Time) + 1
head(dfc)

# mu + se error bar
ggplot(dfc, aes(x=Time, y=Y, colour=Partitioning, group=Partitioning)) +
geom_point(fill="white", size=3) +
    geom_line() + geom_errorbar(aes(ymin=Y-se, ymax=Y+se), width=.5) +
theme_bw() +
    xlab(paste("Minutes")) + ylab("Throughput (Requests per
Minute)") +
    scale_y_continuous(breaks=seq(0,max(dfc$Y + dfc$se), 50), limits=c(0,
max(dfc$Y + dfc$se))) +
    opts(title="System Throughput\n2x Clients 2x Middlewares 2x
Databases") +
    scale_x_continuous(breaks=0:length(dfc$Y),
labels=as.character(0:length(dfc$Y)*samples))

#
========================================================================================#
Prepare the Response Time data
#
========================================================================================



	[[alternative HTML version deleted]]
Possibly Parallel Threads

Search for more possibly parallel threads
R help - Oct 2011 - code review: is it too much to ask?

[R] code review: is it too much to ask?

Possibly Parallel Threads

Wisdom of the Ancients