Hello all,
I really appreciate how helpful the people in this list are. Would it be too
much to ask to send a small script to have it peer-reviewed? to make sure I am
not making blatant mistakes? The script takes an experiment.dat as input and
generates system Throughput using ggplot2. It works now ... [sigh] but I have
this nasty feeling that I might be doing something wrong :). Changing
"samples" i.e. number of samples per group produces arbitrarily
different results, I basically increased it (until 9) until there were no
strongly deterministic periodicities. This is not a full-fledge experiment but
just a preliminary report that will show I have implemented a healthy system.
Proper experimental analysis comes after varying factors according to the 2^k*r
experimental design etc
Some key points I would like to find out:
- aggregation is not breaking the natural order of the measurements i.e. if
there are 20 runtimes taken in that order, and I make groups of 10 measurements
(to compute statistics on them) the first group must contain the first 10
runtimes and the second group must contain the second 10 runtimes. I am not sure
if the choice of aggregation etc is respecting this.
- I am not sure if it is best to do the binning by filling the bins by time
intervals of by number of observations.
Your help will be greatly appreciated!
I have the data too and the plots look very nice but it is a 4mb file.
TIA
Best regards,
Giovanni
#
========================================================================================#
Advanced Systems Lab
# Milestone 1
# Author: Giovanni Azua
# Date: 22 October 2011
#
========================================================================================
rm(list=ls()) # clear
workspace
library(boot) # use boot
library
library(ggplot2) # use
ggplot2 library
library(doBy) # use doBy
library
#
========================================================================================#
ETL Step
#
========================================================================================
data_file <-
file("/Users/bravegag/code/asl11/trunk/report/experiment.dat")
df <- read.table(data_file) # reads
the data as data frame
class(df) # show the
class to be 'list'
names(df) # data is
prepared correcly in Python
str(df)
head(df)
names(df)[names(df)=="V1"] <- "Time"
# change column names
names(df)[names(df)=="V2"] <- "Partitioning"
names(df)[names(df)=="V3"] <- "Workload"
names(df)[names(df)=="V4"] <- "Runtime"
str(df)
head(df)
#
========================================================================================#
Define utility functions
#
========================================================================================
se <- function(x) sqrt(var(x)/length(x))
sst <- function(x) sum(x-mean(x))^2
## ************************************ COPIED FROM
********************************************
##
http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
##
*********************************************************************************************
## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and
confidence interval (default 95%).
## If there are within-subject variables, calculate adjusted values using method
from Morey (2008).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be
summariezed
## groupvars: a vector containing names of columns that contain grouping
variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is
95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95) {
require(doBy)
# New version of length which can handle NA's: if na.rm==T, don't
count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# Collapse the data
formula <- as.formula(paste(measurevar, paste(groupvars, collapse="
+ "), sep=" ~ "))
datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd),
na.rm=na.rm)
# Rename columns
names(datac)[ names(datac) == paste(measurevar, ".mean",
sep="") ] <- measurevar
names(datac)[ names(datac) == paste(measurevar, ".sd",
sep="") ] <- "sd"
names(datac)[ names(datac) == paste(measurevar, ".length2",
sep="") ] <- "N"
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the
mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
#
========================================================================================#
Prepare the Throughput data
#
========================================================================================
throughput <- aggregate(x=df$Runtime, by=list(df$Time,df$Partitioning),
FUN=length)
head(throughput)
names(throughput)[names(throughput)=="Group.1"] <- "Time"
# change column names
names(throughput)[names(throughput)=="Group.2"] <-
"Partitioning"
names(throughput)[names(throughput)=="x"] <- "Y"
head(throughput)
samples = 9
throughput$Time_group <- floor(throughput$Time/samples) + 1 #
generate Time groups of "samples"
dfc <- summarySE(throughput, measurevar="Y",
groupvars=c("Time_group", "Partitioning"))
last <- length(dfc$Time)
dfc <- dfc[c(-1,-2,-(last-1),-last),]
dfc$Time <- dfc$Time - min(dfc$Time) + 1
head(dfc)
# mu + se error bar
ggplot(dfc, aes(x=Time, y=Y, colour=Partitioning, group=Partitioning)) +
geom_point(fill="white", size=3) +
geom_line() + geom_errorbar(aes(ymin=Y-se, ymax=Y+se), width=.5) +
theme_bw() +
xlab(paste("Minutes")) + ylab("Throughput (Requests per
Minute)") +
scale_y_continuous(breaks=seq(0,max(dfc$Y + dfc$se), 50), limits=c(0,
max(dfc$Y + dfc$se))) +
opts(title="System Throughput\n2x Clients 2x Middlewares 2x
Databases") +
scale_x_continuous(breaks=0:length(dfc$Y),
labels=as.character(0:length(dfc$Y)*samples))
#
========================================================================================#
Prepare the Response Time data
#
========================================================================================
[[alternative HTML version deleted]]