Hello all, I really appreciate how helpful the people in this list are. Would it be too much to ask to send a small script to have it peer-reviewed? to make sure I am not making blatant mistakes? The script takes an experiment.dat as input and generates system Throughput using ggplot2. It works now ... [sigh] but I have this nasty feeling that I might be doing something wrong :). Changing "samples" i.e. number of samples per group produces arbitrarily different results, I basically increased it (until 9) until there were no strongly deterministic periodicities. This is not a full-fledge experiment but just a preliminary report that will show I have implemented a healthy system. Proper experimental analysis comes after varying factors according to the 2^k*r experimental design etc Some key points I would like to find out: - aggregation is not breaking the natural order of the measurements i.e. if there are 20 runtimes taken in that order, and I make groups of 10 measurements (to compute statistics on them) the first group must contain the first 10 runtimes and the second group must contain the second 10 runtimes. I am not sure if the choice of aggregation etc is respecting this. - I am not sure if it is best to do the binning by filling the bins by time intervals of by number of observations. Your help will be greatly appreciated! I have the data too and the plots look very nice but it is a 4mb file. TIA Best regards, Giovanni # ========================================================================================# Advanced Systems Lab # Milestone 1 # Author: Giovanni Azua # Date: 22 October 2011 # ======================================================================================== rm(list=ls()) # clear workspace library(boot) # use boot library library(ggplot2) # use ggplot2 library library(doBy) # use doBy library # ========================================================================================# ETL Step # ======================================================================================== data_file <- file("/Users/bravegag/code/asl11/trunk/report/experiment.dat") df <- read.table(data_file) # reads the data as data frame class(df) # show the class to be 'list' names(df) # data is prepared correcly in Python str(df) head(df) names(df)[names(df)=="V1"] <- "Time" # change column names names(df)[names(df)=="V2"] <- "Partitioning" names(df)[names(df)=="V3"] <- "Workload" names(df)[names(df)=="V4"] <- "Runtime" str(df) head(df) # ========================================================================================# Define utility functions # ======================================================================================== se <- function(x) sqrt(var(x)/length(x)) sst <- function(x) sum(x-mean(x))^2 ## ************************************ COPIED FROM ******************************************** ## http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29 ## ********************************************************************************************* ## Summarizes data. ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%). ## If there are within-subject variables, calculate adjusted values using method from Morey (2008). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## groupvars: a vector containing names of columns that contain grouping variables ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) { require(doBy) # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # Collapse the data formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ ")) datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm) # Rename columns names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- measurevar names(datac)[ names(datac) == paste(measurevar, ".sd", sep="") ] <- "sd" names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N" datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean # Confidence interval multiplier for standard error # Calculate t-statistic for confidence interval: # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1 ciMult <- qt(conf.interval/2 + .5, datac$N-1) datac$ci <- datac$se * ciMult return(datac) } # ========================================================================================# Prepare the Throughput data # ======================================================================================== throughput <- aggregate(x=df$Runtime, by=list(df$Time,df$Partitioning), FUN=length) head(throughput) names(throughput)[names(throughput)=="Group.1"] <- "Time" # change column names names(throughput)[names(throughput)=="Group.2"] <- "Partitioning" names(throughput)[names(throughput)=="x"] <- "Y" head(throughput) samples = 9 throughput$Time_group <- floor(throughput$Time/samples) + 1 # generate Time groups of "samples" dfc <- summarySE(throughput, measurevar="Y", groupvars=c("Time_group", "Partitioning")) last <- length(dfc$Time) dfc <- dfc[c(-1,-2,-(last-1),-last),] dfc$Time <- dfc$Time - min(dfc$Time) + 1 head(dfc) # mu + se error bar ggplot(dfc, aes(x=Time, y=Y, colour=Partitioning, group=Partitioning)) + geom_point(fill="white", size=3) + geom_line() + geom_errorbar(aes(ymin=Y-se, ymax=Y+se), width=.5) + theme_bw() + xlab(paste("Minutes")) + ylab("Throughput (Requests per Minute)") + scale_y_continuous(breaks=seq(0,max(dfc$Y + dfc$se), 50), limits=c(0, max(dfc$Y + dfc$se))) + opts(title="System Throughput\n2x Clients 2x Middlewares 2x Databases") + scale_x_continuous(breaks=0:length(dfc$Y), labels=as.character(0:length(dfc$Y)*samples)) # ========================================================================================# Prepare the Response Time data # ======================================================================================== [[alternative HTML version deleted]]