andrewjacksonTCD
2008-Jul-10 14:01 UTC
[R] Non-normal data issues in PhD software engineering experiment
Hi All, Title: Non-normal data issues in PhD software engineering experiment I hope I am not breeching any terms of this forum by this rather general post. There are very R specific elements to this rather long posting. I will do my best to clearly explain my experiment, goals and problems here but please let me know if I have left out any vital information or if there is any ambiguity that I need to address such that you can help me. I have a very limited background in statistics - I have just completed a postgraduate course in Statistics at TCD Dublin, Ireland. *** Experimental setup *** I am have conducted a software engineering experiment in which I have taken measures of quality for a software system build using 2 different design paradigms (1 and 2) over 10 evolutionary versions of the system (1 - 10). So for each version I have a pair of systems identical in that they do precisely the same thing and differ only in that they are build using 2 different design paradigms. For each version and paradigm type I have collected a data set of measures called sensitivity measures. So for instance I have 20 different data sets, 10 for the 10 versions of software under design paradigm 1 and 10 for the 10 versions of software under design paradigm 2. *** Data *** My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv In this data file there are a number of columns - "version","paradigm","location","coverage","execution","infection","propogation","sensitivity" Sensitivity is the main response - please ignore "coverage","execution","infection","propogation" as these were used to calculate sensitivity. All 20 if my data sets are in this file - the columns version (1 - 10) and paradigm (1 or 2) differentiate them. *** Goals *** With this data collected I now want to do a number of things - 1) I want to look at the analysis of variance so see if there is a difference in mean for each paradigm over the 10 versions. I want to remove the version related variance by blocking on version. With this done I should be able to get a picture of the variance related to paradigm only. My null hypothesis is that there the means of both data sets are the same. I also want to look at each data set individually also to see if there is any difference between each pair of system designs. 2) I want to create two regression models, one for each paradigm to enable me to see how the quality of each paradigm is effected over time (versions). It would also be nice to have both confidence and prediction boundaries. 3) I want to be able to look at the power of all of this and possible see how many times I would need to do this to have concrete evidence that one paradigm is different/the same/better/worse than the other. 4) I am not 100% sure if its relevant - but the analysis of divergence (Something I came across when reading an R book - Introductory statistics with R - Peter Dalagaard - Springer - p197) may fit what I am looking for to assess the difference between the two regression models stated in goal statement 2. I think that this will assess the degree to which the regression models diverge over time. *** Problems *** 1)The problem I have is that each of the 20 data sets are of variable size. These data sets are also not-normal. I have assessed this using the normality tests (ad.test etc.in R and Mini-tab) So as far as I understand it I had two choices - the first is to transform my non-normal data into normal data. The second is to look at using non-parametric approaches. So I tried to use R to conduct a boxcox transformation for each of my 20 data sets. I couldn't figure it out past generating an optimal lambda. I then turned to mini-tab and found that I could make transformations there - the problem however was that there was a subset group option I didn't understand. I set it at various numbers but always seemed to get the same result so it didn't seem to upset the outcome that much/if at all. The result of this was non-normal data again. I then turned to the Johnson transformation and found that that also failed to produce transform my non-normal data to normal data. 3) I have looked at the Friedman test as a means of performing two way analysis of variance to address with my scenario. I have tried to execute it in R and Mini-tab but cant really cant figure out what my arguments should be. Using R: I then read my data into a frame using "read.table(data)". I proceed to then with the following - friedman.test( data$sensitivity ~ data$paradigm | data$version, data, data$version, na.action=na.exclude). This produces the following error "incorrect specification for 'formula'". I see that my formula needs to be of length == 3 for this test to be used (https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I dont think that my formula should be like this even but I wanted to be as close as possible to the example provided by R. I then tried to use the kruskal.test as follows - kruskal.test(data$sensitivity ~ data$sensitivity, data = data, na.action=na.exclude) - this gave me a result - however there was no account of the variance between versions. -- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data sensResults, na.action=na.exclude) -- -- Kruskal-Wallis rank sum test -- -- data: data$sensitivity by data$version by data$paradigm -- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053 I have no idea if these tests are the right thing to do here? This test is advertised as a subsitite to one way anova. My instinct tells me that I need to use the friedman.test - but as you can see I am noting having much luck with it. I have looked at the code in R as you can see from the link above and can see where it us rejecting my formula - I just don't understand what I need to do to my model for it to be accepted. 4) I have looked at the outputs to the kruskal.test and friedman.test and they differ from the anova table - By following and executing the R man examples I can see the friedman.test produces the following output: -- > friedman.test(x ~ w | t, data = wb) -- -- Friedman rank sum test -- -- data: x and w and t -- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637 You can also see from the above point that the output of the kruskal.test looks similar enough. This is a big contrasts to an anova table. In an anova table I can see the components of variance and the significant of each F test. These alternative tests do not seem to provide me this information. Using Mintab I go to stats->nonparametrics->Friedman This prompts me to provide columns for response, treatment and blocks I provide the following response <- sensitivity treatment <- paradigm blocks <- version When I try to execute this I get the following error Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'. * ERROR * Must have one observation per cell. * ERROR * Completion of computation impossible. 5) I have looked briefly at the non-parametric approaches to regression - there seems to be many (http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html) paths that can be taken. I need some guidance on which approach I should follow? What are the tradeoffs? How do I do this? Thank you and best regards, Andrew Jackson -- View this message in context: http://www.nabble.com/Non-normal-data-issues-in-PhD-software-engineering-experiment-tp18383175p18383175.html Sent from the R help mailing list archive at Nabble.com.
Andrew Jackson
2008-Jul-10 15:15 UTC
[R] Non-normal data issues in PhD software engineering experiment
Hi All, This is a rather general post to begin with. This is because I need to provide you some important context. There are very R specific elements to this further along in this rather long posting so I thank you in advance for your patients. I will do my best to clearly explain my experiment, data, goals and problems I have. Please let me know if I have left out any vital information or if there is any ambiguity that I need to address such that you can help me. I have a very limited background in statistics - I have just completed a postgraduate course in Statistics at Trinity College Dublin, Ireland. So I have the basics and not much more. I would also like to say up front that I am not the most gifted in terms of maths. With that in mind, I would appreciate it that if you respond to this with a long equations and mathematical notations you could also describe at a high level what the equation does or represents. *** Experimental setup *** I am have conducted a software engineering experiment in which I have taken measures of quality for a software system build using 2 different design paradigms (1 and 2) over 10 evolutionary versions of the system (1 - 10). So for each version I have a pair of systems identical in that they do precisely the same thing and differ only in that they are build using 2 different design paradigms. For each version and paradigm type I have collected a data set of measures called sensitivity measures. So for instance I have 20 different data sets, 10 for the 10 versions of software under design paradigm 1 and 10 for the 10 versions of software under design paradigm 2. *** Data *** My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv In this data file there are a number of columns - "version","paradigm","location","coverage","execution","infection","propogation","sensitivity" Sensitivity is the main response - please ignore "coverage","execution","infection","propogation" as these were used to calculate sensitivity. All 20 if my data sets are in this file - the columns version (1 - 10) and paradigm (1 or 2) differentiate them. *** Goals *** With this data collected I now want to do a number of things - 1) I want to look at the analysis of variance so see if there is a difference in mean for each paradigm over the 10 versions. I want to remove the version related variance by blocking on version. With this done I should be able to get a picture of the variance related to paradigm only. My null hypothesis is that there the means of both data sets are the same. I also want to look at each data set individually also to see if there is any difference between each pair of system designs. 2) I want to create two regression models, one for each paradigm to enable me to see how the quality of each paradigm is effected over time (versions). It would also be nice to have both confidence and prediction boundaries. 3) I want to be able to look at the power of all of this and possible see how many times I would need to do this to have concrete evidence that one paradigm is different/the same/better/worse than the other. 4) I am not 100% sure if its relevant - but the analysis of divergence (Something I came across when reading an R book - Introductory statistics with R - Peter Dalagaard - Springer - p197) may fit what I am looking for to assess the difference between the two regression models stated in goal statement 2. I think that this will assess the degree to which the regression models diverge over time. *** Problems *** 1)The problem I have is that each of the 20 data sets are of variable size. These data sets are also not-normal. I have assessed this using the normality tests (ad.test etc.in R and Mini-tab) So as far as I understand it I had two choices - the first is to transform my non-normal data into normal data. The second is to look at using non-parametric approaches. So I tried to use R to conduct a boxcox transformation for each of my 20 data sets. I couldn't figure it out past generating an optimal lambda. I then turned to mini-tab and found that I could make transformations there - the problem however was that there was a subset group option I didn't understand. I set it at various numbers but always seemed to get the same result so it didn't seem to upset the outcome that much/if at all. The result of this was non-normal data again. I then turned to the Johnson transformation and found that that also failed to produce transform my non-normal data to normal data. 3) I have looked at the Friedman test as a means of performing two way analysis of variance to address with my scenario. I have tried to execute it in R and Mini-tab but cant really cant figure out what my arguments should be. Using R: I then read my data into a frame using "read.table(data)". I proceed to then with the following - friedman.test( data$sensitivity ~ data$paradigm | data$version, data, data$version, na.action=na.exclude). This produces the following error "incorrect specification for 'formula'". I see that my formula needs to be of length == 3 for this test to be used (https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I dont think that my formula should be like this even but I wanted to be as close as possible to the example provided by R. I then tried to use the kruskal.test as follows - kruskal.test(data$sensitivity ~ data$sensitivity, data = data, na.action=na.exclude) - this gave me a result - however there was no account of the variance between versions. -- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data sensResults, na.action=na.exclude) -- -- Kruskal-Wallis rank sum test -- -- data: data$sensitivity by data$version by data$paradigm -- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053 I have no idea if these tests are the right thing to do here? This test is advertised as a substitute to one way anova. My instinct tells me that I need to use the friedman.test - but as you can see I am noting having much luck with it. I have looked at the code in R as you can see from the link above and can see where it us rejecting my formula - I just don't understand what I need to do to my model for it to be accepted. 4) I have looked at the outputs to the kruskal.test and friedman.test and they differ from the anova table - By following and executing the R man examples I can see the friedman.test produces the following output: -- > friedman.test(x ~ w | t, data = wb) -- -- Friedman rank sum test -- -- data: x and w and t -- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637 You can also see from the above point that the output of the kruskal.test looks similar enough. This is a big contrasts to an anova table. In an anova table I can see the components of variance and the significant of each F test. These alternative tests do not seem to provide me this information. Using Mini-tab: I go to stats->Nonparametrics->Friedman. This prompts me to provide columns for response, treatment and blocks. I provide the following: response <- sensitivity treatment <- paradigm blocks <- version When I try to execute this I get the following error Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'. * ERROR * Must have one observation per cell. * ERROR * Completion of computation impossible. 5) I have looked briefly at the non-parametric approaches to regression - there seems to be many (http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html) paths that can be taken. I need some guidance on which approach I should follow? What are the trade-offs? How do I do this? Thank you and best regards, Andrew Jackson