Hi R friends, I am posting this question even though I know that the nature of it is closer to general stats than R. Please let me know if you are aware of a list for general statistical questions: I am looking for a simple method to distinguish two groups of data in a long vector of numbers: list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) I would like to 'learn' that 400,430 are different numbers by using a simple approach.The outcome of processing 'list' should therefore be: listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3) listB <- c(400,340) I am thinking a non-parametric test since I have no knowledge of the underlying distribution. The numbers are time differences between two actions recorded from a the same person over time. Because the data was obtained from the same person I would naturally tend to use Wilcoxon Signed-Rank test. Any thoughts on that? Are there any R packages that would process such a vector and use non-parametric methods to split or divide groups based on their values? Could clustering be the answer given that I already know that I always have two groups with a significant difference between the two. Thanks a lot, Ralf
One of many possible approaches is called k-means clustering. my.data <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) split(my.data, kmeans(my.data, 2)$cluster) $`1` [1] 400 340 $`2` [1] 1 2 3 2 3 2 3 4 3 2 3 4 3 2 3 2 4 5 6 4 3 6 4 5 3 Ralf B wrote:> Hi R friends, > > I am posting this question even though I know that the nature of it is > closer to general stats than R. Please let me know if you are aware of > a list for general statistical questions: > > I am looking for a simple method to distinguish two groups of data in > a long vector of numbers: > > list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) > > I would like to 'learn' that 400,430 are different numbers by using a > simple approach.The outcome of processing 'list' should therefore be: > > listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3) > listB <- c(400,340) > > I am thinking a non-parametric test since I have no knowledge of the > underlying distribution. The numbers are time differences between two > actions recorded from a the same person over time. Because the data > was obtained from the same person I would naturally tend to use > Wilcoxon Signed-Rank test. Any thoughts on that? > > Are there any R packages that would process such a vector and use > non-parametric methods to split or divide groups based on their > values? Could clustering be the answer given that I already know that > I always have two groups with a significant difference between the > two. > > Thanks a lot, > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Wed, 5 May 2010, Ralf B wrote:> Hi R friends, > > I am posting this question even though I know that the nature of it is > closer to general stats than R. Please let me know if you are aware of > a list for general statistical questions: > > I am looking for a simple method to distinguish two groups of data in > a long vector of numbers: > > list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) > > I would like to 'learn' that 400,430 are different numbers by using a > simple approach.It seems that you want to cluster the data. There are, of course, loads of clustering algorithms around, see e.g., http://CRAN.R-project.org/view=Cluster In this simple example a standard hierarchical clustering approach shows you what you're after. ## data list <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,400,340,3,2,4,5,6,4,3,6,4,5,3) ## cluster using Ward method for Euclidian distances hc <- hclust(dist(list, method = "euclidian"), method = "ward") plot(hc) hc ## cut into two clusters split(list, cutree(hc, k = 2)) hth, Z> The outcome of processing 'list' should therefore be: > > listA <- c(1,2,3,2,3,2,3,4,3,2,3,4,3,2,3,2,4,5,6,4,3,6,4,5,3) > listB <- c(400,340) > > I am thinking a non-parametric test since I have no knowledge of the > underlying distribution. The numbers are time differences between two > actions recorded from a the same person over time. Because the data > was obtained from the same person I would naturally tend to use > Wilcoxon Signed-Rank test. Any thoughts on that? > > Are there any R packages that would process such a vector and use > non-parametric methods to split or divide groups based on their > values? Could clustering be the answer given that I already know that > I always have two groups with a significant difference between the > two. > > Thanks a lot, > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >