ngottlieb at marinercapital.com
2007-Jun-13 15:46 UTC
[R] Formatted Data File Question for Clustering -Quickie Project
I am trying to learn how to format Ascii data files for scan or read into R. Precisely for a quickie project, I found some code (at end of this email) to do exactly what I need: To cluster and graph a dendrogram from package (stats). I am stuck on how to format a text file to run the script. I looked at the dataset USArrests (which would be replaced by my data and labels) using UltraEdit. That data appears to be in binary format and I would simply like a readable ASCII text file. How can I: A) format this data to a file for the script below? B) I would like to use squared Euclidean distance, can hclust support this? Thanks, Neil Gottlieb Here is sub-set example of my data set, return series to cluster: 13 cases by 36 observations): Month Convertible Arbitrage Dedicated Short Bias Emerging Markets 1/31/1994 0.004 -0.016 0.105 2/28/1994 0.002 0.020 -0.011 3/31/1994 -0.010 0.072 -0.046 4/30/1994 -0.025 0.013 -0.084 5/31/1994 -0.010 0.023 -0.007 6/30/1994 0.002 0.064 0.005 7/31/1994 0.001 -0.012 0.058 8/31/1994 0.000 -0.057 0.164 9/30/1994 -0.012 0.016 0.052 10/31/1994 -0.014 -0.004 -0.035 11/30/1994 -0.002 0.030 -0.014 12/31/1994 -0.019 -0.002 -0.042 1/31/1995 -0.006 0.013 -0.100 2/28/1995 0.012 -0.022 -0.079 3/31/1995 0.013 0.004 -0.055 4/30/1995 0.023 -0.004 0.073 5/31/1995 0.017 -0.013 0.013 6/30/1995 0.019 -0.069 0.008 7/31/1995 0.009 -0.059 0.022 8/31/1995 0.008 0.008 0.010 9/30/1995 0.011 -0.029 0.019 10/31/1995 0.013 0.064 -0.057 11/30/1995 0.023 -0.010 -0.031 12/31/1995 0.014 0.049 0.007 1/31/1996 0.021 0.006 0.079 2/29/1996 0.012 -0.056 -0.006 3/31/1996 0.015 -0.009 -0.009 4/30/1996 0.013 -0.066 0.051 5/31/1996 0.016 0.000 0.045 6/30/1996 0.015 0.051 0.054 7/31/1996 0.014 0.098 -0.027 8/31/1996 0.013 -0.034 0.036 9/30/1996 0.011 -0.059 0.016 10/31/1996 0.014 0.043 0.017 11/30/1996 0.014 -0.029 0.026 Code Example from Help files: hc <- hclust(dist(USArrests), "ave") (dend1 <- as.dendrogram(hc)) # "print()" method str(dend1) # "str()" method str(dend1, max = 2) # only the first two sub-levels op <- par(mfrow= c(2,2), mar = c(5,2,1,4)) plot(dend1) ## "triangle" type and show inner nodes: plot(dend1, nodePar=list(pch = c(1,NA), cex=0.8, lab.cex = 0.8), type = "t", center=TRUE) plot(dend1, edgePar=list(col = 1:2, lty = 2:3), dLeaf=1, edge.root TRUE) plot(dend1, nodePar=list(pch = 2:1,cex=.4*2:1, col = 2:3), horiz=TRUE) dend2 <- cut(dend1, h=70) plot(dend2$upper) ## leafs are wrong horizontally: plot(dend2$upper, nodePar=list(pch = c(1,7), col = 2:1)) ## dend2$lower is *NOT* a dendrogram, but a list of .. : plot(dend2$lower[[3]], nodePar=list(col=4), horiz = TRUE, type = "tr") ## "inner" and "leaf" edges in different type & color : plot(dend2$lower[[2]], nodePar=list(col=1),# non empty list edgePar = list(lty=1:2, col=2:1), edge.root=TRUE) par(op) str(d3 <- dend2$lower[[2]][[2]][[1]]) nP <- list(col=3:2, cex=c(2.0, 0.75), pch= 21:22, bg= c("light blue", "pink"), lab.cex = 0.75, lab.col = "tomato") plot(d3, nodePar= nP, edgePar = list(col="gray", lwd=2), horiz = TRUE) addE <- function(n) { if(!is.leaf(n)) { attr(n, "edgePar") <- list(p.col="plum") attr(n, "edgetext") <- paste(attr(n,"members"),"members") } n } d3e <- dendrapply(d3, addE) plot(d3e, nodePar= nP) plot(d3e, nodePar= nP, leaflab = "textlike") -------------------------------------------------------- This information is being sent at the recipient's request or...{{dropped}}
if you look at the data USArrests by doing> data(USArrets) > USArretsyou will see that it is a data.frame. so by analogy you could do the following: Probably you have this data in Excel (I guess from the format in your mail). Have the data in a sheet as: converts shortBais .... 19940131 0.004 -0.0016 19940228 ....................... save this sheet as tab limited txt file. then use mydata <- read.table("yourdata.txt") now you can use the data.frame mydata in the cluster analysis. I would suggest you read the intro to R. You can also use read.csv see ?read.table for more info, read R import/export on http://finzi.psych.upenn.edu/R/doc/manual/R-data.html good luck. AA. ----- Original Message ----- From: <ngottlieb at marinercapital.com> To: <R-help at stat.math.ethz.ch> Sent: Wednesday, June 13, 2007 11:46 AM Subject: [R] Formatted Data File Question for Clustering -Quickie Project>I am trying to learn how to format Ascii data files for scan or read > into R. > > Precisely for a quickie project, I found some code (at end of this > email) to do exactly what I need: > To cluster and graph a dendrogram from package (stats). > > I am stuck on how to format a text file to run the script. > I looked at the dataset USArrests (which would be replaced by my data > and labels) using UltraEdit. That data appears to be in binary format > and I would simply like a readable ASCII text file. > > How can I: > A) format this data to a file for the script below? > B) I would like to use squared Euclidean distance, can hclust support > this? > > Thanks, > Neil Gottlieb > > Here is sub-set example of my data set, return series to cluster: 13 > cases by 36 observations): > Month Convertible Arbitrage Dedicated Short Bias Emerging > Markets > 1/31/1994 0.004 -0.016 0.105 > 2/28/1994 0.002 0.020 -0.011 > 3/31/1994 -0.010 0.072 -0.046 > 4/30/1994 -0.025 0.013 -0.084 > 5/31/1994 -0.010 0.023 -0.007 > 6/30/1994 0.002 0.064 0.005 > 7/31/1994 0.001 -0.012 0.058 > 8/31/1994 0.000 -0.057 0.164 > 9/30/1994 -0.012 0.016 0.052 > 10/31/1994 -0.014 -0.004 -0.035 > 11/30/1994 -0.002 0.030 -0.014 > 12/31/1994 -0.019 -0.002 -0.042 > 1/31/1995 -0.006 0.013 -0.100 > 2/28/1995 0.012 -0.022 -0.079 > 3/31/1995 0.013 0.004 -0.055 > 4/30/1995 0.023 -0.004 0.073 > 5/31/1995 0.017 -0.013 0.013 > 6/30/1995 0.019 -0.069 0.008 > 7/31/1995 0.009 -0.059 0.022 > 8/31/1995 0.008 0.008 0.010 > 9/30/1995 0.011 -0.029 0.019 > 10/31/1995 0.013 0.064 -0.057 > 11/30/1995 0.023 -0.010 -0.031 > 12/31/1995 0.014 0.049 0.007 > 1/31/1996 0.021 0.006 0.079 > 2/29/1996 0.012 -0.056 -0.006 > 3/31/1996 0.015 -0.009 -0.009 > 4/30/1996 0.013 -0.066 0.051 > 5/31/1996 0.016 0.000 0.045 > 6/30/1996 0.015 0.051 0.054 > 7/31/1996 0.014 0.098 -0.027 > 8/31/1996 0.013 -0.034 0.036 > 9/30/1996 0.011 -0.059 0.016 > 10/31/1996 0.014 0.043 0.017 > 11/30/1996 0.014 -0.029 0.026 > > Code Example from Help files: > hc <- hclust(dist(USArrests), "ave") > (dend1 <- as.dendrogram(hc)) # "print()" method > str(dend1) # "str()" method > str(dend1, max = 2) # only the first two sub-levels > > op <- par(mfrow= c(2,2), mar = c(5,2,1,4)) > plot(dend1) > ## "triangle" type and show inner nodes: > plot(dend1, nodePar=list(pch = c(1,NA), cex=0.8, lab.cex = 0.8), > type = "t", center=TRUE) > plot(dend1, edgePar=list(col = 1:2, lty = 2:3), dLeaf=1, edge.root > TRUE) > plot(dend1, nodePar=list(pch = 2:1,cex=.4*2:1, col = 2:3), horiz=TRUE) > > dend2 <- cut(dend1, h=70) > plot(dend2$upper) > ## leafs are wrong horizontally: > plot(dend2$upper, nodePar=list(pch = c(1,7), col = 2:1)) > ## dend2$lower is *NOT* a dendrogram, but a list of .. : > plot(dend2$lower[[3]], nodePar=list(col=4), horiz = TRUE, type = "tr") > ## "inner" and "leaf" edges in different type & color : > plot(dend2$lower[[2]], nodePar=list(col=1),# non empty list > edgePar = list(lty=1:2, col=2:1), edge.root=TRUE) > par(op) > str(d3 <- dend2$lower[[2]][[2]][[1]]) > > nP <- list(col=3:2, cex=c(2.0, 0.75), pch= 21:22, bg= c("light blue", > "pink"), > lab.cex = 0.75, lab.col = "tomato") > plot(d3, nodePar= nP, edgePar = list(col="gray", lwd=2), horiz = TRUE) > addE <- function(n) { > if(!is.leaf(n)) { > attr(n, "edgePar") <- list(p.col="plum") > attr(n, "edgetext") <- paste(attr(n,"members"),"members") > } > n > } > d3e <- dendrapply(d3, addE) > plot(d3e, nodePar= nP) > plot(d3e, nodePar= nP, leaflab = "textlike") > -------------------------------------------------------- > > > > This information is being sent at the recipient's request or...{{dropped}} > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Vladimir Eremeev
2007-Jun-14 07:41 UTC
[R] Formatted Data File Question for Clustering -Quickie Project
The "R Data Import/Export" guide was mentioned already, it contains everything you should know about data exchange between R and other software. In case it says nothing about dates, try as.Date() and strftime(). For your example below, as.Date("1/31/1994",format="%m/%d/%Y") works. ngottlieb wrote:> > I am trying to learn how to format Ascii data files for scan or read > into R. > > Precisely for a quickie project, I found some code (at end of this > email) to do exactly what I need: > To cluster and graph a dendrogram from package (stats). > > I am stuck on how to format a text file to run the script. > I looked at the dataset USArrests (which would be replaced by my data > and labels) using UltraEdit. That data appears to be in binary format > and I would simply like a readable ASCII text file. > > How can I: > A) format this data to a file for the script below? > B) I would like to use squared Euclidean distance, can hclust support > this? > > Thanks, > Neil Gottlieb > > Here is sub-set example of my data set, return series to cluster: 13 > cases by 36 observations): > Month Convertible Arbitrage Dedicated Short Bias Emerging > Markets > 1/31/1994 0.004 -0.016 0.105 > 2/28/1994 0.002 0.020 -0.011 > 3/31/1994 -0.010 0.072 -0.046 > > [skip] >-- View this message in context: http://www.nabble.com/Formatted-Data-File-Question-for-Clustering--Quickie-Project-tf3915926.html#a11115436 Sent from the R help mailing list archive at Nabble.com.