Dear list, I'm brand new to R (started using it few days ago...), so sorry for possibly stupid question. Anyways, I'm using R to cluster my data. I do have the dissimilarity matrix as a text file, numbers separated by space. It's at its best something like 2300x2300 matrix. Now, it seems to me, that the process of importing the matrix into R is rather slow. For the peak size of 2300x2300 it takes almost two hours. The clustering itself takes a minimum of time when compared to importing the data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version of R is "Version 1.4.0 (2001-12-19)" I have tried to follow all the recomendations I found in the documentation, so I do something like this: (The file consists of 2300 rows, each containing 2300 real numbers, separated by space. Nothing else.) __________________________ library(cluster) CC<-c("numeric") T1<-read.table("matrix",nrows=2300,colClasses=CC) T2<-as.dist(T1) rm(T1) T3<-agnes(T2,diss=TRUE) write.table(T3$merge,file=outfile,quote=FALSE) ___________________________ The CC vector contains the "numeric" only once, as I read that the values are "recycled"... So, is there any room for improvement? Any way to make the data import quicker? Thanks a lot. Best regards, Filip -- ----------------------------------------------------------------- Filip Ginter Ph.D. student Email: ginter at cs.utu.fi Phone: +358-2-2154078 Office: 4122, 4th floor ICQ: 146959496 Turku Centre for Computer Science Lemmink?isenkatu 14A 20520 Turku Finland -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
There is an extra problem with R 1.4.0 (only), but this would always be slow. You are trying to read in a numeric matrix, but read.table is designed to read in a data frame, for which each of the 2300 columns could be a different type, so lots of housekeeping is inevitable. I think you need T2 <- as.dist(matrix(scan("matrix"), 2300, 2300, byrow=T))) and if the matrix is symmetric you can omit `byrow=T'. (One recommendation you missed was Using `comment.char = ""' will be appreciably faster. ) Even so, processing 5 million text strings to form a 40Mb object will not be instant: you would do better to write a binary file and read it with readBin. You could also write only one triangle of the matrix if it is symmetric. On Thu, 24 Jan 2002, Filip Ginter wrote:> Dear list, > > I'm brand new to R (started using it few days ago...), so sorry for possibly > stupid question. > > Anyways, I'm using R to cluster my data. I do have the dissimilarity matrix > as a text file, numbers separated by space. It's at its best something like > 2300x2300 matrix. > > Now, it seems to me, that the process of importing the matrix into R is > rather slow. For the peak size of 2300x2300 it takes almost two hours. The > clustering itself takes a minimum of time when compared to importing the > data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version of > R is "Version 1.4.0 (2001-12-19)" > > I have tried to follow all the recomendations I found in the documentation, > so I do something like this: (The file consists of 2300 rows, each containing > 2300 real numbers, separated by space. Nothing else.) > > __________________________ > > library(cluster) > CC<-c("numeric") > T1<-read.table("matrix",nrows=2300,colClasses=CC) > T2<-as.dist(T1) > rm(T1) > T3<-agnes(T2,diss=TRUE) > write.table(T3$merge,file=outfile,quote=FALSE) > > ___________________________ > > The CC vector contains the "numeric" only once, as I read that the values are > "recycled"... > > So, is there any room for improvement? Any way to make the data import > quicker? > > Thanks a lot. > > Best regards, > > Filip > > -- > > ----------------------------------------------------------------- > Filip Ginter > Ph.D. student > > Email: ginter at cs.utu.fi > Phone: +358-2-2154078 > Office: 4122, 4th floor > ICQ: 146959496 > > Turku Centre for Computer Science > Lemminkäisenkatu 14A > 20520 Turku > Finland > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Perhaps mimatriz <- scan("directory_and_file_names"),byrow=T,ncol=2300) is more efficient. Be sure you only have names in the file, otherwise you need: mimatriz <- scan("directory_and_file_names",what=""),byrow=T,ncol=2300) and then mimatriz will be char. Then you subset the numeric cols and/or rows and apply as.numeric. Agus Dr. Agustin Lobo Instituto de Ciencias de la Tierra (CSIC) Lluis Sole Sabaris s/n 08028 Barcelona SPAIN tel 34 93409 5410 fax 34 93411 0012 alobo at ija.csic.es On Thu, 24 Jan 2002, Filip Ginter wrote:> Dear list, > > I'm brand new to R (started using it few days ago...), so sorry for possibly > stupid question. > > Anyways, I'm using R to cluster my data. I do have the dissimilarity matrix > as a text file, numbers separated by space. It's at its best something like > 2300x2300 matrix. > > Now, it seems to me, that the process of importing the matrix into R is > rather slow. For the peak size of 2300x2300 it takes almost two hours. The > clustering itself takes a minimum of time when compared to importing the > data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version of > R is "Version 1.4.0 (2001-12-19)" > > I have tried to follow all the recomendations I found in the documentation, > so I do something like this: (The file consists of 2300 rows, each containing > 2300 real numbers, separated by space. Nothing else.) > > __________________________ > > library(cluster) > CC<-c("numeric") > T1<-read.table("matrix",nrows=2300,colClasses=CC) > T2<-as.dist(T1) > rm(T1) > T3<-agnes(T2,diss=TRUE) > write.table(T3$merge,file=outfile,quote=FALSE) > > ___________________________ > > The CC vector contains the "numeric" only once, as I read that the values are > "recycled"... > > So, is there any room for improvement? Any way to make the data import > quicker? > > Thanks a lot. > > Best regards, > > Filip > > -- > > ----------------------------------------------------------------- > Filip Ginter > Ph.D. student > > Email: ginter at cs.utu.fi > Phone: +358-2-2154078 > Office: 4122, 4th floor > ICQ: 146959496 > > Turku Centre for Computer Science > Lemminkäisenkatu 14A > 20520 Turku > Finland > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._