Dear list,
I'm brand new to R (started using it few days ago...), so sorry for possibly
stupid question.
Anyways, I'm using R to cluster my data. I do have the dissimilarity matrix
as a text file, numbers separated by space. It's at its best something like
2300x2300 matrix.
Now, it seems to me, that the process of importing the matrix into R is
rather slow. For the peak size of 2300x2300 it takes almost two hours. The
clustering itself takes a minimum of time when compared to importing the
data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version of
R is "Version 1.4.0 (2001-12-19)"
I have tried to follow all the recomendations I found in the documentation,
so I do something like this: (The file consists of 2300 rows, each containing
2300 real numbers, separated by space. Nothing else.)
__________________________
library(cluster)
CC<-c("numeric")
T1<-read.table("matrix",nrows=2300,colClasses=CC)
T2<-as.dist(T1)
rm(T1)
T3<-agnes(T2,diss=TRUE)
write.table(T3$merge,file=outfile,quote=FALSE)
___________________________
The CC vector contains the "numeric" only once, as I read that the
values are
"recycled"...
So, is there any room for improvement? Any way to make the data import
quicker?
Thanks a lot.
Best regards,
Filip
--
-----------------------------------------------------------------
Filip Ginter
Ph.D. student
Email: ginter at cs.utu.fi
Phone: +358-2-2154078
Office: 4122, 4th floor
ICQ: 146959496
Turku Centre for Computer Science
Lemmink?isenkatu 14A
20520 Turku
Finland
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
There is an extra problem with R 1.4.0 (only), but this would always be
slow. You are trying to read in a numeric matrix, but read.table is
designed to read in a data frame, for which each of the 2300 columns could
be a different type, so lots of housekeeping is inevitable.
I think you need
T2 <- as.dist(matrix(scan("matrix"), 2300, 2300, byrow=T)))
and if the matrix is symmetric you can omit `byrow=T'.
(One recommendation you missed was
Using `comment.char = ""' will be appreciably faster.
)
Even so, processing 5 million text strings to form a 40Mb object will not
be instant: you would do better to write a binary file and read it with
readBin. You could also write only one triangle of the matrix if it is
symmetric.
On Thu, 24 Jan 2002, Filip Ginter wrote:
> Dear list,
>
> I'm brand new to R (started using it few days ago...), so sorry for
possibly
> stupid question.
>
> Anyways, I'm using R to cluster my data. I do have the dissimilarity
matrix
> as a text file, numbers separated by space. It's at its best something
like
> 2300x2300 matrix.
>
> Now, it seems to me, that the process of importing the matrix into R is
> rather slow. For the peak size of 2300x2300 it takes almost two hours. The
> clustering itself takes a minimum of time when compared to importing the
> data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version
of
> R is "Version 1.4.0 (2001-12-19)"
>
> I have tried to follow all the recomendations I found in the documentation,
> so I do something like this: (The file consists of 2300 rows, each
containing
> 2300 real numbers, separated by space. Nothing else.)
>
> __________________________
>
> library(cluster)
> CC<-c("numeric")
> T1<-read.table("matrix",nrows=2300,colClasses=CC)
> T2<-as.dist(T1)
> rm(T1)
> T3<-agnes(T2,diss=TRUE)
> write.table(T3$merge,file=outfile,quote=FALSE)
>
> ___________________________
>
> The CC vector contains the "numeric" only once, as I read that
the values are
> "recycled"...
>
> So, is there any room for improvement? Any way to make the data import
> quicker?
>
> Thanks a lot.
>
> Best regards,
>
> Filip
>
> --
>
> -----------------------------------------------------------------
> Filip Ginter
> Ph.D. student
>
> Email: ginter at cs.utu.fi
> Phone: +358-2-2154078
> Office: 4122, 4th floor
> ICQ: 146959496
>
> Turku Centre for Computer Science
> Lemminkäisenkatu 14A
> 20520 Turku
> Finland
>
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Perhaps
mimatriz <- scan("directory_and_file_names"),byrow=T,ncol=2300)
is more efficient. Be sure you only have names
in the file, otherwise you need:
mimatriz <-
scan("directory_and_file_names",what=""),byrow=T,ncol=2300)
and then mimatriz will be char. Then you subset the
numeric cols and/or rows and apply as.numeric.
Agus
Dr. Agustin Lobo
Instituto de Ciencias de la Tierra (CSIC)
Lluis Sole Sabaris s/n
08028 Barcelona SPAIN
tel 34 93409 5410
fax 34 93411 0012
alobo at ija.csic.es
On Thu, 24 Jan 2002, Filip Ginter wrote:
> Dear list,
>
> I'm brand new to R (started using it few days ago...), so sorry for
possibly
> stupid question.
>
> Anyways, I'm using R to cluster my data. I do have the dissimilarity
matrix
> as a text file, numbers separated by space. It's at its best something
like
> 2300x2300 matrix.
>
> Now, it seems to me, that the process of importing the matrix into R is
> rather slow. For the peak size of 2300x2300 it takes almost two hours. The
> clustering itself takes a minimum of time when compared to importing the
> data. I have 256MB memory, 900MHz processor PC, Linux (RH7.1). The version
of
> R is "Version 1.4.0 (2001-12-19)"
>
> I have tried to follow all the recomendations I found in the documentation,
> so I do something like this: (The file consists of 2300 rows, each
containing
> 2300 real numbers, separated by space. Nothing else.)
>
> __________________________
>
> library(cluster)
> CC<-c("numeric")
> T1<-read.table("matrix",nrows=2300,colClasses=CC)
> T2<-as.dist(T1)
> rm(T1)
> T3<-agnes(T2,diss=TRUE)
> write.table(T3$merge,file=outfile,quote=FALSE)
>
> ___________________________
>
> The CC vector contains the "numeric" only once, as I read that
the values are
> "recycled"...
>
> So, is there any room for improvement? Any way to make the data import
> quicker?
>
> Thanks a lot.
>
> Best regards,
>
> Filip
>
> --
>
> -----------------------------------------------------------------
> Filip Ginter
> Ph.D. student
>
> Email: ginter at cs.utu.fi
> Phone: +358-2-2154078
> Office: 4122, 4th floor
> ICQ: 146959496
>
> Turku Centre for Computer Science
> Lemminkäisenkatu 14A
> 20520 Turku
> Finland
>
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._