Iris Kolder
2006-Dec-21 13:07 UTC
[R] Memory problem on a linux cluster using a large data set [Broadcast]
Thank you all for your help! So with all your suggestions we will try to run it on a computer with a 64 bits proccesor. But i've been told that the new R versions all work on a 32bits processor. I read in other posts that only the old R versions were capable of larger data sets and were running under 64 bit proccesors. I also read that they are adapting the new R version for 64 bits proccesors again so does anyone now if there is a version available that we could use? Iris Kolder ----- Original Message ---- From: "Liaw, Andy" <andy_liaw@merck.com> To: Martin Morgan <mtmorgan@fhcrc.org>; Iris Kolder <iriskolder@yahoo.com> Cc: r-help@stat.math.ethz.ch; N.C. Onland-moret <n.c.onland@umcutrecht.nl> Sent: Monday, December 18, 2006 7:48:23 PM Subject: RE: [R] Memory problem on a linux cluster using a large data set [Broadcast] In addition to my off-list reply to Iris (pointing her to an old post of mine that detailed the memory requirement of RF in R), she might consider the following: - Use larger nodesize - Use sampsize to control the size of bootstrap samples Both of these have the effect of reducing sizes of trees grown. For a data set that large, it may not matter to grow smaller trees. Still, with data of that size, I'd say 64-bit is the better solution. Cheers, Andy From: Martin Morgan> > Iris -- > > I hope the following helps; I think you have too much data > for a 32-bit machine. > > Martin > > Iris Kolder <iriskolder@yahoo.com> writes: > > > Hello, > > > > I have a large data set 320.000 rows and 1000 columns. All the data > > has the values 0,1,2. > > It seems like a single copy of this data set will be at least > a couple of gigabytes; I think you'll have access to only 4 > GB on a 32-bit machine (see section 8 of the R Installation > and Administration guide), and R will probably end up, even > in the best of situations, making at least a couple of copies > of your data. Probably you'll need a 64-bit machine, or > figure out algorithms that work on chunks of data. > > > on a linux cluster with R version R 2.1.0. which operates on a 32 > > This is quite old, and in general it seems like R has become > more sensitive to big-data issues and tracking down > unnecessary memory copying. > > > "cannot allocate vector size 1240 kb". I've searched through > > use traceback() or options(error=recover) to figure out where > this is actually occurring. > > > SNP <- read.table("file.txt", header=FALSE, sep="") # > read in data file > > This makes a data.frame, and data frames have several aspects > (e.g., automatic creation of row names on sub-setting) that > can be problematic in terms of memory use. Probably better to > use a matrix, for which: > > 'read.table' is not the right tool for reading large matrices, > especially those with many columns: it is designed to read _data > frames_ which may have columns of very different classes. Use > 'scan' instead. > > (from the help page for read.table). I'm not sure of the > details of the algorithms you'll invoke, but it might be a > false economy to try to get scan to read in 'small' versions > (e.g., integer, rather than > numeric) of the data -- the algorithms might insist on > numeric data, and then make a copy during coercion from your > small version to numeric. > > > SNP$total.NAs = rowSums(is.na(SN # calculate the > number of NA per row and adds a colum with total Na's > > This adds a column to the data.frame or matrix, probably > causing at least one copy of the entire data. Create a > separate vector instead, even though this unties the > coordination between columns that a data frame provides. > > > SNP = t(as.matrix(SNP)) # > transpose rows and columns > > This will also probably trigger a copy; > > > snp.na<-SNP > > R might be clever enough to figure out that this simple > assignment does not trigger a copy. But it probably means > that any subsequent modification of snp.na or SNP *will* > trigger a copy, so avoid the assignment if possible. > > > snp.roughfix<-na.roughfix(snp.na) > > > fSNP<-factor(snp.roughfix[, 1]) # Asigns > factor to case control status > > > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, > > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, > > keep.forest=FALSE, do.trace=100) > > Now you're entirely in the hands of the randomForest. If > memory problems occur here, perhaps you'll have gained enough > experience to point the package maintainer to the problem and > suggest a possible solution. > > > set it should be able to cope with that amount. Perhaps someone has > > tried this before in R or is Fortram a better choice? I added my R > > If you mean a pure Fortran solution, including coding the > random forest algorithm, then of course you have complete > control over memory management. You'd still likely be limited > to addressing 4 GB of memory. > > > > I wrote a script to remove all the rows with more than 46 missing > > values. This works perfect on a smaller dataset. But the problem > > arises when I try to run it on the larger data set I get an error > > "cannot allocate vector size 1240 kb". I've searched > through previous > > posts and found out that it might be because i'm running it > on a linux > > cluster with R version R 2.1.0. which operates on a 32 bit > processor. > > But I could not find a solution for this problem. The cluster is a > > really fast one and should be able to cope with these large > amounts of > > data the systems configuration are Speed: 3.4 GHz, memory > 4GByte. Is > > there a way to change the settings or processor under R? I > want to run > > the function Random Forest on my large data set it should > be able to > > cope with that amount. Perhaps someone has tried this > before in R or > > is Fortram a better choice? I added my R script down below. > > > > Best regards, > > > > Iris Kolder > > > > SNP <- read.table("file.txt", header=FALSE, sep="") # > read in data file > > SNP[SNP==9]<-NA # change > missing values from a 9 to a NA > > SNP$total.NAs = rowSums(is.na(SN # calculate the > number of NA per row and adds a colum with total Na's > > SNP = SNP[ SNP$total.NAs < 46, ] # create a subset > with no more than 5%(46) NA's > > SNP$total.NAs=NULL # remove > added column with sum of NA's > > SNP = t(as.matrix(SNP)) # > transpose rows and columns > > set.seed(1) > > > snp.na<-SNP > > snp.roughfix<-na.roughfix(snp.na) > > > fSNP<-factor(snp.roughfix[, 1]) # Asigns > factor to case control status > > > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, > > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, > > keep.forest=FALSE, do.trace=100) > > > > print(snp.narf) > > > > __________________________________________________ > > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Martin T. Morgan > Bioconductor / Computational Biology > http://bioconductor.org > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
Thomas Lumley
2006-Dec-21 16:07 UTC
[R] Memory problem on a linux cluster using a large data set [Broadcast]
On Thu, 21 Dec 2006, Iris Kolder wrote:> Thank you all for your help! > > So with all your suggestions we will try to run it on a computer with a > 64 bits proccesor. But i've been told that the new R versions all work > on a 32bits processor. I read in other posts that only the old R > versions were capable of larger data sets and were running under 64 bit > proccesors. I also read that they are adapting the new R version for 64 > bits proccesors again so does anyone now if there is a version available > that we could use?Huh? R 2.4.x runs perfectly happily accessing large memory under Linux on 64bit processors (and Solaris, and probably others). I think it even works on Mac OS X now. For example:> x<-rnorm(1e9) > gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 222881 12.0 467875 25.0 350000 18.7 Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3 -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle