thr3ads.net - R help - [R] Memory problem on a linux cluster using a large data set [Broadcast] [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Iris Kolder

2006-Dec-21 13:07 UTC

[R] Memory problem on a linux cluster using a large data set [Broadcast]

Thank you all for your help! 

So with all your suggestions we will try to run it on a computer with a 64 bits
proccesor. But i've been told that the new R versions all work on a 32bits
processor. I read in other posts that only the old R versions were capable of
larger data sets and were running under 64 bit proccesors. I also read that they
are adapting the new R version for 64 bits proccesors again so does anyone now
if there is a version available that we could use?

Iris Kolder

----- Original Message ----
From: "Liaw, Andy" <andy_liaw@merck.com>
To: Martin Morgan <mtmorgan@fhcrc.org>; Iris Kolder
<iriskolder@yahoo.com>
Cc: r-help@stat.math.ethz.ch; N.C. Onland-moret <n.c.onland@umcutrecht.nl>
Sent: Monday, December 18, 2006 7:48:23 PM
Subject: RE: [R] Memory problem on a linux cluster using a large data set
[Broadcast]


In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan> 
> Iris --
> 
> I hope the following helps; I think you have too much data 
> for a 32-bit machine.
> 
> Martin
> 
> Iris Kolder <iriskolder@yahoo.com> writes:
> 
> > Hello,
> >  
> > I have a large data set 320.000 rows and 1000 columns. All the data 
> > has the values 0,1,2.
> 
> It seems like a single copy of this data set will be at least 
> a couple of gigabytes; I think you'll have access to only 4 
> GB on a 32-bit machine (see section 8 of the R Installation 
> and Administration guide), and R will probably end up, even 
> in the best of situations, making at least a couple of copies 
> of your data. Probably you'll need a 64-bit machine, or 
> figure out algorithms that work on chunks of data.
> 
> > on a linux cluster with R version R 2.1.0.  which operates on a 32
> 
> This is quite old, and in general it seems like R has become 
> more sensitive to big-data issues and tracking down 
> unnecessary memory copying.
> 
> > "cannot allocate vector size 1240 kb". I've searched
through
> 
> use traceback() or options(error=recover) to figure out where 
> this is actually occurring.
> 
> > SNP <- read.table("file.txt", header=FALSE,
sep="")    #
> read in data file
> 
> This makes a data.frame, and data frames have several aspects 
> (e.g., automatic creation of row names on sub-setting) that 
> can be problematic in terms of memory use. Probably better to 
> use a matrix, for which:
> 
>      'read.table' is not the right tool for reading large matrices,
>      especially those with many columns: it is designed to read _data
>      frames_ which may have columns of very different classes. Use
>      'scan' instead.
> 
> (from the help page for read.table). I'm not sure of the 
> details of the algorithms you'll invoke, but it might be a 
> false economy to try to get scan to read in 'small' versions 
> (e.g., integer, rather than
> numeric) of the data -- the algorithms might insist on 
> numeric data, and then make a copy during coercion from your 
> small version to numeric.
> 
> > SNP$total.NAs = rowSums(is.na(SN         # calculate the 
> number of NA per row and adds a colum with total Na's
> 
> This adds a column to the data.frame or matrix, probably 
> causing at least one copy of the entire data. Create a 
> separate vector instead, even though this unties the 
> coordination between columns that a data frame provides.
> 
> > SNP  = t(as.matrix(SNP))                          # 
> transpose rows and columns
> 
> This will also probably trigger a copy; 
> 
> > snp.na<-SNP
> 
> R might be clever enough to figure out that this simple 
> assignment does not trigger a copy. But it probably means 
> that any subsequent modification of snp.na or SNP *will* 
> trigger a copy, so avoid the assignment if possible.
> 
> > snp.roughfix<-na.roughfix(snp.na)                           
>                   
> > fSNP<-factor(snp.roughfix[, 1])                # Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> 
> Now you're entirely in the hands of the randomForest. If 
> memory problems occur here, perhaps you'll have gained enough 
> experience to point the package maintainer to the problem and 
> suggest a possible solution.
> 
> > set it should be able to cope with that amount. Perhaps someone has 
> > tried this before in R or is Fortram a better choice? I added my R
> 
> If you mean a pure Fortran solution, including coding the 
> random forest algorithm, then of course you have complete 
> control over memory management. You'd still likely be limited 
> to addressing 4 GB of memory. 
> 
> 
> > I wrote a script to remove all the rows with more than 46 missing 
> > values. This works perfect on a smaller dataset. But the problem 
> > arises when I try to run it on the larger data set I get an error 
> > "cannot allocate vector size 1240 kb". I've searched 
> through previous 
> > posts and found out that it might be because i'm running it 
> on a linux 
> > cluster with R version R 2.1.0.  which operates on a 32 bit 
> processor. 
> > But I could not find a solution for this problem. The cluster is a 
> > really fast one and should be able to cope with these large 
> amounts of 
> > data the systems configuration are Speed: 3.4 GHz, memory 
> 4GByte. Is 
> > there a way to change the settings or processor under R? I 
> want to run 
> > the function Random Forest on my large data set it should 
> be able to 
> > cope with that amount. Perhaps someone has tried this 
> before in R or 
> > is Fortram a better choice? I added my R script down below.
> >  
> > Best regards,
> >  
> > Iris Kolder
> >  
> > SNP <- read.table("file.txt", header=FALSE,
sep="")    #
> read in data file
> > SNP[SNP==9]<-NA                                   # change 
> missing values from a 9 to a NA
> > SNP$total.NAs = rowSums(is.na(SN         # calculate the 
> number of NA per row and adds a colum with total Na's
> > SNP = SNP[ SNP$total.NAs < 46,  ]         # create a subset 
> with no more than 5%(46) NA's 
> > SNP$total.NAs=NULL                              # remove 
> added column with sum of NA's
> > SNP  = t(as.matrix(SNP))                          # 
> transpose rows and columns
> > set.seed(1)                                                 
>                                   
> > snp.na<-SNP 
> > snp.roughfix<-na.roughfix(snp.na)                           
>                   
> > fSNP<-factor(snp.roughfix[, 1])                # Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> >  
> > print(snp.narf)
> >
> > __________________________________________________
> >
> >
> >
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Martin T. Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
> 
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Thomas Lumley

2006-Dec-21 16:07 UTC

head link

[R] Memory problem on a linux cluster using a large data set [Broadcast]

On Thu, 21 Dec 2006, Iris Kolder wrote:
> Thank you all for your help!
>
> So with all your suggestions we will try to run it on a computer with a 
> 64 bits proccesor. But i've been told that the new R versions all work 
> on a 32bits processor. I read in other posts that only the old R 
> versions were capable of larger data sets and were running under 64 bit 
> proccesors. I also read that they are adapting the new R version for 64 
> bits proccesors again so does anyone now if there is a version available 
> that we could use?
Huh?  R 2.4.x runs perfectly happily accessing large memory under Linux on 
64bit processors (and Solaris, and probably others). I think it even works 
on Mac OS X now.

For example:> x<-rnorm(1e9)
> gc()              used   (Mb) gc trigger   (Mb)   max used   (Mb)
Ncells     222881   12.0     467875   25.0     350000   18.7
Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3


         -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Dec 2006 - Memory problem on a linux cluster using a large data set [Broadcast]

[R] Memory problem on a linux cluster using a large data set [Broadcast]

[R] Memory problem on a linux cluster using a large data set [Broadcast]

Possibly Parallel Threads