Iris Kolder
2007-Jan-10 12:44 UTC
[R] Fw: Memory problem on a linux cluster using a large data set [Broadcast]
Hi I listened to all your advise and ran my data on a computer with a 64 bits procesor but i still get the same error saying "it cannot allocate a vector of that size 1240 kb" . I don't want to cut my data in smaller pieces because we are looking at interaction. So are there any other options for me to try out or should i wait for the development of more advanced computers! Thanks, Iris ----- Forwarded Message ---- From: Iris Kolder <iriskolder@yahoo.com> To: r-help@stat.math.ethz.ch Sent: Thursday, December 21, 2006 2:07:08 PM Subject: Re: [R] Memory problem on a linux cluster using a large data set [Broadcast] Thank you all for your help! So with all your suggestions we will try to run it on a computer with a 64 bits proccesor. But i've been told that the new R versions all work on a 32bits processor. I read in other posts that only the old R versions were capable of larger data sets and were running under 64 bit proccesors. I also read that they are adapting the new R version for 64 bits proccesors again so does anyone now if there is a version available that we could use? Iris Kolder ----- Original Message ---- From: "Liaw, Andy" <andy_liaw@merck.com> To: Martin Morgan <mtmorgan@fhcrc.org>; Iris Kolder <iriskolder@yahoo.com> Cc: r-help@stat.math.ethz.ch; N.C. Onland-moret <n.c.onland@umcutrecht.nl> Sent: Monday, December 18, 2006 7:48:23 PM Subject: RE: [R] Memory problem on a linux cluster using a large data set [Broadcast] In addition to my off-list reply to Iris (pointing her to an old post of mine that detailed the memory requirement of RF in R), she might consider the following: - Use larger nodesize - Use sampsize to control the size of bootstrap samples Both of these have the effect of reducing sizes of trees grown. For a data set that large, it may not matter to grow smaller trees. Still, with data of that size, I'd say 64-bit is the better solution. Cheers, Andy From: Martin Morgan> > Iris -- > > I hope the following helps; I think you have too much data > for a 32-bit machine. > > Martin > > Iris Kolder <iriskolder@yahoo.com> writes: > > > Hello, > > > > I have a large data set 320.000 rows and 1000 columns. All the data > > has the values 0,1,2. > > It seems like a single copy of this data set will be at least > a couple of gigabytes; I think you'll have access to only 4 > GB on a 32-bit machine (see section 8 of the R Installation > and Administration guide), and R will probably end up, even > in the best of situations, making at least a couple of copies > of your data. Probably you'll need a 64-bit machine, or > figure out algorithms that work on chunks of data. > > > on a linux cluster with R version R 2.1.0. which operates on a 32 > > This is quite old, and in general it seems like R has become > more sensitive to big-data issues and tracking down > unnecessary memory copying. > > > "cannot allocate vector size 1240 kb". I've searched through > > use traceback() or options(error=recover) to figure out where > this is actually occurring. > > > SNP <- read.table("file.txt", header=FALSE, sep="") # > read in data file > > This makes a data.frame, and data frames have several aspects > (e.g., automatic creation of row names on sub-setting) that > can be problematic in terms of memory use. Probably better to > use a matrix, for which: > > 'read.table' is not the right tool for reading large matrices, > especially those with many columns: it is designed to read _data > frames_ which may have columns of very different classes. Use > 'scan' instead. > > (from the help page for read.table). I'm not sure of the > details of the algorithms you'll invoke, but it might be a > false economy to try to get scan to read in 'small' versions > (e.g., integer, rather than > numeric) of the data -- the algorithms might insist on > numeric data, and then make a copy during coercion from your > small version to numeric. > > > SNP$total.NAs = rowSums(is.na(SN # calculate the > number of NA per row and adds a colum with total Na's > > This adds a column to the data.frame or matrix, probably > causing at least one copy of the entire data. Create a > separate vector instead, even though this unties the > coordination between columns that a data frame provides. > > > SNP = t(as.matrix(SNP)) # > transpose rows and columns > > This will also probably trigger a copy; > > > snp.na<-SNP > > R might be clever enough to figure out that this simple > assignment does not trigger a copy. But it probably means > that any subsequent modification of snp.na or SNP *will* > trigger a copy, so avoid the assignment if possible. > > > snp.roughfix<-na.roughfix(snp.na) > > > fSNP<-factor(snp.roughfix[, 1]) # Asigns > factor to case control status > > > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, > > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, > > keep.forest=FALSE, do.trace=100) > > Now you're entirely in the hands of the randomForest. If > memory problems occur here, perhaps you'll have gained enough > experience to point the package maintainer to the problem and > suggest a possible solution. > > > set it should be able to cope with that amount. Perhaps someone has > > tried this before in R or is Fortram a better choice? I added my R > > If you mean a pure Fortran solution, including coding the > random forest algorithm, then of course you have complete > control over memory management. You'd still likely be limited > to addressing 4 GB of memory. > > > > I wrote a script to remove all the rows with more than 46 missing > > values. This works perfect on a smaller dataset. But the problem > > arises when I try to run it on the larger data set I get an error > > "cannot allocate vector size 1240 kb". I've searched > through previous > > posts and found out that it might be because i'm running it > on a linux > > cluster with R version R 2.1.0. which operates on a 32 bit > processor. > > But I could not find a solution for this problem. The cluster is a > > really fast one and should be able to cope with these large > amounts of > > data the systems configuration are Speed: 3.4 GHz, memory > 4GByte. Is > > there a way to change the settings or processor under R? I > want to run > > the function Random Forest on my large data set it should > be able to > > cope with that amount. Perhaps someone has tried this > before in R or > > is Fortram a better choice? I added my R script down below. > > > > Best regards, > > > > Iris Kolder > > > > SNP <- read.table("file.txt", header=FALSE, sep="") # > read in data file > > SNP[SNP==9]<-NA # change > missing values from a 9 to a NA > > SNP$total.NAs = rowSums(is.na(SN # calculate the > number of NA per row and adds a colum with total Na's > > SNP = SNP[ SNP$total.NAs < 46, ] # create a subset > with no more than 5%(46) NA's > > SNP$total.NAs=NULL # remove > added column with sum of NA's > > SNP = t(as.matrix(SNP)) # > transpose rows and columns > > set.seed(1) > > > snp.na<-SNP > > snp.roughfix<-na.roughfix(snp.na) > > > fSNP<-factor(snp.roughfix[, 1]) # Asigns > factor to case control status > > > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, > > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, > > keep.forest=FALSE, do.trace=100) > > > > print(snp.narf) > > > > __________________________________________________ > > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -- > Martin T. Morgan > Bioconductor / Computational Biology > http://bioconductor.org > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. ------------------------------------------------------------------------------ __________________________________________________ ____________________________________________________________________________________ Want to start your own business? [[alternative HTML version deleted]]
Prof Brian Ripley
2007-Jan-10 15:09 UTC
[R] Fw: Memory problem on a linux cluster using a large data set [Broadcast]
On Wed, 10 Jan 2007, Iris Kolder wrote:> Hi > > I listened to all your advise and ran my data on a computer with a 64 > bits procesor but i still get the same error saying "it cannot allocate > a vector of that size 1240 kb" . I don't want to cut my data in smaller > pieces because we are looking at interaction. So are there any other > options for me to try out or should i wait for the development of more > advanced computers!Did you use a 64-bit build of R on that machine? If the message is the same, I strongly suspect not. 64-bit builds are not the default on most OSes.> > Thanks, > > Iris > > > ----- Forwarded Message ---- > From: Iris Kolder <iriskolder at yahoo.com> > To: r-help at stat.math.ethz.ch > Sent: Thursday, December 21, 2006 2:07:08 PM > Subject: Re: [R] Memory problem on a linux cluster using a large data set [Broadcast] > > > Thank you all for your help! > > So with all your suggestions we will try to run it on a computer with a > 64 bits proccesor. But i've been told that the new R versions all work > on a 32bits processor. I read in other posts that only the old R > versions were capable of larger data sets and were running under 64 bit > proccesors. I also read that they are adapting the new R version for 64 > bits proccesors again so does anyone now if there is a version available > that we could use? > > Iris Kolder > > ----- Original Message ---- > From: "Liaw, Andy" <andy_liaw at merck.com> > To: Martin Morgan <mtmorgan at fhcrc.org>; Iris Kolder <iriskolder at yahoo.com> > Cc: r-help at stat.math.ethz.ch; N.C. Onland-moret <n.c.onland at umcutrecht.nl> > Sent: Monday, December 18, 2006 7:48:23 PM > Subject: RE: [R] Memory problem on a linux cluster using a large data set [Broadcast] > > > In addition to my off-list reply to Iris (pointing her to an old post of > mine that detailed the memory requirement of RF in R), she might > consider the following: > > - Use larger nodesize > - Use sampsize to control the size of bootstrap samples > > Both of these have the effect of reducing sizes of trees grown. For a > data set that large, it may not matter to grow smaller trees. > > Still, with data of that size, I'd say 64-bit is the better solution. > > Cheers, > Andy > > From: Martin Morgan >> >> Iris -- >> >> I hope the following helps; I think you have too much data >> for a 32-bit machine. >> >> Martin >> >> Iris Kolder <iriskolder at yahoo.com> writes: >> >>> Hello, >>> >>> I have a large data set 320.000 rows and 1000 columns. All the data >>> has the values 0,1,2. >> >> It seems like a single copy of this data set will be at least >> a couple of gigabytes; I think you'll have access to only 4 >> GB on a 32-bit machine (see section 8 of the R Installation >> and Administration guide), and R will probably end up, even >> in the best of situations, making at least a couple of copies >> of your data. Probably you'll need a 64-bit machine, or >> figure out algorithms that work on chunks of data. >> >>> on a linux cluster with R version R 2.1.0. which operates on a 32 >> >> This is quite old, and in general it seems like R has become >> more sensitive to big-data issues and tracking down >> unnecessary memory copying. >> >>> "cannot allocate vector size 1240 kb". I've searched through >> >> use traceback() or options(error=recover) to figure out where >> this is actually occurring. >> >>> SNP <- read.table("file.txt", header=FALSE, sep="") # >> read in data file >> >> This makes a data.frame, and data frames have several aspects >> (e.g., automatic creation of row names on sub-setting) that >> can be problematic in terms of memory use. Probably better to >> use a matrix, for which: >> >> 'read.table' is not the right tool for reading large matrices, >> especially those with many columns: it is designed to read _data >> frames_ which may have columns of very different classes. Use >> 'scan' instead. >> >> (from the help page for read.table). I'm not sure of the >> details of the algorithms you'll invoke, but it might be a >> false economy to try to get scan to read in 'small' versions >> (e.g., integer, rather than >> numeric) of the data -- the algorithms might insist on >> numeric data, and then make a copy during coercion from your >> small version to numeric. >> >>> SNP$total.NAs = rowSums(is.na(SN # calculate the >> number of NA per row and adds a colum with total Na's >> >> This adds a column to the data.frame or matrix, probably >> causing at least one copy of the entire data. Create a >> separate vector instead, even though this unties the >> coordination between columns that a data frame provides. >> >>> SNP = t(as.matrix(SNP)) # >> transpose rows and columns >> >> This will also probably trigger a copy; >> >>> snp.na<-SNP >> >> R might be clever enough to figure out that this simple >> assignment does not trigger a copy. But it probably means >> that any subsequent modification of snp.na or SNP *will* >> trigger a copy, so avoid the assignment if possible. >> >>> snp.roughfix<-na.roughfix(snp.na) >> >>> fSNP<-factor(snp.roughfix[, 1]) # Asigns >> factor to case control status >>> >>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP, >>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, >>> keep.forest=FALSE, do.trace=100) >> >> Now you're entirely in the hands of the randomForest. If >> memory problems occur here, perhaps you'll have gained enough >> experience to point the package maintainer to the problem and >> suggest a possible solution. >> >>> set it should be able to cope with that amount. Perhaps someone has >>> tried this before in R or is Fortram a better choice? I added my R >> >> If you mean a pure Fortran solution, including coding the >> random forest algorithm, then of course you have complete >> control over memory management. You'd still likely be limited >> to addressing 4 GB of memory. >> >> >>> I wrote a script to remove all the rows with more than 46 missing >>> values. This works perfect on a smaller dataset. But the problem >>> arises when I try to run it on the larger data set I get an error >>> "cannot allocate vector size 1240 kb". I've searched >> through previous >>> posts and found out that it might be because i'm running it >> on a linux >>> cluster with R version R 2.1.0. which operates on a 32 bit >> processor. >>> But I could not find a solution for this problem. The cluster is a >>> really fast one and should be able to cope with these large >> amounts of >>> data the systems configuration are Speed: 3.4 GHz, memory >> 4GByte. Is >>> there a way to change the settings or processor under R? I >> want to run >>> the function Random Forest on my large data set it should >> be able to >>> cope with that amount. Perhaps someone has tried this >> before in R or >>> is Fortram a better choice? I added my R script down below. >>> >>> Best regards, >>> >>> Iris Kolder >>> >>> SNP <- read.table("file.txt", header=FALSE, sep="") # >> read in data file >>> SNP[SNP==9]<-NA # change >> missing values from a 9 to a NA >>> SNP$total.NAs = rowSums(is.na(SN # calculate the >> number of NA per row and adds a colum with total Na's >>> SNP = SNP[ SNP$total.NAs < 46, ] # create a subset >> with no more than 5%(46) NA's >>> SNP$total.NAs=NULL # remove >> added column with sum of NA's >>> SNP = t(as.matrix(SNP)) # >> transpose rows and columns >>> set.seed(1) >> >>> snp.na<-SNP >>> snp.roughfix<-na.roughfix(snp.na) >> >>> fSNP<-factor(snp.roughfix[, 1]) # Asigns >> factor to case control status >>> >>> snp.narf<- randomForest(snp.roughfix[,-1], fSNP, >>> na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, >>> keep.forest=FALSE, do.trace=100) >>> >>> print(snp.narf) >>> >>> __________________________________________________ >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at stat.math.ethz.ch mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Martin T. Morgan >> Bioconductor / Computational Biology >> http://bioconductor.org >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, > New Jersey, USA 08889), and/or its affiliates (which may be known > outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD > and in Japan, as Banyu - direct contact information for affiliates is > available at http://www.merck.com/contact/contacts.html) that may be > confidential, proprietary copyrighted and/or legally privileged. It is > intended solely for the use of the individual or entity named on this > message. If you are not the intended recipient, and have received this > message in error, please notify us immediately by reply e-mail and then > delete it from your system. > > ------------------------------------------------------------------------------ > > > > __________________________________________________ > > > > > > > ____________________________________________________________________________________ > Want to start your own business? > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595