Iris Kolder
2006-Dec-18 10:19 UTC
[R] Memory problem on a linux cluster using a large data set
Hello, I have a large data set 320.000 rows and 1000 columns. All the data has the values 0,1,2. I wrote a script to remove all the rows with more than 46 missing values. This works perfect on a smaller dataset. But the problem arises when I try to run it on the larger data set I get an error “cannot allocate vector size 1240 kb”. I’ve searched through previous posts and found out that it might be because i’m running it on a linux cluster with R version R 2.1.0. which operates on a 32 bit processor. But I could not find a solution for this problem. The cluster is a really fast one and should be able to cope with these large amounts of data the systems configuration are Speed: 3.4 GHz, memory 4GByte. Is there a way to change the settings or processor under R? I want to run the function Random Forest on my large data set it should be able to cope with that amount. Perhaps someone has tried this before in R or is Fortram a better choice? I added my R script down below. Best regards, Iris Kolder SNP <- read.table("file.txt", header=FALSE, sep="") # read in data file SNP[SNP==9]<-NA # change missing values from a 9 to a NA SNP$total.NAs = rowSums(is.na(SN # calculate the number of NA per row and adds a colum with total Na's SNP = SNP[ SNP$total.NAs < 46, ] # create a subset with no more than 5%(46) NA's SNP$total.NAs=NULL # remove added column with sum of NA's SNP = t(as.matrix(SNP)) # transpose rows and columns set.seed(1) snp.na<-SNP snp.roughfix<-na.roughfix(snp.na) fSNP<-factor(snp.roughfix[, 1]) # Asigns factor to case control status snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100) print(snp.narf) __________________________________________________ [[alternative HTML version deleted]]
Martin Morgan
2006-Dec-18 18:32 UTC
[R] Memory problem on a linux cluster using a large data set
Iris -- I hope the following helps; I think you have too much data for a 32-bit machine. Martin Iris Kolder <iriskolder at yahoo.com> writes:> Hello, > > I have a large data set 320.000 rows and 1000 columns. All the data > has the values 0,1,2.It seems like a single copy of this data set will be at least a couple of gigabytes; I think you'll have access to only 4 GB on a 32-bit machine (see section 8 of the R Installation and Administration guide), and R will probably end up, even in the best of situations, making at least a couple of copies of your data. Probably you'll need a 64-bit machine, or figure out algorithms that work on chunks of data.> on a linux cluster with R version R 2.1.0. which operates on a 32This is quite old, and in general it seems like R has become more sensitive to big-data issues and tracking down unnecessary memory copying.> ?cannot allocate vector size 1240 kb?. I?ve searched throughuse traceback() or options(error=recover) to figure out where this is actually occurring.> SNP <- read.table("file.txt", header=FALSE, sep="") # read in data fileThis makes a data.frame, and data frames have several aspects (e.g., automatic creation of row names on sub-setting) that can be problematic in terms of memory use. Probably better to use a matrix, for which: 'read.table' is not the right tool for reading large matrices, especially those with many columns: it is designed to read _data frames_ which may have columns of very different classes. Use 'scan' instead. (from the help page for read.table). I'm not sure of the details of the algorithms you'll invoke, but it might be a false economy to try to get scan to read in 'small' versions (e.g., integer, rather than numeric) of the data -- the algorithms might insist on numeric data, and then make a copy during coercion from your small version to numeric.> SNP$total.NAs = rowSums(is.na(SN # calculate the number of NA per row and adds a colum with total Na'sThis adds a column to the data.frame or matrix, probably causing at least one copy of the entire data. Create a separate vector instead, even though this unties the coordination between columns that a data frame provides.> SNP = t(as.matrix(SNP)) # transpose rows and columnsThis will also probably trigger a copy;> snp.na<-SNPR might be clever enough to figure out that this simple assignment does not trigger a copy. But it probably means that any subsequent modification of snp.na or SNP *will* trigger a copy, so avoid the assignment if possible.> snp.roughfix<-na.roughfix(snp.na) > fSNP<-factor(snp.roughfix[, 1]) # Asigns factor to case control status > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100)Now you're entirely in the hands of the randomForest. If memory problems occur here, perhaps you'll have gained enough experience to point the package maintainer to the problem and suggest a possible solution.> set it should be able to cope with that amount. Perhaps someone has > tried this before in R or is Fortram a better choice? I added my RIf you mean a pure Fortran solution, including coding the random forest algorithm, then of course you have complete control over memory management. You'd still likely be limited to addressing 4 GB of memory.> I wrote a script to remove all the rows with more than 46 missing > values. This works perfect on a smaller dataset. But the problem > arises when I try to run it on the larger data set I get an error > ?cannot allocate vector size 1240 kb?. I?ve searched through > previous posts and found out that it might be because i?m running it > on a linux cluster with R version R 2.1.0. which operates on a 32 > bit processor. But I could not find a solution for this problem. The > cluster is a really fast one and should be able to cope with these > large amounts of data the systems configuration are Speed: 3.4 GHz, > memory 4GByte. Is there a way to change the settings or processor > under R? I want to run the function Random Forest on my large data > set it should be able to cope with that amount. Perhaps someone has > tried this before in R or is Fortram a better choice? I added my R > script down below. > > Best regards, > > Iris Kolder > > SNP <- read.table("file.txt", header=FALSE, sep="") # read in data file > SNP[SNP==9]<-NA # change missing values from a 9 to a NA > SNP$total.NAs = rowSums(is.na(SN # calculate the number of NA per row and adds a colum with total Na's > SNP = SNP[ SNP$total.NAs < 46, ] # create a subset with no more than 5%(46) NA's > SNP$total.NAs=NULL # remove added column with sum of NA's > SNP = t(as.matrix(SNP)) # transpose rows and columns > set.seed(1) > snp.na<-SNP > snp.roughfix<-na.roughfix(snp.na) > fSNP<-factor(snp.roughfix[, 1]) # Asigns factor to case control status > > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, keep.forest=FALSE, do.trace=100) > > print(snp.narf) > > __________________________________________________ > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Martin T. Morgan Bioconductor / Computational Biology http://bioconductor.org