Latchezar Dimitrov
2005-Jun-24 15:38 UTC
[R] Memory limits using read.table on Windows XP Pro
Hello, When I try: geno <-read.table("2500.geno.tab",header=TRUE,sep="\t",na.strings=".",quote=" ",comment.char="",colClasses=c("factor"),nrows=2501) I get, after hour(s) of work: Error: cannot allocate vector of size 9 Kb I have: Rgui.exe --max-mem-size=3Gb and multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect /NoExecute=OptIn /PAE /3GB in boot.ini 2500.geno.tab is a tab-delimited text table with 2500 x 125000 312,500,000 3-level (two alphabet characters) factors (x 4 bites 1,250,000,000 (1.25GB). Even if we double it (as per read.table help) it's still 2.5GB < 3Gb. And actually Windows Task Manager shows peak mem use for Rgui 2,056,992K (~2.057GB) and total memory used 2.62GB. And the total physical memory is 4GB (of which windows recognizes above 3GB) Any help or suggestions? Thanks, Latchezar
Prof Brian Ripley
2005-Jun-24 16:47 UTC
[R] Memory limits using read.table on Windows XP Pro
On Fri, 24 Jun 2005, Latchezar Dimitrov wrote:> Hello, > > When I try: > > geno > <-read.table("2500.geno.tab",header=TRUE,sep="\t",na.strings=".",quote=" > ",comment.char="",colClasses=c("factor"),nrows=2501) > > I get, after hour(s) of work: > > Error: cannot allocate vector of size 9 Kb > > I have: > > Rgui.exe --max-mem-size=3Gb > > and > > multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP > Professional" /fastdetect /NoExecute=OptIn /PAE /3GB > > in boot.ini > > 2500.geno.tab is a tab-delimited text table with 2500 x 125000 > 312,500,000 3-level (two alphabet characters) factors (x 4 bites > 1,250,000,000 (1.25GB). Even if we double it (as per read.table help) > it's still 2.5GB < 3Gb. And actually Windows Task Manager shows peak mem > use for Rgui 2,056,992K (~2.057GB) and total memory used 2.62GB. And the > total physical memory is 4GB (of which windows recognizes above 3GB) > > Any help or suggestions?Do check the rw-FAQ. If you modified R to address more than 2GB, you omitted to tell us a vital fact, so I guess you did not. I think you need to check the actual meaning of G and K, although they are much misused. 1,250,000,000 is 1.16GB in the units you are using for 3GB. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Latchezar Dimitrov
2005-Jun-24 18:09 UTC
[R] Memory limits using read.table on Windows XP Pro
Thank you very much for your attention. I checked rw-FAQ, did not mention it though. Since it's common req. I thought it is a common practice too and decided not to abuse bandwidth. Apparently wrong. However from what I presented you can easily (I guess) infer it as well. Your guess is about what I used is absolutely correct as I expected BTW. Or yeah, the water is wet although I did not mention it either :-) R FAQ Frequently Asked Questions on R Version 2.1.2005-06-22 ISBN 3-900051-08-9: "7.28 Why is read.table() so inefficient? By default, read.table() needs to read in everything as character data, and then try to figure out which variables to convert to numerics or factors. For a large data set, this takes condiderable amounts of time and memory. Performance can substantially be improved by using the colClasses argument to specify the classes to be assumed for the columns of the table." (The vital word "condiderable" above is not explained anywhere, so I guess it means considerable. I think you (all) need to check the spelling of the words you (all) use. Although spelling-checkers are much misused they are sometimes useful.) Is my use of read.table() in accordance with the above? Can it be improved with respect of my problem? R for Windows FAQ Version for rw2011 B. D. Ripley and D. J. Murdoch: (it does not say Prof. but I guess it is "Prof. B. D. Ripley", isn't it?) "2.11 There seems to be a limit on the memory it uses! Indeed there is. It is set by the command-line flag --max-mem-size (see How do I install R for Windows?) and defaults to the smaller of the amount of physical RAM in the machine and 1Gb. It can be set to any amount over 16M. (R will not run in less.) Be aware though that Windows has (in most versions) a maximum amount of user virtual memory of 2Gb, and parts of this can be reserved by processes but not used." So what is wrong if at all in my configuration, settings, parameters, flags, etc. (you name them) with respect of the above? Although I did not mention it I know very well the diff. b/n GiB, GB, and Gb (as used in rw-FAQ, wrongly I suppose) and your guess is incorrect here. Anyway my estimates as you can see are conservative and so your note does not contribute essential info. Despite your blunder about my knowledge I suspect that you secretly knew about the conservativeness above so I wonder why after your correct interpretation of my e-mail I did not get plain answer in straight English. Best regards, Latchezar Dimitrov PS. Please do not reply if you do not have any help or suggestions to solve the problem (not about my education, experience, not mentioning all the trivia, etc). Thanks PPS. I also wonder if you have ever heard about "the magic word" or there is no such thing as magic for Prof.'s> -----Original Message----- > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] > Sent: Friday, June 24, 2005 12:47 PM > To: Latchezar Dimitrov > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] Memory limits using read.table on Windows XP Pro > > On Fri, 24 Jun 2005, Latchezar Dimitrov wrote: > > > Hello, > > > > When I try: > > > > geno > > > <-read.table("2500.geno.tab",header=TRUE,sep="\t",na.strings=" > .",quote=" > > ",comment.char="",colClasses=c("factor"),nrows=2501) > > > > I get, after hour(s) of work: > > > > Error: cannot allocate vector of size 9 Kb > > > > I have: > > > > Rgui.exe --max-mem-size=3Gb > > > > and > > > > multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP > > Professional" /fastdetect /NoExecute=OptIn /PAE /3GB > > > > in boot.ini > > > > 2500.geno.tab is a tab-delimited text table with 2500 x 125000 = > > 312,500,000 3-level (two alphabet characters) factors (x 4 bites = > > 1,250,000,000 (1.25GB). Even if we double it (as per > read.table help) > > it's still 2.5GB < 3Gb. And actually Windows Task Manager > shows peak > > mem use for Rgui 2,056,992K (~2.057GB) and total memory > used 2.62GB. > > And the total physical memory is 4GB (of which windows recognizes > > above 3GB) > > > > Any help or suggestions? > > Do check the rw-FAQ. If you modified R to address more than > 2GB, you omitted to tell us a vital fact, so I guess you did not. > > I think you need to check the actual meaning of G and K, > although they are much misused. 1,250,000,000 is 1.16GB in > the units you are using for 3GB. > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 >
Latchezar Dimitrov
2005-Jul-20 01:46 UTC
[R] Memory limits using read.table on Windows XP Pro
Hello everyone, Would you please somebody explain me what my sin is (please see the code and timing bellow)? And how to improve myself and the following piece of R code? BTW, the code works. This is R 64-bit built by myself on Sun SPARC Solaris 9 with gcc-4.0.1 (64-bit) also built by yours truly. The machine is Sun Fire-V880 with 32GB memory and 8 cpu's. Nobody else was using it. Thank you very much Latchezar Dimitrov PS. Prof. Brian D. Ripley, thank you very much for not wasting your invaluable time responding to the above. I know your answer anyway. Sorry for making you read it though ...> system.time(+ haplo[i,2*j-1]<-substr(as.character(geno[i,j]),1,1) + ,TRUE) [1] 66.27 13.67 80.02 0.00 0.00> ls()[1] "P" "geno" "haplo" "i" "j" "lr" "model" "pheno"> i[1] 1> j[1] 1> dim(P)[1] 1 125000> str(P)num [1, 1:125000] 0.6188 0.0533 0.0893 0.8994 0.0316 ...> str(pheno)`data.frame': 2500 obs. of 11 variables: $ pca : int 1 1 1 1 1 1 1 1 1 1 ... $ her : int 0 1 0 0 0 0 0 0 0 0 ... $ age : num 67.1 70.4 64.9 60.8 64.3 ... $ t : int 1 3 1 2 1 2 1 3 9 1 ... $ n : int 9 9 9 9 9 9 9 0 9 9 ... $ m : int 0 1 0 0 0 1 9 9 1 9 ... $ diffgrd: int 9 9 9 2 9 9 9 9 3 9 ... $ gs : int 6 7 6 7 7 7 5 6 NA 6 ... $ psa : num 13 75 11.3 13 9.1 51 4.3 10.7 NA 93 ... $ geo : int 2 2 1 1 1 1 1 1 1 1 ... $ ageg : int 6 7 5 5 5 5 3 5 4 5 ...> str(geno)`data.frame': 2500 obs. of 125000 variables: ^C> > dim(geno)[1] 2500 125000> dim(haplo)[1] 2500 250000> version()Error: attempt to apply non-function> version_ platform sparc-sun-solaris2.9 arch sparc os solaris2.9 system sparc, solaris2.9 status Patched major 2 minor 1.1 year 2005 month 07 day 09 language R>
Latchezar Dimitrov
2005-Jul-20 01:48 UTC
[R] Memory limits using read.table on Windows XP Pro
Really sorry for the wrong addressing. It was intended to the list only. I apologize. Latchezar> -----Original Message----- > From: Latchezar Dimitrov > Sent: Tuesday, July 19, 2005 9:47 PM > To: Latchezar Dimitrov; 'Prof Brian Ripley' > Cc: 'r-help at stat.math.ethz.ch' > Subject: RE: [R] Memory limits using read.table on Windows XP Pro > > Hello everyone, > > Would you please somebody explain me what my sin is (please > see the code and timing bellow)? And how to improve myself > and the following piece of R code? BTW, the code works. > > This is R 64-bit built by myself on Sun SPARC Solaris 9 with > gcc-4.0.1 (64-bit) also built by yours truly. The machine is > Sun Fire-V880 with 32GB memory and 8 cpu's. Nobody else was using it. > > > Thank you very much > > Latchezar Dimitrov > > PS. Prof. Brian D. Ripley, thank you very much for not > wasting your invaluable time responding to the above. I know > your answer anyway. Sorry for making you read it though ... > > > system.time( > + haplo[i,2*j-1]<-substr(as.character(geno[i,j]),1,1) > + ,TRUE) > [1] 66.27 13.67 80.02 0.00 0.00 > > ls() > [1] "P" "geno" "haplo" "i" "j" "lr" "model" "pheno" > > i > [1] 1 > > j > [1] 1 > > dim(P) > [1] 1 125000 > > str(P) > num [1, 1:125000] 0.6188 0.0533 0.0893 0.8994 0.0316 ... > > str(pheno) > `data.frame': 2500 obs. of 11 variables: > $ pca : int 1 1 1 1 1 1 1 1 1 1 ... > $ her : int 0 1 0 0 0 0 0 0 0 0 ... > $ age : num 67.1 70.4 64.9 60.8 64.3 ... > $ t : int 1 3 1 2 1 2 1 3 9 1 ... > $ n : int 9 9 9 9 9 9 9 0 9 9 ... > $ m : int 0 1 0 0 0 1 9 9 1 9 ... > $ diffgrd: int 9 9 9 2 9 9 9 9 3 9 ... > $ gs : int 6 7 6 7 7 7 5 6 NA 6 ... > $ psa : num 13 75 11.3 13 9.1 51 4.3 10.7 NA 93 ... > $ geo : int 2 2 1 1 1 1 1 1 1 1 ... > $ ageg : int 6 7 5 5 5 5 3 5 4 5 ... > > str(geno) > `data.frame': 2500 obs. of 125000 variables: > ^C > > > > > dim(geno) > [1] 2500 125000 > > dim(haplo) > [1] 2500 250000 > > version() > Error: attempt to apply non-function > > version > _ > platform sparc-sun-solaris2.9 > arch sparc > os solaris2.9 > system sparc, solaris2.9 > status Patched > major 2 > minor 1.1 > year 2005 > month 07 > day 09 > language R > >
Hello, This is a speed question. I have a dataframe genoT:> dim(genoT)[1] 1002 238304> str(genoT)'data.frame': 1002 obs. of 238304 variables: $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2 ... $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1 ... $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1 ... $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 ... $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2 ... $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2 ... $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3 ... $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3 ... $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3 ... $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2 ... $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2 ... $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1 ... $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2 ... $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2 1 ... $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1 ... $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ... $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ... $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1 ... Its columns are factors with different number of levels (from 1 to 3 - that's what I got from read.table, i.e., it dropped missing levels). I want to convert it to uniform factors with 3 levels. The 1st 10 rows above show already converted columns and the rest are not yet converted. Here's my attempt wich is a complete failure as speed:> system.time(+ for(j in 1:(10 )){ #-- this is to try 1st 10 cols and measure the time, it otherwise is ncol(genoT) instead of 10 + gt<-genoT[[j]] #-- this is to avoid 2D indices + for(l in 1:length(gt at levels)){ + levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2") #-- convert levels to "0","1", or "2" + genoT[[j]]<-factor(gt,levels=0:2) #-- make a 3-level factor and put it back + } + } + ) [1] 785.085 4.358 789.454 0.000 0.000 789s for 10 columns only! To me it seems like replacing 10 x 3 levels and then making a factor of 1002 element vector x 10 is a "negligible" amount of operations needed. So, what's wrong with me? Any idea how to accelerate significantly the transformation or (to go to the very beginning) to make read.table use a fixed set of levels ("AA","AB", and "BB") and not to drop any (missing) level? R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not it. Thank you very much for the help, Latchezar Dimitrov, Analyst/Programmer IV, Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA