Hello everyone, I recently made a 64-bit build of R-2.2.1 under Solaris 9 using gcc v.3.4.2. The server has 12GB memory, 6 Sparc CPUs and plenty of swap space. I was the only user at the time of the following experiment. I wanted to benchmark R's capability to read large data files and used a data set consisting of 2MM records with 65 variables in each row. All but 2 of the variables are of the character type and the other two are numeric. The whole data set is about 600 MB when stored as plain ASCII file. The following code was used in the benchmarking runs: c = list(var1=0, var2=0, var3="", var4="", .....var65="") A <- scan("test.dat", skip = 1, sep = ",", what = c, nmax=XXXXX, quiet=FALSE) summary(A) where XXXX = 1000000 or 2000000 I made two runs with nmax=1000000 and nmax=2000000 respectively. The first run completed successfully, in about hour of CPU time. However, the actual memory usage exceeded 2.2GB, about 7 times of the acutal file size on disk. The second run aborted when the memory usage reached 4GB. The error messgae is "vector memory exhausted (limit reached?)". Three questions: 1) Why were so much memory and CPU consumed to read 300MB of data? Since almost all of the variables are character, I expected almost of 1-1 mapping between file size on disk and that in memory 2) Since this is a 64-bit build, I expected it could handle more than the 600MB of data I used. What does the error message mean? I don't beleive the vector length exceeded the theoretic limit of about 1 billion. 3) The original file was compressed and I had to uncompress it before the experiement. Is there a way to read compressed files directly in R Thanks so much for your help. Min [[alternative HTML version deleted]]
On Wed, 26 Apr 2006, Min Shao wrote:> I recently made a 64-bit build of R-2.2.1 under Solaris 9 using gcc v.3.4.2. > The server has 12GB memory, 6 Sparc CPUs and plenty of swap space. I was the > only user at the time of the following experiment. > > I wanted to benchmark R's capability to read large data files and used a > data set consisting of 2MM records with 65 variables in each row. All but 2 > of the variables are of the character type and the other two are numeric. > The whole data set is about 600 MB when stored as plain ASCII file. > > The following code was used in the benchmarking runs: > > c = list(var1=0, var2=0, var3="", var4="", .....var65="") > A <- scan("test.dat", skip = 1, sep = ",", what = c, nmax=XXXXX, > quiet=FALSE) > summary(A) > where XXXX = 1000000 or 2000000 > > I made two runs with nmax=1000000 and nmax=2000000 respectively. The first > run completed successfully, in about hour of CPU time. However, the actual > memory usage exceeded 2.2GB, about 7 times of the acutal file size on disk. > The second run aborted when the memory usage reached 4GB. The error messgae > is "vector memory exhausted (limit reached?)". > > Three questions: > 1) Why were so much memory and CPU consumed to read 300MB of data? Since > almost all of the variables are character, I expected almost of 1-1 mapping > between file size on disk and that in memory > 2) Since this is a 64-bit build, I expected it could handle more than the > 600MB of data I used. What does the error message mean? I don't beleive the > vector length exceeded the theoretic limit of about 1 billion. > 3) The original file was compressed and I had to uncompress it before the > experiement. Is there a way to read compressed files directly in RA <- scan(gzfile("test.dat.gz", "r"), skip = 1, sep = ",", what = c, nmax = XXXXX, quiet= FALSE) ---------------------------------------------------------- SIGSIG -- signature too long (core dumped)
R character vectors are stored as a list of character strings. On a 64-bit system, each string has an overhead of about 64 bytes. R nowadays shares strings if they are the same, but only for the first 'few': it gives up after 10,000 distinct strings. Nevertheless, for many distinct short strings this is very inefficient. On Wed, 26 Apr 2006, Min Shao wrote:> Hello everyone, > > I recently made a 64-bit build of R-2.2.1 under Solaris 9 using gcc v.3.4.2.That's an inadvisable version of gcc, with a bug in g77 which affects some R packages.> The server has 12GB memory, 6 Sparc CPUs and plenty of swap space. I was the > only user at the time of the following experiment. > > I wanted to benchmark R's capability to read large data files and used a > data set consisting of 2MM records with 65 variables in each row. All but 2 > of the variables are of the character type and the other two are numeric. > The whole data set is about 600 MB when stored as plain ASCII file. > > The following code was used in the benchmarking runs: > > c = list(var1=0, var2=0, var3="", var4="", .....var65="") > A <- scan("test.dat", skip = 1, sep = ",", what = c, nmax=XXXXX, > quiet=FALSE) > summary(A) > where XXXX = 1000000 or 2000000 > > I made two runs with nmax=1000000 and nmax=2000000 respectively. The first > run completed successfully, in about hour of CPU time. However, the actual > memory usage exceeded 2.2GB, about 7 times of the acutal file size on disk. > The second run aborted when the memory usage reached 4GB. The error messgae > is "vector memory exhausted (limit reached?)". > > Three questions: > 1) Why were so much memory and CPU consumed to read 300MB of data? Since > almost all of the variables are character, I expected almost of 1-1 mapping > between file size on disk and that in memory > 2) Since this is a 64-bit build, I expected it could handle more than the > 600MB of data I used. What does the error message mean? I don't beleive the > vector length exceeded the theoretic limit of about 1 billion. > 3) The original file was compressed and I had to uncompress it before the > experiement. Is there a way to read compressed files directly in R > > Thanks so much for your help. > > Min > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595