Hi, First, thanks in advance. Some useful info:>versionplatform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu version.string R version 2.15.1 (2012-06-22) I'm trying to use the table() function on a 2 column matrix that has 711 million rows (see below). However, it freezes. If I subset the matrix to be less than or equal to 2^29 (500+ million) then the table() function finishes in minutes. As soon as I go larger than that--beginning with 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I assume it has something to do with memory since I believe that's the 32 bit limit but I'm running on a 64 bit machine. Here's the matrix:>head(DRI.mtx)POSITION BP 38076904 C 38076905 C 38076906 A 38076907 T 38076908 C 38076909 C The result from table (if the matrix has less than 2^29 rows) is>head(table(DRI.mtx))BP POSITION A C G N T 115247036 17 0 0 0 0 115247037 31 0 0 0 0 115247038 46 0 0 0 0 115247039 0 0 54 0 0 115247040 0 0 1 0 66 115247041 0 0 0 0 78 I've tracked the problem down to the C-file, "unique.c". table() calls factor() which calls unique() which I believe calls "unique.c". Browsing through the C file I found an if statement that checks if the size of the vector is larger than 2^30-1. If TRUE it gives the error message "too large for hashing". I do not get any error message when I run table() on the full matrix but I wonder if maybe I should be and if the limit of 2^30 is too high and should be lowered. Maybe it's just my set up or maybe it has nothing to do with unique.c. I don't know. Here's the part of unique.c I was referring to: /* Choose M to be the smallest power of 2 not less than 2*n and set K = log2(M). Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. Dec 2004: modified from 4*n to 2*n, since in the worst case we have a 50% full table, and that is still rather efficient -- see R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. */ static void MKsetup(int n, HashData *d) { int n2 = 2 * n; if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ error(_("length %d is too large for hashing"), n); d->M = 2; d->K = 1; while (d->M < n2) { d->M *= 2; d->K += 1; } } "n" I presume is the number of rows of the matrix so I don't see why this wouldn't run properly though I'm not sure what is causing the problem in the unique.c file and I have no idea how to troubleshoot. I have a work around that reads in chunks at a time, but I'm very interested in why there appears to be a limit at 2^29 when according to the unique.c file it should be twice that. Thanks for the help. -Sean [[alternative HTML version deleted]]
On Aug 9, 2012, at 5:29 PM, Sean Ruddy wrote:> Hi, > > First, thanks in advance. Some useful info: > >> version > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > version.string R version 2.15.1 (2012-06-22) > > I'm trying to use the table() function on a 2 column matrix that has > 711 > million rows (see below). However, it freezes. If I subset the > matrix to be > less than or equal to 2^29 (500+ million) then the table() function > finishes in minutes. As soon as I go larger than that--beginning with > 2^29+1--it gets stuck, ie. nothing happens even after hours of > running. I > assume it has something to do with memory since I believe that's the > 32 bit > limit but I'm running on a 64 bit machine.The maximum size of a vector or matrix (= nrow x ncol) is the same on 32 and 64 bit machines: 2^32-1> > Here's the matrix: > >> head(DRI.mtx) > > POSITION BP > 38076904 C > 38076905 C > 38076906 A > 38076907 T > 38076908 C > 38076909 C > > > The result from table (if the matrix has less than 2^29 rows) is > >> head(table(DRI.mtx)) > > BP > POSITION A C G N T > 115247036 17 0 0 0 0 > 115247037 31 0 0 0 0 > 115247038 46 0 0 0 0 > 115247039 0 0 54 0 0 > 115247040 0 0 1 0 66 > 115247041 0 0 0 0 78 > > > I've tracked the problem down to the C-file, "unique.c". table() calls > factor() which calls unique() which I believe calls "unique.c". > Browsing > through the C file I found an if statement that checks if the size > of the > vector is larger than 2^30-1. If TRUE it gives the error message > "too large > for hashing". I do not get any error message when I run table() on > the full > matrix but I wonder if maybe I should be and if the limit of 2^30 is > too > high and should be lowered. Maybe it's just my set up or maybe it has > nothing to do with unique.c. I don't know. > > Here's the part of unique.c I was referring to: > > /* > Choose M to be the smallest power of 2 > not less than 2*n and set K = log2(M). > Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. > > Dec 2004: modified from 4*n to 2*n, since in the worst case we have > a 50% full table, and that is still rather efficient -- see > R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. > */ > static void MKsetup(int n, HashData *d) > { > int n2 = 2 * n; > if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ > error(_("length %d is too large for hashing"), n); > d->M = 2; > d->K = 1; > while (d->M < n2) { > d->M *= 2; > d->K += 1; > } > } > > "n" I presume is the number of rows of the matrix so I don't see why > this > wouldn't run properly though I'm not sure what is causing the > problem in > the unique.c file and I have no idea how to troubleshoot. > > I have a work around that reads in chunks at a time, but I'm very > interested in why there appears to be a limit at 2^29 when according > to the > unique.c file it should be twice that.Matrices are stored as vectors, so the maximum number of rows of a two column matrix _should _ be half of the maximum length of a vector. Issues with reaching the limits for matrix or vector sizes come up from time to time but this is the first in my memory for size of factor objects. David Winsemius, MD Alameda, CA, USA
R. Michael Weylandt <michael.weylandt@gmail.com>
2012-Aug-10 02:59 UTC
[R] Vector size limit for table() in R-2.15.1
On Aug 9, 2012, at 7:29 PM, Sean Ruddy <sruddy17 at gmail.com> wrote:> Hi, > > First, thanks in advance. Some useful info: > >> version > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > version.string R version 2.15.1 (2012-06-22) > > I'm trying to use the table() function on a 2 column matrix that has 711 > million rows (see below). However, it freezes. If I subset the matrix to be > less than or equal to 2^29 (500+ million) then the table() function > finishes in minutes. As soon as I go larger than that--beginning with > 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I > assume it has something to do with memory since I believe that's the 32 bit > limit but I'm running on a 64 bit machine. > > Here's the matrix: > >> head(DRI.mtx) > > POSITION BP > 38076904 C > 38076905 C > 38076906 A > 38076907 T > 38076908 C > 38076909 C > > > The result from table (if the matrix has less than 2^29 rows) is > >> head(table(DRI.mtx)) > > BP > POSITION A C G N T > 115247036 17 0 0 0 0 > 115247037 31 0 0 0 0 > 115247038 46 0 0 0 0 > 115247039 0 0 54 0 0 > 115247040 0 0 1 0 66 > 115247041 0 0 0 0 78 > > > I've tracked the problem down to the C-file, "unique.c". table() calls > factor() which calls unique() which I believe calls "unique.c". Browsing > through the C file I found an if statement that checks if the size of the > vector is larger than 2^30-1. If TRUE it gives the error message "too large > for hashing". I do not get any error message when I run table() on the full > matrix but I wonder if maybe I should be and if the limit of 2^30 is too > high and should be lowered. Maybe it's just my set up or maybe it has > nothing to do with unique.c. I don't know. > > Here's the part of unique.c I was referring to: > > /* > Choose M to be the smallest power of 2 > not less than 2*n and set K = log2(M). > Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. > > Dec 2004: modified from 4*n to 2*n, since in the worst case we have > a 50% full table, and that is still rather efficient -- see > R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. > */ > static void MKsetup(int n, HashData *d) > { > int n2 = 2 * n; > if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ > error(_("length %d is too large for hashing"), n); > d->M = 2; > d->K = 1; > while (d->M < n2) { > d->M *= 2; > d->K += 1; > } > } > > "n" I presume is the number of rows of the matrix so I don't see why this > wouldn't run properly though I'm not sure what is causing the problem in > the unique.c file and I have no idea how to troubleshoot. > > I have a work around that reads in chunks at a time, but I'm very > interested in why there appears to be a limit at 2^29 when according to the > unique.c file it should be twice that. >I believe Prof Ripley has touched this code in R-Devel recently, but I can't remember if he changed the size of the hash table. See if one the nightly builds can help you out. Also, possibly see if Simon Urbanek's fastmatch package can help. Sorry for the lack of concrete pointers but I'm working from my phone, which isn't svn equipped ;-) Michael> Thanks for the help. > > -Sean > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
As the posting guide asked you to before posting, try R-patched. That has the NEWS items ? duplicated(), unique() and similar now support vectors of lengths above 2^29 on 64-bit platforms. ? unique() and similar would infinite-loop if called on a vector of length > 2^29 (but reported that the vector was too long for 2^30 or more). If you want to work on such large datasets, you might want to consider using R-devel which has a number of enhancements already with more in the pipeline. On 10/08/2012 01:29, Sean Ruddy wrote:> Hi, > > First, thanks in advance. Some useful info: > >> version > platform x86_64-unknown-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > version.string R version 2.15.1 (2012-06-22) > > I'm trying to use the table() function on a 2 column matrix that has 711 > million rows (see below). However, it freezes. If I subset the matrix to be > less than or equal to 2^29 (500+ million) then the table() function > finishes in minutes. As soon as I go larger than that--beginning with > 2^29+1--it gets stuck, ie. nothing happens even after hours of running. I > assume it has something to do with memory since I believe that's the 32 bit > limit but I'm running on a 64 bit machine. > > Here's the matrix: > >> head(DRI.mtx) > > POSITION BP > 38076904 C > 38076905 C > 38076906 A > 38076907 T > 38076908 C > 38076909 C > > > The result from table (if the matrix has less than 2^29 rows) is > >> head(table(DRI.mtx)) > > BP > POSITION A C G N T > 115247036 17 0 0 0 0 > 115247037 31 0 0 0 0 > 115247038 46 0 0 0 0 > 115247039 0 0 54 0 0 > 115247040 0 0 1 0 66 > 115247041 0 0 0 0 78 > > > I've tracked the problem down to the C-file, "unique.c". table() calls > factor() which calls unique() which I believe calls "unique.c". Browsing > through the C file I found an if statement that checks if the size of the > vector is larger than 2^30-1. If TRUE it gives the error message "too large > for hashing". I do not get any error message when I run table() on the full > matrix but I wonder if maybe I should be and if the limit of 2^30 is too > high and should be lowered. Maybe it's just my set up or maybe it has > nothing to do with unique.c. I don't know. > > Here's the part of unique.c I was referring to: > > /* > Choose M to be the smallest power of 2 > not less than 2*n and set K = log2(M). > Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30. > > Dec 2004: modified from 4*n to 2*n, since in the worst case we have > a 50% full table, and that is still rather efficient -- see > R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. > */ > static void MKsetup(int n, HashData *d) > { > int n2 = 2 * n; > if(n < 0 || n > 1073741824) /* protect against overflow to -ve */ > error(_("length %d is too large for hashing"), n); > d->M = 2; > d->K = 1; > while (d->M < n2) { > d->M *= 2; > d->K += 1; > } > } > > "n" I presume is the number of rows of the matrix so I don't see why this > wouldn't run properly though I'm not sure what is causing the problem in > the unique.c file and I have no idea how to troubleshoot. > > I have a work around that reads in chunks at a time, but I'm very > interested in why there appears to be a limit at 2^29 when according to the > unique.c file it should be twice that. > > Thanks for the help. > > -Sean > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595