A possible regex bug when working with large strings. The following code snippet t5 <- paste( c( "# === TEST", rep(' ', 2452294) ), collapse='') str( sub("^.*TEST", "xyz", t5) ) str( sub("^.*TEST", "xyz", substr(t5,0,200)) ) doesn't behave right; on one machine, the second and third lines print different results [the second line, on the long string, doesn't do the substitution], while on another, the second line causes a segfault. Both are running R 1.8.1 with PCRE, under NetBSD (1.6.1 and 1.6 respectively). Possible related (although perhaps not a bug): function(n) { line <- paste(as.character(trunc(runif(n)*100)),collapse=" ") system.time( rep <- gsub("[[:space:]]", "-", line) ) } gives rather long times rising v sharply for big strings (eg 2.2s at n=2e4, 360s at n=2e5 on AMD 1.2GHz). Other languages aren't so slow on this task (eg n=2e5: 0.4s ruby 1.8.1, and 5.2s python 2). Doubtless my extremely-quick-hack benchmarks aren't fair, but the difference still seems rather big. Mark <><
Prof Brian Ripley
2004-Feb-28 12:31 UTC
[Rd] Regular expressions & large strings (PR#6617)
I was able to confirm the error on RH8.0 Linux and the segfault on Windows. Note that PCRE is not being used, and if you add perl=TRUE to your [g]sub calls you get correct results extremely fast. The segfault is occurring in regexec, that is in the GNU regex code included in R. I am not clear it is worth spending any time on trying to find the problem in that code as - you can use perl=TRUE as an alternative - we will be replacing the GNU regex code in due course to cope with internationalization issues. On Fri, 27 Feb 2004 mjw@celos.net wrote:> A possible regex bug when working with large strings. The > following code snippet > > t5 <- paste( c( "# === TEST", rep(' ', 2452294) ), collapse='') > str( sub("^.*TEST", "xyz", t5) ) > str( sub("^.*TEST", "xyz", substr(t5,0,200)) ) > > doesn't behave right; on one machine, the second and third > lines print different results [the second line, on the long > string, doesn't do the substitution], while on another, the > second line causes a segfault. Both are running R 1.8.1 > with PCRE, under NetBSD (1.6.1 and 1.6 respectively). > > Possible related (although perhaps not a bug): > > function(n) { > line <- paste(as.character(trunc(runif(n)*100)),collapse=" ") > system.time( rep <- gsub("[[:space:]]", "-", line) ) > } > > gives rather long times rising v sharply for big strings (eg > 2.2s at n=2e4, 360s at n=2e5 on AMD 1.2GHz). Other languages > aren't so slow on this task (eg n=2e5: 0.4s ruby 1.8.1, and > 5.2s python 2). Doubtless my extremely-quick-hack benchmarks > aren't fair, but the difference still seems rather big. > > Mark <>< > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595