Toby Hocking
2019-Feb-20 18:16 UTC
[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases
Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent yesterday) I believe that I have found another bug, this time in the substring function. The use case that I am concerned with is when there is a single (character scalar) text/subject, and many substrings to extract. For example substring("AAAA", 1:4, 1:4) or more generally, N=1000 substring(paste(rep("A", N), collapse=""), 1:N, 1:N) The problem I observe is that the time complexity is quadratic in N, as shown on this figure https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png source: https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R I expected the time complexity to be linear in N. The example above may seem contrived/trivial, but it is indeed relevant to a number of packages (rex, rematch2, namedCapture) which provide functions that use gregexpr and then substring to extract the text in the captured sub-patterns. The figure https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png shows the issue: these packages have quadratic time complexity, whereas other packages (and the gregexpr function with perl=TRUE after applying the patch discussed yesterday) have linear time complexity. I believe the problem is the substring function. Source for this figure: https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R I suspect that a fix can be accomplished by optimizing the implementation of substring, for the special case when the text/subject is a single element (character scalar). Right now I notice that the substring R code uses rep_len so that the text/subject which is passed to the C code is a character vector with the same length as the number of substrings to extract. Maybe the C code is calling strlen for each of these (identical) text/subject elements? Anyway, it would be useful to have some feedback to make sure this is indeed a bug before I post on bugzilla. (btw thanks Martin for signing me up for an account) Toby [[alternative HTML version deleted]]
Toby Hocking
2019-Feb-20 18:55 UTC
[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases
Update: I have observed that stringi::stri_sub is linear time complexity, and it computes the same thing as base::substring. figure https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png source: https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R To me this is a clear indication of a bug in substring, but again it would be nice to have some feedback/confirmation before posting on bugzilla. Also this suggests a fix -- just need to copy whatever stringi::stri_sub is doing. On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 at gmail.com> wrote:> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent > yesterday) > > I believe that I have found another bug, this time in the substring > function. The use case that I am concerned with is when there is a single > (character scalar) text/subject, and many substrings to extract. For example > > substring("AAAA", 1:4, 1:4) > > or more generally, > > N=1000 > substring(paste(rep("A", N), collapse=""), 1:N, 1:N) > > The problem I observe is that the time complexity is quadratic in N, as > shown on this figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png > source: > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R > > I expected the time complexity to be linear in N. > > The example above may seem contrived/trivial, but it is indeed relevant to > a number of packages (rex, rematch2, namedCapture) which provide functions > that use gregexpr and then substring to extract the text in the captured > sub-patterns. The figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png > shows the issue: these packages have quadratic time complexity, whereas > other packages (and the gregexpr function with perl=TRUE after applying the > patch discussed yesterday) have linear time complexity. I believe the > problem is the substring function. Source for this figure: > https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R > > I suspect that a fix can be accomplished by optimizing the implementation > of substring, for the special case when the text/subject is a single > element (character scalar). Right now I notice that the substring R code > uses rep_len so that the text/subject which is passed to the C code is a > character vector with the same length as the number of substrings to > extract. Maybe the C code is calling strlen for each of these (identical) > text/subject elements? > > Anyway, it would be useful to have some feedback to make sure this is > indeed a bug before I post on bugzilla. (btw thanks Martin for signing me > up for an account) > > Toby >[[alternative HTML version deleted]]
Tomas Kalibera
2019-Feb-22 19:16 UTC
[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases
On 2/20/19 7:55 PM, Toby Hocking wrote:> Update: I have observed that stringi::stri_sub is linear time complexity, > and it computes the same thing as base::substring. figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png > source: > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R > > To me this is a clear indication of a bug in substring, but again it would > be nice to have some feedback/confirmation before posting on bugzilla. > > Also this suggests a fix -- just need to copy whatever stringi::stri_sub is > doing.Thanks for the report, I am working on a patch that will address this. I confirm there is a lot of potential for speedup. On my system, 'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)' spends 96% time in checking if the string is ascii and 3% in strlen(); if we take advantage of the pre-computed value in the ASCII bit, the speed up is about 40x. Of course, with micro-benchmarks, any performance limitation can be arbitrarily inflated, users cannot expect to see these or any close speedups in applications as a result of the patch. The patch is going to do other easy optimizations that will not complicate the code, including avoiding the strlen() call (taking advantage of pre-computed length of R character object). Best Tomas> > > > On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 at gmail.com> wrote: > >> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent >> yesterday) >> >> I believe that I have found another bug, this time in the substring >> function. The use case that I am concerned with is when there is a single >> (character scalar) text/subject, and many substrings to extract. For example >> >> substring("AAAA", 1:4, 1:4) >> >> or more generally, >> >> N=1000 >> substring(paste(rep("A", N), collapse=""), 1:N, 1:N) >> >> The problem I observe is that the time complexity is quadratic in N, as >> shown on this figure >> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png >> source: >> https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R >> >> I expected the time complexity to be linear in N. >> >> The example above may seem contrived/trivial, but it is indeed relevant to >> a number of packages (rex, rematch2, namedCapture) which provide functions >> that use gregexpr and then substring to extract the text in the captured >> sub-patterns. The figure >> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png >> shows the issue: these packages have quadratic time complexity, whereas >> other packages (and the gregexpr function with perl=TRUE after applying the >> patch discussed yesterday) have linear time complexity. I believe the >> problem is the substring function. Source for this figure: >> https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R >> >> I suspect that a fix can be accomplished by optimizing the implementation >> of substring, for the special case when the text/subject is a single >> element (character scalar). Right now I notice that the substring R code >> uses rep_len so that the text/subject which is passed to the C code is a >> character vector with the same length as the number of substrings to >> extract. Maybe the C code is calling strlen for each of these (identical) >> text/subject elements? >> >> Anyway, it would be useful to have some feedback to make sure this is >> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me >> up for an account) >> >> Toby >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Possibly Parallel Threads
- Bug: time complexity of substring is quadratic as string size and number of substrings increases
- Bug: time complexity of substring is quadratic as string size and number of substrings increases
- patch for gregexpr(perl=TRUE)
- Error in substring: invalid multibyte string
- Error in substring: invalid multibyte string