thr3ads.net - R devel - [Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Toby Hocking

2019-Feb-20 18:16 UTC

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a single
(character scalar) text/subject, and many substrings to extract. For example

substring("AAAA", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

I expected the time complexity to be linear in N.

The example above may seem contrived/trivial, but it is indeed relevant to
a number of packages (rex, rematch2, namedCapture) which provide functions
that use gregexpr and then substring to extract the text in the captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R

I suspect that a fix can be accomplished by optimizing the implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R code
uses rep_len so that the text/subject which is passed to the C code is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these (identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
up for an account)

Toby

	[[alternative HTML version deleted]]

Toby Hocking

2019-Feb-20 18:55 UTC

head link

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.




On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 at gmail.com>
wrote:
> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
> yesterday)
>
> I believe that I have found another bug, this time in the substring
> function. The use case that I am concerned with is when there is a single
> (character scalar) text/subject, and many substrings to extract. For
example
>
> substring("AAAA", 1:4, 1:4)
>
> or more generally,
>
> N=1000
> substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
>
> The problem I observe is that the time complexity is quadratic in N, as
> shown on this figure
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> I expected the time complexity to be linear in N.
>
> The example above may seem contrived/trivial, but it is indeed relevant to
> a number of packages (rex, rematch2, namedCapture) which provide functions
> that use gregexpr and then substring to extract the text in the captured
> sub-patterns. The figure
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
> shows the issue: these packages have quadratic time complexity, whereas
> other packages (and the gregexpr function with perl=TRUE after applying the
> patch discussed yesterday) have linear time complexity. I believe the
> problem is the substring function. Source for this figure:
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>
> I suspect that a fix can be accomplished by optimizing the implementation
> of substring, for the special case when the text/subject is a single
> element (character scalar). Right now I notice that the substring R code
> uses rep_len so that the text/subject which is passed to the C code is a
> character vector with the same length as the number of substrings to
> extract. Maybe the C code is calling strlen for each of these (identical)
> text/subject elements?
>
> Anyway, it would be useful to have some feedback to make sure this is
> indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
> up for an account)
>
> Toby
>
	[[alternative HTML version deleted]]

Tomas Kalibera

2019-Feb-22 19:16 UTC

head link

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

On 2/20/19 7:55 PM, Toby Hocking wrote:> Update: I have observed that stringi::stri_sub is linear time complexity,
> and it computes the same thing as base::substring. figure
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
> source:
>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>
> To me this is a clear indication of a bug in substring, but again it would
> be nice to have some feedback/confirmation before posting on bugzilla.
>
> Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
> doing.
Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=200000; x <- substring(paste(rep("A", N),
collapse=""), 1:N, 1:N)'

spends 96% time in checking if the string is ascii and 3% in strlen(); 
if we take advantage of the pre-computed value in the ASCII bit, the 
speed up is about 40x. Of course, with micro-benchmarks, any performance 
limitation can be arbitrarily inflated, users cannot expect to see these 
or any close speedups in applications as a result of the patch. The 
patch is going to do other easy optimizations that will not complicate 
the code, including avoiding the strlen() call (taking advantage of 
pre-computed length of R character object).

Best
Tomas
>
>
>
> On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdhock5 at gmail.com>
wrote:
>
>> Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
>> yesterday)
>>
>> I believe that I have found another bug, this time in the substring
>> function. The use case that I am concerned with is when there is a
single
>> (character scalar) text/subject, and many substrings to extract. For
example
>>
>> substring("AAAA", 1:4, 1:4)
>>
>> or more generally,
>>
>> N=1000
>> substring(paste(rep("A", N), collapse=""), 1:N,
1:N)
>>
>> The problem I observe is that the time complexity is quadratic in N, as
>> shown on this figure
>>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
>> source:
>>
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
>>
>> I expected the time complexity to be linear in N.
>>
>> The example above may seem contrived/trivial, but it is indeed relevant
to
>> a number of packages (rex, rematch2, namedCapture) which provide
functions
>> that use gregexpr and then substring to extract the text in the
captured
>> sub-patterns. The figure
>>
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
>> shows the issue: these packages have quadratic time complexity, whereas
>> other packages (and the gregexpr function with perl=TRUE after applying
the
>> patch discussed yesterday) have linear time complexity. I believe the
>> problem is the substring function. Source for this figure:
>>
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
>>
>> I suspect that a fix can be accomplished by optimizing the
implementation
>> of substring, for the special case when the text/subject is a single
>> element (character scalar). Right now I notice that the substring R
code
>> uses rep_len so that the text/subject which is passed to the C code is
a
>> character vector with the same length as the number of substrings to
>> extract. Maybe the C code is calling strlen for each of these
(identical)
>> text/subject elements?
>>
>> Anyway, it would be useful to have some feedback to make sure this is
>> indeed a bug before I post on bugzilla. (btw thanks Martin for signing
me
>> up for an account)
>>
>> Toby
>>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more reasonably related threads

R devel - Feb 2019 - Bug: time complexity of substring is quadratic as string size and number of substrings increases

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

[Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Reasonably Related Threads