thr3ads.net - similar to: "Bug: time complexity of substring is quadratic as string size and number of substrings increases"

Displaying 20 results from an estimated 2000 matches similar to: "Bug: time complexity of substring is quadratic as string size and number of substrings increases"

Bug: time complexity of substring is quadratic as string size and number of substrings increases

2019 Feb 22

Bug: time complexity of substring is quadratic as string size and number of substrings increases

On 2/20/19 7:55 PM, Toby Hocking wrote: > Update: I have observed that stringi::stri_sub is linear time complexity, > and it computes the same thing as base::substring. figure > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png > source: > https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R > > To me this is a

Bug: time complexity of substring is quadratic as string size and number of substrings increases

2019 Feb 20

Bug: time complexity of substring is quadratic as string size and number of substrings increases

Update: I have observed that stringi::stri_sub is linear time complexity, and it computes the same thing as base::substring. figure https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png source: https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R To me this is a clear indication of a bug in substring, but again it would be nice to have

patch for gregexpr(perl=TRUE)

2019 Feb 19

patch for gregexpr(perl=TRUE)

Hi all, Several people have noticed that gregexpr is very slow for large subject strings when perl=TRUE is specified. - https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings - http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html - https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html I figured out

Feature request: non-dropping regmatches/strextract

2019 Aug 15

Feature request: non-dropping regmatches/strextract

A very common use case for regmatches is to extract regex matches into a new column in a data.frame (or data.table, etc.) or otherwise use the extracted strings alongside the input. However, the default behavior is to drop empty matches, which results in mismatches in column length if reassignment is done without subsetting. For consistency with other R functions and compatibility with this use

Error in substring: invalid multibyte string

2020 Jun 27

Error in substring: invalid multibyte string

Thanks for the quick response Ivan. readLines with encoding='latin1' works for me (on Ubuntu). However I was more concerned with the inconsistency in results between substr and regexpr. I was expecting that if one of them errors because of an unknown encoding then the other should as well. Even better, if regexpr works, why shouldn't substr work as well? Incidentally the analogous

Feature request: non-dropping regmatches/strextract

2019 Aug 29

Feature request: non-dropping regmatches/strextract

if you want "to extract regex matches into a new column in a data.frame" then there are some package functions which do exactly that. three examples are namedCapture::df_match_variable, rematch2::bind_re_match, and tidyr::extract. For a more detailed discussion see my R journal submission (under review) about regular expression packages,

Error in substring: invalid multibyte string

2020 Jun 26

Error in substring: invalid multibyte string

Hi all, I'm getting the following error from substring: > substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) Error in substr("<I>Jens Oehlschl\xe4gel-Akiyoshi", 1, 100) : invalid multibyte string at '<e4>gel-A<6b>iyoshi' Is that normal / intended? I've tried setting the Encoding/locale to Latin-1/UTF-8 but that does not help. nchar

valgrind false positive on R startup?

2020 Jun 09

valgrind false positive on R startup?

Hi all, I'm on Ubuntu 18.04, running R-4.0.0 which I compiled from source, and using valgrind I am always seeing the following message. Does anybody else see that? Is that a known false positive? Any ideas how to fix/suppress? Seems related to TRE, do I need to upgrade that? (base) tdhock at maude-MacBookPro:~/R/binsegRcpp$ R --vanilla -d valgrind -e 'extSoftVersion()' ==9565==

write.csv performance improvements?

2023 Mar 30

write.csv performance improvements?

Dear R-devel, I did a systematic comparison of write.csv with similar functions, and observed two asymptotic inefficiencies that could be improved. 1. write.csv is quadratic time (N^2) in the number of columns N. Can write.csv be improved to use a linear time algorithm, so it can handle CSV files with larger numbers of columns? For more details including figures and session info, please see

str_count counts the substring

2013 Sep 30

str_count counts the substring

I am trying to count the number of times a word occurs in a string. and using str_count function from the package stringr. This function counts the substrings as well. Is there a way in which I can exclude the substring count and just take the exact match. Thanks in advance. -- Thanks and Regards Agrima Srivastava -------------------------------------------------------------------------------

Bug: time complexity of substring is quadratic

2019 Feb 23

Bug: time complexity of substring is quadratic

> From: Tomas Kalibera <tomas.kalibera at gmail.com> > > Thanks for the report, I am working on a patch that will address this. > > I confirm there is a lot of potential for speedup. On my system, > > 'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)' > > spends 96% time in checking if the string is ascii and 3%

install.packages bug (PR#8873)

2006 May 17

install.packages bug (PR#8873)

Hello, I've been using R for about 3 years now and I'm pretty sure this is a bug. I'm using R 2.2.0. The way R is set up to get packages from CRAN using install.packages is really convenient --- if you are installing to your system's main package directory. However, I observe the following problem: I want package X but it requires package Y. Further, I have neither package

how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

2009 Dec 20

how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

Last one for you guys: The command: length(gregexpr('cus','hocus pocus')[[1]]) [1] 2 returns the number of times the substring 'cus' appears in 'hocus pocus' (which is two) It's returning the number of **disjoint** matches. So: length(gregexpr('aa','aaa')[[1]]) [1] 1 returns 1. **What I want to do:** I'm looking for a way to count

stats::reshape quadratic in number of input columns

2019 Oct 29

stats::reshape quadratic in number of input columns

Hi R-core, I have been performance testing R packages for wide-to-tall data reshaping and for the most part I see they differ by constant factors. However in one test, which involves converting into multiple output columns, I see that stats::reshape is in fact quadratic in the number of input columns. For example take the iris data, which has 4 input columns to reshape, and the desired output

read.csv quadratic time in number of columns

2023 Mar 30

read.csv quadratic time in number of columns

Dear R-devel, A number of people have observed anecdotally that read.csv is slow for large number of columns, for example: https://stackoverflow.com/questions/7327851/read-csv-is-extremely-slow-in-reading-csv-files-with-large-numbers-of-columns I did a systematic comparison of read.csv with similar functions, and observed that read.csv is quadratic time (N^2) in the number of columns N, whereas

locate substring in the string it belong to

2009 Jul 20

locate substring in the string it belong to

Hi R users, I am trying generate the indices for locating a in the string it come from. Given the length of the string, it will take too long using the combn() for further comparison. I am wondering if R has any built-in function for this purpose. To make it concrete: this.substring="cc" this.string="ccc" start.location=1,2 end.location=2,3 Thanks in advance, Kevin

use sliding window to count substrings found in large string

2010 Jul 07

use sliding window to count substrings found in large string

Hello together, I'm looking for advice on how to do some tests on strings. What I want to do is the following: (just an example, real strings/sequence are about 200-400 characters long) given set of Strings: String1 abcdefgh String2 bcdefgop use a sliding window of size x to create an vector of all subsequences of size x found in the set (order matters! ). Now create, for every string

mclapply memory leak?

2015 Sep 02

mclapply memory leak?

Dear R-devel, I am running mclapply with many iterations over a function that modifies nothing and makes no copies of anything. It is taking up a lot of memory, so it seems to me like this is a bug. Should I post this to bugs.r-project.org? A minimal reproducible example can be obtained by first starting a memory monitoring program such as htop, and then executing the following code while

Spliting columns, strings or reg exp returning substrings

2009 Sep 25

Spliting columns, strings or reg exp returning substrings

Currently as the first column in a data frame I have string values in the format xx_yy - I want to create a new column with just the substring xx (for each row in turn). Three possible ways to do this might be (1) split the string by '_' using strsplit and paste the first of the resulting variables into a new column, but I have been unable to do this for each row of my data frame in turn

Getting many substrings but only loading the original string one time.

2011 Apr 11

Getting many substrings but only loading the original string one time.

Hi All, I'm looking for a way to get many substrings from a longer string and then stitch them together. But, since the longer string is really, really long (like 250 MB long), I don't want to do this in a loop and load and re-load the longer string many times. Does anybody have an idea? Maybe I could pass in two vectors (the first would have the starting coordinates, and the second

similar to: Bug: time complexity of substring is quadratic as string size and number of substrings increases