bill.hopkins at level3.com
2009-Nov-09  23:40 UTC
[Rd] textConnection performance quadratic (PR#14053)
Full_Name: William E. Hopkins Version: 2.9.0 OS: Windows XP Submission from: (NULL) (209.244.4.106) textConnection() has quadratic performance. A function I wrote was taking outrageous amount of time to execute on a large character vector (small test set was used for functional development). I created a test harness to execute the function and gather stats (system.time) for various dataset sizes (datasets generated by sample() of very large set). If I used textConnection() to provide input to read.csv(), the performance was quadratic with dataset size. However, if I had the function write the character vector to a temp file then read the data back in via read.csv, the performance was linear. The reason for using a textConnection() was that the character vector was within a data frame read in via read.csv. The character vector (URLs) needed to be parsed into separate vectors, but no mechanism exists to do that directly (that I know of). So, I used sub() to extract the proper pieces and put commas in between so that I can use read.csv() to read the comma-separate strings directly into vectors.
Gabor Grothendieck
2009-Nov-10  00:10 UTC
[Rd] textConnection performance quadratic (PR#14053)
strsplit can split by separators and strapply in the gsubfn package can split by content. On Mon, Nov 9, 2009 at 6:40 PM, <bill.hopkins at level3.com> wrote:> Full_Name: William E. Hopkins > Version: 2.9.0 > OS: Windows XP > Submission from: (NULL) (209.244.4.106) > > > textConnection() has quadratic performance. > > A function I wrote was taking outrageous amount of time to execute on a large > character vector (small test set was used for functional development). I created > a test harness to execute the function and gather stats (system.time) for > various dataset sizes (datasets generated by sample() of very large set). If I > used textConnection() to provide input to read.csv(), the performance was > quadratic with dataset size. However, if I had the function write the character > vector to a temp file then read the data back in via read.csv, the performance > was linear. > > The reason for using a textConnection() was that the character vector was within > a data frame read in via read.csv. The character vector (URLs) needed to be > parsed into separate vectors, but no mechanism exists to do that directly (that > I know of). So, I used sub() to extract the proper pieces and put commas in > between so that I can use read.csv() to read the comma-separate strings directly > into vectors. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >