Olivier Boudry
2009-Mar-18 16:16 UTC
[R] Profiling question: string formatting extremely slow
Hi all, I'm using R to find duplicates in a set of 6 files containing Part Number information. Before applying the intersect method to identify the duplicates I need to normalize the P/Ns. Converting the P/N to uppercase if alphanumerical and applying an 18 char long zero padding if numerical. When I apply the pn_formatting function (see code below) to "Part Number" column of the data.frame (character vectors up to 18 char long) it consumes a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU goes to zero and completion takes hours to complete. Part Number columns can have from 7'000 to 80'000 records and I've never got enough patience to wait for completion of more than 17'000 records. Is there a way to find out which of the function used below is the bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a profiler for R and if yes where could I find some documentation on how to use it? The code: # String contains digits only (can be converted to an integer) digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } # Remove blanks at both ends of a string trim <- function (x) { sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) } # P/N formatting pn_formatting <- function(pn_in) { pn_out = trim(pn_in) if (digits_only(pn_out)) { # Zero padding pn_out <- paste("000000000000000000", pn_out, sep="") pn_len <- nchar(pn_out) pn_out <- substr(pn_out, pn_len - 17, pn_len) } else { # Uppercase pn_out <- toupper(pn_out) } pn_out } Thanks, Olivier. [[alternative HTML version deleted]]
jim holtman
2009-Mar-18 17:09 UTC
[R] Profiling question: string formatting extremely slow
Try this way. Took less than 1 second for 50,000> system.time({+ x <- sample(50000) # test data + x[sample(50000,10000)] <- 'asdfasdf' # characters strings + which.num <- grep("^[ 0-9]+$", x) # find numbers + # convert to leading 0 + x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num])) + x[-which.num] <- toupper(x[-which.num]) + }) user system elapsed 0.25 0.00 0.25> > > > head(x,30)[1] "000000000000026550" "000000000000019100" "000000000000045961" "000000000000031473" "000000000000005031" "000000000000012266" [7] "000000000000034418" "000000000000042279" "000000000000041193" "ASDFASDF" "000000000000005760" "000000000000035659" [13] "ASDFASDF" "000000000000008420" "000000000000042220" "ASDFASDF" "000000000000039903" "000000000000032234" [19] "000000000000024125" "000000000000032970" "000000000000006814" "000000000000000215" "ASDFASDF" "000000000000045239" [25] "ASDFASDF" "ASDFASDF" "000000000000043065" "ASDFASDF" "000000000000007642" "000000000000019196">On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry <olivier.boudry at gmail.com> wrote:> Hi all, > > I'm using R to find duplicates in a set of 6 files containing Part Number > information. Before applying the intersect method to identify the duplicates > I need to normalize the P/Ns. Converting the P/N to uppercase if > alphanumerical and applying an 18 char long zero padding if numerical. > > When I apply the pn_formatting function (see code below) to "Part Number" > column of the data.frame (character vectors up to 18 char long) it consumes > a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU > goes to zero and completion takes hours to complete. Part Number columns can > have from 7'000 to 80'000 records and I've never got enough patience to wait > for completion of more than 17'000 records. > > Is there a way to find out which of the function used below is the > bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a > profiler for R and if yes where could I find some documentation on how to > use it? > > The code: > > # String contains digits only (can be converted to an integer) > digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } > > # Remove blanks at both ends of a string > trim <- function (x) { > ?sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) > } > > # P/N formatting > pn_formatting <- function(pn_in) { > > ?pn_out = trim(pn_in) > ?if (digits_only(pn_out)) { > > ? ?# Zero padding > ? ?pn_out <- paste("000000000000000000", pn_out, sep="") > ? ?pn_len <- nchar(pn_out) > ? ?pn_out <- substr(pn_out, pn_len - 17, pn_len) > > ?} else { > ? ?# Uppercase > ? ?pn_out <- toupper(pn_out) > ?} > ?pn_out > } > > Thanks, > > Olivier. > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Olivier Boudry
2009-Mar-18 21:53 UTC
[R] Profiling question: string formatting extremely slow
Bill, Jim and Martin, Great! The code is much faster and even looks more R-ish. I'm very new to R and have some difficulties getting rid my procedural programming habits. Many thanks to all for the great help. Olivier. On Wed, Mar 18, 2009 at 9:38 PM, William Dunlap <wdunlap@tibco.com> wrote:> Olivier, > You can profile R code with Rprof(). E.g., > Rprof(tmp<-tempfile()) # start profiling, saving results in a file > ... run your code: sapply(x, pn_formatting) ... > Rprof() # stop profiling > summaryRprof(tmp) # analyze the file and present results in a pair of > data.frames > $by.self > self.time self.pct total.time total.pct > sub 2.26 24.3 2.26 24.3 > structure 0.68 7.3 1.42 15.3 > FUN 0.66 7.1 8.82 94.8 > withCallingHandlers 0.62 6.7 4.14 44.5 > paste 0.54 5.8 0.56 6.0 > toupper 0.44 4.7 0.46 4.9 > as.integer 0.42 4.5 3.34 35.9 > makeRestartList 0.38 4.1 1.36 14.6 > unlist 0.34 3.7 0.40 4.3 > substr 0.30 3.2 0.36 3.9 > ... > $by.total > total.time total.pct self.time self.pct > sapply 9.30 100.0 0.00 0.0 > lapply 8.88 95.5 0.08 0.9 > FUN 8.82 94.8 0.66 7.1 > digits_only 4.18 44.9 0.02 0.2 > suppressWarnings 4.16 44.7 0.02 0.2 > withCallingHandlers 4.14 44.5 0.62 6.7 > as.integer 3.34 35.9 0.42 4.5 > withRestarts 2.92 31.4 0.10 1.1 > .signalSimpleWarning 2.92 31.4 0.00 0.0 > trim 2.34 25.2 0.08 0.9 > sub 2.26 24.3 2.26 24.3 > > You didn't mention that you used sapply() on your pn_formatting > function, which I think you must have since you use a non-vectorized > if statement in it. If you vectorize the numeric/non-numeric choice, > as in the following code, you get a huge speedup because you don't > have to use sapply: > > pn_formatting1 <- > function(pn_in) { > pn_out = trim(pn_in) > numeric <- digits_only(pn_out) > pn_out[!numeric] <- toupper(pn_out[!numeric]) > pn_out[numeric] <- { > # Zero padding > tmp <- paste("000000000000000000", pn_out[numeric], sep="") > pn_len <- nchar(tmp) > substr(tmp, pn_len - 17, pn_len) > } > pn_out > } > > Jim's code is a bit cleaner (to my taste) but runs at the same > speed as yours after this simple modification. His contains an > error, in that it uses integer subscripts and does not check that > there are at least one numeric entry in the input (in that case > pn_in[-which.num] returns all of pn_in and sprintf() dies because > one of its arguments is 0-long). > > Bill Dunlap > TIBCO Software Inc - Spotfire Division > wdunlap tibco.com > ------------------------------------------------------------------------ > ----------- > [R] Profiling question: string formatting extremely slow > > jim holtman jholtman at gmail.com > Wed Mar 18 18:09:37 CET 2009 > Previous message: [R] Profiling question: string formatting extremely > slow > Next message: [R] Updated R on Debian testing machine... > Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] > Try this way. Took less than 1 second for 50,000 > > > system.time({ > + x <- sample(50000) # test data > + x[sample(50000,10000)] <- 'asdfasdf' # characters strings > + which.num <- grep("^[ 0-9]+$", x) # find numbers > + # convert to leading 0 > + x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num])) > + x[-which.num] <- toupper(x[-which.num]) > + }) > user system elapsed > 0.25 0.00 0.25 > > > > > > > > head(x,30) > [1] "000000000000026550" "000000000000019100" "000000000000045961" > "000000000000031473" "000000000000005031" "000000000000012266" > [7] "000000000000034418" "000000000000042279" "000000000000041193" > "ASDFASDF" "000000000000005760" "000000000000035659" > [13] "ASDFASDF" "000000000000008420" "000000000000042220" > "ASDFASDF" "000000000000039903" "000000000000032234" > [19] "000000000000024125" "000000000000032970" "000000000000006814" > "000000000000000215" "ASDFASDF" "000000000000045239" > [25] "ASDFASDF" "ASDFASDF" "000000000000043065" > "ASDFASDF" "000000000000007642" "000000000000019196" > > > > > On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry > <olivier.boudry at gmail.com> wrote: > > Hi all, > > > > I'm using R to find duplicates in a set of 6 files containing Part > Number > > information. Before applying the intersect method to identify the > duplicates > > I need to normalize the P/Ns. Converting the P/N to uppercase if > > alphanumerical and applying an 18 char long zero padding if numerical. > > > > When I apply the pn_formatting function (see code below) to "Part > Number" > > column of the data.frame (character vectors up to 18 char long) it > consumes > > a lot of memory, my computer (Windows XP SP3) starts to swap memory, > CPU > > goes to zero and completion takes hours to complete. Part Number > columns can > > have from 7'000 to 80'000 records and I've never got enough patience > to wait > > for completion of more than 17'000 records. > > > > Is there a way to find out which of the function used below is the > > bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a > > profiler for R and if yes where could I find some documentation on how > to > > use it? > > > > The code: > > > > # String contains digits only (can be converted to an integer) > > digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) } > > > > # Remove blanks at both ends of a string > > trim <- function (x) { > > sub("^\\s+((.*\\S)\\s+)?$", "\\2", x) > > } > > > > # P/N formatting > > pn_formatting <- function(pn_in) { > > > > pn_out = trim(pn_in) > > if (digits_only(pn_out)) { > > > > # Zero padding > > pn_out <- paste("000000000000000000", pn_out, sep="") > > pn_len <- nchar(pn_out) > > pn_out <- substr(pn_out, pn_len - 17, pn_len) > > > > } else { > > # Uppercase > > pn_out <- toupper(pn_out) > > } > > pn_out > > } > > > > Thanks, > > > > Olivier. > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? >[[alternative HTML version deleted]]