First thing to do is to run Rprof and see where the time is going;
here it is from my computer:
self.time self.pct total.time total.pct
tolower 4.42 39.46 4.42 39.46
sub 3.56 31.79 3.56 31.79
nchar 1.54 13.75 1.54 13.75
canonicalize.language 0.62 5.54 11.14 99.46
!= 0.52 4.64 0.52 4.64
== 0.26 2.32 0.26 2.32
& 0.22 1.96 0.22 1.96
gc 0.06 0.54 0.06 0.54
more than half the time is in 'tolower' and 'nchar', so it is
not all
'sub's problem.
This version runs a little faster since it does not need the 'tolower':
canonicalize.language <- function (s) {
# s <- tolower(s)
long <- nchar(s) == 5
s[long] <-
sub("^([[:alpha:]]{2})[-_][[:alpha:]]{2}$","\\1",s[long])
s[nchar(s) != 2 & s != "c"] <- "unknown"
s
}
On Fri, Sep 14, 2012 at 12:30 PM, Sam Steingold <sds at gnu.org>
wrote:> this function is supposed to canonicalize the language:
>
> --8<---------------cut here---------------start------------->8---
> canonicalize.language <- function (s) {
> s <- tolower(s)
> long <- nchar(s) == 5
> s[long] <-
sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long])
> s[nchar(s) != 2 & s != "c"] <- "unknown"
> s
> }
>
canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C"))
> [1] "aa" "bb" "unknown"
"unknown" "ff" "c"
> --8<---------------cut here---------------end--------------->8---
>
> it does what I want it to do, but it takes 4.5 seconds on a vector of
> length 10,256,341 - I wonder if I might be doing something aufully stupid.
> I thought that sub() was slow, but my second attempt:
> --8<---------------cut here---------------start------------->8---
> canonicalize.language <- function (s) {
> s <- tolower(s)
> good <- nchar(s) == 5 & substr(s,3,3) %in%
c("_","-")
> s[good] <- substr(s[good],1,2)
> s[nchar(s) != 2 & s != "c"] <- "unknown"
> s
> }
> --8<---------------cut here---------------end--------------->8---
> was even slower (6.4 sec).
>
> My two concerns are:
>
> 1. avoid allocating many small objects which are never collected
> 2. run fast
>
> Which would be the best implementation?
>
> Thanks a lot for your insight!
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
11.0.11103000
> http://www.childpsy.net/ http://think-israel.org
http://openvotingconsortium.org
> http://memri.org http://camera.org http://truepeace.org
> WHO ATE MY BREAKFAST PANTS?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.