Dear R-help list, Part of a program I wrote seem to take a significant amount of time, therefore I am looking for an alternative approach. In order to explain what is does: - the input is a sorted vector of integer numbers - some higher numbers may be derived (using a mathematical formula) from lower numbers, therefore they should be eliminated - at the end, the vector should contain only uniquely defined numbers Pet hypothetical example, input vector: - 2 3 4 5 6 7 8 9 10 - number 2 generates 4, 7, 10 - 2 3 5 6 8 9 (surviving vector) - number 3 generates 5 and 9 - 2 3 6 8 (surviving vector) - number 6 generates 8 - final surviving vector 2 3 6 Function foo(x, ...) generates the numbers, my current approach being: #### index <- 0 while ((index <- index + 1) < length(numbers)) { numbers <- setdiff(numbers, foo(numbers[index])) } #### This seem to take quite some time (but I don't know any other way of doing it), hence my question(s): - would there be another (quicker) implementation in R? - alternatively, should I go for a C implementation? (actually, I did create a C implementation, but it doesn't bring any more speed... it is actually a bit slower). A real-life pet example, using the function findSubsets() from the QCA package (our foo function above): #### library(QCA) testfoo <- function(x, y) { index <- 0 while((index <- index + 1) < length(x)) { x <- setdiff(x, findSubsets(y, x[index], max(x))) } return(x) } nofl <- rep(3, 14) set.seed(12345) numbers <- sort(sample(seq(prod(nofl)), 1000000)) system.time(result <- testfoo(numbers, nofl)) #### user system elapsed 8.168 2.049 10.148 Any hint will be highly appreciated, thanks in advance, Adrian -- Adrian Dusa Romanian Social Data Archive 1, Schitu Magureanu Bd. 050025 Bucharest sector 5 Romania Tel.:+40 21 3126618 \ ? ? ? ?+40 21 3120210 / int.101 Fax: +40 21 3158391
One place to start is to use Rprof to see where time is being spent. I used the sample you sent and this is what I got: 0 16.7 root 1. 16.2 system.time 2. . 16.1 testfoo 3. . . 16.1 setdiff 4. . . . 8.2 as.vector 5. . . . . 8.2 findSubsets 6. . . . . . 6.4 increment 7. . . . . . . 4.2 as.vector 8. . . . . . . . 3.6 outer 9. . . . . . . . . 0.3 rep.int 7. . . . . . . 1.6 c 7. . . . . . . 0.2 max 4. . . . 7.9 unique 5. . . . . 7.3 match 5. . . . . 0.3 unique.default 1. 0.5 sort 2. . 0.5 standardGeneric 3. . . 0.3 sample 3. . . 0.2 sort 4. . . . 0.2 sort.default 5. . . . . 0.2 sort.int Of the 16.7 seconds to execute the code, 16.1 was taken up in 'setdiff'. Maybe there is some other way you can determine the difference. So if you continue to use 'setdiff', it does not look like there is much that can be done. On Wed, Jun 27, 2012 at 10:36 AM, Adrian Du?a <dusa.adrian at gmail.com> wrote:> Dear R-help list, > > Part of a program I wrote seem to take a significant amount of time, > therefore I am looking for an alternative approach. > In order to explain what is does: > > - the input is a sorted vector of integer numbers > - some higher numbers may be derived (using a mathematical formula) > from lower numbers, therefore they should be eliminated > - at the end, the vector should contain only uniquely defined numbers > > Pet hypothetical example, input vector: > - 2 3 4 5 6 7 8 9 10 > - number 2 generates 4, 7, 10 > - 2 3 5 6 8 9 (surviving vector) > - number 3 generates 5 and 9 > - 2 3 6 8 (surviving vector) > - number 6 generates 8 > - final surviving vector 2 3 6 > > Function foo(x, ...) generates the numbers, my current approach being: > #### > index <- 0 > while ((index <- index + 1) < length(numbers)) { > ? ?numbers <- setdiff(numbers, foo(numbers[index])) > } > #### > > This seem to take quite some time (but I don't know any other way of > doing it), hence my question(s): > - would there be another (quicker) implementation in R? > - alternatively, should I go for a C implementation? > > (actually, I did create a C implementation, but it doesn't bring any > more speed... it is actually a bit slower). > > A real-life pet example, using the function findSubsets() from the QCA > package (our foo function above): > > #### > library(QCA) > testfoo <- function(x, y) { > ? ?index <- 0 > ? ?while((index <- index + 1) < length(x)) { > ? ? ? ?x <- setdiff(x, findSubsets(y, x[index], max(x))) > ? ?} > ? ?return(x) > } > > nofl <- rep(3, 14) > set.seed(12345) > numbers <- sort(sample(seq(prod(nofl)), 1000000)) > > system.time(result <- testfoo(numbers, nofl)) > #### > ? user ?system elapsed > ?8.168 ? 2.049 ?10.148 > > Any hint will be highly appreciated, thanks in advance, > Adrian > > -- > Adrian Dusa > Romanian Social Data Archive > 1, Schitu Magureanu Bd. > 050025 Bucharest sector 5 > Romania > Tel.:+40 21 3126618 \ > ? ? ? ?+40 21 3120210 / int.101 > Fax: +40 21 3158391 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
On Wed, Jun 27, 2012 at 05:36:08PM +0300, Adrian Du?a wrote:> Dear R-help list, > > Part of a program I wrote seem to take a significant amount of time, > therefore I am looking for an alternative approach. > In order to explain what is does: > > - the input is a sorted vector of integer numbers > - some higher numbers may be derived (using a mathematical formula) > from lower numbers, therefore they should be eliminated > - at the end, the vector should contain only uniquely defined numbers > > Pet hypothetical example, input vector: > - 2 3 4 5 6 7 8 9 10 > - number 2 generates 4, 7, 10 > - 2 3 5 6 8 9 (surviving vector) > - number 3 generates 5 and 9 > - 2 3 6 8 (surviving vector) > - number 6 generates 8 > - final surviving vector 2 3 6 > > Function foo(x, ...) generates the numbers, my current approach being: > #### > index <- 0 > while ((index <- index + 1) < length(numbers)) { > numbers <- setdiff(numbers, foo(numbers[index])) > } > #### > > This seem to take quite some time (but I don't know any other way of > doing it), hence my question(s): > - would there be another (quicker) implementation in R? > - alternatively, should I go for a C implementation? > > (actually, I did create a C implementation, but it doesn't bring any > more speed... it is actually a bit slower). > > A real-life pet example, using the function findSubsets() from the QCA > package (our foo function above): > > #### > library(QCA) > testfoo <- function(x, y) { > index <- 0 > while((index <- index + 1) < length(x)) { > x <- setdiff(x, findSubsets(y, x[index], max(x))) > } > return(x) > } > > nofl <- rep(3, 14) > set.seed(12345) > numbers <- sort(sample(seq(prod(nofl)), 1000000))Hi. How large the numbers are? If the bound 3^14 = 4782969 used above apply also to the real situation, then it is possible to represent the set using a logical vector of this length, which has TRUE for the numbers present in the set and FALSE for the other. Removing a number means to set its bit from TRUE to FALSE, which is a very simple operation. # prepare an example numbers <- c(2, 3, 6, 8, 12, 17) x <- rep(FALSE, times=max(numbers)) x[numbers] <- TRUE which(x) [1] 2 3 6 8 12 17 #remove 3, 8, 12 from the set x[c(3, 8, 12)] <- FALSE which(x) [1] 2 6 17 Hope this helps. Petr Savicky.