Hervé Pagès
2015-Jan-06 21:02 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
Hi, Current implementation: setequal <- function (x, y) { x <- as.vector(x) y <- as.vector(y) all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) } First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > 0L' with 'x %in% y' and 'y %in% x', respectively. They're strictly equivalent but the latter form is a lot more readable than the former (isn't this the "raison d'?tre" of %in%?): setequal <- function (x, y) { x <- as.vector(x) y <- as.vector(y) all(c(x %in% y, y %in% x)) } Furthermore, replacing 'all(c(x %in% y, y %in x))' with 'all(x %in% y) && all(y %in% x)' improves readability even more and, more importantly, reduces memory footprint significantly on big vectors (e.g. by 15% on integer vectors with 15M elements): setequal <- function (x, y) { x <- as.vector(x) y <- as.vector(y) all(x %in% y) && all(y %in% x) } It also seems to speed up things a little bit (not in a significant way though). Cheers, H. -- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
peter dalgaard
2015-Jan-08 21:30 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call... Readability of source code is not usually our prime concern. The && idea does have some merit, though. Apropos, why is there no setcontains()? -pd> On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: > > Hi, > > Current implementation: > > setequal <- function (x, y) > { > x <- as.vector(x) > y <- as.vector(y) > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) > } > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > 0L' > with 'x %in% y' and 'y %in% x', respectively. They're strictly > equivalent but the latter form is a lot more readable than the former > (isn't this the "raison d'?tre" of %in%?): > > setequal <- function (x, y) > { > x <- as.vector(x) > y <- as.vector(y) > all(c(x %in% y, y %in% x)) > } > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with > 'all(x %in% y) && all(y %in% x)' improves readability even more and, > more importantly, reduces memory footprint significantly on big vectors > (e.g. by 15% on integer vectors with 15M elements): > > setequal <- function (x, y) > { > x <- as.vector(x) > y <- as.vector(y) > all(x %in% y) && all(y %in% x) > } > > It also seems to speed up things a little bit (not in a significant > way though). > > Cheers, > H. > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Peter Haverty
2015-Jan-08 22:06 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
How about unique them both and compare the lengths? It's less work, especially allocation. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote:> If you look at the definition of %in%, you'll find that it is implemented > using match, so if we did as you suggest, I give it about three days before > someone suggests to inline the function call... Readability of source code > is not usually our prime concern. > > The && idea does have some merit, though. > > Apropos, why is there no setcontains()? > > -pd > > > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: > > > > Hi, > > > > Current implementation: > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) > > } > > > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > > 0L' > > with 'x %in% y' and 'y %in% x', respectively. They're strictly > > equivalent but the latter form is a lot more readable than the former > > (isn't this the "raison d'?tre" of %in%?): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(x %in% y, y %in% x)) > > } > > > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with > > 'all(x %in% y) && all(y %in% x)' improves readability even more and, > > more importantly, reduces memory footprint significantly on big vectors > > (e.g. by 15% on integer vectors with 15M elements): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(x %in% y) && all(y %in% x) > > } > > > > It also seems to speed up things a little bit (not in a significant > > way though). > > > > Cheers, > > H. > > > > -- > > Herv? Pag?s > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fredhutch.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
William Dunlap
2015-Jan-08 22:19 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
> why is there no setcontains()?Several packages define is.subset(), which I am assuming is what you are proposing, but it its arguments reversed. E.g., package:algstat has is.subset <- function(x, y) all(x %in% y) containsQ <- function(y, x) all(x %in% y) and package:rje has essentially the same is.subset. package:arulesSequences and package:arules have an S4 generic called is.subset, which is entirely different (it is not a predicate, but returns a matrix). Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote:> If you look at the definition of %in%, you'll find that it is implemented > using match, so if we did as you suggest, I give it about three days before > someone suggests to inline the function call... Readability of source code > is not usually our prime concern. > > The && idea does have some merit, though. > > Apropos, why is there no setcontains()? > > -pd > > > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: > > > > Hi, > > > > Current implementation: > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) > > } > > > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > > 0L' > > with 'x %in% y' and 'y %in% x', respectively. They're strictly > > equivalent but the latter form is a lot more readable than the former > > (isn't this the "raison d'?tre" of %in%?): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(x %in% y, y %in% x)) > > } > > > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with > > 'all(x %in% y) && all(y %in% x)' improves readability even more and, > > more importantly, reduces memory footprint significantly on big vectors > > (e.g. by 15% on integer vectors with 15M elements): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(x %in% y) && all(y %in% x) > > } > > > > It also seems to speed up things a little bit (not in a significant > > way though). > > > > Cheers, > > H. > > > > -- > > Herv? Pag?s > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fredhutch.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Hervé Pagès
2015-Jan-09 06:21 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
On 01/08/2015 01:30 PM, peter dalgaard wrote:> If you look at the definition of %in%, you'll find that it is implemented using match, so if we did as you suggest, I give it about three days before someone suggests to inline the function call...But you wouldn't bet money on that right? Because you know you would loose.> Readability of source code is not usually our prime concern.Don't sacrifice readability if you do not have a good reason for it. What's your reason here? Are you seriously suggesting that inlining makes a significant difference? As Michael pointed out, the expensive operation here is the hashing. But sadly some people like inlining and want to use it everywhere: it's easy and they feel good about it, even if it hurts readability and maintainability (if you use x %in% y instead of the inlined version, the day someone changes the implementation of x %in% y for something faster, or fixes a bug in it, your code will automatically benefit, right now it won't). More simply put: good readability generally leads to better code.> > The && idea does have some merit, though. > > Apropos, why is there no setcontains()?Wait... shouldn't everybody use all(match(x, y, nomatch = 0L) > 0L) ? H.> > -pd > >> On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: >> >> Hi, >> >> Current implementation: >> >> setequal <- function (x, y) >> { >> x <- as.vector(x) >> y <- as.vector(y) >> all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) >> } >> >> First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > 0L' >> with 'x %in% y' and 'y %in% x', respectively. They're strictly >> equivalent but the latter form is a lot more readable than the former >> (isn't this the "raison d'?tre" of %in%?): >> >> setequal <- function (x, y) >> { >> x <- as.vector(x) >> y <- as.vector(y) >> all(c(x %in% y, y %in% x)) >> } >> >> Furthermore, replacing 'all(c(x %in% y, y %in x))' with >> 'all(x %in% y) && all(y %in% x)' improves readability even more and, >> more importantly, reduces memory footprint significantly on big vectors >> (e.g. by 15% on integer vectors with 15M elements): >> >> setequal <- function (x, y) >> { >> x <- as.vector(x) >> y <- as.vector(y) >> all(x %in% y) && all(y %in% x) >> } >> >> It also seems to speed up things a little bit (not in a significant >> way though). >> >> Cheers, >> H. >> >> -- >> Herv? Pag?s >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fredhutch.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Maybe Matching Threads
- setequal: better readability, reduced memory footprint, and minor speedup
- setequal: better readability, reduced memory footprint, and minor speedup
- setequal: better readability, reduced memory footprint, and minor speedup
- reducing redundant work in methods package
- names function for environments?