Peter Haverty
2015-Jan-08 22:06 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
How about unique them both and compare the lengths? It's less work, especially allocation. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote:> If you look at the definition of %in%, you'll find that it is implemented > using match, so if we did as you suggest, I give it about three days before > someone suggests to inline the function call... Readability of source code > is not usually our prime concern. > > The && idea does have some merit, though. > > Apropos, why is there no setcontains()? > > -pd > > > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: > > > > Hi, > > > > Current implementation: > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) > > } > > > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > > 0L' > > with 'x %in% y' and 'y %in% x', respectively. They're strictly > > equivalent but the latter form is a lot more readable than the former > > (isn't this the "raison d'?tre" of %in%?): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(c(x %in% y, y %in% x)) > > } > > > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with > > 'all(x %in% y) && all(y %in% x)' improves readability even more and, > > more importantly, reduces memory footprint significantly on big vectors > > (e.g. by 15% on integer vectors with 15M elements): > > > > setequal <- function (x, y) > > { > > x <- as.vector(x) > > y <- as.vector(y) > > all(x %in% y) && all(y %in% x) > > } > > > > It also seems to speed up things a little bit (not in a significant > > way though). > > > > Cheers, > > H. > > > > -- > > Herv? Pag?s > > > > Program in Computational Biology > > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N, M1-B514 > > P.O. Box 19024 > > Seattle, WA 98109-1024 > > > > E-mail: hpages at fredhutch.org > > Phone: (206) 667-5791 > > Fax: (206) 667-1319 > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Michael Lawrence
2015-Jan-08 23:38 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
Currently unique() does duplicated() internally and then extracts. One could make a countUnique that simply counts, rather than allocate the logical return value of duplicated(). But so much of the cost is in the hash operation that it probably won't help much, but that might depend on the sizes of things. The more unique elements, the better it would perform. On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <haverty.peter at gene.com> wrote:> How about unique them both and compare the lengths? It's less work, > especially allocation. > > > > Pete > > ____________________ > Peter M. Haverty, Ph.D. > Genentech, Inc. > phaverty at gene.com > > On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote: > > > If you look at the definition of %in%, you'll find that it is implemented > > using match, so if we did as you suggest, I give it about three days > before > > someone suggests to inline the function call... Readability of source > code > > is not usually our prime concern. > > > > The && idea does have some merit, though. > > > > Apropos, why is there no setcontains()? > > > > -pd > > > > > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: > > > > > > Hi, > > > > > > Current implementation: > > > > > > setequal <- function (x, y) > > > { > > > x <- as.vector(x) > > > y <- as.vector(y) > > > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) > > > } > > > > > > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) > > > > 0L' > > > with 'x %in% y' and 'y %in% x', respectively. They're strictly > > > equivalent but the latter form is a lot more readable than the former > > > (isn't this the "raison d'?tre" of %in%?): > > > > > > setequal <- function (x, y) > > > { > > > x <- as.vector(x) > > > y <- as.vector(y) > > > all(c(x %in% y, y %in% x)) > > > } > > > > > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with > > > 'all(x %in% y) && all(y %in% x)' improves readability even more and, > > > more importantly, reduces memory footprint significantly on big vectors > > > (e.g. by 15% on integer vectors with 15M elements): > > > > > > setequal <- function (x, y) > > > { > > > x <- as.vector(x) > > > y <- as.vector(y) > > > all(x %in% y) && all(y %in% x) > > > } > > > > > > It also seems to speed up things a little bit (not in a significant > > > way though). > > > > > > Cheers, > > > H. > > > > > > -- > > > Herv? Pag?s > > > > > > Program in Computational Biology > > > Division of Public Health Sciences > > > Fred Hutchinson Cancer Research Center > > > 1100 Fairview Ave. N, M1-B514 > > > P.O. Box 19024 > > > Seattle, WA 98109-1024 > > > > > > E-mail: hpages at fredhutch.org > > > Phone: (206) 667-5791 > > > Fax: (206) 667-1319 > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Peter Dalgaard, Professor, > > Center for Statistics, Copenhagen Business School > > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > > Phone: (+45)38153501 > > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >[[alternative HTML version deleted]]
Peter Haverty
2015-Jan-08 23:50 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
I was thinking something like: setequal <- function(x,y) { xu = unique(x) yu = unique(y) if (length(xu) != length(yu)) { return FALSE; } return (all( match( xu, yu, 0L ) > 0L ) ) } This lets you fail early for cheap (skipping the allocation from the ">0L"s). Whether or not this goes fast depends a lot on the uniqueness of x and y and whether or not you want to optimize for the TRUE or FALSE case. You'd do much better to make some real hashes in C and compare the keys, but it's probably not worth the complexity. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <phaverty at gene.com> wrote:> How about unique them both and compare the lengths? It's less work, > especially allocation. > > > > Pete > > ____________________ > Peter M. Haverty, Ph.D. > Genentech, Inc. > phaverty at gene.com > > On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com> wrote: > >> If you look at the definition of %in%, you'll find that it is implemented >> using match, so if we did as you suggest, I give it about three days before >> someone suggests to inline the function call... Readability of source code >> is not usually our prime concern. >> >> The && idea does have some merit, though. >> >> Apropos, why is there no setcontains()? >> >> -pd >> >> > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org> wrote: >> > >> > Hi, >> > >> > Current implementation: >> > >> > setequal <- function (x, y) >> > { >> > x <- as.vector(x) >> > y <- as.vector(y) >> > all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L)) >> > } >> > >> > First what about replacing 'match(x, y, 0L) > 0L' and 'match(y, x, 0L) >> > 0L' >> > with 'x %in% y' and 'y %in% x', respectively. They're strictly >> > equivalent but the latter form is a lot more readable than the former >> > (isn't this the "raison d'?tre" of %in%?): >> > >> > setequal <- function (x, y) >> > { >> > x <- as.vector(x) >> > y <- as.vector(y) >> > all(c(x %in% y, y %in% x)) >> > } >> > >> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with >> > 'all(x %in% y) && all(y %in% x)' improves readability even more and, >> > more importantly, reduces memory footprint significantly on big vectors >> > (e.g. by 15% on integer vectors with 15M elements): >> > >> > setequal <- function (x, y) >> > { >> > x <- as.vector(x) >> > y <- as.vector(y) >> > all(x %in% y) && all(y %in% x) >> > } >> > >> > It also seems to speed up things a little bit (not in a significant >> > way though). >> > >> > Cheers, >> > H. >> > >> > -- >> > Herv? Pag?s >> > >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M1-B514 >> > P.O. Box 19024 >> > Seattle, WA 98109-1024 >> > >> > E-mail: hpages at fredhutch.org >> > Phone: (206) 667-5791 >> > Fax: (206) 667-1319 >> > >> > ______________________________________________ >> > R-devel at r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > >[[alternative HTML version deleted]]
Peter Haverty
2015-Jan-09 00:57 UTC
[Rd] setequal: better readability, reduced memory footprint, and minor speedup
Try this out. It looks like a 2X speedup for some cases and a wash in others. "unique" does two allocations, but skipping the "> 0L" allocation could make up for it. library(microbenchmark) library(RUnit) x = sample.int(1e4, 1e5, TRUE) y = sample.int(1e4, 1e5, TRUE) set_equal <- function(x, y) { xu = .Internal(unique(x, FALSE, FALSE, NA)) yu = .Internal(unique(y, FALSE, FALSE, NA)) if (length(xu) != length(yu)) { return(FALSE); } return( all(match(xu, yu, 0L) > 0L) ) } set_equal2 <- function(x, y) { xu = .Internal(unique(x, FALSE, FALSE, NA)) yu = .Internal(unique(y, FALSE, FALSE, NA)) if (length(xu) != length(yu)) { return(FALSE); } return( !anyNA(match(xu, yu)) ) } microbenchmark( a = setequal(x, y), b = set_equal(x, y), c = set_equal2(x, y) ) checkIdentical(setequal(x, y), set_equal(x, y)) checkIdentical(setequal(x, y), set_equal2(x, y)) x = y microbenchmark( a = setequal(x, y), b = set_equal(x, y), c = set_equal2(x, y) ) checkIdentical(setequal(x, y), set_equal(x, y)) checkIdentical(setequal(x, y), set_equal2(x, y)) Sorry, I'm probably over-posting today. Regards, [[alternative HTML version deleted]]
Reasonably Related Threads
- setequal: better readability, reduced memory footprint, and minor speedup
- setequal: better readability, reduced memory footprint, and minor speedup
- setequal: better readability, reduced memory footprint, and minor speedup
- reducing redundant work in methods package
- reducing redundant work in methods package