thr3ads.net - R devel - [Rd] setequal: better readability, reduced memory footprint, and minor speedup [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Peter Haverty

2015-Jan-08 22:06 UTC

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

How about unique them both and compare the lengths?  It's less work,
especially allocation.



Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com>
wrote:
> If you look at the definition of %in%, you'll find that it is
implemented
> using match, so if we did as you suggest, I give it about three days before
> someone suggests to inline the function call... Readability of source code
> is not usually our prime concern.
>
> The && idea does have some merit, though.
>
> Apropos, why is there no setcontains()?
>
> -pd
>
> > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org>
wrote:
> >
> > Hi,
> >
> > Current implementation:
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> > }
> >
> > First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L) >
> 0L'
> > with 'x %in% y' and 'y %in% x', respectively.
They're strictly
> > equivalent but the latter form is a lot more readable than the former
> > (isn't this the "raison d'?tre" of %in%?):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(x %in% y, y %in% x))
> > }
> >
> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> > 'all(x %in% y) && all(y %in% x)' improves readability
even more and,
> > more importantly, reduces memory footprint significantly on big
vectors
> > (e.g. by 15% on integer vectors with 15M elements):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(x %in% y) && all(y %in% x)
> > }
> >
> > It also seems to speed up things a little bit (not in a significant
> > way though).
> >
> > Cheers,
> > H.
> >
> > --
> > Herv? Pag?s
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Michael Lawrence

2015-Jan-08 23:38 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Currently unique() does duplicated() internally and then extracts. One
could make a countUnique that simply counts, rather than allocate the
logical return value of duplicated(). But so much of the cost is in the
hash operation that it probably won't help much, but that might depend on
the sizes of things. The more unique elements, the better it would perform.


On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
> How about unique them both and compare the lengths?  It's less work,
> especially allocation.
>
>
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com>
wrote:
>
> > If you look at the definition of %in%, you'll find that it is
implemented
> > using match, so if we did as you suggest, I give it about three days
> before
> > someone suggests to inline the function call... Readability of source
> code
> > is not usually our prime concern.
> >
> > The && idea does have some merit, though.
> >
> > Apropos, why is there no setcontains()?
> >
> > -pd
> >
> > > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at
fredhutch.org> wrote:
> > >
> > > Hi,
> > >
> > > Current implementation:
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> > > }
> > >
> > > First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L)
> >
> > 0L'
> > > with 'x %in% y' and 'y %in% x', respectively.
They're strictly
> > > equivalent but the latter form is a lot more readable than the
former
> > > (isn't this the "raison d'?tre" of %in%?):
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(c(x %in% y, y %in% x))
> > > }
> > >
> > > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> > > 'all(x %in% y) && all(y %in% x)' improves
readability even more and,
> > > more importantly, reduces memory footprint significantly on big
vectors
> > > (e.g. by 15% on integer vectors with 15M elements):
> > >
> > > setequal <- function (x, y)
> > > {
> > >  x <- as.vector(x)
> > >  y <- as.vector(y)
> > >  all(x %in% y) && all(y %in% x)
> > > }
> > >
> > > It also seems to speed up things a little bit (not in a
significant
> > > way though).
> > >
> > > Cheers,
> > > H.
> > >
> > > --
> > > Herv? Pag?s
> > >
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >
> > > E-mail: hpages at fredhutch.org
> > > Phone:  (206) 667-5791
> > > Fax:    (206) 667-1319
> > >
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
	[[alternative HTML version deleted]]

Peter Haverty

2015-Jan-08 23:50 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

I was thinking something like:

setequal <- function(x,y) {
xu = unique(x)
yu = unique(y)
if (length(xu) != length(yu)) { return FALSE; }
return (all( match( xu, yu, 0L ) > 0L ) )
}

This lets you fail early for cheap (skipping the allocation from the
">0L"s).  Whether or not this goes fast depends a lot on the
uniqueness of
x and y and whether or not you want to optimize for the TRUE or FALSE case.
You'd do much better to make some real hashes in C and compare the keys,
but it's probably not worth the complexity.




Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Thu, Jan 8, 2015 at 2:06 PM, Peter Haverty <phaverty at gene.com>
wrote:
> How about unique them both and compare the lengths?  It's less work,
> especially allocation.
>
>
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com>
wrote:
>
>> If you look at the definition of %in%, you'll find that it is
implemented
>> using match, so if we did as you suggest, I give it about three days
before
>> someone suggests to inline the function call... Readability of source
code
>> is not usually our prime concern.
>>
>> The && idea does have some merit, though.
>>
>> Apropos, why is there no setcontains()?
>>
>> -pd
>>
>> > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at
fredhutch.org> wrote:
>> >
>> > Hi,
>> >
>> > Current implementation:
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
>> > }
>> >
>> > First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L)
>> > 0L'
>> > with 'x %in% y' and 'y %in% x', respectively.
They're strictly
>> > equivalent but the latter form is a lot more readable than the
former
>> > (isn't this the "raison d'?tre" of %in%?):
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(c(x %in% y, y %in% x))
>> > }
>> >
>> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
>> > 'all(x %in% y) && all(y %in% x)' improves
readability even more and,
>> > more importantly, reduces memory footprint significantly on big
vectors
>> > (e.g. by 15% on integer vectors with 15M elements):
>> >
>> > setequal <- function (x, y)
>> > {
>> >  x <- as.vector(x)
>> >  y <- as.vector(y)
>> >  all(x %in% y) && all(y %in% x)
>> > }
>> >
>> > It also seems to speed up things a little bit (not in a
significant
>> > way though).
>> >
>> > Cheers,
>> > H.
>> >
>> > --
>> > Herv? Pag?s
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fredhutch.org
>> > Phone:  (206) 667-5791
>> > Fax:    (206) 667-1319
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
	[[alternative HTML version deleted]]

Peter Haverty

2015-Jan-09 00:57 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Try this out. It looks like a 2X speedup for some cases and a wash in
others.  "unique" does two allocations, but skipping the ">
0L" allocation
could make up for it.

library(microbenchmark)
library(RUnit)

x = sample.int(1e4, 1e5, TRUE)
y = sample.int(1e4, 1e5, TRUE)

set_equal <- function(x, y) {
    xu = .Internal(unique(x, FALSE, FALSE, NA))
    yu = .Internal(unique(y, FALSE, FALSE, NA))
    if (length(xu) != length(yu)) {
        return(FALSE);
    }
    return( all(match(xu, yu, 0L) > 0L) )
}

set_equal2 <- function(x, y) {
    xu = .Internal(unique(x, FALSE, FALSE, NA))
    yu = .Internal(unique(y, FALSE, FALSE, NA))
    if (length(xu) != length(yu)) {
        return(FALSE);
    }
    return( !anyNA(match(xu, yu)) )
}

microbenchmark(
    a = setequal(x, y),
    b = set_equal(x, y),
    c = set_equal2(x, y)
    )
checkIdentical(setequal(x, y), set_equal(x, y))
checkIdentical(setequal(x, y), set_equal2(x, y))

x = y
microbenchmark(
    a = setequal(x, y),
    b = set_equal(x, y),
    c = set_equal2(x, y)
    )
checkIdentical(setequal(x, y), set_equal(x, y))
checkIdentical(setequal(x, y), set_equal2(x, y))


Sorry, I'm probably over-posting today.

Regards,

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more apparently analagous threads

R devel - Jan 2015 - setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Seemingly Similar Threads