thr3ads.net - R devel - [Rd] setequal: better readability, reduced memory footprint, and minor speedup [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Hervé Pagès

2015-Jan-06 21:02 UTC

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Hi,

Current implementation:

   setequal <- function (x, y)
   {
     x <- as.vector(x)
     y <- as.vector(y)
     all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
   }

First what about replacing 'match(x, y, 0L) > 0L' and 'match(y,
x, 0L) > 0L'
with 'x %in% y' and 'y %in% x', respectively. They're
strictly
equivalent but the latter form is a lot more readable than the former
(isn't this the "raison d'?tre" of %in%?):

   setequal <- function (x, y)
   {
     x <- as.vector(x)
     y <- as.vector(y)
     all(c(x %in% y, y %in% x))
   }

Furthermore, replacing 'all(c(x %in% y, y %in x))' with
'all(x %in% y) && all(y %in% x)' improves readability even more
and,
more importantly, reduces memory footprint significantly on big vectors
(e.g. by 15% on integer vectors with 15M elements):

   setequal <- function (x, y)
   {
     x <- as.vector(x)
     y <- as.vector(y)
     all(x %in% y) && all(y %in% x)
   }

It also seems to speed up things a little bit (not in a significant
way though).

Cheers,
H.

-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

peter dalgaard

2015-Jan-08 21:30 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

If you look at the definition of %in%, you'll find that it is implemented
using match, so if we did as you suggest, I give it about three days before
someone suggests to inline the function call... Readability of source code is
not usually our prime concern.

The && idea does have some merit, though. 

Apropos, why is there no setcontains()?

-pd
> On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org>
wrote:
> 
> Hi,
> 
> Current implementation:
> 
> setequal <- function (x, y)
> {
>  x <- as.vector(x)
>  y <- as.vector(y)
>  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> }
> 
> First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L) > 0L'
> with 'x %in% y' and 'y %in% x', respectively. They're
strictly
> equivalent but the latter form is a lot more readable than the former
> (isn't this the "raison d'?tre" of %in%?):
> 
> setequal <- function (x, y)
> {
>  x <- as.vector(x)
>  y <- as.vector(y)
>  all(c(x %in% y, y %in% x))
> }
> 
> Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> 'all(x %in% y) && all(y %in% x)' improves readability even
more and,
> more importantly, reduces memory footprint significantly on big vectors
> (e.g. by 15% on integer vectors with 15M elements):
> 
> setequal <- function (x, y)
> {
>  x <- as.vector(x)
>  y <- as.vector(y)
>  all(x %in% y) && all(y %in% x)
> }
> 
> It also seems to speed up things a little bit (not in a significant
> way though).
> 
> Cheers,
> H.
> 
> -- 
> Herv? Pag?s
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Peter Haverty

2015-Jan-08 22:06 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

How about unique them both and compare the lengths?  It's less work,
especially allocation.



Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com>
wrote:
> If you look at the definition of %in%, you'll find that it is
implemented
> using match, so if we did as you suggest, I give it about three days before
> someone suggests to inline the function call... Readability of source code
> is not usually our prime concern.
>
> The && idea does have some merit, though.
>
> Apropos, why is there no setcontains()?
>
> -pd
>
> > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org>
wrote:
> >
> > Hi,
> >
> > Current implementation:
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> > }
> >
> > First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L) >
> 0L'
> > with 'x %in% y' and 'y %in% x', respectively.
They're strictly
> > equivalent but the latter form is a lot more readable than the former
> > (isn't this the "raison d'?tre" of %in%?):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(x %in% y, y %in% x))
> > }
> >
> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> > 'all(x %in% y) && all(y %in% x)' improves readability
even more and,
> > more importantly, reduces memory footprint significantly on big
vectors
> > (e.g. by 15% on integer vectors with 15M elements):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(x %in% y) && all(y %in% x)
> > }
> >
> > It also seems to speed up things a little bit (not in a significant
> > way though).
> >
> > Cheers,
> > H.
> >
> > --
> > Herv? Pag?s
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

William Dunlap

2015-Jan-08 22:19 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

> why is there no setcontains()?
Several packages define is.subset(), which I am assuming is what you are
proposing, but it its arguments reversed.  E.g., package:algstat has
   is.subset <- function(x, y) all(x %in% y)
   containsQ <- function(y, x) all(x %in% y)
and package:rje has essentially the same is.subset.

package:arulesSequences and package:arules have an S4 generic called
is.subset, which is entirely different (it is not a predicate, but returns
a matrix).


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Jan 8, 2015 at 1:30 PM, peter dalgaard <pdalgd at gmail.com>
wrote:
> If you look at the definition of %in%, you'll find that it is
implemented
> using match, so if we did as you suggest, I give it about three days before
> someone suggests to inline the function call... Readability of source code
> is not usually our prime concern.
>
> The && idea does have some merit, though.
>
> Apropos, why is there no setcontains()?
>
> -pd
>
> > On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org>
wrote:
> >
> > Hi,
> >
> > Current implementation:
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
> > }
> >
> > First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L) >
> 0L'
> > with 'x %in% y' and 'y %in% x', respectively.
They're strictly
> > equivalent but the latter form is a lot more readable than the former
> > (isn't this the "raison d'?tre" of %in%?):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(c(x %in% y, y %in% x))
> > }
> >
> > Furthermore, replacing 'all(c(x %in% y, y %in x))' with
> > 'all(x %in% y) && all(y %in% x)' improves readability
even more and,
> > more importantly, reduces memory footprint significantly on big
vectors
> > (e.g. by 15% on integer vectors with 15M elements):
> >
> > setequal <- function (x, y)
> > {
> >  x <- as.vector(x)
> >  y <- as.vector(y)
> >  all(x %in% y) && all(y %in% x)
> > }
> >
> > It also seems to speed up things a little bit (not in a significant
> > way though).
> >
> > Cheers,
> > H.
> >
> > --
> > Herv? Pag?s
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Hervé Pagès

2015-Jan-09 06:21 UTC

head link

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

On 01/08/2015 01:30 PM, peter dalgaard wrote:> If you look at the definition of %in%, you'll find that it is
implemented using match, so if we did as you suggest, I give it about three days
before someone suggests to inline the function call...
But you wouldn't bet money on that right? Because you know you would
loose.
> Readability of source code is not usually our prime concern.
Don't sacrifice readability if you do not have a good reason for it.
What's your reason here? Are you seriously suggesting that inlining
makes a significant difference? As Michael pointed out, the expensive
operation here is the hashing. But sadly some people like inlining and
want to use it everywhere: it's easy and they feel good about it, even
if it hurts readability and maintainability (if you use x %in% y
instead of the inlined version, the day someone changes the
implementation of x %in% y for something faster, or fixes a bug
in it, your code will automatically benefit, right now it won't).

More simply put: good readability generally leads to better code.
>
> The && idea does have some merit, though.
>
> Apropos, why is there no setcontains()?
Wait... shouldn't everybody use all(match(x, y, nomatch = 0L) > 0L) ?

H.
>
> -pd
>
>> On 06 Jan 2015, at 22:02 , Herv? Pag?s <hpages at fredhutch.org>
wrote:
>>
>> Hi,
>>
>> Current implementation:
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(c(match(x, y, 0L) > 0L, match(y, x, 0L) > 0L))
>> }
>>
>> First what about replacing 'match(x, y, 0L) > 0L' and
'match(y, x, 0L) > 0L'
>> with 'x %in% y' and 'y %in% x', respectively.
They're strictly
>> equivalent but the latter form is a lot more readable than the former
>> (isn't this the "raison d'?tre" of %in%?):
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(c(x %in% y, y %in% x))
>> }
>>
>> Furthermore, replacing 'all(c(x %in% y, y %in x))' with
>> 'all(x %in% y) && all(y %in% x)' improves readability
even more and,
>> more importantly, reduces memory footprint significantly on big vectors
>> (e.g. by 15% on integer vectors with 15M elements):
>>
>> setequal <- function (x, y)
>> {
>>   x <- as.vector(x)
>>   y <- as.vector(y)
>>   all(x %in% y) && all(y %in% x)
>> }
>>
>> It also seems to speed up things a little bit (not in a significant
>> way though).
>>
>> Cheers,
>> H.
>>
>> --
>> Herv? Pag?s
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Apparently Analagous Threads

Search for more maybe matching threads

R devel - Jan 2015 - setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

[Rd] setequal: better readability, reduced memory footprint, and minor speedup

Apparently Analagous Threads