thr3ads.net - R devel - [Rd] NAs and rle [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Gabriel Becker

2020-Aug-26 05:57 UTC

[Rd] NAs and rle

Hi All,

A twitter user, Mike fc (@coolbutuseless) mentioned today that he was
surprised that repeated NAs weren't treated as a run by the rle function.

Now I know why they are not. NAs represent values which could be the same
or different from eachother if they were known, so from a purely conceptual
standpoint there is no way to tell whether they are the same and thus
constitute a run or not.

This conceptual strictness isnt universally observed, though, because we
get the following:
> unique(c(1, 2, 3, NA, NA, NA))
[1]  1  2  3 NA


Which means that rle(sort(x))$value is not guaranteed to be the same as
unique(x), which is a little strange (though likely of little practical
impact).


Personally, to me it also seems that, from a purely data-compression
standpoint, it would be valid to collapse those missing values into a run
of missing, as it reduces size in-memory/on disk without losing any
information.

Now none of this is to say that I suggest the default behavior be changed
(that would surely disrupt some non-trivial amount of existing code) but
what do people think of a  group.nas argument which defaults to FALSE
controlling the behavior?

As a final point, there is some precedent here (though obviously not at all
binding), as Bioconductor's Rle functionality does group NAs.

Best,
~G

	[[alternative HTML version deleted]]

William Dunlap

2020-Aug-26 14:24 UTC

head link

[Rd] NAs and rle

Splus's rle() also grouped NA's (separately from NaN's):

% Splus
TIBCO Software Inc. Confidential Information
Copyright (c) 1988-2008 TIBCO Software Inc. ALL RIGHTS RESERVED.
TIBCO Spotfire S+ Version 8.1.1 for Linux 2.6.9-34.EL, 32-bit :
2008> dput(rle(c(11,11,NA,NA,NA,NaN,14,14,14,14)))list("lengths" = c(2, 3, 1, 4)
, "values" = c(11., NA, NaN, 14.)
)

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 25, 2020 at 10:57 PM Gabriel Becker <gabembecker at gmail.com>
wrote:>
> Hi All,
>
> A twitter user, Mike fc (@coolbutuseless) mentioned today that he was
> surprised that repeated NAs weren't treated as a run by the rle
function.
>
> Now I know why they are not. NAs represent values which could be the same
> or different from eachother if they were known, so from a purely conceptual
> standpoint there is no way to tell whether they are the same and thus
> constitute a run or not.
>
> This conceptual strictness isnt universally observed, though, because we
> get the following:
>
> > unique(c(1, 2, 3, NA, NA, NA))
>
> [1]  1  2  3 NA
>
>
> Which means that rle(sort(x))$value is not guaranteed to be the same as
> unique(x), which is a little strange (though likely of little practical
> impact).
>
>
> Personally, to me it also seems that, from a purely data-compression
> standpoint, it would be valid to collapse those missing values into a run
> of missing, as it reduces size in-memory/on disk without losing any
> information.
>
> Now none of this is to say that I suggest the default behavior be changed
> (that would surely disrupt some non-trivial amount of existing code) but
> what do people think of a  group.nas argument which defaults to FALSE
> controlling the behavior?
>
> As a final point, there is some precedent here (though obviously not at all
> binding), as Bioconductor's Rle functionality does group NAs.
>
> Best,
> ~G
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Aug 2020 - NAs and rle

[Rd] NAs and rle

[Rd] NAs and rle

Seemingly Similar Threads