thr3ads.net - R devel - [Rd] [External] Re: 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2021-May-26 15:43 UTC

[Rd] [External] Re: 1954 from NA

On 26/05/2021 10:22 a.m., Adrian Du?a wrote:> Dear?Duncan,
> 
> On Wed, May 26, 2021 at 2:27 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     You've already been told how to solve this:? just add attributes to
the
>     objects. Use the standard NA to indicate that there is some kind of
>     missingness, and the attribute to describe exactly what it is.? Stick a
>     class on those objects and define methods so that subsetting and
>     arithmetic preserves the extra info you've added. If you do some
>     operation that turns those NAs into NaNs, big deal:? the attribute will
>     still be there, and is.na <is.na>(NaN) still returns TRUE.
> 
> 
> I've already tried the attributes way, it is not so easy.
If you have specific operations that are needed but that you can't get 
to work, post the issue here.
> In the best case scenario, it unnecessarily triples the size of the 
> data, but perhaps this is the only way forward.
I don't see how it could triple the size.  Surely an integer has enough 
values to cover all possible kinds of missingness.  So on integer or 
factor data you'd double the size, on real or character data you'd 
increase it by 50%.  (This is assuming you're on a 64 bit platform with 
32 bit integers and 64 bit reals and pointers.)

Here's a tiny implementation to show what I'm talking about:

asMultiMissing <- function(x) {
   if (isMultiMissing(x))
     return(x)
   missingKind <- ifelse(is.na(x), 1, 0)
   structure(x,
             missingKind = missingKind,
             class = c("MultiMissing", class(x)))
}

isMultiMissing <- function(x)
   inherits(x, "MultiMissing")

missingKind <- function(x) {
   if (isMultiMissing(x))
     attr(x, "missingKind")
   else
     ifelse(is.na(x), 1, 0)
}

`missingKind<-` <- function(x, value) {
   class(x) <- setdiff(class(x), "MultiMissing")
   x[value != 0] <- NA
   x <- asMultiMissing(x)
   attr(x, "missingKind") <- value
   x
}

`[.MultiMissing` <- function(x, i, ...) {
   missings <- missingKind(x)
   x <- NextMethod()
   missings <- missings[i]
   missingKind(x) <- missings
   x
}

print.MultiMissing <- function(x, ...) {
   vals <- as.character(x)
   if (!is.character(x) || inherits(x, "noquote"))
     print(noquote(vals))
   else
     print(vals)
}

`[<-.MultiMissing` <- function(x, i, value, ...) {
   missings <- missingKind(x)
   class(x) <- setdiff(class(x), "MultiMissing")
   x[i] <- value
   missings[i] <- missingKind(value)
   missingKind(x) <- missings
   x
}

as.character.MultiMissing <- function(x, ...) {
   missings <- missingKind(x)
   result <- NextMethod()
   ifelse(missings != 0,
          paste0("NA.", missings), result)

}

This is incomplete.  It doesn't do printing very well, and it doesn't 
handle the case of assigning a MultiMissing value to a regular vector at 
all.  (I think you'd need an S4 implementation if you want to support 
that.)  But it does the basics:

 > x <- 1:10
 > missingKind(x)[4] <- 23
 > x
  [1] 1     2     3     NA.23 5     6     7     8     9
[10] 10
 > is.na(x)
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[10] FALSE
 > missingKind(x)
  [1]  0  0  0 23  0  0  0  0  0  0
 >

Duncan Murdoch
> 
>     Base R doesn't need anything else.
> 
>     You complained that users shouldn't need to know about attributes,
and
>     they won't:? you, as the author of the package that does this, will
>     handle all those details.? Working in your subject area you know all
>     the
>     different kinds of NAs that people care about, and how they code
>     them in
>     input data, so you can make it all totally transparent.? If you do it
>     well, someone in some other subject area with a completely different
>     set
>     of kinds of missingness will be able to adapt your code to their use.
> 
> 
> But that is the whole point: the package author does not define possible 
> NAs (the possibilities are infinite), users do that.
> The package should only provide a simple method to achieve that.
> 
> 
>     I imagine this has all been done in one of the thousands of packages on
>     CRAN, but if it hasn't been done well enough for you, do it better.
> 
> 
> If it were, I would have found it by now...
> 
> Best wishes,
> Adrian

Duncan Murdoch

2021-May-26 16:05 UTC

head link

[Rd] [External] Re: 1954 from NA

After 5 minutes more thought:

- code non-missing as missingKind = NA, not 0, so that missingKind could 
be a character vector, or missingKind = 0 could be supported.

- print methods should return the main argument, so mine should be

print.MultiMissing <- function(x, ...) {
   vals <- as.character(x)
   if (!is.character(x) || inherits(x, "noquote"))
     print(noquote(vals))
   else
     print(vals)
   invisible(x)
}

This still needs a lot of improvement to be a good print method, but 
I'll leave that to you.

Duncan Murdoch

On 26/05/2021 11:43 a.m., Duncan Murdoch wrote:> On 26/05/2021 10:22 a.m., Adrian Du?a wrote:
>> Dear?Duncan,
>>
>> On Wed, May 26, 2021 at 2:27 AM Duncan Murdoch <murdoch.duncan at
gmail.com
>> <mailto:murdoch.duncan at gmail.com>> wrote:
>>
>>      You've already been told how to solve this:? just add
attributes to the
>>      objects. Use the standard NA to indicate that there is some kind
of
>>      missingness, and the attribute to describe exactly what it is.?
Stick a
>>      class on those objects and define methods so that subsetting and
>>      arithmetic preserves the extra info you've added. If you do
some
>>      operation that turns those NAs into NaNs, big deal:? the attribute
will
>>      still be there, and is.na <is.na>(NaN) still returns
TRUE.
>>
>>
>> I've already tried the attributes way, it is not so easy.
> 
> If you have specific operations that are needed but that you can't get
> to work, post the issue here.
> 
>> In the best case scenario, it unnecessarily triples the size of the
>> data, but perhaps this is the only way forward.
> 
> I don't see how it could triple the size.  Surely an integer has enough
> values to cover all possible kinds of missingness.  So on integer or
> factor data you'd double the size, on real or character data you'd
> increase it by 50%.  (This is assuming you're on a 64 bit platform with
> 32 bit integers and 64 bit reals and pointers.)
> 
> Here's a tiny implementation to show what I'm talking about:
> 
> asMultiMissing <- function(x) {
>     if (isMultiMissing(x))
>       return(x)
>     missingKind <- ifelse(is.na(x), 1, 0)
>     structure(x,
>               missingKind = missingKind,
>               class = c("MultiMissing", class(x)))
> }
> 
> isMultiMissing <- function(x)
>     inherits(x, "MultiMissing")
> 
> missingKind <- function(x) {
>     if (isMultiMissing(x))
>       attr(x, "missingKind")
>     else
>       ifelse(is.na(x), 1, 0)
> }
> 
> `missingKind<-` <- function(x, value) {
>     class(x) <- setdiff(class(x), "MultiMissing")
>     x[value != 0] <- NA
>     x <- asMultiMissing(x)
>     attr(x, "missingKind") <- value
>     x
> }
> 
> `[.MultiMissing` <- function(x, i, ...) {
>     missings <- missingKind(x)
>     x <- NextMethod()
>     missings <- missings[i]
>     missingKind(x) <- missings
>     x
> }
> 
> print.MultiMissing <- function(x, ...) {
>     vals <- as.character(x)
>     if (!is.character(x) || inherits(x, "noquote"))
>       print(noquote(vals))
>     else
>       print(vals)
> }
> 
> `[<-.MultiMissing` <- function(x, i, value, ...) {
>     missings <- missingKind(x)
>     class(x) <- setdiff(class(x), "MultiMissing")
>     x[i] <- value
>     missings[i] <- missingKind(value)
>     missingKind(x) <- missings
>     x
> }
> 
> as.character.MultiMissing <- function(x, ...) {
>     missings <- missingKind(x)
>     result <- NextMethod()
>     ifelse(missings != 0,
>            paste0("NA.", missings), result)
> 
> }
> 
> This is incomplete.  It doesn't do printing very well, and it
doesn't
> handle the case of assigning a MultiMissing value to a regular vector at
> all.  (I think you'd need an S4 implementation if you want to support
> that.)  But it does the basics:
> 
>   > x <- 1:10
>   > missingKind(x)[4] <- 23
>   > x
>    [1] 1     2     3     NA.23 5     6     7     8     9
> [10] 10
>   > is.na(x)
>    [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
> [10] FALSE
>   > missingKind(x)
>    [1]  0  0  0 23  0  0  0  0  0  0
>   >
> 
> Duncan Murdoch
> 
>>
>>      Base R doesn't need anything else.
>>
>>      You complained that users shouldn't need to know about
attributes, and
>>      they won't:? you, as the author of the package that does this,
will
>>      handle all those details.? Working in your subject area you know
all
>>      the
>>      different kinds of NAs that people care about, and how they code
>>      them in
>>      input data, so you can make it all totally transparent.? If you do
it
>>      well, someone in some other subject area with a completely
different
>>      set
>>      of kinds of missingness will be able to adapt your code to their
use.
>>
>>
>> But that is the whole point: the package author does not define
possible
>> NAs (the possibilities are infinite), users do that.
>> The package should only provide a simple method to achieve that.
>>
>>
>>      I imagine this has all been done in one of the thousands of
packages on
>>      CRAN, but if it hasn't been done well enough for you, do it
better.
>>
>>
>> If it were, I would have found it by now...
>>
>> Best wishes,
>> Adrian
>

Adrian Dușa

2021-May-26 17:08 UTC

head link

[Rd] [External] Re: 1954 from NA

On Wed, May 26, 2021 at 6:43 PM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> [...]
> > In the best case scenario, it unnecessarily triples the size of the
> > data, but perhaps this is the only way forward.
>
> I don't see how it could triple the size.  Surely an integer has enough
> values to cover all possible kinds of missingness.  So on integer or
> factor data you'd double the size, on real or character data you'd
> increase it by 50%.  (This is assuming you're on a 64 bit platform with
> 32 bit integers and 64 bit reals and pointers.)

Apologies, that was supposed to be double the size not triple, 99% of the
survey data are integers.
But I suppose that is alright, space doesn't seem to be a problem.

Thank you very much for the examples, they do seem to cover the basics
indeed.
(that is what I meant when I wrote there might be a way without tagging
NAs).

Will take it from there, best wishes,
Adrian

	[[alternative HTML version deleted]]

R devel - May 2021 - [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA