thr3ads.net - R devel - [Rd] 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Adrian Dușa

2021-May-23 18:04 UTC

[Rd] 1954 from NA

Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is
strictly an information carrier that differentiates between different types
of (tagged) NA values.

Having only one NA value in R is extremely limiting for the social
sciences, where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as
if they would be regular missing values. Whether the payload might be lost
in computations makes no difference: they were supposed to be "missing
values" anyways.

The original question is how the payload is currently stored: as an
unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R
internals would not be affected (and I see no reason why they would be), it
would allow an entire universe for the social sciences that is not
currently available and which all other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated,
Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera <tomas.kalibera at
gmail.com>
wrote:
> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA,
> this may change at any time. Tagging of NA is not supported in R (if it
> were, it would have been documented). It would not be possible to
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero
> the payload on any operation. Virtualized environments, binary
> translations, etc, may not preserve it in any way, either. ?NA has
> disclaimers about this, an NA may become NaN (payload lost) even in
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific
> happens to the NaN payloads would not be a good idea. One can only
> reliably use the NaN payload bits for storage, that is if one avoids any
> computation at all, avoids passing the values to any external code
> unaware of such tagging (including R), etc. If such software wants any
> NaN to be understood as NA by R, it would have to use the documented R
> API for this (so essentially translating) - but given the problems
> mentioned above, there is really no point in doing that, because such
> NAs become NaNs at any time.
>
> Best
> Tomas
>
> On 5/23/21 9:56 AM, Adrian Du?a wrote:
> > Dear R devs,
> >
> > I am probably missing something obvious, but still trying to
understand
> why
> > the 1954 from the definition of an NA has to fill 32 bits when it
> normally
> > doesn't need more than 16.
> >
> > Wouldn't the code below achieve exactly the same thing?
> >
> > typedef union
> > {
> >      double value;
> >      unsigned short word[4];
> > } ieee_double;
> >
> >
> > #ifdef WORDS_BIGENDIAN
> > static CONST int hw = 0;
> > static CONST int lw = 3;
> > #else  /* !WORDS_BIGENDIAN */
> > static CONST int hw = 3;
> > static CONST int lw = 0;
> > #endif /* WORDS_BIGENDIAN */
> >
> >
> > static double R_ValueOfNA(void)
> > {
> >      volatile ieee_double x;
> >      x.word[hw] = 0x7ff0;
> >      x.word[lw] = 1954;
> >      return x.value;
> > }
> >
> > This question has to do with the tagged NA values from package haven,
on
> > which I want to improve. Every available bit counts, especially if
> > multi-byte characters are going to be involved.
> >
> > Best wishes,
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Tomas Kalibera

2021-May-23 19:14 UTC

head link

[Rd] 1954 from NA

On 5/23/21 8:04 PM, Adrian Du?a wrote:> Dear Tomas,
>
> I understand that perfectly, but that is fine.
> The payload is not going to be used in any computations anyways, it is 
> strictly an information carrier that differentiates between different 
> types of (tagged) NA values.Good, but unfortunately the delineation between computation and 
non-computation is not always transparent. Even if an operation doesn't 
look like "computation" on the high-level, it may internally involve 
computation - so, really, an R NA can become R NaN and vice versa, at 
any point (this is not a "feature", but it is how things are
now).> Having only one NA value in R is extremely limiting for the social 
> sciences, where multiple missing values may exist, because respondents:
> - did not know what to respond, or
> - did not want to respond, or perhaps
> - the question did not apply in a given situation etc.
>
> All of these need to be captured, stored, and most importantly treated 
> as if they would be regular missing values. Whether the payload might 
> be lost in computations makes no difference: they were supposed to be 
> "missing values" anyways.
Ok, then I would probably keep the meta-data on the missing values on 
the side to implement such missing values in such code, and treat them 
explicitly in supported operations.

But. in principle, you can use the floating-point NaN payloads, and you 
can pass such values to R. You just need to be prepared that not only 
you would loose your payloads/tags, but also the difference between R NA 
and R NaNs. Thanks to value semantics of R, you would not loose the tags 
in input values with proper reference counts (e.g. marked immutable), 
because those values will not be modified.

Best
Tomas
> The original question is how the payload is currently stored: as an 
> unsigned int of 32 bits, or as an unsigned short of 16 bits. If the R 
> internals would not be affected (and I see no reason why they would 
> be), it would allow an entire universe for the social sciences that is 
> not currently available and which all other major statistical packages 
> do offer.
>
> Thank you very much, your attention is greatly appreciated,
> Adrian
>
> On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera 
> <tomas.kalibera at gmail.com <mailto:tomas.kalibera at
gmail.com>> wrote:
>
>     TLDR: tagging R NAs is not possible.
>
>     External software should not depend on how R currently implements NA,
>     this may change at any time. Tagging of NA is not supported in R
>     (if it
>     were, it would have been documented). It would not be possible to
>     implement such tagging reliably with the current implementation of
>     NA in R.
>
>     NaN payload propagation is not standardized. Compilers are free to
>     and
>     do optimize code not preserving/achieving any specific propagation.
>     CPUs/FPUs differ in how they propagate in binary operations, some
>     zero
>     the payload on any operation. Virtualized environments, binary
>     translations, etc, may not preserve it in any way, either. ?NA has
>     disclaimers about this, an NA may become NaN (payload lost) even in
>     unary operations and also in binary operations not involving other
>     NaN/NAs.
>
>     Writing any new software that would depend on that anything specific
>     happens to the NaN payloads would not be a good idea. One can only
>     reliably use the NaN payload bits for storage, that is if one
>     avoids any
>     computation at all, avoids passing the values to any external code
>     unaware of such tagging (including R), etc. If such software wants
>     any
>     NaN to be understood as NA by R, it would have to use the
>     documented R
>     API for this (so essentially translating) - but given the problems
>     mentioned above, there is really no point in doing that, because such
>     NAs become NaNs at any time.
>
>     Best
>     Tomas
>
>     On 5/23/21 9:56 AM, Adrian Du?a wrote:
>     > Dear R devs,
>     >
>     > I am probably missing something obvious, but still trying to
>     understand why
>     > the 1954 from the definition of an NA has to fill 32 bits when
>     it normally
>     > doesn't need more than 16.
>     >
>     > Wouldn't the code below achieve exactly the same thing?
>     >
>     > typedef union
>     > {
>     >? ? ? double value;
>     >? ? ? unsigned short word[4];
>     > } ieee_double;
>     >
>     >
>     > #ifdef WORDS_BIGENDIAN
>     > static CONST int hw = 0;
>     > static CONST int lw = 3;
>     > #else? /* !WORDS_BIGENDIAN */
>     > static CONST int hw = 3;
>     > static CONST int lw = 0;
>     > #endif /* WORDS_BIGENDIAN */
>     >
>     >
>     > static double R_ValueOfNA(void)
>     > {
>     >? ? ? volatile ieee_double x;
>     >? ? ? x.word[hw] = 0x7ff0;
>     >? ? ? x.word[lw] = 1954;
>     >? ? ? return x.value;
>     > }
>     >
>     > This question has to do with the tagged NA values from package
>     haven, on
>     > which I want to improve. Every available bit counts, especially if
>     > multi-byte characters are going to be involved.
>     >
>     > Best wishes,
>
>     ______________________________________________
>     R-devel at r-project.org <mailto:R-devel at r-project.org>
mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
>     <https://stat.ethz.ch/mailman/listinfo/r-devel>
>
	[[alternative HTML version deleted]]

Avi Gross

2021-May-23 19:21 UTC

head link

[Rd] 1954 from NA

Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have
had to adapt to subtleties in how they did things with software like SPSS in
which an NA was done using an out of bounds marker like 999 or "." or
even a blank cell. The problem is that R has a concept where data such as
integers or floating point numbers is not stored as text normally but in their
own formats and a vector by definition can only contain ONE data type. So the
various forms of NA as well as Nan and Inf had to be grafted on to be considered
VALID to share the same storage area as if they sort of were an integer or
floating point number or text or whatever.

It does strike me as possible to simply have a column that is something like a
factor that can contain as many NA excuses as you wish such as "NOT
ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to
"WILL BE FILLED IN LATER" to "I DON'T SPEAK ENGLISH AND
CANNOT ANSWER STUPID QUESTIONS". This additional column would presumably
only have content when the other column has an NA. Your queries and other
changes would work on something like a data.frame where both such columns
coexisted.

Note reading in data with multiple NA reasons may take extra work. If your
errors codes are text, it will all become text. If the errors are 999 and 998
and 997, it may all be treated as numeric and you may not want to convert all
such codes to an NA immediately. Rather, you would use the first vector/column
to make the second vector and THEN replace everything that should be an NA with
an actual NA and reparse the entire vector to become properly numeric unless you
like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an
implementation that does allow annotation may use up space too. Of course, if
your NA values are rare and space is only used then, you might save space. But
if you could make a factor column and have it use the smallest int it can get as
a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are
quite used to using one column to explain another. So if you are asked to say
tabulate what percent of missing values are due to reasons A/B/C then the added
columns works fine for that calculation too.

-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Adrian Du?a
Sent: Sunday, May 23, 2021 2:04 PM
To: Tomas Kalibera <tomas.kalibera at gmail.com>
Cc: r-devel <r-devel at r-project.org>
Subject: Re: [Rd] 1954 from NA

Dear Tomas,

I understand that perfectly, but that is fine.
The payload is not going to be used in any computations anyways, it is strictly
an information carrier that differentiates between different types of (tagged)
NA values.

Having only one NA value in R is extremely limiting for the social sciences,
where multiple missing values may exist, because respondents:
- did not know what to respond, or
- did not want to respond, or perhaps
- the question did not apply in a given situation etc.

All of these need to be captured, stored, and most importantly treated as if
they would be regular missing values. Whether the payload might be lost in
computations makes no difference: they were supposed to be "missing
values" anyways.

The original question is how the payload is currently stored: as an unsigned int
of 32 bits, or as an unsigned short of 16 bits. If the R internals would not be
affected (and I see no reason why they would be), it would allow an entire
universe for the social sciences that is not currently available and which all
other major statistical packages do offer.

Thank you very much, your attention is greatly appreciated, Adrian

On Sun, May 23, 2021 at 7:59 PM Tomas Kalibera <tomas.kalibera at
gmail.com>
wrote:
> TLDR: tagging R NAs is not possible.
>
> External software should not depend on how R currently implements NA, 
> this may change at any time. Tagging of NA is not supported in R (if 
> it were, it would have been documented). It would not be possible to 
> implement such tagging reliably with the current implementation of NA in R.
>
> NaN payload propagation is not standardized. Compilers are free to and 
> do optimize code not preserving/achieving any specific propagation.
> CPUs/FPUs differ in how they propagate in binary operations, some zero 
> the payload on any operation. Virtualized environments, binary 
> translations, etc, may not preserve it in any way, either. ?NA has 
> disclaimers about this, an NA may become NaN (payload lost) even in 
> unary operations and also in binary operations not involving other NaN/NAs.
>
> Writing any new software that would depend on that anything specific 
> happens to the NaN payloads would not be a good idea. One can only 
> reliably use the NaN payload bits for storage, that is if one avoids 
> any computation at all, avoids passing the values to any external code 
> unaware of such tagging (including R), etc. If such software wants any 
> NaN to be understood as NA by R, it would have to use the documented R 
> API for this (so essentially translating) - but given the problems 
> mentioned above, there is really no point in doing that, because such 
> NAs become NaNs at any time.
>
> Best
> Tomas
>
> On 5/23/21 9:56 AM, Adrian Du?a wrote:
> > Dear R devs,
> >
> > I am probably missing something obvious, but still trying to 
> > understand
> why
> > the 1954 from the definition of an NA has to fill 32 bits when it
> normally
> > doesn't need more than 16.
> >
> > Wouldn't the code below achieve exactly the same thing?
> >
> > typedef union
> > {
> >      double value;
> >      unsigned short word[4];
> > } ieee_double;
> >
> >
> > #ifdef WORDS_BIGENDIAN
> > static CONST int hw = 0;
> > static CONST int lw = 3;
> > #else  /* !WORDS_BIGENDIAN */
> > static CONST int hw = 3;
> > static CONST int lw = 0;
> > #endif /* WORDS_BIGENDIAN */
> >
> >
> > static double R_ValueOfNA(void)
> > {
> >      volatile ieee_double x;
> >      x.word[hw] = 0x7ff0;
> >      x.word[lw] = 1954;
> >      return x.value;
> > }
> >
> > This question has to do with the tagged NA values from package 
> > haven, on which I want to improve. Every available bit counts, 
> > especially if multi-byte characters are going to be involved.
> >
> > Best wishes,
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - May 2021 - 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA