thr3ads.net - R devel - [Rd] 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

brodie gaslam

2021-May-23 13:30 UTC

[Rd] 1954 from NA

I should add, I don't know that you can rely on this
particular encoding of R's NA.? If I were trying to restore
an NA from some external format, I would just generate an
R NA via e.g NA_real_ in the R session I'm restoring the 
external data into, and not try to hand assemble one.

Best,

B.


On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel <r-devel
at r-project.org> wrote:





This is because the NA in question is NA_real_, which
is encoded in double precision IEEE-754, which uses
64 bits.? The "1954" is just part of the NA.? The NA
must also conform to the NaN encoding for double precision
numbers, which requires that the "beginning" portion of
the number be "0x7ff0" (well, I think it should be "0x7ff8"
but that's a different story), as you can see here:

? ? x.word[hw] = 0x7ff0;
? ? x.word[lw] = 1954;

Both those components are part of the same double precision
value.? They are just accessed this way to make it easy to
set the high bits (63-32) and the low bits (31-0).

So NA is not just 1954, its 0x7ff0 0000 & 1954 (note I'm
mixing hex and decimals here).

In IEEE 754 double precision encoding numbers that start
with 0x7ff are all NaNs.? The rest of the number except for
the first bit which designates "quiet" vs "signaling" NaNs
can
be anything.? R has taken advantage of that to designate the
R NA by setting the lower bits to be 1954.

Note I'm being pretty loose about endianess, etc. here, but
hopefully this conveys the problem.

In terms of your proposal, I'm not entirely sure what you gain.
You're still attempting to generate a 64 bit representation
in the end.? If all you need is to encode the fact that there
was an NA, and restore it later as a 64 bit NA, then you can do
whatever you want so long as the end result conforms to the
expected encoding.

In terms of using 'short' here (which again, I don't see the
need for as you're using it to generate the final 64 bit encoding),
I see two possible problems.? You're adding the dependency that
short will be 16 bits.? We already have the (implicit) assumption
in R that double is 64 bits, and explicit that int is 32 bits.
But I think you'd be going a bit on a limb assuming that short
is 16 bits (not sure).? More important, if short is indeed 16 bits,
I think in:

??? x.word[hw] = 0x7ff0;

You overflow short.

Best,

B.



On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Du?a <dusa.adrian at
unibuc.ro> wrote:





Dear R devs,

I am probably missing something obvious, but still trying to understand why
the 1954 from the definition of an NA has to fill 32 bits when it normally
doesn't need more than 16.

Wouldn't the code below achieve exactly the same thing?

typedef union
{
? ? double value;
? ? unsigned short word[4];
} ieee_double;


#ifdef WORDS_BIGENDIAN
static CONST int hw = 0;
static CONST int lw = 3;
#else? /* !WORDS_BIGENDIAN */
static CONST int hw = 3;
static CONST int lw = 0;
#endif /* WORDS_BIGENDIAN */


static double R_ValueOfNA(void)
{
? ? volatile ieee_double x;
? ? x.word[hw] = 0x7ff0;
? ? x.word[lw] = 1954;
? ? return x.value;
}

This question has to do with the tagged NA values from package haven, on
which I want to improve. Every available bit counts, especially if
multi-byte characters are going to be involved.

Best wishes,
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

??? [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Mark van der Loo

2021-May-23 14:31 UTC

head link

[Rd] 1954 from NA

I wrote about this once over here:
http://www.markvanderloo.eu/yaRb/2012/07/08/representation-of-numerical-nas-in-r-and-the-1954-enigma/

-M



Op zo 23 mei 2021 15:33 schreef brodie gaslam via R-devel <
r-devel at r-project.org>:
> I should add, I don't know that you can rely on this
> particular encoding of R's NA.  If I were trying to restore
> an NA from some external format, I would just generate an
> R NA via e.g NA_real_ in the R session I'm restoring the
> external data into, and not try to hand assemble one.
>
> Best,
>
> B.
>
>
> On Sunday, May 23, 2021, 9:23:54 AM EDT, brodie gaslam via R-devel <
> r-devel at r-project.org> wrote:
>
>
>
>
>
> This is because the NA in question is NA_real_, which
> is encoded in double precision IEEE-754, which uses
> 64 bits.  The "1954" is just part of the NA.  The NA
> must also conform to the NaN encoding for double precision
> numbers, which requires that the "beginning" portion of
> the number be "0x7ff0" (well, I think it should be
"0x7ff8"
> but that's a different story), as you can see here:
>
>     x.word[hw] = 0x7ff0;
>     x.word[lw] = 1954;
>
> Both those components are part of the same double precision
> value.  They are just accessed this way to make it easy to
> set the high bits (63-32) and the low bits (31-0).
>
> So NA is not just 1954, its 0x7ff0 0000 & 1954 (note I'm
> mixing hex and decimals here).
>
> In IEEE 754 double precision encoding numbers that start
> with 0x7ff are all NaNs.  The rest of the number except for
> the first bit which designates "quiet" vs "signaling"
NaNs can
> be anything.  R has taken advantage of that to designate the
> R NA by setting the lower bits to be 1954.
>
> Note I'm being pretty loose about endianess, etc. here, but
> hopefully this conveys the problem.
>
> In terms of your proposal, I'm not entirely sure what you gain.
> You're still attempting to generate a 64 bit representation
> in the end.  If all you need is to encode the fact that there
> was an NA, and restore it later as a 64 bit NA, then you can do
> whatever you want so long as the end result conforms to the
> expected encoding.
>
> In terms of using 'short' here (which again, I don't see the
> need for as you're using it to generate the final 64 bit encoding),
> I see two possible problems.  You're adding the dependency that
> short will be 16 bits.  We already have the (implicit) assumption
> in R that double is 64 bits, and explicit that int is 32 bits.
> But I think you'd be going a bit on a limb assuming that short
> is 16 bits (not sure).  More important, if short is indeed 16 bits,
> I think in:
>
>     x.word[hw] = 0x7ff0;
>
> You overflow short.
>
> Best,
>
> B.
>
>
>
> On Sunday, May 23, 2021, 8:56:18 AM EDT, Adrian Du?a <
> dusa.adrian at unibuc.ro> wrote:
>
>
>
>
>
> Dear R devs,
>
> I am probably missing something obvious, but still trying to understand why
> the 1954 from the definition of an NA has to fill 32 bits when it normally
> doesn't need more than 16.
>
> Wouldn't the code below achieve exactly the same thing?
>
> typedef union
> {
>     double value;
>     unsigned short word[4];
> } ieee_double;
>
>
> #ifdef WORDS_BIGENDIAN
> static CONST int hw = 0;
> static CONST int lw = 3;
> #else  /* !WORDS_BIGENDIAN */
> static CONST int hw = 3;
> static CONST int lw = 0;
> #endif /* WORDS_BIGENDIAN */
>
>
> static double R_ValueOfNA(void)
> {
>     volatile ieee_double x;
>     x.word[hw] = 0x7ff0;
>     x.word[lw] = 1954;
>     return x.value;
> }
>
> This question has to do with the tagged NA values from package haven, on
> which I want to improve. Every available bit counts, especially if
> multi-byte characters are going to be involved.
>
> Best wishes,
> --
> Adrian Dusa
> University of Bucharest
> Romanian Social Data Archive
> Soseaua Panduri nr. 90-92
> 050663 Bucharest sector 5
> Romania
> https://adriandusa.eu
>
>     [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

Adrian Dușa

2021-May-23 14:45 UTC

head link

[Rd] 1954 from NA

On Sun, May 23, 2021 at 4:33 PM brodie gaslam via R-devel <
r-devel at r-project.org> wrote:
> I should add, I don't know that you can rely on this
> particular encoding of R's NA.  If I were trying to restore
> an NA from some external format, I would just generate an
> R NA via e.g NA_real_ in the R session I'm restoring the
> external data into, and not try to hand assemble one.
>
Thanks for your answer, Brodie, especially on Sunday (much appreciated).
The aim is not to reconstruct an NA, but to "tag" an NA (and yes, I
was
referring to an NA_real_ of course), as seen in action here:
https://github.com/tidyverse/haven/blob/master/src/tagged_na.c

That code:
- preserves the first part 0x7ff0
- preserves the last part 1954
- adds one additional byte to store (tag) a character provided in the SEXP
vector

That is precisely my understanding, that doubles starting with 0x7ff are
all NaNs. My question was related to the additional part 1954 from the low
bits: why does it need 32 bits?

The binary value of 1954 is 11110100010, which is represented by 11 bits
occupying at most 2 bytes... So why does it need 4 bytes?

Re. the possible overflow, I am not sure: 0x7ff0 is the decimal 32752, or
the binary 111111111110000.
That is just about enough to fit in the available 16 bits (actually 15 to
leave one for the sign bit), so I don't really understand why it would. And
in any case, the union definition uses an unsigned short which (if my
understanding is correct) should certainly not overflow:

typedef union
{
    double value;
    unsigned short word[4];
} ieee_double;

What is gained with this proposal: 16 additional bits to do something with.
For the moment, only 16 are available (from the lower part of the high 32
bits). If the value 1954 would be checked as a short instead of an int, the
other 16 bits would become available. And those bits could be extremely
valuable to tag multi-byte characters, for instance, but also higher
numbers than 32767.

Best wishes,
Adrian

	[[alternative HTML version deleted]]

R devel - May 2021 - 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA