thr3ads.net - R devel - [Rd] 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Bertram, Alexander

2021-May-24 12:29 UTC

[Rd] 1954 from NA

Dear Adrian,
I just wanted to pipe in and underscore Thomas' point: the payload bits of
IEEE 754 floating point values are no place to store data that you care
about or need to keep. That is not only related to the R APIs, but also how
processors handle floating point values and signaling and non-signaling
NaNs. It is very difficult to reason about when and under which
circumstances these bits are preserved. I spent a lot of time working on
Renjin's handling of these values and I can assure that any such scheme
will end in tears.

A far, far better option is to use R's attributes to store this kind of
metadata. This is exactly what this language feature is for. There is
already a standard 'levels' attribute that holds the labels of factors
like
"Yes", "No" , "Refused", "Interviewer
error'' etc. In the past, I've worked
on projects where we stored an additional attribute like
"missingLevels"
that stores extra metadata on which levels should be used in which kind of
analysis. That way, you can preserve all the information, and then write a
utility function which automatically applies certain logic to a whole
dataframe just before passing the data to an analysis function. This is
also important because in surveys like this, different values should be
excluded at different times. For example, you might want to include all
responses in a data quality report, but exclude interviewer error and
refusals when conducting a PCA or fitting a model.

Best,
Alex

On Mon, May 24, 2021 at 2:03 PM Adrian Du?a <dusa.adrian at gmail.com>
wrote:
> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera at
gmail.com>
> wrote:
>
> > [...]
> >
> > For the reasons I explained, I would be against such a change. Keeping
> the
> > data on the side, as also recommended by others on this list, would
allow
> > you for a reliable implementation. I don't want to support fragile
> package
> > code building on unspecified R internals, and in this case
particularly
> > internals that themselves have not stood the test of time, so are at
high
> > risk of change.
> >
> I understand, and it makes sense.
> We'll have to wait for the R internals to settle (this really is
> surprising, I wonder how other software have solved this). In the meantime,
> I will probably go ahead with NaNs.
>
> Thank you again,
> Adrian
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Alexander Bertram
Technical Director
*BeDataDriven BV*

Web: http://bedatadriven.com
Email: alex at bedatadriven.com
Tel. Nederlands: +31(0)647205388
Skype: akbertram

	[[alternative HTML version deleted]]

Adrian Dușa

2021-May-24 13:09 UTC

head link

[Rd] 1954 from NA

Dear Alex,

Thanks for piping in, I am learning with each new message.
The problem is clear, the solution escapes me though. I've already tried
the attributes route: it is going to triple the data size: along with the
additional (logical) variable that specifies which level is missing, one
also needs to store an index such that sorting the data would still
maintain the correct information.

One also needs to think about subsetting (subset the attributes as well),
splitting (the same), aggregating multiple datasets (even more attention),
creating custom vectors out of multiple variables... complexity quickly
grows towards infinity.

R factors are nice indeed, but:
- there are numerical variables which can hold multiple missing values (for
instance income)
- factors convert the original questionnaire values: if a missing value was
coded 999, turning that into a factor would convert that value into
something else

I really, and wholeheartedly, do appreciate all advice: but please be
assured that I have been thinking about this for more than 10 years and
still haven't found a satisfactory solution.

Which makes it even more intriguing, since other software like SAS or Stata
have solved this for decades: what is their implementation, and how come
they don't seem to be affected by the new M1 architecture?
When package "haven" introduced the tagged NA values I said: ah-haa...
so
that is how it's done... only to learn that implementation is just as
fragile as the R internals.

There really should be a robust solution for this seemingly mundane
problem, but apparently is far from mundane...

Best wishes,
Adrian


On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <alex at
bedatadriven.com>
wrote:
> Dear Adrian,
> I just wanted to pipe in and underscore Thomas' point: the payload bits
of
> IEEE 754 floating point values are no place to store data that you care
> about or need to keep. That is not only related to the R APIs, but also how
> processors handle floating point values and signaling and non-signaling
> NaNs. It is very difficult to reason about when and under which
> circumstances these bits are preserved. I spent a lot of time working on
> Renjin's handling of these values and I can assure that any such scheme
> will end in tears.
>
> A far, far better option is to use R's attributes to store this kind of
> metadata. This is exactly what this language feature is for. There is
> already a standard 'levels' attribute that holds the labels of
factors like
> "Yes", "No" , "Refused", "Interviewer
error'' etc. In the past, I've worked
> on projects where we stored an additional attribute like
"missingLevels"
> that stores extra metadata on which levels should be used in which kind of
> analysis. That way, you can preserve all the information, and then write a
> utility function which automatically applies certain logic to a whole
> dataframe just before passing the data to an analysis function. This is
> also important because in surveys like this, different values should be
> excluded at different times. For example, you might want to include all
> responses in a data quality report, but exclude interviewer error and
> refusals when conducting a PCA or fitting a model.
>
> Best,
> Alex
>
> On Mon, May 24, 2021 at 2:03 PM Adrian Du?a <dusa.adrian at
gmail.com> wrote:
>
>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera at
gmail.com>
>> wrote:
>>
>> > [...]
>> >
>> > For the reasons I explained, I would be against such a change.
Keeping
>> the
>> > data on the side, as also recommended by others on this list,
would
>> allow
>> > you for a reliable implementation. I don't want to support
fragile
>> package
>> > code building on unspecified R internals, and in this case
particularly
>> > internals that themselves have not stood the test of time, so are
at
>> high
>> > risk of change.
>> >
>> I understand, and it makes sense.
>> We'll have to wait for the R internals to settle (this really is
>> surprising, I wonder how other software have solved this). In the
>> meantime,
>> I will probably go ahead with NaNs.
>>
>> Thank you again,
>> Adrian
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
> --
> Alexander Bertram
> Technical Director
> *BeDataDriven BV*
>
> Web: http://bedatadriven.com
> Email: alex at bedatadriven.com
> Tel. Nederlands: +31(0)647205388
> Skype: akbertram
>
	[[alternative HTML version deleted]]

R devel - May 2021 - 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA