Hi Adrian,
Have a look at vctrs package ? they have low-level primitives that might
simplify your life a bit. I think you can get quite far by creating a custom
type that stores NAs in an attribute and utilizes vctrs proxy functionality to
preserve these attributes across different operations. Going that route will
likely to give you a much more flexible and robust solution.
Best,
Taras
> On 24 May 2021, at 15:09, Adrian Du?a <dusa.adrian at gmail.com>
wrote:
>
> Dear Alex,
>
> Thanks for piping in, I am learning with each new message.
> The problem is clear, the solution escapes me though. I've already
tried
> the attributes route: it is going to triple the data size: along with the
> additional (logical) variable that specifies which level is missing, one
> also needs to store an index such that sorting the data would still
> maintain the correct information.
>
> One also needs to think about subsetting (subset the attributes as well),
> splitting (the same), aggregating multiple datasets (even more attention),
> creating custom vectors out of multiple variables... complexity quickly
> grows towards infinity.
>
> R factors are nice indeed, but:
> - there are numerical variables which can hold multiple missing values (for
> instance income)
> - factors convert the original questionnaire values: if a missing value was
> coded 999, turning that into a factor would convert that value into
> something else
>
> I really, and wholeheartedly, do appreciate all advice: but please be
> assured that I have been thinking about this for more than 10 years and
> still haven't found a satisfactory solution.
>
> Which makes it even more intriguing, since other software like SAS or Stata
> have solved this for decades: what is their implementation, and how come
> they don't seem to be affected by the new M1 architecture?
> When package "haven" introduced the tagged NA values I said:
ah-haa... so
> that is how it's done... only to learn that implementation is just as
> fragile as the R internals.
>
> There really should be a robust solution for this seemingly mundane
> problem, but apparently is far from mundane...
>
> Best wishes,
> Adrian
>
>
> On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <alex at
bedatadriven.com>
> wrote:
>
>> Dear Adrian,
>> I just wanted to pipe in and underscore Thomas' point: the payload
bits of
>> IEEE 754 floating point values are no place to store data that you care
>> about or need to keep. That is not only related to the R APIs, but also
how
>> processors handle floating point values and signaling and non-signaling
>> NaNs. It is very difficult to reason about when and under which
>> circumstances these bits are preserved. I spent a lot of time working
on
>> Renjin's handling of these values and I can assure that any such
scheme
>> will end in tears.
>>
>> A far, far better option is to use R's attributes to store this
kind of
>> metadata. This is exactly what this language feature is for. There is
>> already a standard 'levels' attribute that holds the labels of
factors like
>> "Yes", "No" , "Refused",
"Interviewer error'' etc. In the past, I've worked
>> on projects where we stored an additional attribute like
"missingLevels"
>> that stores extra metadata on which levels should be used in which kind
of
>> analysis. That way, you can preserve all the information, and then
write a
>> utility function which automatically applies certain logic to a whole
>> dataframe just before passing the data to an analysis function. This is
>> also important because in surveys like this, different values should be
>> excluded at different times. For example, you might want to include all
>> responses in a data quality report, but exclude interviewer error and
>> refusals when conducting a PCA or fitting a model.
>>
>> Best,
>> Alex
>>
>> On Mon, May 24, 2021 at 2:03 PM Adrian Du?a <dusa.adrian at
gmail.com> wrote:
>>
>>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera
at gmail.com>
>>> wrote:
>>>
>>>> [...]
>>>>
>>>> For the reasons I explained, I would be against such a change.
Keeping
>>> the
>>>> data on the side, as also recommended by others on this list,
would
>>> allow
>>>> you for a reliable implementation. I don't want to support
fragile
>>> package
>>>> code building on unspecified R internals, and in this case
particularly
>>>> internals that themselves have not stood the test of time, so
are at
>>> high
>>>> risk of change.
>>>>
>>> I understand, and it makes sense.
>>> We'll have to wait for the R internals to settle (this really
is
>>> surprising, I wonder how other software have solved this). In the
>>> meantime,
>>> I will probably go ahead with NaNs.
>>>
>>> Thank you again,
>>> Adrian
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>> --
>> Alexander Bertram
>> Technical Director
>> *BeDataDriven BV*
>>
>> Web: http://bedatadriven.com
>> Email: alex at bedatadriven.com
>> Tel. Nederlands: +31(0)647205388
>> Skype: akbertram
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel