thr3ads.net - R devel - [Rd] 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Taras Zakharko

2021-May-24 13:18 UTC

[Rd] 1954 from NA

Hi Adrian, 

Have a look at vctrs package ? they have low-level primitives that might
simplify your life a bit. I think you can get quite far by creating a custom
type that stores NAs in an attribute and utilizes vctrs proxy functionality to
preserve these attributes across different operations. Going that route will
likely to give you a much more flexible and robust solution.

Best, 

Taras
> On 24 May 2021, at 15:09, Adrian Du?a <dusa.adrian at gmail.com>
wrote:
> 
> Dear Alex,
> 
> Thanks for piping in, I am learning with each new message.
> The problem is clear, the solution escapes me though. I've already
tried
> the attributes route: it is going to triple the data size: along with the
> additional (logical) variable that specifies which level is missing, one
> also needs to store an index such that sorting the data would still
> maintain the correct information.
> 
> One also needs to think about subsetting (subset the attributes as well),
> splitting (the same), aggregating multiple datasets (even more attention),
> creating custom vectors out of multiple variables... complexity quickly
> grows towards infinity.
> 
> R factors are nice indeed, but:
> - there are numerical variables which can hold multiple missing values (for
> instance income)
> - factors convert the original questionnaire values: if a missing value was
> coded 999, turning that into a factor would convert that value into
> something else
> 
> I really, and wholeheartedly, do appreciate all advice: but please be
> assured that I have been thinking about this for more than 10 years and
> still haven't found a satisfactory solution.
> 
> Which makes it even more intriguing, since other software like SAS or Stata
> have solved this for decades: what is their implementation, and how come
> they don't seem to be affected by the new M1 architecture?
> When package "haven" introduced the tagged NA values I said:
ah-haa... so
> that is how it's done... only to learn that implementation is just as
> fragile as the R internals.
> 
> There really should be a robust solution for this seemingly mundane
> problem, but apparently is far from mundane...
> 
> Best wishes,
> Adrian
> 
> 
> On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <alex at
bedatadriven.com>
> wrote:
> 
>> Dear Adrian,
>> I just wanted to pipe in and underscore Thomas' point: the payload
bits of
>> IEEE 754 floating point values are no place to store data that you care
>> about or need to keep. That is not only related to the R APIs, but also
how
>> processors handle floating point values and signaling and non-signaling
>> NaNs. It is very difficult to reason about when and under which
>> circumstances these bits are preserved. I spent a lot of time working
on
>> Renjin's handling of these values and I can assure that any such
scheme
>> will end in tears.
>> 
>> A far, far better option is to use R's attributes to store this
kind of
>> metadata. This is exactly what this language feature is for. There is
>> already a standard 'levels' attribute that holds the labels of
factors like
>> "Yes", "No" , "Refused",
"Interviewer error'' etc. In the past, I've worked
>> on projects where we stored an additional attribute like
"missingLevels"
>> that stores extra metadata on which levels should be used in which kind
of
>> analysis. That way, you can preserve all the information, and then
write a
>> utility function which automatically applies certain logic to a whole
>> dataframe just before passing the data to an analysis function. This is
>> also important because in surveys like this, different values should be
>> excluded at different times. For example, you might want to include all
>> responses in a data quality report, but exclude interviewer error and
>> refusals when conducting a PCA or fitting a model.
>> 
>> Best,
>> Alex
>> 
>> On Mon, May 24, 2021 at 2:03 PM Adrian Du?a <dusa.adrian at
gmail.com> wrote:
>> 
>>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <tomas.kalibera
at gmail.com>
>>> wrote:
>>> 
>>>> [...]
>>>> 
>>>> For the reasons I explained, I would be against such a change.
Keeping
>>> the
>>>> data on the side, as also recommended by others on this list,
would
>>> allow
>>>> you for a reliable implementation. I don't want to support
fragile
>>> package
>>>> code building on unspecified R internals, and in this case
particularly
>>>> internals that themselves have not stood the test of time, so
are at
>>> high
>>>> risk of change.
>>>> 
>>> I understand, and it makes sense.
>>> We'll have to wait for the R internals to settle (this really
is
>>> surprising, I wonder how other software have solved this). In the
>>> meantime,
>>> I will probably go ahead with NaNs.
>>> 
>>> Thank you again,
>>> Adrian
>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> 
>> --
>> Alexander Bertram
>> Technical Director
>> *BeDataDriven BV*
>> 
>> Web: http://bedatadriven.com
>> Email: alex at bedatadriven.com
>> Tel. Nederlands: +31(0)647205388
>> Skype: akbertram
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Bertram, Alexander

2021-May-24 13:34 UTC

head link

[Rd] 1954 from NA

Dear Adrian,
SPSS and other packages handle this problem in a very similar way to what I
described: they store additional metadata for each variable. You can see
this in the way that SPSS organizes it's file format: each
"variable" has
additional metadata that indicate how specific values of the variable,
encoded as an integer or a floating point should be handled in analysis.
Before you actually run a crosstab in SPSS, the metadata is (presumably)
applied to the raw data to arrive at an in memory buffer on which the
actual model is fitted, etc.

The 20 line solution in R looks like this:


df <- data.frame(q1 = c(1, 10, 50, 999), q2 = c("Yes",
"No", "Don't know",
"Interviewer napping"), stringsAsFactors = FALSE)
attr(df$q1, 'missing') <- 999
attr(df$q2, 'missing') <- c("Don't know",
"Interviewer napping")

excludeMissing <- function(df) {
  for(q in names(df)) {
    v <- df[[q]]
    mv <- attr(v, 'missing')
    if(!is.null(mv)) {
      df[[q]] <- ifelse(v %in% mv, NA, v)
    }
  }
  df
}

table(excludeMissing(df))

If you want to preserve the missing attribute when subsetting the vectors
then you will have to take the example further by adding a class and
`[.withMissing` functions. This might bring the whole project to a few
hundred lines, but the rules that apply here are well defined and well
understood, giving you a proper basis on which to build. And perhaps the
vctrs package might make this even simpler, take a look.

Best,
Alex

On Mon, May 24, 2021 at 3:20 PM Taras Zakharko <taras.zakharko at uzh.ch>
wrote:
> Hi Adrian,
>
> Have a look at vctrs package ? they have low-level primitives that might
> simplify your life a bit. I think you can get quite far by creating a
> custom type that stores NAs in an attribute and utilizes vctrs proxy
> functionality to preserve these attributes across different operations.
> Going that route will likely to give you a much more flexible and robust
> solution.
>
> Best,
>
> Taras
>
> > On 24 May 2021, at 15:09, Adrian Du?a <dusa.adrian at gmail.com>
wrote:
> >
> > Dear Alex,
> >
> > Thanks for piping in, I am learning with each new message.
> > The problem is clear, the solution escapes me though. I've already
tried
> > the attributes route: it is going to triple the data size: along with
the
> > additional (logical) variable that specifies which level is missing,
one
> > also needs to store an index such that sorting the data would still
> > maintain the correct information.
> >
> > One also needs to think about subsetting (subset the attributes as
well),
> > splitting (the same), aggregating multiple datasets (even more
> attention),
> > creating custom vectors out of multiple variables... complexity
quickly
> > grows towards infinity.
> >
> > R factors are nice indeed, but:
> > - there are numerical variables which can hold multiple missing values
> (for
> > instance income)
> > - factors convert the original questionnaire values: if a missing
value
> was
> > coded 999, turning that into a factor would convert that value into
> > something else
> >
> > I really, and wholeheartedly, do appreciate all advice: but please be
> > assured that I have been thinking about this for more than 10 years
and
> > still haven't found a satisfactory solution.
> >
> > Which makes it even more intriguing, since other software like SAS or
> Stata
> > have solved this for decades: what is their implementation, and how
come
> > they don't seem to be affected by the new M1 architecture?
> > When package "haven" introduced the tagged NA values I said:
ah-haa... so
> > that is how it's done... only to learn that implementation is just
as
> > fragile as the R internals.
> >
> > There really should be a robust solution for this seemingly mundane
> > problem, but apparently is far from mundane...
> >
> > Best wishes,
> > Adrian
> >
> >
> > On Mon, May 24, 2021 at 3:29 PM Bertram, Alexander <
> alex at bedatadriven.com>
> > wrote:
> >
> >> Dear Adrian,
> >> I just wanted to pipe in and underscore Thomas' point: the
payload bits
> of
> >> IEEE 754 floating point values are no place to store data that you
care
> >> about or need to keep. That is not only related to the R APIs, but
also
> how
> >> processors handle floating point values and signaling and
non-signaling
> >> NaNs. It is very difficult to reason about when and under which
> >> circumstances these bits are preserved. I spent a lot of time
working on
> >> Renjin's handling of these values and I can assure that any
such scheme
> >> will end in tears.
> >>
> >> A far, far better option is to use R's attributes to store
this kind of
> >> metadata. This is exactly what this language feature is for. There
is
> >> already a standard 'levels' attribute that holds the
labels of factors
> like
> >> "Yes", "No" , "Refused",
"Interviewer error'' etc. In the past, I've
> worked
> >> on projects where we stored an additional attribute like
"missingLevels"
> >> that stores extra metadata on which levels should be used in which
kind
> of
> >> analysis. That way, you can preserve all the information, and then
> write a
> >> utility function which automatically applies certain logic to a
whole
> >> dataframe just before passing the data to an analysis function.
This is
> >> also important because in surveys like this, different values
should be
> >> excluded at different times. For example, you might want to
include all
> >> responses in a data quality report, but exclude interviewer error
and
> >> refusals when conducting a PCA or fitting a model.
> >>
> >> Best,
> >> Alex
> >>
> >> On Mon, May 24, 2021 at 2:03 PM Adrian Du?a <dusa.adrian at
gmail.com>
> wrote:
> >>
> >>> On Mon, May 24, 2021 at 1:31 PM Tomas Kalibera <
> tomas.kalibera at gmail.com>
> >>> wrote:
> >>>
> >>>> [...]
> >>>>
> >>>> For the reasons I explained, I would be against such a
change. Keeping
> >>> the
> >>>> data on the side, as also recommended by others on this
list, would
> >>> allow
> >>>> you for a reliable implementation. I don't want to
support fragile
> >>> package
> >>>> code building on unspecified R internals, and in this case
> particularly
> >>>> internals that themselves have not stood the test of time,
so are at
> >>> high
> >>>> risk of change.
> >>>>
> >>> I understand, and it makes sense.
> >>> We'll have to wait for the R internals to settle (this
really is
> >>> surprising, I wonder how other software have solved this). In
the
> >>> meantime,
> >>> I will probably go ahead with NaNs.
> >>>
> >>> Thank you again,
> >>> Adrian
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >>
> >> --
> >> Alexander Bertram
> >> Technical Director
> >> *BeDataDriven BV*
> >>
> >> Web: http://bedatadriven.com
> >> Email: alex at bedatadriven.com
> >> Tel. Nederlands: +31(0)647205388
> >> Skype: akbertram
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Alexander Bertram
Technical Director
*BeDataDriven BV*

Web: http://bedatadriven.com
Email: alex at bedatadriven.com
Tel. Nederlands: +31(0)647205388
Skype: akbertram

	[[alternative HTML version deleted]]

Adrian Dușa

2021-May-24 14:30 UTC

head link

[Rd] 1954 from NA

Hi Taras,

On Mon, May 24, 2021 at 4:20 PM Taras Zakharko <taras.zakharko at uzh.ch>
wrote:
> Hi Adrian,
>
> Have a look at vctrs package ? they have low-level primitives that might
> simplify your life a bit. I think you can get quite far by creating a
> custom type that stores NAs in an attribute and utilizes vctrs proxy
> functionality to preserve these attributes across different operations.
> Going that route will likely to give you a much more flexible and robust
> solution.
>
Yes I am well aware of the primitives from package vctrs, since package
haven itself uses the vctrs_vctr class.
They're doing a very interesting work, albeit not a solution for this
particular problem.

A.

	[[alternative HTML version deleted]]

R devel - May 2021 - 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA