thr3ads.net - R devel - [Rd] [External] Re: 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Avi Gross

2021-May-25 04:05 UTC

[Rd] [External] Re: 1954 from NA

I was thinking about how one does things in a language that is properly
object-oriented versus R that makes various half-assed attempts at being such.

Clearly in some such languages you can make an object that is a wrapper that
allows you to save an item that is the main payload as well as anything else you
want. You might need a way to convince everything else to allow you to make
things like lists and vectors and other collections of the objects and perhaps
automatically unbox them for many purposes. As an example in a language like
Python, you might provide methods so that adding A and B actually gets the value
out of A and/or B and adds them properly.  But there may be too many edge cases
to handle and some software may not pay attention to what you want including
some libraries written in other languages.

I mention Python for the odd reason that it is now possible to combine Python
and R in the same program and sort of switch back and forth between data
representations. This may provide some openings for preserving and accessing
metadata when needed.

Realistically, if R was being designed from scratch TODAY, many things might be
done differently. But I recall it being developed at Bell Labs for purposes
where it was sort of revolutionary at the time (back when it was S) and designed
to do things in a vectorized way and probably primarily for the kinds of
scientific and mathematical operations where a single NA (of several types
depending on the data) was enough when augmented by a few things like a Nan and
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA
that were all the same AND also all different that they felt had to be built-in.
As noted, had they had a reason to make it fully object-oriented too and made
the base types such as integer into full-fledged objects with room for
additional metadata, then things may be different. I note I have seen languages
which have both a data type called integer as lower case and Integer as upper
case. One of them is regularly boxed and unboxed automagically when used in a
context that needs the other. As far as efficiency goes, this invisibly adds
many steps. So do languages that sometimes take a variable that is a pointer and
invisibly reference it to provide the underlying field rather than make you do
extra typing and so on.

So is there any reason only an NA should have such meta-data? Why not have
reasons associated with Inf stating it was an Inf because you asked for one or
the result of a calculation such as dividing by Zero (albeit maybe that might be
a NaN) and so on. Maybe I could annotate integers with whether they are prime or
even  versus odd  or a factor of 144 or anything else I can imagine. But at some
point, the overhead from allowing all this can become substantial. I was amused
at how python allows a function to be annotated including by itself since it is
an object. So it can store such metadata perhaps in an attached dictionary so a
complex costly calculation can have the results cached and when you ask for the
same thing in the same session, it checks if it has done it and just returns the
result in linear time. But after a while, how many cached results can there be?

-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of luke-tierney
at uiowa.edu
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Du?a <dusa.adrian at unibuc.ro>
Cc: Greg Minshall <minshall at umich.edu>; r-devel <r-devel at
r-project.org>
Subject: Re: [Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Du?a wrote:
> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu>
wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have 
>> one column of 500 "bits", where each bit has one of N values,
N being
>> the number of explanations the corresponding column has for why the 
>> NA exists.
>>
PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in this
thread.

If you distribute code that does this it will only lead to bug reports on R that
will waste R-core time.

As Alex explained, you can use attributes for this. If you need operations to
preserve attributes across subsetting you can define subsetting methods that do
that.

If you are dead set on doing something in C you can try to develop an ALTREP
class that provides augmented missing value information.

Best,

luke

>
> The mere thought of implementing something like that gives me shivers. 
> Not to mention such a solution should also be robust when subsetting, 
> splitting, column and row binding, etc. and everything can be lost if 
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might 
> first think: there are multi-wave studies with tens of countries, and 
> aggregating such data is already a complex process to add even more 
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the 
> R internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Adrian Dușa

2021-May-25 06:16 UTC

head link

[Rd] [External] Re: 1954 from NA

Dear Avi,

Thank you so much for the extended messages, I read them carefully.
While partially offering a solution (I've already been there), it creates
additional work for the user, and some of that is unnecessary.

What I am trying to achieve is best described in this draft vignette:

devtools::install_github("dusadrian/mixed")
vignette("mixed")

Once a value is declared to be missing, the user should not do anything
else about it. Despite being present, the value should automatically be
treated as missing by the software. That is the way it's done in all major
statistical packages like SAS, Stata and even SPSS.

My end goal is to make R attractive for my faculty peers (and beyond),
almost all of whom are massively using SPSS and sometimes Stata. But in
order to convince them to (finally) make the switch, I need to provide
similar functionality, not additional work.

Re. your first part of the message, I am definitely not trying to change
the R internals. The NA will still be NA, exactly as currently defined.
My initial proposal was based on the observation that the 1954 payload was
stored as an unsigned int (thus occupying 32 bits) when it is obvious it
doesn't need more than 16. That was the only proposed modification, and
everything else stays the same.

I now learned, thanks to all contributors in this list, that building
something around that payload is risky because we do not know exactly what
the compilers will do. One possible solution that I can think of, while
(still) maintaining the current functionality around the NA, is to use a
different high word for the NA that would not trigger compilation issues.
But I have absolutely no idea what that implies for the other inner
workings of R.

I very much trust the R core will eventually find a robust solution,
they've solved much more complicated problems than this. I just hope the
current thread will push the idea of tagged NAs on the table, for when they
will discuss this.

Once that will be solved, and despite the current advice discouraging this
route, I believe tagging NAs is a valuable idea that should not be
discarded.
After all, the NA is nothing but a tagged NaN.

All the best,
Adrian


On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel at
r-project.org>
wrote:
> I was thinking about how one does things in a language that is properly
> object-oriented versus R that makes various half-assed attempts at being
> such.
>
> Clearly in some such languages you can make an object that is a wrapper
> that allows you to save an item that is the main payload as well as
> anything else you want. You might need a way to convince everything else to
> allow you to make things like lists and vectors and other collections of
> the objects and perhaps automatically unbox them for many purposes. As an
> example in a language like Python, you might provide methods so that adding
> A and B actually gets the value out of A and/or B and adds them properly.
> But there may be too many edge cases to handle and some software may not
> pay attention to what you want including some libraries written in other
> languages.
>
> I mention Python for the odd reason that it is now possible to combine
> Python and R in the same program and sort of switch back and forth between
> data representations. This may provide some openings for preserving and
> accessing metadata when needed.
>
> Realistically, if R was being designed from scratch TODAY, many things
> might be done differently. But I recall it being developed at Bell Labs for
> purposes where it was sort of revolutionary at the time (back when it was
> S) and designed to do things in a vectorized way and probably primarily for
> the kinds of scientific and mathematical operations where a single NA (of
> several types depending on the data) was enough when augmented by a few
> things like a Nan and Inf and -Inf. I doubt they seriously saw a need for
> an unlimited number of NA that were all the same AND also all different
> that they felt had to be built-in. As noted, had they had a reason to make
> it fully object-oriented too and made the base types such as integer into
> full-fledged objects with room for additional metadata, then things may be
> different. I note I have seen languages which have both a data type called
> integer as lower case and Integer as upper case. One of them is regularly
> boxed and unboxed automagically when used in a context that needs the
> other. As far as efficiency goes, this invisibly adds many steps. So do
> languages that sometimes take a variable that is a pointer and invisibly
> reference it to provide the underlying field rather than make you do extra
> typing and so on.
>
> So is there any reason only an NA should have such meta-data? Why not have
> reasons associated with Inf stating it was an Inf because you asked for one
> or the result of a calculation such as dividing by Zero (albeit maybe that
> might be a NaN) and so on. Maybe I could annotate integers with whether
> they are prime or even  versus odd  or a factor of 144 or anything else I
> can imagine. But at some point, the overhead from allowing all this can
> become substantial. I was amused at how python allows a function to be
> annotated including by itself since it is an object. So it can store such
> metadata perhaps in an attached dictionary so a complex costly calculation
> can have the results cached and when you ask for the same thing in the same
> session, it checks if it has done it and just returns the result in linear
> time. But after a while, how many cached results can there be?
>
> -----Original Message-----
> From: R-devel <r-devel-bounces at r-project.org> On Behalf Of
> luke-tierney at uiowa.edu
> Sent: Monday, May 24, 2021 9:15 AM
> To: Adrian Du?a <dusa.adrian at unibuc.ro>
> Cc: Greg Minshall <minshall at umich.edu>; r-devel <r-devel at
r-project.org>
> Subject: Re: [Rd] [External] Re: 1954 from NA
>
> On Mon, 24 May 2021, Adrian Du?a wrote:
>
> > On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at
umich.edu>
> wrote:
> >
> >> [...]
> >> if you have 500 columns of possibly-NA'd variables, you could
have
> >> one column of 500 "bits", where each bit has one of N
values, N being
> >> the number of explanations the corresponding column has for why
the
> >> NA exists.
> >>
>
> PLEASE DO NOT DO THIS!
>
> It will not work reliably, as has been explained to you ad nauseam in this
> thread.
>
> If you distribute code that does this it will only lead to bug reports on
> R that will waste R-core time.
>
> As Alex explained, you can use attributes for this. If you need operations
> to preserve attributes across subsetting you can define subsetting methods
> that do that.
>
> If you are dead set on doing something in C you can try to develop an
> ALTREP class that provides augmented missing value information.
>
> Best,
>
> luke
>
>
>
> >
> > The mere thought of implementing something like that gives me shivers.
> > Not to mention such a solution should also be robust when subsetting,
> > splitting, column and row binding, etc. and everything can be lost if
> > the user deletes that particular column without realising its
importance.
> >
> > Social science datasets are much more alive and complex than one might
> > first think: there are multi-wave studies with tens of countries, and
> > aggregating such data is already a complex process to add even more
> > complexity on top of that.
> >
> > As undocumented as they may be, or even subject to change, I think the
> > R internals are much more reliable that this.
> >
> > Best wishes,
> > Adrian
> >
> >
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

	[[alternative HTML version deleted]]

R devel - May 2021 - [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA