That helps get more understanding of what you want to do, Adrian. Getting anyone
to switch is always a challenge but changing R enough to tempt them may be a
bigger challenge. His is an old story. I was the first adopter for C++ in my
area and at first had to have my code be built with an all C project making me
reinvent some wheels so the same ?make? system knew how to build the two
compatibly and link them. Of course, they all eventually had to join me in a
later release but I had moved forward by then.
I have changed (or more accurately added) lots of languages in my life and
continue to do so. The biggest challenge is not to just adapt and use it
similarly to the previous ones already mastered but to understand WHY someone
designed the language this way and what kind of idioms are common and useful
even if that means a new way of thinking. But, of course, any ?older? language
has evolved and often drifted in multiple directions. Many now borrow heavily
from others even when the philosophy is different and often the results are not
pretty. Making major changes in R might have serious impacts on existing
programs including just by making them fail as they run out of memory.
If you look at R, there is plenty you can do in base R, sometimes by standing on
your head. Yet you see package after package coming along that offers not just
new things but sometimes a reworking and even remodeling of old things. R has a
base graphics system I now rarely use and another called lattice I have no
reason to use again because I can do so much quite easily in ggplot. Similarly,
the evolving tidyverse group of packages approaches things from an interesting
direction to the point where many people mainly use it and not base R. So if
they were to teach a class in how to gather your data and analyze it and draw
pretty pictures, the students might walk away thinking they had learned R but
actually have learned these packages.
Your scenario seems related to a common scenario of how we can have values that
signal beyond some range in an out-of-band manner. Years ago we had functions in
languages like C that would return a -1 on failure when only non-negative
results were otherwise possible. That can work fine but fails in cases when any
possible value in the range can be returned. We have languages that deal with
this kind of thing using error handling constructs like exceptions. Sometimes
you bundle up multiple items into a structure and return that with one element
of the structure holding some kind of return status and another holding the
payload. A variation on this theme, as in languages like GO is to have function
that return multiple values with one of them containing nil on success and an
error structure on failure.
The situation we have here that seems to be of concern to you is that you would
like each item in a structure to have attributes that are recognized and
propagated as it is being processed. Older languages tended not to even have a
concept so basic types simply existed and two instances of the number 5 might
even be the same underlying one or two strings with the same contents and so on.
You could of course play the game of making a struct, as mentioned above, but
then you needed your own code to do all the handling as nothing else knew it
contained multiple items and which ones had which purpose.
R did add generalized attributes and some are fairly well integrated or at least
partially. ?Names? were discussed as not being easy to keep around. Factors used
their own tagging method that seems to work fairly well but probably not
everywhere. But what you want may be more general and not built on similar
foundations.
I look at languages like Python that are arguably more object-oriented now than
R is and in some ways can be extended better, albeit not in others. If I wanted
to create an object to hold the number 5 and I add methods to the object that
allow it to participate in various ways with other objects using the hidden
payload but also sometimes using the hidden payload, then I might pair it with
the string ?five? but also with dozens of other strings for the word
representing 5 in many languages. So I might have it act like a number in
numerical situations and like text when someone is using it in writing a novel
in any of many languages.
You seem to want to have the original text visible that gives a reason something
is missing (or something like that) but have the software TREAT it like it is
missing in calculations. In effect, you want is.na() to be a bit more like
is.numeric() or is.character() and care more about the TYPE of what is being
stored. An item may contain a 999 and yet not be seen as a number but as an NA.
The problem I see is that you also may want the item to be a string like
?DELETED? and yet include it in the vector that R insists can only hold
integers. R does have a built-in data structure called a list that indeed allows
that. You can easily store data as a list of lists rather than a list of vectors
and many other structures. Some of those structures might handle your needs BUT
may only work properly if you build your own packages as with the tidyverse and
break as soon as any other functions encountered them!
But then you would arguably no longer be in R but in your own universe based on
R.
I have written much code that does things a bit sideways. For example, I might
have a treelike structure in which you do some form of search till you encounter
a leaf node and return that value to be used in a calculation. To perform a
calculation using multiple trees such as taking an average, you always use
find_value(tree) and never hand over the tree itself. As I think I pointed out
earlier, you can do things like that in many places and hand over a variation of
your data. In the ggplot example, you might have:
ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2)) ?
Ggplot would not use the original data in plotting but the view it is asked to
use. The function I made up above would know what values are some form of NA and
convert all others like ?12.3? to numeric form. BUT it would not act as simply
or smoothly as when your data is already in the format everyone else uses.
So how does R know what something is? Presumably there is some overhead
associated with a vector or some table that records the type. A list presumably
depends on each internal item to have such a type. So maybe what you want is for
each item in a vector to have a type where one type is some for of NA. But as
noted, R does often not give a damn about an NA and happily uses it to create
more nonsense. The mean of a bunch of numbers that includes one or more copies
of things like NA (or NaN or inf) can pollute them all. Generally R is not
designed to give a darn. When people complain, they may get mean to add an
na.rm=TRUE or remove them some way before asking for a mean or perhaps reset
them to something like zero.
So if you want to leave your variables in place with assorted meanings but a tag
saying they are to be treated as NA, much in R might have to change. Your
suggested approach though is not yet clear but might mean doing something
analogous to using extra bits and hoping nobody will notice.
So, the solution is both blindingly obvious and even more blindingly stupid. Use
complex numbers! All normal content shall be stored as numbers like 5.3+0i and
any variant on NA shall be stored as something like 0+3i where 3 means an NA of
type 3.
OK, humor aside, since the social sciences do not tend to even know what complex
numbers are, this should provide another dimension to hide lots of meaningless
info. Heck, you could convert message like ?LATE? into some numeric form.
Assuming an English centered world (which I do not!) you could store it with L
replaced by 12 and A by 01 and so on so the imaginary component might look like
0+12011905i and easily decoded back into LATE when needed. Again, not a serious
proposal. The storage probably would be twice the size of a numeric albeit you
can extract the real part when needed for normal calculations and the imaginary
part when you want to know about NA type or whatever.
What R really is missing is quaternions and octonions which are the only two
other variations on complex numbers that are possible and are sort of complex
numbers on steroids with either three or seven distinct square roots of
minus-one so they allow storage along additional axes in other dimensions.
Yes, I am sure someone wrote a package for that! LOL!
Ah, here is one: https://cran.r-project.org/web/packages/onion/onion.pdf
I will end by saying my experience is that enticing people to do something new
is just a start. After they start, you often get lots of complaints and requests
for help and even requests to help them move back! Unless you make some popular
package everyone runs to, NOBODY else will be able to help them on some things.
The reality is that some of the more common tasks these people do are sometimes
already optimized for them and often do not make them know more. I have had to
use these systems and for some common tasks they are easy. Dialog boxes can pop
up and let you checks off various options and off you go. No need to learn lots
of programming details like the names of various functions that do a Tukey test
and what arguments they need and what errors might have to be handled and so on.
I know SPSS often produces LOTS of output including many things you do not wat
and then lets you remove parts you don?t need or even know what they mean. Sure,
R can have similar functionality but often you are expected to sort of stitch
various parts together as well as ADD your own bits. I love that and value being
able to be creative. In my experience, most normal people just want to get the
job done and be fairly certain others accept the results ad then do other
activities they are better suited for, or at least think they are.
There are intermediates I have used where I let them do various kinds of
processing on SPSS and save the result in some format I can read into R for
additional processing. The latter may not be stuff that requires keeping track
of multiple NA equivalents. Of course if you want to save the results and move
them back, that is a challenge. Hybrid approaches may tempt them to try
something and maybe later do more and more and move over.
From: Adrian Du?a <dusa.adrian at unibuc.ro>
Sent: Tuesday, May 25, 2021 2:17 AM
To: Avi Gross <avigross at verizon.net>
Cc: r-devel <r-devel at r-project.org>
Subject: Re: [Rd] [External] Re: 1954 from NA
Dear Avi,
Thank you so much for the extended messages, I read them carefully.
While partially offering a solution (I've already been there), it creates
additional work for the user, and some of that is unnecessary.
What I am trying to achieve is best described in this draft vignette:
devtools::install_github("dusadrian/mixed")
vignette("mixed")
Once a value is declared to be missing, the user should not do anything else
about it. Despite being present, the value should automatically be treated as
missing by the software. That is the way it's done in all major statistical
packages like SAS, Stata and even SPSS.
My end goal is to make R attractive for my faculty peers (and beyond), almost
all of whom are massively using SPSS and sometimes Stata. But in order to
convince them to (finally) make the switch, I need to provide similar
functionality, not additional work.
Re. your first part of the message, I am definitely not trying to change the R
internals. The NA will still be NA, exactly as currently defined.
My initial proposal was based on the observation that the 1954 payload was
stored as an unsigned int (thus occupying 32 bits) when it is obvious it
doesn't need more than 16. That was the only proposed modification, and
everything else stays the same.
I now learned, thanks to all contributors in this list, that building something
around that payload is risky because we do not know exactly what the compilers
will do. One possible solution that I can think of, while (still) maintaining
the current functionality around the NA, is to use a different high word for the
NA that would not trigger compilation issues. But I have absolutely no idea what
that implies for the other inner workings of R.
I very much trust the R core will eventually find a robust solution, they've
solved much more complicated problems than this. I just hope the current thread
will push the idea of tagged NAs on the table, for when they will discuss this.
Once that will be solved, and despite the current advice discouraging this
route, I believe tagging NAs is a valuable idea that should not be discarded.
After all, the NA is nothing but a tagged NaN.
All the best,
Adrian
On Tue, May 25, 2021 at 7:05 AM Avi Gross via R-devel <r-devel at
r-project.org <mailto:r-devel at r-project.org> > wrote:
I was thinking about how one does things in a language that is properly
object-oriented versus R that makes various half-assed attempts at being such.
Clearly in some such languages you can make an object that is a wrapper that
allows you to save an item that is the main payload as well as anything else you
want. You might need a way to convince everything else to allow you to make
things like lists and vectors and other collections of the objects and perhaps
automatically unbox them for many purposes. As an example in a language like
Python, you might provide methods so that adding A and B actually gets the value
out of A and/or B and adds them properly. But there may be too many edge cases
to handle and some software may not pay attention to what you want including
some libraries written in other languages.
I mention Python for the odd reason that it is now possible to combine Python
and R in the same program and sort of switch back and forth between data
representations. This may provide some openings for preserving and accessing
metadata when needed.
Realistically, if R was being designed from scratch TODAY, many things might be
done differently. But I recall it being developed at Bell Labs for purposes
where it was sort of revolutionary at the time (back when it was S) and designed
to do things in a vectorized way and probably primarily for the kinds of
scientific and mathematical operations where a single NA (of several types
depending on the data) was enough when augmented by a few things like a Nan and
Inf and -Inf. I doubt they seriously saw a need for an unlimited number of NA
that were all the same AND also all different that they felt had to be built-in.
As noted, had they had a reason to make it fully object-oriented too and made
the base types such as integer into full-fledged objects with room for
additional metadata, then things may be different. I note I have seen languages
which have both a data type called integer as lower case and Integer as upper
case. One of them is regularly boxed and unboxed automagically when used in a
context that needs the other. As far as efficiency goes, this invisibly adds
many steps. So do languages that sometimes take a variable that is a pointer and
invisibly reference it to provide the underlying field rather than make you do
extra typing and so on.
So is there any reason only an NA should have such meta-data? Why not have
reasons associated with Inf stating it was an Inf because you asked for one or
the result of a calculation such as dividing by Zero (albeit maybe that might be
a NaN) and so on. Maybe I could annotate integers with whether they are prime or
even versus odd or a factor of 144 or anything else I can imagine. But at some
point, the overhead from allowing all this can become substantial. I was amused
at how python allows a function to be annotated including by itself since it is
an object. So it can store such metadata perhaps in an attached dictionary so a
complex costly calculation can have the results cached and when you ask for the
same thing in the same session, it checks if it has done it and just returns the
result in linear time. But after a while, how many cached results can there be?
-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org <mailto:r-devel-bounces at
r-project.org> > On Behalf Of luke-tierney at uiowa.edu
<mailto:luke-tierney at uiowa.edu>
Sent: Monday, May 24, 2021 9:15 AM
To: Adrian Du?a <dusa.adrian at unibuc.ro <mailto:dusa.adrian at
unibuc.ro> >
Cc: Greg Minshall <minshall at umich.edu <mailto:minshall at umich.edu>
>; r-devel <r-devel at r-project.org <mailto:r-devel at
r-project.org> >
Subject: Re: [Rd] [External] Re: 1954 from NA
On Mon, 24 May 2021, Adrian Du?a wrote:
> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu
<mailto:minshall at umich.edu> > wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have
>> one column of 500 "bits", where each bit has one of N values,
N being
>> the number of explanations the corresponding column has for why the
>> NA exists.
>>
PLEASE DO NOT DO THIS!
It will not work reliably, as has been explained to you ad nauseam in this
thread.
If you distribute code that does this it will only lead to bug reports on R that
will waste R-core time.
As Alex explained, you can use attributes for this. If you need operations to
preserve attributes across subsetting you can define subsetting methods that do
that.
If you are dead set on doing something in C you can try to develop an ALTREP
class that provides augmented missing value information.
Best,
luke
>
> The mere thought of implementing something like that gives me shivers.
> Not to mention such a solution should also be robust when subsetting,
> splitting, column and row binding, etc. and everything can be lost if
> the user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might
> first think: there are multi-wave studies with tens of countries, and
> aggregating such data is already a complex process to add even more
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the
> R internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney at uiowa.edu
<mailto:luke-tierney at uiowa.edu>
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
______________________________________________
R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu
[[alternative HTML version deleted]]