thr3ads.net - R devel - [Rd] [External] Re: 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2021-May-25 23:27 UTC

[Rd] [External] Re: 1954 from NA

You've already been told how to solve this:  just add attributes to the 
objects. Use the standard NA to indicate that there is some kind of 
missingness, and the attribute to describe exactly what it is.  Stick a 
class on those objects and define methods so that subsetting and 
arithmetic preserves the extra info you've added. If you do some 
operation that turns those NAs into NaNs, big deal:  the attribute will 
still be there, and is.na(NaN) still returns TRUE.

Base R doesn't need anything else.

You complained that users shouldn't need to know about attributes, and 
they won't:  you, as the author of the package that does this, will 
handle all those details.  Working in your subject area you know all the 
different kinds of NAs that people care about, and how they code them in 
input data, so you can make it all totally transparent.  If you do it 
well, someone in some other subject area with a completely different set 
of kinds of missingness will be able to adapt your code to their use.

I imagine this has all been done in one of the thousands of packages on 
CRAN, but if it hasn't been done well enough for you, do it better.

Duncan Murdoch

On 25/05/2021 7:01 p.m., Adrian Du?a wrote:> Dear Avi,
> 
> That was quite a lengthy email...
> What you write makes sense of course. I try hard not to deviate from the
> base R, and thought my solution does just that but apparently no such luck.
> 
> I suspect, however, that something will have to eventually change: since
> one of the R building blocks (such as an NA) is questioned by compilers, it
> is serious enough to attract attention from the R core and maintainers.
> And if that happens, my fingers are crossed the solution would allow users
> to declare existing values as missing.
> 
> The importance of that, for the social sciences, cannot be stressed enough.
> 
> Best wishes, thanks once again to everyone,
> Adrian
> 
> On Tue, May 25, 2021 at 10:03 PM Avi Gross via R-devel <
> r-devel at r-project.org> wrote:
> 
>> That helps get more understanding of what you want to do, Adrian.
Getting
>> anyone to switch is always a challenge but changing R enough to tempt
them
>> may be a bigger challenge. His is an old story. I was the first adopter
for
>> C++ in my area and at first had to have my code be built with an all C
>> project making me reinvent some wheels so the same ?make? system knew
how
>> to build the two compatibly and link them. Of course, they all
eventually
>> had to join me in a later release but I had moved forward by then.
>>
>>
>>
>> I have changed (or more accurately added) lots of languages in my life
and
>> continue to do so. The biggest challenge is not to just adapt and use
it
>> similarly to the previous ones already mastered but to understand WHY
>> someone designed the language this way and what kind of idioms are
common
>> and useful even if that means a new way of thinking. But, of course,
any
>> ?older? language has evolved and often drifted in multiple directions.
Many
>> now borrow heavily from others even when the philosophy is different
and
>> often the results are not pretty. Making major changes in R might have
>> serious impacts on existing programs including just by making them fail
as
>> they run out of memory.
>>
>>
>>
>> If you look at R, there is plenty you can do in base R, sometimes by
>> standing on your head. Yet you see package after package coming along
that
>> offers not just new things but sometimes a reworking and even
remodeling of
>> old things. R has a base graphics system I now rarely use and another
>> called lattice I have no reason to use again because I can do so much
quite
>> easily in ggplot. Similarly, the evolving tidyverse group of packages
>> approaches things from an interesting direction to the point where many
>> people mainly use it and not base R. So if they were to teach a class
in
>> how to gather your data and analyze it and draw pretty pictures, the
>> students might walk away thinking they had learned R but actually have
>> learned these packages.
>>
>>
>>
>> Your scenario seems related to a common scenario of how we can have
values
>> that signal beyond some range in an out-of-band manner. Years ago we
had
>> functions in languages like C that would return a -1 on failure when
only
>> non-negative results were otherwise possible. That can work fine but
fails
>> in cases when any possible value in the range can be returned. We have
>> languages that deal with this kind of thing using error handling
constructs
>> like exceptions.  Sometimes you bundle up multiple items into a
structure
>> and return that with one element of the structure holding some kind of
>> return status and another holding the payload. A variation on this
theme,
>> as in languages like GO is to have function that return multiple values
>> with one of them containing nil on success and an error structure on
>> failure.
>>
>>
>>
>> The situation we have here that seems to be of concern to you is that
you
>> would like each item in a structure to have attributes that are
recognized
>> and propagated as it is being processed. Older languages tended not to
even
>> have a concept so basic types simply existed and two instances of the
>> number 5 might even be the same underlying one or two strings with the
same
>> contents and so on. You could of course play the game of making a
struct,
>> as mentioned above, but then you needed your own code to do all the
>> handling as nothing else knew it contained multiple items and which
ones
>> had which purpose.
>>
>>
>>
>> R did add generalized attributes and some are fairly well integrated or
at
>> least partially. ?Names? were discussed as not being easy to keep
around.
>> Factors used their own tagging method that seems to work fairly well
but
>> probably not everywhere. But what you want may be more general and not
>> built on similar foundations.
>>
>>
>>
>> I look at languages like Python that are arguably more object-oriented
now
>> than R is and in some ways can be extended better, albeit not in
others. If
>> I wanted to create an object to hold the number 5 and I add methods to
the
>> object that allow it to participate in various ways with other objects
>> using the hidden payload but also sometimes using the hidden payload,
then
>> I might pair it with the string ?five? but also with dozens of other
>> strings for the word representing 5 in many languages. So I might have
it
>> act like a number in numerical situations and like text when someone is
>> using it in writing a novel in any of many languages.
>>
>>
>>
>> You seem to want to have the original text visible that gives a reason
>> something is missing (or something like that) but have the software
TREAT
>> it like it is missing in calculations. In effect, you want is.na() to
be
>> a bit more like is.numeric() or is.character() and care more about the
TYPE
>> of what is being stored. An item may contain a 999 and yet not be seen
as a
>> number but as an NA. The problem I see is that you also may want the
item
>> to be a string like ?DELETED? and yet include it in the vector that R
>> insists can only hold integers. R does have a built-in data structure
>> called a list that indeed allows that. You can easily store data as a
list
>> of lists rather than a list of vectors and many other structures. Some
of
>> those structures might handle your needs BUT may only work properly if
you
>> build your own packages as with  the tidyverse and break as soon as any
>> other functions encountered them!
>>
>>
>>
>> But then you would arguably no longer be in R but in your own universe
>> based on R.
>>
>>
>>
>> I have written much code that does things a bit sideways. For example,
I
>> might have a treelike structure in which you do some form of search
till
>> you encounter a leaf node and return that value to be used in a
>> calculation. To perform a calculation using multiple trees such as
taking
>> an average, you always use find_value(tree) and never hand over the
tree
>> itself. As I think I pointed out earlier, you can do things like that
in
>> many places and hand over a variation of your data. In the ggplot
example,
>> you might have:
>>
>>
>>
>> ggplot(data=mydata, aes(x=abs(col1), y=convert_string_to_numeric(col2))
?
>>
>>
>>
>> Ggplot would not use the original data in plotting but the view it is
>> asked to use. The function I made up above would know what values are
some
>> form of NA and convert all others like ?12.3? to numeric form. BUT it
would
>> not act as simply or smoothly as when your data is already in the
format
>> everyone else uses.
>>
>>
>>
>> So how does R know what something is? Presumably there is some overhead
>> associated with a vector or some table that records the type. A list
>> presumably depends on each internal item to have such a type. So maybe
what
>> you want is for each item in a vector to have a type where one type is
some
>> for of NA. But as noted, R does often not give a damn about an NA and
>> happily uses it to create more nonsense. The mean of a bunch of numbers
>> that includes one or more copies of things like NA (or NaN or inf) can
>> pollute them all. Generally R is not designed to give a darn. When
people
>> complain, they may get mean to add an na.rm=TRUE or remove them some
way
>> before asking for a mean or perhaps reset them to something like zero.
>>
>>
>>
>> So if you want to leave your variables in place with assorted meanings
but
>> a tag saying they are to be treated as NA, much in R might have to
change.
>> Your suggested approach though is not yet clear but might mean doing
>> something analogous to using extra bits and hoping nobody will notice.
>>
>>
>>
>> So, the solution is both blindingly obvious and even more blindingly
>> stupid. Use complex numbers! All normal content shall be stored as
numbers
>> like 5.3+0i and any variant on NA shall be stored as something like
0+3i
>> where 3 means an NA of type 3.
>>
>>
>>
>> OK, humor aside, since the social sciences do not tend to even know
what
>> complex numbers are, this should provide another dimension to hide lots
of
>> meaningless info. Heck, you could convert  message like ?LATE? into
some
>> numeric form. Assuming an English centered world (which I do not!) you
>> could store it with L replaced by 12 and A by 01 and so on so the
imaginary
>> component might look like 0+12011905i and easily decoded back into LATE
>> when needed. Again, not a serious proposal. The storage probably would
be
>> twice the size of a numeric albeit you can extract the real part when
>> needed for normal calculations and the imaginary part when you want to
know
>> about NA type or whatever.
>>
>>
>>
>> What R really is missing is quaternions and octonions which are the
only
>> two other variations on complex numbers that are possible and are sort
of
>> complex numbers on steroids with either three or seven distinct square
>> roots of minus-one  so they allow storage along additional axes in
other
>> dimensions.
>>
>>
>>
>> Yes, I am sure someone wrote a package for that! LOL!
>>
>>
>>
>> Ah, here is one:
https://cran.r-project.org/web/packages/onion/onion.pdf
>>
>>
>>
>> I will end by saying my experience is that enticing people to do
something
>> new is just a start. After they start, you often get lots of complaints
and
>> requests for help and even requests to help them move back! Unless you
make
>> some popular package everyone runs to, NOBODY else will be able to help
>> them on some things. The reality is that some of the more common tasks
>> these people do are sometimes already optimized for them and often do
not
>> make them know more. I have had to use these systems and for some
common
>> tasks they are easy. Dialog boxes can pop up and let you checks off
various
>> options and off you go. No need to learn lots of programming details
like
>> the names of various functions that do a Tukey test and what arguments
they
>> need and what errors might have to be handled and so on. I know SPSS
often
>> produces LOTS of output including many things you do not wat and then
lets
>> you remove parts you don?t need or even know what they mean. Sure, R
can
>> have similar functionality but often you are expected to sort of stitch
>> various parts together as well as ADD your own bits. I love that and
value
>> being able to be creative. In my experience, most normal people just
want
>> to get the job done and be fairly certain others accept the results ad
then
>> do other activities they are better suited for, or at least think they
are.
>>
>>
>>
>> There are intermediates I have used where I let them do various kinds
of
>> processing on SPSS and save the result in some format I can read into R
for
>> additional processing. The latter may not be stuff that requires
keeping
>> track of multiple NA equivalents. Of course if you want to save the
results
>> and move them back, that is  a challenge. Hybrid approaches may tempt
them
>> to try something and maybe later do more and more and move over.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Adrian Dușa

2021-May-26 14:22 UTC

head link

[Rd] [External] Re: 1954 from NA

Dear Duncan,

On Wed, May 26, 2021 at 2:27 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> You've already been told how to solve this:  just add attributes to the
> objects. Use the standard NA to indicate that there is some kind of
> missingness, and the attribute to describe exactly what it is.  Stick a
> class on those objects and define methods so that subsetting and
> arithmetic preserves the extra info you've added. If you do some
> operation that turns those NAs into NaNs, big deal:  the attribute will
> still be there, and is.na(NaN) still returns TRUE.
>
I've already tried the attributes way, it is not so easy.
In the best case scenario, it unnecessarily triples the size of the data,
but perhaps this is the only way forward.


> Base R doesn't need anything else.
>
> You complained that users shouldn't need to know about attributes, and
> they won't:  you, as the author of the package that does this, will
> handle all those details.  Working in your subject area you know all the
> different kinds of NAs that people care about, and how they code them in
> input data, so you can make it all totally transparent.  If you do it
> well, someone in some other subject area with a completely different set
> of kinds of missingness will be able to adapt your code to their use.
>
But that is the whole point: the package author does not define possible
NAs (the possibilities are infinite), users do that.
The package should only provide a simple method to achieve that.


I imagine this has all been done in one of the thousands of packages
on> CRAN, but if it hasn't been done well enough for you, do it better.
>
If it were, I would have found it by now...

Best wishes,
Adrian

	[[alternative HTML version deleted]]

R devel - May 2021 - [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA

[Rd] [External] Re: 1954 from NA