thr3ads.net - R devel - [Rd] [External] Re: 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Adrian Dușa

2021-May-24 12:18 UTC

[Rd] 1954 from NA

On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu>
wrote:
> [...]
> if you have 500 columns of possibly-NA'd variables, you could have one
> column of 500 "bits", where each bit has one of N values, N being
the
> number of explanations the corresponding column has for why the NA
> exists.
>
The mere thought of implementing something like that gives me shivers. Not
to mention such a solution should also be robust when subsetting,
splitting, column and row binding, etc. and everything can be lost if the
user deletes that particular column without realising its importance.

Social science datasets are much more alive and complex than one might
first think: there are multi-wave studies with tens of countries, and
aggregating such data is already a complex process to add even more
complexity on top of that.

As undocumented as they may be, or even subject to change, I think the R
internals are much more reliable that this.

Best wishes,
Adrian

-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

	[[alternative HTML version deleted]]

iuke-tier@ey m@iii@g oii uiow@@edu

2021-May-24 13:14 UTC

head link

[Rd] [External] Re: 1954 from NA

On Mon, 24 May 2021, Adrian Du?a wrote:
> On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu>
wrote:
>
>> [...]
>> if you have 500 columns of possibly-NA'd variables, you could have
one
>> column of 500 "bits", where each bit has one of N values, N
being the
>> number of explanations the corresponding column has for why the NA
>> exists.
>>
PLEASE DO NOT DO THIS!

It will not work reliably, as has been explained to you ad nauseam in
this thread.

If you distribute code that does this it will only lead to bug reports
on R that will waste R-core time.

As Alex explained, you can use attributes for this. If you need
operations to preserve attributes across subsetting you can define
subsetting methods that do that.

If you are dead set on doing something in C you can try to develop an
ALTREP class that provides augmented missing value information.

Best,

luke


>
> The mere thought of implementing something like that gives me shivers. Not
> to mention such a solution should also be robust when subsetting,
> splitting, column and row binding, etc. and everything can be lost if the
> user deletes that particular column without realising its importance.
>
> Social science datasets are much more alive and complex than one might
> first think: there are multi-wave studies with tens of countries, and
> aggregating such data is already a complex process to add even more
> complexity on top of that.
>
> As undocumented as they may be, or even subject to change, I think the R
> internals are much more reliable that this.
>
> Best wishes,
> Adrian
>
>
-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

Avi Gross

2021-May-25 02:51 UTC

head link

[Rd] 1954 from NA

Adrian,

This is an aside. I note in many machine-learning algorithms they actually do
something along the lines being discussed. They may take an item like a
paragraph of words or an email message  and add thousands of columns with each
one being a Boolean specifying if a particular word is in or not in that item.
They may then run an analysis trying to heuristically match known SPAM items so
as to be able to predict if new items might be SPAM. Some may even have a column
for words taken two or more at a time such as ?must? followed by ?have? or
?Your?, ?last?, ?chance? resulting> column_orig

<NA>    <NA>     bad    <NA>   worse    <NA>   
<NA>    <NA>     bad    <NA>   worse missing    <NA>

  5       1      NA       2      NA       1       2       5      NA       6     
NA      NA       2  in even more columns. The software than does the analysis
can work on remarkably large such collections including in some cases taking
multiple approaches at the same problem and choosing among them in some way.

In your case, yes, adding lots of columns seems like added work. But in data
science, often the easiest way to do some complex things is to loop over
selected existing columns and create multiple sets of additional columns that
simplify later calculations by just using these values rather than some
multi-line complex condition. I have as an example run statistical analyses
where I have a Boolean column if the analysis failed (as in I caught it using
try() or else it would kill my process) and another if I was told it did not
converge properly and yet another column if it failed some post-tests. It
simplified some queries that excluded rows where any one of the above was TRUE.
I also stored columns for metrics like RMSEA and chi-squared values, sometimes
dozens. And for each of the above, I actually had a set of columns for various
models such as linear versus quadratic and more. Worse, as the analysis
continued, more derived columns were added as various measures of the above
results were compared to each other so the different models could be compared as
in how often each was better. Careful choices of naming conventions and nice
features of the tidyverse made it fairly simple to operate on many columns in
the same way fairly easily such as all columns whose names start with a string
or end with ?

And, yes, for some efficiency, I often made a narrower version of the above with
just the fields I needed and was careful not to remove what I might need later.

So it can be done and fairly trivially if you know what you are doing. If the
names of all your original columns that behave this way look like *.orig and
others look different, you can ask for a function to be applied to just those
that produces another set with the same prefixes but named *.converted and yet
another called *.annotation and so on. You may want to remove the originals to
save space but you get the idea. The fact there are six hundred means little
with such a design as the above can be done in probably a dozen lines of code to
all of them at once.

For me, the above is way less complex than what you want to do and can have
benefits. For example, if you make a graph of points from my larger
tibble/data.frame using ggplot(), you can do things like specify what color to
use for a point using a variable that contains the reason the data was missing
(albeit that assumes the missing part is not what is being graphed) or add text
giving the reason just above each such point. Your method of faking multiple
things YOU claim are an NA may not make it doable in the above example.

From: Adrian Du?a <dusa.adrian at unibuc.ro <mailto:dusa.adrian at
unibuc.ro> >
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall <minshall at umich.edu <mailto:minshall at umich.edu>
>
Cc: Avi Gross <avigross at verizon.net <mailto:avigross at verizon.net>
>; r-devel <r-devel at r-project.org <mailto:r-devel at
r-project.org> >
Subject: Re: [Rd] 1954 from NA

On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu
<mailto:minshall at umich.edu> > wrote:

[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

The mere thought of implementing something like that gives me shivers. Not to
mention such a solution should also be robust when subsetting, splitting, column
and row binding, etc. and everything can be lost if the user deletes that
particular column without realising its importance.

Social science datasets are much more alive and complex than one might first
think: there are multi-wave studies with tens of countries, and aggregating such
data is already a complex process to add even more complexity on top of that.

As undocumented as they may be, or even subject to change, I think the R
internals are much more reliable that this.

Best wishes,

Adrian

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

	[[alternative HTML version deleted]]

R devel - May 2021 - [External] Re: 1954 from NA

[Rd] 1954 from NA

[Rd] [External] Re: 1954 from NA

[Rd] 1954 from NA