Adrian,
This is an aside. I note in many machine-learning algorithms they actually do
something along the lines being discussed. They may take an item like a
paragraph of words or an email message and add thousands of columns with each
one being a Boolean specifying if a particular word is in or not in that item.
They may then run an analysis trying to heuristically match known SPAM items so
as to be able to predict if new items might be SPAM. Some may even have a column
for words taken two or more at a time such as ?must? followed by ?have? or
?Your?, ?last?, ?chance? resulting> column_orig
<NA> <NA> bad <NA> worse <NA>
<NA> <NA> bad <NA> worse missing <NA>
5 1 NA 2 NA 1 2 5 NA 6
NA NA 2 in even more columns. The software than does the analysis
can work on remarkably large such collections including in some cases taking
multiple approaches at the same problem and choosing among them in some way.
In your case, yes, adding lots of columns seems like added work. But in data
science, often the easiest way to do some complex things is to loop over
selected existing columns and create multiple sets of additional columns that
simplify later calculations by just using these values rather than some
multi-line complex condition. I have as an example run statistical analyses
where I have a Boolean column if the analysis failed (as in I caught it using
try() or else it would kill my process) and another if I was told it did not
converge properly and yet another column if it failed some post-tests. It
simplified some queries that excluded rows where any one of the above was TRUE.
I also stored columns for metrics like RMSEA and chi-squared values, sometimes
dozens. And for each of the above, I actually had a set of columns for various
models such as linear versus quadratic and more. Worse, as the analysis
continued, more derived columns were added as various measures of the above
results were compared to each other so the different models could be compared as
in how often each was better. Careful choices of naming conventions and nice
features of the tidyverse made it fairly simple to operate on many columns in
the same way fairly easily such as all columns whose names start with a string
or end with ?
And, yes, for some efficiency, I often made a narrower version of the above with
just the fields I needed and was careful not to remove what I might need later.
So it can be done and fairly trivially if you know what you are doing. If the
names of all your original columns that behave this way look like *.orig and
others look different, you can ask for a function to be applied to just those
that produces another set with the same prefixes but named *.converted and yet
another called *.annotation and so on. You may want to remove the originals to
save space but you get the idea. The fact there are six hundred means little
with such a design as the above can be done in probably a dozen lines of code to
all of them at once.
For me, the above is way less complex than what you want to do and can have
benefits. For example, if you make a graph of points from my larger
tibble/data.frame using ggplot(), you can do things like specify what color to
use for a point using a variable that contains the reason the data was missing
(albeit that assumes the missing part is not what is being graphed) or add text
giving the reason just above each such point. Your method of faking multiple
things YOU claim are an NA may not make it doable in the above example.
From: Adrian Du?a <dusa.adrian at unibuc.ro <mailto:dusa.adrian at
unibuc.ro> >
Sent: Monday, May 24, 2021 8:18 AM
To: Greg Minshall <minshall at umich.edu <mailto:minshall at umich.edu>
>
Cc: Avi Gross <avigross at verizon.net <mailto:avigross at verizon.net>
>; r-devel <r-devel at r-project.org <mailto:r-devel at
r-project.org> >
Subject: Re: [Rd] 1954 from NA
On Mon, May 24, 2021 at 2:11 PM Greg Minshall <minshall at umich.edu
<mailto:minshall at umich.edu> > wrote:
[...]
if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.
The mere thought of implementing something like that gives me shivers. Not to
mention such a solution should also be robust when subsetting, splitting, column
and row binding, etc. and everything can be lost if the user deletes that
particular column without realising its importance.
Social science datasets are much more alive and complex than one might first
think: there are multi-wave studies with tens of countries, and aggregating such
data is already a complex process to add even more complexity on top of that.
As undocumented as they may be, or even subject to change, I think the R
internals are much more reliable that this.
Best wishes,
Adrian
--
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu
[[alternative HTML version deleted]]