thr3ads.net - R devel - [Rd] 1954 from NA [May 2021]

If this information is useful, please help other people find it:
Share via:

Adrian Dușa

2021-May-24 09:26 UTC

[Rd] 1954 from NA

Hmm...
If it was only one column then your solution is neat. But with 5-600
variables, each of which can contain multiple missing values, to double
this number of variables just to describe NA values seems to me excessive.
Not to mention we should be able to quickly convert / import / export from
one software package to another. This would imply maintaining some sort of
metadata reference of which explanatory additional factor describes which
original variable.

All of this strikes me as a lot of hassle compared to storing some
information within a tagged NA value... I just need a little bit more bits
to play with.

Best wishes,
Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <
r-devel at r-project.org> wrote:
> Arguably, R was not developed to satisfy some needs in the way intended.
>
> When I have had to work with datasets from some of the social sciences I
> have had to adapt to subtleties in how they did things with software like
> SPSS in which an NA was done using an out of bounds marker like 999 or
"."
> or even a blank cell. The problem is that R has a concept where data such
> as integers or floating point numbers is not stored as text normally but in
> their own formats and a vector by definition can only contain ONE data
> type. So the various forms of NA as well as Nan and Inf had to be grafted
> on to be considered VALID to share the same storage area as if they sort of
> were an integer or floating point number or text or whatever.
>
> It does strike me as possible to simply have a column that is something
> like a factor that can contain as many NA excuses as you wish such as
"NOT
> ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT
SURE" to "WILL BE FILLED IN
> LATER" to "I DON'T SPEAK ENGLISH AND CANNOT ANSWER STUPID
QUESTIONS". This
> additional column would presumably only have content when the other column
> has an NA. Your queries and other changes would work on something like a
> data.frame where both such columns coexisted.
>
> Note reading in data with multiple NA reasons may take extra work. If your
> errors codes are text, it will all become text. If the errors are 999 and
> 998 and 997, it may all be treated as numeric and you may not want to
> convert all such codes to an NA immediately. Rather, you would use the
> first vector/column to make the second vector and THEN replace everything
> that should be an NA with an actual NA and reparse the entire vector to
> become properly numeric unless you like working with text and will convert
> to numbers as needed on the fly.
>
> Now this form of annotation may not be pleasing but I suggest that an
> implementation that does allow annotation may use up space too. Of course,
> if your NA values are rare and space is only used then, you might save
> space. But if you could make a factor column and have it use the smallest
> int it can get as a basis, it may be a way to save on space.
>
> People who have done work with R, especially those using the tidyverse,
> are quite used to using one column to explain another. So if you are asked
> to say tabulate what percent of missing values are due to reasons A/B/C
> then the added columns works fine for that calculation too.
>
-- 
Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

	[[alternative HTML version deleted]]

Greg Minshall

2021-May-24 11:11 UTC

head link

[Rd] 1954 from NA

Adrian,
> If it was only one column then your solution is neat. But with 5-600
> variables, each of which can contain multiple missing values, to
> double this number of variables just to describe NA values seems to me
> excessive.  Not to mention we should be able to quickly convert /
> import / export from one software package to another. This would imply
> maintaining some sort of metadata reference of which explanatory
> additional factor describes which original variable.
one thing *i* should keep in mind is the old saying: "The difference
between theory and practice is that in theory there is no difference,
but in practice, there is."

but, in theory:

if you have 500 columns of possibly-NA'd variables, you could have one
column of 500 "bits", where each bit has one of N values, N being the
number of explanations the corresponding column has for why the NA
exists.

i guess the CS'y thing that comes to my mind here is that one thing is
the *semantics* of what you are trying to convey, and the other is how
those semantics are *encoded* in whatever representation you are using.

cheers, Greg

Avi Gross

2021-May-25 02:51 UTC

head link

[Rd] FW: 1954 from NA

Adrian,

Agreed. To do what you said hundreds of columns of data by doubling it is indeed
a pain just to get what you want. There are straightforward ways especially if
you use tidyverse packages rather than base R. Just a warning, this message is a
tad long for anyone not interested to skip.

But there is a caution about trying to use a feature nobody wanted changed until
you came along. R has all kinds of dependencies on existing ways of looking at
an NA value such as asking is.na(SOMETHING) or the many function like a mean
where they handle mean(SOMETHING, na.rm=TRUE) or the way ggplot graphs skip
items that are NA and so on. Any solution you come up with to enlarge the kinds
of NA may break some of that and then you will have no right to complain.

What does your data look like? I mean for example if all the data in a column is
small integers say under a thousand, you can pick some number like 10,000 and
store some NA categories as 10,000 + 1 then 10,000 + 2 and so on. THEN you have
to be careful, as noted above, to remove all such values in other contexts by
either making a copy where all numbers above 10,000 are changed to an NA for the
duration or you take subsets of the data that exclude those.

Floating point can also be done that way or by using a negative number or other
tricks.

Character DATA obviously can have reserved words that will not happen in the
rest of the DATA such as NA*NA:1 and NA*NA:2 or whatever makes sense to you.
Ideally this can be something you can remove all at once with something like
perhaps a regular expression when needed. If you use a factor to store such a
field, as if often a good idea, there are ways to force the indexes of your
NA-like fields to be whatever you want, such as the first 10 or even last 10,
perhaps letting you play games when they need to be hidden or removed or
refactored into a plain NA. It adds complexity and may break in unexpected ways.

And, I shudder to say this, more modern usage allows you to change normal
vectors into list variables as columns. So you can replace a single column (or
as many as you want) by a tuple column where the first part may be your data
including just NA when needed and the second item would  be something else like
a more specific reason for any items where the first is NA. Heck, you can add a
series of Boolean entries in the list where the second to the last each encode
TRUE if it has a particular excuse and you can even have multiple excuses (or
none) for an entry. I repeat, I shudder, simply because many other normally used
R functions are not expecting list columns and you may need to call them
indirectly with something that extracts only the part needed first.

R does have some inconsistencies in how it handles some things such as name tags
associated with parts of a vector. Some functions preserve the attributes used
this way and others do not.  But if you want to emulate the same tricks normally
used in making factors and matrices or giving column names, you can do something
like this that might work. My EXAMPLE below makes a vector of a dozen sequential
numbers as the VALUE and hides an attribute with month names to sort of match:
It then changes every third to be NA:

temp <- 1:12

attr(temp, "Month") <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

The current value of temp may look like this:

> temp
[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr"
"May" "Jun" "Jul" "Aug" "Sep"
"Oct" "Nov" "Dec"

So it has months attached as PLACEHOLDERS and four different NA values. To see
an NA value?s REASON, the two have the same index so:

> attr(temp, "Month")[is.na(temp)]
[1] "Mar" "Jun" "Sep" "Dec"

The above asked to see what text is associated with each NA. You can use many
techniques like the above to find out why a particular item is an NA. If you
want to know why the sixth item is NA, with R using index of 6 as it starts with
1:

> attr(temp, "Month")[6]
[1] "Jun"

And it can work both ways. If I now were to change the above to say store an NA
in the Months variable (renamed by you to Reason or something) except for other
entries saying ?RanOutOfTime?,  ?DidNotUnderstandQuestion? and so on, you could
search the attribute first and get the index numbers of which questions matched
and other such things.

There may well be a well-reasoned package as just described and perhaps some
that do not use as much space. The above very rough implementation just hides a
second vector attached loosely tied to the first vector in a way that may be
invisible to most other functionality. But it can easily have problems as so
many things make  new vectors and remove your change. Consider just doubling the
odd vector I created:

> temp2 <- c(temp, temp)
> temp2
[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA  1  2 NA  4  5 NA  7  8 NA 10 11 NA

The annotation is gone!

Now if you do something a tad more normal like re-use the names() feature, maybe
you can preserve it in more cases:

temp <- 1:12

names(temp) <- month.abb

temp[3] <- temp[6] <- temp[9] <- temp[12] <- NA

> temp
[1]  1  2 NA  4  5 NA  7  8 NA 10 11 NA

attr(,"Month")

[1] "Jan" "Feb" "Mar" "Apr"
"May" "Jun" "Jul" "Aug" "Sep"
"Oct" "Nov" "Dec"

> temp
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 

1   2  NA   4   5  NA   7   8  NA  10  11  NA

Now NAMES used this way can be preserved sometimes. For example some functions
have arguments like this:

> temp2 <- c(temp, temp, use.names=TRUE)
> temp2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
Sep Oct Nov Dec

1   2  NA   4   5  NA   7   8  NA  10  11  NA   1   2  NA   4   5  NA   7   8 
NA  10  11  NA

So, it may well be you can play such games with your input but doing that for
hundreds of columns may be a tad of work that can be automated easily enough if
all the columns are similar, such as in repeats of data in a time series. As
noted, R functions that read in DATA expect all items in a column to be of the
same underlying type or an NA. If your data has text giving a REASON, and you
know exactly what reasons are allowed with any remaining values to be left as
is, you might do something like this in pseudo code.

Say column_orig looks like this: 5, 1, bad, 2, worse, 1, 2, 5, bad, 6, worse,
missing, 2

Your stuff may be read in as CHARACTER and look like:

> column_orig
[1] "5"       "1"       "bad"     "2"   
"worse"   "1"       "2"       "5"      
"bad"

[10] "6"       "worse"   "missing" "2"

So, you can process the above with something like an ifelse() to make a
temporary version VERY carefully as ifelse does not preserve name attributes!

> names.temp <- ifelse(column_orig %in% c("bad",
"worse", "missing"), column_orig, NA)
> column_orig <- ifelse(column_orig %in% c("bad",
"worse", "missing"), NA, column_orig)
> column_orig <- as.numeric(column_orig)
> names(column_orig) <- names.temp

> column_orig
<NA>    <NA>     bad    <NA>   worse    <NA>   
<NA>    <NA>     bad    <NA>   worse missing    <NA>

  5       1      NA       2      NA       1       2       5      NA       6     
NA      NA       2

(the above may not show up formatted right in the email but shows the names on
the first line and the data on the second. Wherever the data is NA, the reason
is in the name.

Again, I am just playing with your specified need and pointing out ways R may
partially support them but probably far from ideal as you are trying to do
something it probably was never designed for. I suspect the philosophy behind
using a tibble instead of a data.frame may preserve your meta-info better.

But if all you want is to know the reason for a missing observation while using
little space, there may be other ways to consider such as making a sparse matrix
from the original data if missing values are rare enough. Sure, it might have
600 columns and umpteen rows, but you can store a small integer or even a byte
in each entry and perhaps skip any row that has nothing missing. If you later
need the info and the data has not been scrambled such as by removing rows or
columns or sorting, you can easily find it. Or, if you simply add one more
column with some form of unique sequence number or ID and maintain it, you can
always index back to find what you want, WITHOUT all the warnings mentioned
above.

If memory is a huge concern, consider ways you can massage your original data to
conserve what you need then save THAT to a file on disk and remove the extra
space use for garbage collection. When and IF you ever need that info at some
later date, the form you chose can be read back in. But you need to be careful
as such meta-info is lost unless you use a method that conserves it. Do not save
it as a CSV file, for example, but as something R uses and can read back in the
same way.

Or, you can try to make your own twists on changing how NA works and take lots
of risks as it is not doing something published and guaranteed.

I think I can now politely bow out of this topic and wish you luck with whatever
you choose. It may even be using something other than R!

From: Adrian Du?a <dusa.adrian at unibuc.ro <mailto:dusa.adrian at
unibuc.ro> >
Sent: Monday, May 24, 2021 5:26 AM
To: Avi Gross <avigross at verizon.net <mailto:avigross at verizon.net>
>
Cc: r-devel <r-devel at r-project.org <mailto:r-devel at r-project.org>
>
Subject: Re: [Rd] 1954 from NA

Hmm...

If it was only one column then your solution is neat. But with 5-600 variables,
each of which can contain multiple missing values, to double this number of
variables just to describe NA values seems to me excessive.

Not to mention we should be able to quickly convert / import / export from one
software package to another. This would imply maintaining some sort of metadata
reference of which explanatory additional factor describes which original
variable.

All of this strikes me as a lot of hassle compared to storing some information
within a tagged NA value... I just need a little bit more bits to play with.

Best wishes,

Adrian

On Sun, May 23, 2021 at 10:21 PM Avi Gross via R-devel <r-devel at
r-project.org <mailto:r-devel at r-project.org> > wrote:

Arguably, R was not developed to satisfy some needs in the way intended.

When I have had to work with datasets from some of the social sciences I have
had to adapt to subtleties in how they did things with software like SPSS in
which an NA was done using an out of bounds marker like 999 or "." or
even a blank cell. The problem is that R has a concept where data such as
integers or floating point numbers is not stored as text normally but in their
own formats and a vector by definition can only contain ONE data type. So the
various forms of NA as well as Nan and Inf had to be grafted on to be considered
VALID to share the same storage area as if they sort of were an integer or
floating point number or text or whatever.

It does strike me as possible to simply have a column that is something like a
factor that can contain as many NA excuses as you wish such as "NOT
ANSWERED" to "CANNOT READ THE SQUIGLE" to "NOT SURE" to
"WILL BE FILLED IN LATER" to "I DON'T SPEAK ENGLISH AND
CANNOT ANSWER STUPID QUESTIONS". This additional column would presumably
only have content when the other column has an NA. Your queries and other
changes would work on something like a data.frame where both such columns
coexisted.

Note reading in data with multiple NA reasons may take extra work. If your
errors codes are text, it will all become text. If the errors are 999 and 998
and 997, it may all be treated as numeric and you may not want to convert all
such codes to an NA immediately. Rather, you would use the first vector/column
to make the second vector and THEN replace everything that should be an NA with
an actual NA and reparse the entire vector to become properly numeric unless you
like working with text and will convert to numbers as needed on the fly.

Now this form of annotation may not be pleasing but I suggest that an
implementation that does allow annotation may use up space too. Of course, if
your NA values are rare and space is only used then, you might save space. But
if you could make a factor column and have it use the smallest int it can get as
a basis, it may be a way to save on space.

People who have done work with R, especially those using the tidyverse, are
quite used to using one column to explain another. So if you are asked to say
tabulate what percent of missing values are due to reasons A/B/C then the added
columns works fine for that calculation too.

-- 

Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania

https://adriandusa.eu

	[[alternative HTML version deleted]]

R devel - May 2021 - 1954 from NA

[Rd] 1954 from NA

[Rd] 1954 from NA

[Rd] FW: 1954 from NA