Hadley,
Responses inline.
On Wed, Aug 8, 2018 at 7:34 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> >>> Method dispatch for `vec_c()` is quite simple because
associativity and
> >>> commutativity mean that we can determine the output type only
by
> >>> considering a pair of inputs at a time. To this end, vctrs
provides
> >>> `vec_type2()` which takes two inputs and returns their common
type
> >>> (represented as zero length vector):
> >>>
> >>> str(vec_type2(integer(), double()))
> >>> #> num(0)
> >>>
> >>> str(vec_type2(factor("a"),
factor("b")))
> >>> #> Factor w/ 2 levels "a","b":
> >>
> >>
> >> What is the reasoning behind taking the union of the levels here?
I'm
> not
> >> sure that is actually the behavior I would want if I have a vector
of
> >> factors and I try to append some new data to it. I might want/
expect to
> >> retain the existing levels and get either NAs or an error if the
new
> data
> >> has (present) levels not in the first data. The behavior as above
> doesn't
> >> seem in-line with what I understand the purpose of factors to be
> (explicit
> >> restriction of possible values).
> >
> > Originally (like a week ago ?), we threw an error if the factors
> > didn't have the same level, and provided an optional coercion to
> > character. I decided that while correct (the factor levels are a
> > parameter of the type, and hence factors with different levels
aren't
> > comparable), that this fights too much against how people actually use
> > factors in practice. It also seems like base R is moving more in this
> > direction, i.e. in 3.4 factor("a") == factor("b")
is an error, whereas
> > in R 3.5 it returns FALSE.
>
> I now have a better argument, I think:
>
> If you squint your brain a little, I think you can see that each set
> of automatic coercions is about increasing resolution. Integers are
> low resolution versions of doubles, and dates are low resolution
> versions of date-times. Logicals are low resolution version of
> integers because there's a strong convention that `TRUE` and `FALSE`
> can be used interchangeably with `1` and `0`.
>
> But what is the resolution of a factor? We must take a somewhat
> pragmatic approach because base R often converts character vectors to
> factors, and we don't want to be burdensome to users.
I don't know, I personally just don't buy this line of reasoning. Yes,
you
can convert between characters and factors, but that doesn't make factors
"a special kind of character", which you seem to be implicitly arguing
they
are. Fundamentally they are different objects with different purposes. As I
said in my previous email, the primary semantic purpose of factors is value
restriction. You don't WANT to increase the set of levels when your set of
values has already been carefully curated. Certainly not automagically.
> So we say that a
> factor `x` has finer resolution than factor `y` if the levels of `y`
> are contained in `x`. So to find the common type of two factors, we
> take the union of the levels of each factor, given a factor that has
> finer resolution than both.
I'm not so sure. I think a more useful definition of resolution may be that
it is about increasing the precision of information. In that case, a factor
with 4 levels each of which is present has a *higher* resolution than the
same data with additional-but-absent levels on the factor object. Now that
may be different when the the new levels are not absent, but my point is
that its not clear to me that resolution is a useful way of talking about
factors.
> Finally, you can think of a character
> vector as a factor with every possible level, so factors and character
> vectors are coercible.
>
If users want unrestricted character type behavior, then IMHO they should
just be using characters, and it's quite easy for them to do so in any case
I can easily think of where they have somehow gotten their hands on a
factor. If, however, they want a factor, it must be - I imagine - because
they actually want the the semantics and behavior *specific* to factors.
Best,
~G
>
> (extracted from the in-progress vignette explaining how to extend
> vctrs to work with your own vctrs, now that vctrs has been rewritten
> to use double dispatch)
>
> Hadley
>
> --
> http://hadley.nz
>
--
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research
[[alternative HTML version deleted]]