thr3ads.net - R devel - [Rd] vctrs: a type system for the tidyverse [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2018-Aug-06 16:21 UTC

[Rd] vctrs: a type system for the tidyverse

Hi all,

I wanted to share with you an experimental package that I?m currently
working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for
vctrs is to think deeply about the output ?type? of functions like
`c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
strategy throughout the tidyverse (i.e. all the functions listed at
<https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is
going to be a big change, I thought it would be very useful to get
comments from a wide audience, so I?m reaching out to R-devel to get
your thoughts.

There is quite a lot already in the readme
(<https://github.com/r-lib/vctrs#vctrs>), so here I?ll try to motivate
vctrs as succinctly as possible by comparing `base::c()` to its
equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
known, but to refresh your memory, I?ve highlighted a few at
<https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise
because of two main challenges: `c()` has to both combine vectors *and*
strip attributes, and it only dispatches on the first argument.

The design of vctrs is largely driven by a pair of principles:

-   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`

-   The type of `vec_c(x, vec_c(y, z))` should be the same as
    `vec_c(vec_c(x, y), z)`

i.e. the type should be associative and commutative. I think these are
good principles because they makes types simpler to understand and to
implement.

Method dispatch for `vec_c()` is quite simple because associativity and
commutativity mean that we can determine the output type only by
considering a pair of inputs at a time. To this end, vctrs provides
`vec_type2()` which takes two inputs and returns their common type
(represented as zero length vector):

    str(vec_type2(integer(), double()))
    #>  num(0)

    str(vec_type2(factor("a"), factor("b")))
    #>  Factor w/ 2 levels "a","b":

    # NB: not all types have a common/unifying type
    str(vec_type2(Sys.Date(), factor("a")))
    #> Error: No common type for date and factor

(`vec_type()` currently implements double dispatch through a combination
of S3 dispatch and if-else blocks, but this will change to a pure S3
approach in the near future.)

To find the common type of multiple vectors, we can use `Reduce()`:

    vecs <- list(TRUE, 1:10, 1.5)

    type <- Reduce(vec_type2, vecs)
    str(type)
    #>  num(0)

There?s one other piece of the puzzle: casting one vector to another
type. That?s implemented by `vec_cast()` (which also uses double
dispatch):

    str(lapply(vecs, vec_cast, to = type))
    #> List of 3
    #>  $ : num 1
    #>  $ : num [1:10] 1 2 3 4 5 6 7 8 9 10
    #>  $ : num 1.5

All up, this means that we can implement the essence of `vec_c()` in
only a few lines:

    vec_c2 <- function(...) {
      args <- list(...)
      type <- Reduce(vec_type, args)

      cast <- lapply(type, vec_cast, to = type)
      unlist(cast, recurse = FALSE)
    }

    vec_c(factor("a"), factor("b"))
    #> [1] a b
    #> Levels: a b

    vec_c(Sys.Date(), Sys.time())
    #> [1] "2018-08-06 00:00:00 CDT" "2018-08-06 11:20:32
CDT"

(The real implementation is little more complex:
<https://github.com/r-lib/vctrs/blob/master/R/c.R>)

On top of this foundation, vctrs expands in a few different ways:

-   To consider the ?type? of a data frame, and what the common type of
    two data frames should be. This leads to a natural implementation of
    `vec_rbind()` which includes all columns that appear in any input.

-   To create a new ?list\_of? type, a list where every element is of
    fixed type (enforced by `[<-`, `[[<-`, and `$<-`)

-   To think a little about the ?shape? of a vector, and to consider
    recycling as part of the type system. (This thinking is not yet
    fully fleshed out)

Thanks for making it to the bottom of this long email :) I would love to
hear your thoughts on vctrs. It?s something that I?ve been having a lot
of fun exploring, and I?d like to make sure it is as robust as possible
(and the motivations are as clear as possible) before we start using it
in other packages.

Hadley


-- 
http://hadley.nz

Gabe Becker

2018-Aug-06 17:46 UTC

head link

[Rd] vctrs: a type system for the tidyverse

Hadley,

Looks interesting and like a fun project from what you said in the email (I
don't have time right now to dig deep into the readme) A few thoughts.

First off, you are using the word "type" throughout this email; You
seem to
mean class (judging by your Date and factor examples, and the fact you
mention S3 dispatch) as opposed to type in the sense of what is returned by
R's  typeof() function. I think it would be clearer if you called it class
throughout unless that isn't actually what you mean (in which case I would
have other questions...)

More thoughts inline.

On Mon, Aug 6, 2018 at 9:21 AM, Hadley Wickham <h.wickham at gmail.com>
wrote:
> Hi all,
>
> I wanted to share with you an experimental package that I?m currently
> working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation
for
> vctrs is to think deeply about the output ?type? of functions like
> `c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
> strategy throughout the tidyverse (i.e. all the functions listed at
> <https://github.com/r-lib/vctrs#tidyverse-functions>). Because this
is
> going to be a big change, I thought it would be very useful to get
> comments from a wide audience, so I?m reaching out to R-devel to get
> your thoughts.
>
> There is quite a lot already in the readme
> (<https://github.com/r-lib/vctrs#vctrs>), so here I?ll try to
motivate
> vctrs as succinctly as possible by comparing `base::c()` to its
> equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
> known, but to refresh your memory, I?ve highlighted a few at
> <https://github.com/r-lib/vctrs#compared-to-base-r>. I think they
arise
> because of two main challenges: `c()` has to both combine vectors *and*
> strip attributes, and it only dispatches on the first argument.
>
> The design of vctrs is largely driven by a pair of principles:
>
> -   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`
>
> -   The type of `vec_c(x, vec_c(y, z))` should be the same as
>     `vec_c(vec_c(x, y), z)`
>
> i.e. the type should be associative and commutative. I think these are
> good principles because they makes types simpler to understand and to
> implement.
>
> Method dispatch for `vec_c()` is quite simple because associativity and
> commutativity mean that we can determine the output type only by
> considering a pair of inputs at a time. To this end, vctrs provides
> `vec_type2()` which takes two inputs and returns their common type
> (represented as zero length vector):
>
>     str(vec_type2(integer(), double()))
>     #>  num(0)
>
>     str(vec_type2(factor("a"), factor("b")))
>     #>  Factor w/ 2 levels "a","b":
>
What is the reasoning behind taking the union of the levels here? I'm not
sure that is actually the behavior I would want if I have a vector of
factors and I try to append some new data to it. I might want/ expect to
retain the existing levels and get either NAs or an error if the new data
has (present) levels not in the first data. The behavior as above doesn't
seem in-line with what I understand the purpose of factors to be (explicit
restriction of possible values).

I guess what I'm saying is that while I agree associativity is good for
most things, it doesn't seem like the right behavior to me in the case of
factors.

Also, while we're on factors, what does

vec_type2(factor("a"), "a")

return, character or factor with levels "a"?


>
>     # NB: not all types have a common/unifying type
>     str(vec_type2(Sys.Date(), factor("a")))
>     #> Error: No common type for date and factor
>
Why is this not a list? Do you have the additional restraint that vec_type2
must return the class of one of its operands? If so, what is the
justification of that? Are you not counting list as a "type of
vector"?

>
> (`vec_type()` currently implements double dispatch through a combination
> of S3 dispatch and if-else blocks, but this will change to a pure S3
> approach in the near future.)
>
> To find the common type of multiple vectors, we can use `Reduce()`:
>
>     vecs <- list(TRUE, 1:10, 1.5)
>
>     type <- Reduce(vec_type2, vecs)
>     str(type)
>     #>  num(0)
>
> There?s one other piece of the puzzle: casting one vector to another
> type. That?s implemented by `vec_cast()` (which also uses double
> dispatch):
>
>     str(lapply(vecs, vec_cast, to = type))
>     #> List of 3
>     #>  $ : num 1
>     #>  $ : num [1:10] 1 2 3 4 5 6 7 8 9 10
>     #>  $ : num 1.5
>
> All up, this means that we can implement the essence of `vec_c()` in
> only a few lines:
>
>     vec_c2 <- function(...) {
>       args <- list(...)
>       type <- Reduce(vec_type, args)
>
>       cast <- lapply(type, vec_cast, to = type)
>       unlist(cast, recurse = FALSE)
>     }
>
>     vec_c(factor("a"), factor("b"))
>     #> [1] a b
>     #> Levels: a b
>
>     vec_c(Sys.Date(), Sys.time())
>     #> [1] "2018-08-06 00:00:00 CDT" "2018-08-06 11:20:32
CDT"
>
> (The real implementation is little more complex:
> <https://github.com/r-lib/vctrs/blob/master/R/c.R>)
>
> On top of this foundation, vctrs expands in a few different ways:
>
> -   To consider the ?type? of a data frame, and what the common type of
>     two data frames should be. This leads to a natural implementation of
>     `vec_rbind()` which includes all columns that appear in any input.
>
I must admit I'm a bit surprised here. rbind is one of the few places that
immediately come to mind where R takes a fail early and loud approach to
likely errors (as opposed to the more permissive do soemthing  that could
be what they meant appraoch of, e.g., out-of-bounds indexing). Are we sure
we want rbind to get less strict with respect to compatibility of the
data.frames being combined? Another "permissive" option would be to
return
a data.frame which has only the intersection of the columns. There are
certainly times when that is what I want (rather than columns with tons of
NAs in them) and it would be convenient not to need to do the column
subsetting myself. This behavior would also meet your design goals of
associativity and commutivity.

I want to be clear, I think what you describe is a useful operation, if it
is what is intended, but perhaps a different name rather than calling it
rbind? maybe vec_rcbind to indicate that both rows and columns are being
potentially added to any given individual input.

Best,
~G

> -   To create a new ?list\_of? type, a list where every element is of
>     fixed type (enforced by `[<-`, `[[<-`, and `$<-`)
>
> -   To think a little about the ?shape? of a vector, and to consider
>     recycling as part of the type system. (This thinking is not yet
>     fully fleshed out)
>
> Thanks for making it to the bottom of this long email :) I would love to
> hear your thoughts on vctrs. It?s something that I?ve been having a lot
> of fun exploring, and I?d like to make sure it is as robust as possible
> (and the motivations are as clear as possible) before we start using it
> in other packages.
>
> Hadley
>
>
> --
> http://hadley.nz
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

	[[alternative HTML version deleted]]

Hadley Wickham

2018-Aug-06 20:22 UTC

head link

[Rd] vctrs: a type system for the tidyverse

> First off, you are using the word "type" throughout this email;
You seem to
> mean class (judging by your Date and factor examples, and the fact you
> mention S3 dispatch) as opposed to type in the sense of what is returned by
> R's  typeof() function. I think it would be clearer if you called it
class
> throughout unless that isn't actually what you mean (in which case I
would
> have other questions...)
I used "type" to hand wave away the precise definition - it's not
S3
class or base type (i.e. typeof()) but some hybrid of the two. I do
want to emphasise that it's a type system, not a oo system, in that
coercions are not defined by superclass/subclass relationships.
> More thoughts inline.
>
> On Mon, Aug 6, 2018 at 9:21 AM, Hadley Wickham <h.wickham at
gmail.com> wrote:
>>
>> Hi all,
>>
>> I wanted to share with you an experimental package that I?m currently
>> working on: vctrs, <https://github.com/r-lib/vctrs>. The
motivation for
>> vctrs is to think deeply about the output ?type? of functions like
>> `c()`, `ifelse()`, and `rbind()`, with an eye to implementing one
>> strategy throughout the tidyverse (i.e. all the functions listed at
>> <https://github.com/r-lib/vctrs#tidyverse-functions>). Because
this is
>> going to be a big change, I thought it would be very useful to get
>> comments from a wide audience, so I?m reaching out to R-devel to get
>> your thoughts.
>>
>> There is quite a lot already in the readme
>> (<https://github.com/r-lib/vctrs#vctrs>), so here I?ll try to
motivate
>> vctrs as succinctly as possible by comparing `base::c()` to its
>> equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well
>> known, but to refresh your memory, I?ve highlighted a few at
>> <https://github.com/r-lib/vctrs#compared-to-base-r>. I think they
arise
>> because of two main challenges: `c()` has to both combine vectors *and*
>> strip attributes, and it only dispatches on the first argument.
>>
>> The design of vctrs is largely driven by a pair of principles:
>>
>> -   The type of `vec_c(x, y)` should be the same as `vec_c(y, x)`
>>
>> -   The type of `vec_c(x, vec_c(y, z))` should be the same as
>>     `vec_c(vec_c(x, y), z)`
>>
>> i.e. the type should be associative and commutative. I think these are
>> good principles because they makes types simpler to understand and to
>> implement.
>>
>> Method dispatch for `vec_c()` is quite simple because associativity and
>> commutativity mean that we can determine the output type only by
>> considering a pair of inputs at a time. To this end, vctrs provides
>> `vec_type2()` which takes two inputs and returns their common type
>> (represented as zero length vector):
>>
>>     str(vec_type2(integer(), double()))
>>     #>  num(0)
>>
>>     str(vec_type2(factor("a"), factor("b")))
>>     #>  Factor w/ 2 levels "a","b":
>
>
> What is the reasoning behind taking the union of the levels here? I'm
not
> sure that is actually the behavior I would want if I have a vector of
> factors and I try to append some new data to it. I might want/ expect to
> retain the existing levels and get either NAs or an error if the new data
> has (present) levels not in the first data. The behavior as above
doesn't
> seem in-line with what I understand the purpose of factors to be (explicit
> restriction of possible values).
Originally (like a week ago ?), we threw an error if the factors
didn't have the same level, and provided an optional coercion to
character. I decided that while correct (the factor levels are a
parameter of the type, and hence factors with different levels aren't
comparable), that this fights too much against how people actually use
factors in practice. It also seems like base R is moving more in this
direction, i.e. in 3.4 factor("a") == factor("b") is an
error, whereas
in R 3.5 it returns FALSE.

I'm not wedded to the current approach, but it feels like the same
principle should apply in comparisons like x == y (even though == is
outside the scope of vctrs, ideally the underlying principles would be
robust enough to suggest what should happen).
> I guess what I'm saying is that while I agree associativity is good for
most
> things, it doesn't seem like the right behavior to me in the case of
> factors.
I think associativity is such a strong and useful principle that it
may be worth making some sacrifices for factors. That said, my claim
of associativity is only on the type, not the values of the type:
vec_c(fa, fb) and vec_c(fb, fa) both return factors, but the levels
are in different orders.
> Also, while we're on factors, what does
>
> vec_type2(factor("a"), "a")
>
> return, character or factor with levels "a"?
Character. Coercing to character would potentially lose too much
information. I think you could argue that this could be an error, but
again I feel like this would make the type system a little too strict
and cause extra friction for most uses.
>>     # NB: not all types have a common/unifying type
>>     str(vec_type2(Sys.Date(), factor("a")))
>>     #> Error: No common type for date and factor
>
>
> Why is this not a list? Do you have the additional restraint that vec_type2
> must return the class of one of its operands? If so, what is the
> justification of that? Are you not counting list as a "type of
vector"?
You can always request a list, with `vec_type2(Sys.Date(),
factor("a"), .type = list())` - generally the philosophy is too not
make major changes to the type without explicit user input.

I can't currently fully articulate my reasoning for why some coercions
happen automatically, and why some don't. I think these decisions have
to be made somewhat on the basis of pragmatics, and what R users are
currently familiar with. You can see a visual summary of implicit
casts (arrows) + explicit casts (circles) at
https://github.com/r-lib/vctrs/blob/master/man/figures/combined.png.
This matrix must be symmetric, and I think it should be block
diagonal, but I don't otherwise know what the constraints are.
>> On top of this foundation, vctrs expands in a few different ways:
>>
>> -   To consider the ?type? of a data frame, and what the common type of
>>     two data frames should be. This leads to a natural implementation
of
>>     `vec_rbind()` which includes all columns that appear in any input.
>
>
> I must admit I'm a bit surprised here. rbind is one of the few places
that
> immediately come to mind where R takes a fail early and loud approach to
> likely errors (as opposed to the more permissive do soemthing  that could
be
> what they meant appraoch of, e.g., out-of-bounds indexing). Are we sure we
> want rbind to get less strict with respect to compatibility of the
> data.frames being combined?
Pragmatically, it's clearly needed for data analysis.

Also note that there are some inputs to rbind that lead to silent data loss:

rbind(data.frame(x = 1:3), c(1, 1000000))
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 1

So while it's pretty good in general, there are still a few
infelicities (In particular, I suspect R-core might be interested in
fixing this one)
> Another "permissive" option would be to return a
> data.frame which has only the intersection of the columns. There are
> certainly times when that is what I want (rather than columns with tons of
> NAs in them) and it would be convenient not to need to do the column
> subsetting myself. This behavior would also meet your design goals of
> associativity and commutivity.
Yes, I think that would make sense as an option and would be trivial
to implemet (issue at https://github.com/r-lib/vctrs/issues/46).

Another thing I need to implement is the ability to specify the types
of some columns. Currently it's all or nothing:

vec_rbind(
  data.frame(x = F, y = 1),
  data.frame(x = 1L, y = 2),
  .type = data.frame(x = logical())
)

#>       x
#> 1 FALSE
#> 2  TRUE
#> Warning messages:
#> 1: Lossy conversion from data.frame to data.frame
#> Dropped variables: y
#> 2: Lossy conversion from data.frame to data.frame
#> Dropped variables: y
> I want to be clear, I think what you describe is a useful operation, if it
> is what is intended, but perhaps a different name rather than calling it
> rbind? maybe vec_rcbind to indicate that both rows and columns are being
> potentially added to any given individual input.
Sorry, I should have mentioned that this is unlikely to be the final
name. As well as the problem you mention, I think calling them
vec_cbind() and vec_rbind() over-emphasises the symmetry between the
two operations. cbind() and rbind() are symmetric for matrices, but
for data frames, rbind() is more about common types, and cbind() is
more about common shapes.

Thanks for your feedback, it's very useful!

Hadley

-- 
http://hadley.nz

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Aug 2018 - vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

Reasonably Related Threads