thr3ads.net - R devel - [Rd] vctrs: a type system for the tidyverse [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2018-Aug-08 15:54 UTC

[Rd] vctrs: a type system for the tidyverse

>>>>> Hadley Wickham 
>>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
    >>>> Method dispatch for `vec_c()` is quite simple because
    >>>> associativity and commutativity mean that we can
    >>>> determine the output type only by considering a pair of
    >>>> inputs at a time. To this end, vctrs provides
    >>>> `vec_type2()` which takes two inputs and returns their
    >>>> common type (represented as zero length vector):
    >>>> 
    >>>> str(vec_type2(integer(), double())) #> num(0)
    >>>> 
    >>>> str(vec_type2(factor("a"),
factor("b"))) #> Factor w/ 2
    >>>> levels "a","b":
    >>> 
    >>> 
    >>> What is the reasoning behind taking the union of the
    >>> levels here? I'm not sure that is actually the behavior
    >>> I would want if I have a vector of factors and I try to
    >>> append some new data to it. I might want/ expect to
    >>> retain the existing levels and get either NAs or an
    >>> error if the new data has (present) levels not in the
    >>> first data. The behavior as above doesn't seem in-line
    >>> with what I understand the purpose of factors to be
    >>> (explicit restriction of possible values).
    >> 
    >> Originally (like a week ago ?), we threw an error if the
    >> factors didn't have the same level, and provided an
    >> optional coercion to character. I decided that while
    >> correct (the factor levels are a parameter of the type,
    >> and hence factors with different levels aren't
    >> comparable), that this fights too much against how people
    >> actually use factors in practice. It also seems like base
    >> R is moving more in this direction, i.e. in 3.4
    >> factor("a") == factor("b") is an error, whereas
in R 3.5
    >> it returns FALSE.

    > I now have a better argument, I think:

    > If you squint your brain a little, I think you can see
    > that each set of automatic coercions is about increasing
    > resolution. Integers are low resolution versions of
    > doubles, and dates are low resolution versions of
    > date-times. Logicals are low resolution version of
    > integers because there's a strong convention that `TRUE`
    > and `FALSE` can be used interchangeably with `1` and `0`.

    > But what is the resolution of a factor? We must take a
    > somewhat pragmatic approach because base R often converts
    > character vectors to factors, and we don't want to be
    > burdensome to users. So we say that a factor `x` has finer
    > resolution than factor `y` if the levels of `y` are
    > contained in `x`. So to find the common type of two
    > factors, we take the union of the levels of each factor,
    > given a factor that has finer resolution than
    > both. Finally, you can think of a character vector as a
    > factor with every possible level, so factors and character
    > vectors are coercible.

    > (extracted from the in-progress vignette explaining how to
    > extend vctrs to work with your own vctrs, now that vctrs
    > has been rewritten to use double dispatch)

I like this argumentation, and find it very nice indeed!
It confirms my own gut feeling which had lead me to agreeing
with you, Hadley, that taking the union of all factor levels
should be done here.

As Gabe mentioned (and you've explained about) the term "type"
is really confusing here.  As you know, the R internals are all
about SEXPs, TYPEOF(), etc, and that's what the R level
typeof(.) also returns.  As you want to use something slightly
different, it should be different naming, ideally something not
existing yet in the R / S world, maybe 'kind' ?

Martin


    > Hadley

    > -- 
    > http://hadley.nz

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Gabe Becker

2018-Aug-08 16:03 UTC

head link

[Rd] vctrs: a type system for the tidyverse

Actually, I sent that too quickly, I should have let it stew a bit more.
I've changed my mind about the resolution argument I Was trying to make.
There is more information, technically speaking, in the factor with empty
levels. I'm still not convinced that its the right behavior, personally. It
may just be me though, since Martin seems on board. Mostly I'm just very
wary of taking away the thing about factors that makes them fundamentally
not characters, and removing the effectiveness of the level restriction, in
practice, does that.

Best,
~G

On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <maechler at
stat.math.ethz.ch>
wrote:
> >>>>> Hadley Wickham
> >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
>
>     >>>> Method dispatch for `vec_c()` is quite simple because
>     >>>> associativity and commutativity mean that we can
>     >>>> determine the output type only by considering a pair
of
>     >>>> inputs at a time. To this end, vctrs provides
>     >>>> `vec_type2()` which takes two inputs and returns their
>     >>>> common type (represented as zero length vector):
>     >>>>
>     >>>> str(vec_type2(integer(), double())) #> num(0)
>     >>>>
>     >>>> str(vec_type2(factor("a"),
factor("b"))) #> Factor w/ 2
>     >>>> levels "a","b":
>     >>>
>     >>>
>     >>> What is the reasoning behind taking the union of the
>     >>> levels here? I'm not sure that is actually the
behavior
>     >>> I would want if I have a vector of factors and I try to
>     >>> append some new data to it. I might want/ expect to
>     >>> retain the existing levels and get either NAs or an
>     >>> error if the new data has (present) levels not in the
>     >>> first data. The behavior as above doesn't seem in-line
>     >>> with what I understand the purpose of factors to be
>     >>> (explicit restriction of possible values).
>     >>
>     >> Originally (like a week ago ?), we threw an error if the
>     >> factors didn't have the same level, and provided an
>     >> optional coercion to character. I decided that while
>     >> correct (the factor levels are a parameter of the type,
>     >> and hence factors with different levels aren't
>     >> comparable), that this fights too much against how people
>     >> actually use factors in practice. It also seems like base
>     >> R is moving more in this direction, i.e. in 3.4
>     >> factor("a") == factor("b") is an error,
whereas in R 3.5
>     >> it returns FALSE.
>
>     > I now have a better argument, I think:
>
>     > If you squint your brain a little, I think you can see
>     > that each set of automatic coercions is about increasing
>     > resolution. Integers are low resolution versions of
>     > doubles, and dates are low resolution versions of
>     > date-times. Logicals are low resolution version of
>     > integers because there's a strong convention that `TRUE`
>     > and `FALSE` can be used interchangeably with `1` and `0`.
>
>     > But what is the resolution of a factor? We must take a
>     > somewhat pragmatic approach because base R often converts
>     > character vectors to factors, and we don't want to be
>     > burdensome to users. So we say that a factor `x` has finer
>     > resolution than factor `y` if the levels of `y` are
>     > contained in `x`. So to find the common type of two
>     > factors, we take the union of the levels of each factor,
>     > given a factor that has finer resolution than
>     > both. Finally, you can think of a character vector as a
>     > factor with every possible level, so factors and character
>     > vectors are coercible.
>
>     > (extracted from the in-progress vignette explaining how to
>     > extend vctrs to work with your own vctrs, now that vctrs
>     > has been rewritten to use double dispatch)
>
> I like this argumentation, and find it very nice indeed!
> It confirms my own gut feeling which had lead me to agreeing
> with you, Hadley, that taking the union of all factor levels
> should be done here.
>
> As Gabe mentioned (and you've explained about) the term
"type"
> is really confusing here.  As you know, the R internals are all
> about SEXPs, TYPEOF(), etc, and that's what the R level
> typeof(.) also returns.  As you want to use something slightly
> different, it should be different naming, ideally something not
> existing yet in the R / S world, maybe 'kind' ?
>
> Martin
>
>
>     > Hadley
>
>     > --
>     > http://hadley.nz
>
>     > ______________________________________________
>     > R-devel at r-project.org mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

	[[alternative HTML version deleted]]

Hadley Wickham

2018-Aug-08 17:37 UTC

head link

[Rd] vctrs: a type system for the tidyverse

>     > I now have a better argument, I think:
>
>     > If you squint your brain a little, I think you can see
>     > that each set of automatic coercions is about increasing
>     > resolution. Integers are low resolution versions of
>     > doubles, and dates are low resolution versions of
>     > date-times. Logicals are low resolution version of
>     > integers because there's a strong convention that `TRUE`
>     > and `FALSE` can be used interchangeably with `1` and `0`.
>
>     > But what is the resolution of a factor? We must take a
>     > somewhat pragmatic approach because base R often converts
>     > character vectors to factors, and we don't want to be
>     > burdensome to users. So we say that a factor `x` has finer
>     > resolution than factor `y` if the levels of `y` are
>     > contained in `x`. So to find the common type of two
>     > factors, we take the union of the levels of each factor,
>     > given a factor that has finer resolution than
>     > both. Finally, you can think of a character vector as a
>     > factor with every possible level, so factors and character
>     > vectors are coercible.
>
>     > (extracted from the in-progress vignette explaining how to
>     > extend vctrs to work with your own vctrs, now that vctrs
>     > has been rewritten to use double dispatch)
>
> I like this argumentation, and find it very nice indeed!
> It confirms my own gut feeling which had lead me to agreeing
> with you, Hadley, that taking the union of all factor levels
> should be done here.
That's great to hear :)
> As Gabe mentioned (and you've explained about) the term
"type"
> is really confusing here.  As you know, the R internals are all
> about SEXPs, TYPEOF(), etc, and that's what the R level
> typeof(.) also returns.  As you want to use something slightly
> different, it should be different naming, ideally something not
> existing yet in the R / S world, maybe 'kind' ?
Agreed - I've been using type in the sense of "type system"
(particularly as it related to algebraic data types), but that's not
obvious from the current presentation, and as you note, is confusing
with existing notions of type in R. I like your suggestion of kind,
but I think it might be possible to just talk about classes, and
instead emphasise that while the components of the system are classes
(and indeed it's implemented using S3), the coercion/casting
relationship do not strictly follow the subclass/superclass
relationships.

A good motivating example is now ordered vs factor - I don't think you
can say that ordered or factor have greater resolution than the other
so:

vec_c(factor("a"), ordered("a"))
#> Error: No common type for factor and ordered

This is not what you'd expect from an _object_ system since ordered is
a subclass of factor.

Hadley

-- 
http://hadley.nz

Iñaki Úcar

2018-Aug-08 17:47 UTC

head link

[Rd] vctrs: a type system for the tidyverse

El mi?., 8 ago. 2018 a las 19:23, Gabe Becker (<becker.gabe at gene.com>)
escribi?:>
> Actually, I sent that too quickly, I should have let it stew a bit more.
> I've changed my mind about the resolution argument I Was trying to
make.
> There is more information, technically speaking, in the factor with empty
> levels. I'm still not convinced that its the right behavior,
personally. It
> may just be me though, since Martin seems on board. Mostly I'm just
very
> wary of taking away the thing about factors that makes them fundamentally
> not characters, and removing the effectiveness of the level restriction, in
> practice, does that.
For what it's worth, I always thought about factors as fundamentally
characters, but with restrictions: a subspace of all possible strings.
And I'd say that a non-negligible number of R users may think about
them in a similar way.

In fact, if you search "concatenation factors", you'll see that
back
in 2008 somebody asked on R-help [1] because he wanted to do exactly
what Hadley is describing (i.e., concatenation as character with
levels as a union of the levels), and he was surprised because...
well, the behaviour of c.factor is quite surprising if you don't read
the manual.

BTW, the solution proposed was unlist(list(fct1, fct2)).

[1] https://www.mail-archive.com/r-help at r-project.org/msg38360.html

I?aki
>
> Best,
> ~G
>
> On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <maechler at
stat.math.ethz.ch>
> wrote:
>
> > >>>>> Hadley Wickham
> > >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
> >
> >     >>>> Method dispatch for `vec_c()` is quite simple
because
> >     >>>> associativity and commutativity mean that we can
> >     >>>> determine the output type only by considering a
pair of
> >     >>>> inputs at a time. To this end, vctrs provides
> >     >>>> `vec_type2()` which takes two inputs and returns
their
> >     >>>> common type (represented as zero length vector):
> >     >>>>
> >     >>>> str(vec_type2(integer(), double())) #> num(0)
> >     >>>>
> >     >>>> str(vec_type2(factor("a"),
factor("b"))) #> Factor w/ 2
> >     >>>> levels "a","b":
> >     >>>
> >     >>>
> >     >>> What is the reasoning behind taking the union of the
> >     >>> levels here? I'm not sure that is actually the
behavior
> >     >>> I would want if I have a vector of factors and I try
to
> >     >>> append some new data to it. I might want/ expect to
> >     >>> retain the existing levels and get either NAs or an
> >     >>> error if the new data has (present) levels not in the
> >     >>> first data. The behavior as above doesn't seem
in-line
> >     >>> with what I understand the purpose of factors to be
> >     >>> (explicit restriction of possible values).
> >     >>
> >     >> Originally (like a week ago ), we threw an error if the
> >     >> factors didn't have the same level, and provided an
> >     >> optional coercion to character. I decided that while
> >     >> correct (the factor levels are a parameter of the type,
> >     >> and hence factors with different levels aren't
> >     >> comparable), that this fights too much against how people
> >     >> actually use factors in practice. It also seems like base
> >     >> R is moving more in this direction, i.e. in 3.4
> >     >> factor("a") == factor("b") is an
error, whereas in R 3.5
> >     >> it returns FALSE.
> >
> >     > I now have a better argument, I think:
> >
> >     > If you squint your brain a little, I think you can see
> >     > that each set of automatic coercions is about increasing
> >     > resolution. Integers are low resolution versions of
> >     > doubles, and dates are low resolution versions of
> >     > date-times. Logicals are low resolution version of
> >     > integers because there's a strong convention that `TRUE`
> >     > and `FALSE` can be used interchangeably with `1` and `0`.
> >
> >     > But what is the resolution of a factor? We must take a
> >     > somewhat pragmatic approach because base R often converts
> >     > character vectors to factors, and we don't want to be
> >     > burdensome to users. So we say that a factor `x` has finer
> >     > resolution than factor `y` if the levels of `y` are
> >     > contained in `x`. So to find the common type of two
> >     > factors, we take the union of the levels of each factor,
> >     > given a factor that has finer resolution than
> >     > both. Finally, you can think of a character vector as a
> >     > factor with every possible level, so factors and character
> >     > vectors are coercible.
> >
> >     > (extracted from the in-progress vignette explaining how to
> >     > extend vctrs to work with your own vctrs, now that vctrs
> >     > has been rewritten to use double dispatch)
> >
> > I like this argumentation, and find it very nice indeed!
> > It confirms my own gut feeling which had lead me to agreeing
> > with you, Hadley, that taking the union of all factor levels
> > should be done here.
> >
> > As Gabe mentioned (and you've explained about) the term
"type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
> >
> > Martin
> >
> >
> >     > Hadley
> >
> >     > --
> >     > http://hadley.nz
> >
> >     > ______________________________________________
> >     > R-devel at r-project.org mailing list
> >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Hadley Wickham

2018-Aug-09 14:36 UTC

head link

[Rd] vctrs: a type system for the tidyverse

> > As Gabe mentioned (and you've explained about) the term
"type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
>
> Agreed - I've been using type in the sense of "type system"
> (particularly as it related to algebraic data types), but that's not
> obvious from the current presentation, and as you note, is confusing
> with existing notions of type in R. I like your suggestion of kind,
> but I think it might be possible to just talk about classes, and
> instead emphasise that while the components of the system are classes
> (and indeed it's implemented using S3), the coercion/casting
> relationship do not strictly follow the subclass/superclass
> relationships.
I've taken another pass through (the first part of) the readme
(https://github.com/r-lib/vctrs#vctrs), and I'm now confident that I
can avoid using "type" by itself, and instead always use it in a
compound phrase (like type system) to avoid confusion. That leaves the
`.type` argument to many vctrs functions. I'm considering change it to
.prototype, because what you actually give it is a zero-length vector
of the class you want, i.e. a prototype of the desired output. What do
you think of prototype as a name?

Do you have any thoughts on good names for distinction vectors without
a class (i.e. logical, integer, double, ...) from vectors with a class
(e.g. factors, dates, etc). I've been thinking bare vector and S3
vector (leaving room to later think about S4 vectors). Do those sound
reasonable to you?

Hadley

-- 
http://hadley.nz

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Aug 2018 - vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

Apparently Analagous Threads