> First off, you are using the word "type" throughout this email; You seem to > mean class (judging by your Date and factor examples, and the fact you > mention S3 dispatch) as opposed to type in the sense of what is returned by > R's typeof() function. I think it would be clearer if you called it class > throughout unless that isn't actually what you mean (in which case I would > have other questions...)I used "type" to hand wave away the precise definition - it's not S3 class or base type (i.e. typeof()) but some hybrid of the two. I do want to emphasise that it's a type system, not a oo system, in that coercions are not defined by superclass/subclass relationships.> More thoughts inline. > > On Mon, Aug 6, 2018 at 9:21 AM, Hadley Wickham <h.wickham at gmail.com> wrote: >> >> Hi all, >> >> I wanted to share with you an experimental package that I?m currently >> working on: vctrs, <https://github.com/r-lib/vctrs>. The motivation for >> vctrs is to think deeply about the output ?type? of functions like >> `c()`, `ifelse()`, and `rbind()`, with an eye to implementing one >> strategy throughout the tidyverse (i.e. all the functions listed at >> <https://github.com/r-lib/vctrs#tidyverse-functions>). Because this is >> going to be a big change, I thought it would be very useful to get >> comments from a wide audience, so I?m reaching out to R-devel to get >> your thoughts. >> >> There is quite a lot already in the readme >> (<https://github.com/r-lib/vctrs#vctrs>), so here I?ll try to motivate >> vctrs as succinctly as possible by comparing `base::c()` to its >> equivalent `vctrs::vec_c()`. I think the drawbacks of `c()` are well >> known, but to refresh your memory, I?ve highlighted a few at >> <https://github.com/r-lib/vctrs#compared-to-base-r>. I think they arise >> because of two main challenges: `c()` has to both combine vectors *and* >> strip attributes, and it only dispatches on the first argument. >> >> The design of vctrs is largely driven by a pair of principles: >> >> - The type of `vec_c(x, y)` should be the same as `vec_c(y, x)` >> >> - The type of `vec_c(x, vec_c(y, z))` should be the same as >> `vec_c(vec_c(x, y), z)` >> >> i.e. the type should be associative and commutative. I think these are >> good principles because they makes types simpler to understand and to >> implement. >> >> Method dispatch for `vec_c()` is quite simple because associativity and >> commutativity mean that we can determine the output type only by >> considering a pair of inputs at a time. To this end, vctrs provides >> `vec_type2()` which takes two inputs and returns their common type >> (represented as zero length vector): >> >> str(vec_type2(integer(), double())) >> #> num(0) >> >> str(vec_type2(factor("a"), factor("b"))) >> #> Factor w/ 2 levels "a","b": > > > What is the reasoning behind taking the union of the levels here? I'm not > sure that is actually the behavior I would want if I have a vector of > factors and I try to append some new data to it. I might want/ expect to > retain the existing levels and get either NAs or an error if the new data > has (present) levels not in the first data. The behavior as above doesn't > seem in-line with what I understand the purpose of factors to be (explicit > restriction of possible values).Originally (like a week ago ?), we threw an error if the factors didn't have the same level, and provided an optional coercion to character. I decided that while correct (the factor levels are a parameter of the type, and hence factors with different levels aren't comparable), that this fights too much against how people actually use factors in practice. It also seems like base R is moving more in this direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas in R 3.5 it returns FALSE. I'm not wedded to the current approach, but it feels like the same principle should apply in comparisons like x == y (even though == is outside the scope of vctrs, ideally the underlying principles would be robust enough to suggest what should happen).> I guess what I'm saying is that while I agree associativity is good for most > things, it doesn't seem like the right behavior to me in the case of > factors.I think associativity is such a strong and useful principle that it may be worth making some sacrifices for factors. That said, my claim of associativity is only on the type, not the values of the type: vec_c(fa, fb) and vec_c(fb, fa) both return factors, but the levels are in different orders.> Also, while we're on factors, what does > > vec_type2(factor("a"), "a") > > return, character or factor with levels "a"?Character. Coercing to character would potentially lose too much information. I think you could argue that this could be an error, but again I feel like this would make the type system a little too strict and cause extra friction for most uses.>> # NB: not all types have a common/unifying type >> str(vec_type2(Sys.Date(), factor("a"))) >> #> Error: No common type for date and factor > > > Why is this not a list? Do you have the additional restraint that vec_type2 > must return the class of one of its operands? If so, what is the > justification of that? Are you not counting list as a "type of vector"?You can always request a list, with `vec_type2(Sys.Date(), factor("a"), .type = list())` - generally the philosophy is too not make major changes to the type without explicit user input. I can't currently fully articulate my reasoning for why some coercions happen automatically, and why some don't. I think these decisions have to be made somewhat on the basis of pragmatics, and what R users are currently familiar with. You can see a visual summary of implicit casts (arrows) + explicit casts (circles) at https://github.com/r-lib/vctrs/blob/master/man/figures/combined.png. This matrix must be symmetric, and I think it should be block diagonal, but I don't otherwise know what the constraints are.>> On top of this foundation, vctrs expands in a few different ways: >> >> - To consider the ?type? of a data frame, and what the common type of >> two data frames should be. This leads to a natural implementation of >> `vec_rbind()` which includes all columns that appear in any input. > > > I must admit I'm a bit surprised here. rbind is one of the few places that > immediately come to mind where R takes a fail early and loud approach to > likely errors (as opposed to the more permissive do soemthing that could be > what they meant appraoch of, e.g., out-of-bounds indexing). Are we sure we > want rbind to get less strict with respect to compatibility of the > data.frames being combined?Pragmatically, it's clearly needed for data analysis. Also note that there are some inputs to rbind that lead to silent data loss: rbind(data.frame(x = 1:3), c(1, 1000000)) #> x #> 1 1 #> 2 2 #> 3 3 #> 4 1 So while it's pretty good in general, there are still a few infelicities (In particular, I suspect R-core might be interested in fixing this one)> Another "permissive" option would be to return a > data.frame which has only the intersection of the columns. There are > certainly times when that is what I want (rather than columns with tons of > NAs in them) and it would be convenient not to need to do the column > subsetting myself. This behavior would also meet your design goals of > associativity and commutivity.Yes, I think that would make sense as an option and would be trivial to implemet (issue at https://github.com/r-lib/vctrs/issues/46). Another thing I need to implement is the ability to specify the types of some columns. Currently it's all or nothing: vec_rbind( data.frame(x = F, y = 1), data.frame(x = 1L, y = 2), .type = data.frame(x = logical()) ) #> x #> 1 FALSE #> 2 TRUE #> Warning messages: #> 1: Lossy conversion from data.frame to data.frame #> Dropped variables: y #> 2: Lossy conversion from data.frame to data.frame #> Dropped variables: y> I want to be clear, I think what you describe is a useful operation, if it > is what is intended, but perhaps a different name rather than calling it > rbind? maybe vec_rcbind to indicate that both rows and columns are being > potentially added to any given individual input.Sorry, I should have mentioned that this is unlikely to be the final name. As well as the problem you mention, I think calling them vec_cbind() and vec_rbind() over-emphasises the symmetry between the two operations. cbind() and rbind() are symmetric for matrices, but for data frames, rbind() is more about common types, and cbind() is more about common shapes. Thanks for your feedback, it's very useful! Hadley -- http://hadley.nz
>>> Method dispatch for `vec_c()` is quite simple because associativity and >>> commutativity mean that we can determine the output type only by >>> considering a pair of inputs at a time. To this end, vctrs provides >>> `vec_type2()` which takes two inputs and returns their common type >>> (represented as zero length vector): >>> >>> str(vec_type2(integer(), double())) >>> #> num(0) >>> >>> str(vec_type2(factor("a"), factor("b"))) >>> #> Factor w/ 2 levels "a","b": >> >> >> What is the reasoning behind taking the union of the levels here? I'm not >> sure that is actually the behavior I would want if I have a vector of >> factors and I try to append some new data to it. I might want/ expect to >> retain the existing levels and get either NAs or an error if the new data >> has (present) levels not in the first data. The behavior as above doesn't >> seem in-line with what I understand the purpose of factors to be (explicit >> restriction of possible values). > > Originally (like a week ago ?), we threw an error if the factors > didn't have the same level, and provided an optional coercion to > character. I decided that while correct (the factor levels are a > parameter of the type, and hence factors with different levels aren't > comparable), that this fights too much against how people actually use > factors in practice. It also seems like base R is moving more in this > direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas > in R 3.5 it returns FALSE.I now have a better argument, I think: If you squint your brain a little, I think you can see that each set of automatic coercions is about increasing resolution. Integers are low resolution versions of doubles, and dates are low resolution versions of date-times. Logicals are low resolution version of integers because there's a strong convention that `TRUE` and `FALSE` can be used interchangeably with `1` and `0`. But what is the resolution of a factor? We must take a somewhat pragmatic approach because base R often converts character vectors to factors, and we don't want to be burdensome to users. So we say that a factor `x` has finer resolution than factor `y` if the levels of `y` are contained in `x`. So to find the common type of two factors, we take the union of the levels of each factor, given a factor that has finer resolution than both. Finally, you can think of a character vector as a factor with every possible level, so factors and character vectors are coercible. (extracted from the in-progress vignette explaining how to extend vctrs to work with your own vctrs, now that vctrs has been rewritten to use double dispatch) Hadley -- http://hadley.nz
>>>>> Hadley Wickham >>>>> on Wed, 8 Aug 2018 09:34:42 -0500 writes:>>>> Method dispatch for `vec_c()` is quite simple because >>>> associativity and commutativity mean that we can >>>> determine the output type only by considering a pair of >>>> inputs at a time. To this end, vctrs provides >>>> `vec_type2()` which takes two inputs and returns their >>>> common type (represented as zero length vector): >>>> >>>> str(vec_type2(integer(), double())) #> num(0) >>>> >>>> str(vec_type2(factor("a"), factor("b"))) #> Factor w/ 2 >>>> levels "a","b": >>> >>> >>> What is the reasoning behind taking the union of the >>> levels here? I'm not sure that is actually the behavior >>> I would want if I have a vector of factors and I try to >>> append some new data to it. I might want/ expect to >>> retain the existing levels and get either NAs or an >>> error if the new data has (present) levels not in the >>> first data. The behavior as above doesn't seem in-line >>> with what I understand the purpose of factors to be >>> (explicit restriction of possible values). >> >> Originally (like a week ago ?), we threw an error if the >> factors didn't have the same level, and provided an >> optional coercion to character. I decided that while >> correct (the factor levels are a parameter of the type, >> and hence factors with different levels aren't >> comparable), that this fights too much against how people >> actually use factors in practice. It also seems like base >> R is moving more in this direction, i.e. in 3.4 >> factor("a") == factor("b") is an error, whereas in R 3.5 >> it returns FALSE. > I now have a better argument, I think: > If you squint your brain a little, I think you can see > that each set of automatic coercions is about increasing > resolution. Integers are low resolution versions of > doubles, and dates are low resolution versions of > date-times. Logicals are low resolution version of > integers because there's a strong convention that `TRUE` > and `FALSE` can be used interchangeably with `1` and `0`. > But what is the resolution of a factor? We must take a > somewhat pragmatic approach because base R often converts > character vectors to factors, and we don't want to be > burdensome to users. So we say that a factor `x` has finer > resolution than factor `y` if the levels of `y` are > contained in `x`. So to find the common type of two > factors, we take the union of the levels of each factor, > given a factor that has finer resolution than > both. Finally, you can think of a character vector as a > factor with every possible level, so factors and character > vectors are coercible. > (extracted from the in-progress vignette explaining how to > extend vctrs to work with your own vctrs, now that vctrs > has been rewritten to use double dispatch) I like this argumentation, and find it very nice indeed! It confirms my own gut feeling which had lead me to agreeing with you, Hadley, that taking the union of all factor levels should be done here. As Gabe mentioned (and you've explained about) the term "type" is really confusing here. As you know, the R internals are all about SEXPs, TYPEOF(), etc, and that's what the R level typeof(.) also returns. As you want to use something slightly different, it should be different naming, ideally something not existing yet in the R / S world, maybe 'kind' ? Martin > Hadley > -- > http://hadley.nz > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Hadley, Responses inline. On Wed, Aug 8, 2018 at 7:34 AM, Hadley Wickham <h.wickham at gmail.com> wrote:> >>> Method dispatch for `vec_c()` is quite simple because associativity and > >>> commutativity mean that we can determine the output type only by > >>> considering a pair of inputs at a time. To this end, vctrs provides > >>> `vec_type2()` which takes two inputs and returns their common type > >>> (represented as zero length vector): > >>> > >>> str(vec_type2(integer(), double())) > >>> #> num(0) > >>> > >>> str(vec_type2(factor("a"), factor("b"))) > >>> #> Factor w/ 2 levels "a","b": > >> > >> > >> What is the reasoning behind taking the union of the levels here? I'm > not > >> sure that is actually the behavior I would want if I have a vector of > >> factors and I try to append some new data to it. I might want/ expect to > >> retain the existing levels and get either NAs or an error if the new > data > >> has (present) levels not in the first data. The behavior as above > doesn't > >> seem in-line with what I understand the purpose of factors to be > (explicit > >> restriction of possible values). > > > > Originally (like a week ago ?), we threw an error if the factors > > didn't have the same level, and provided an optional coercion to > > character. I decided that while correct (the factor levels are a > > parameter of the type, and hence factors with different levels aren't > > comparable), that this fights too much against how people actually use > > factors in practice. It also seems like base R is moving more in this > > direction, i.e. in 3.4 factor("a") == factor("b") is an error, whereas > > in R 3.5 it returns FALSE. > > I now have a better argument, I think: > > If you squint your brain a little, I think you can see that each set > of automatic coercions is about increasing resolution. Integers are > low resolution versions of doubles, and dates are low resolution > versions of date-times. Logicals are low resolution version of > integers because there's a strong convention that `TRUE` and `FALSE` > can be used interchangeably with `1` and `0`. > > But what is the resolution of a factor? We must take a somewhat > pragmatic approach because base R often converts character vectors to > factors, and we don't want to be burdensome to users.I don't know, I personally just don't buy this line of reasoning. Yes, you can convert between characters and factors, but that doesn't make factors "a special kind of character", which you seem to be implicitly arguing they are. Fundamentally they are different objects with different purposes. As I said in my previous email, the primary semantic purpose of factors is value restriction. You don't WANT to increase the set of levels when your set of values has already been carefully curated. Certainly not automagically.> So we say that a > factor `x` has finer resolution than factor `y` if the levels of `y` > are contained in `x`. So to find the common type of two factors, we > take the union of the levels of each factor, given a factor that has > finer resolution than both.I'm not so sure. I think a more useful definition of resolution may be that it is about increasing the precision of information. In that case, a factor with 4 levels each of which is present has a *higher* resolution than the same data with additional-but-absent levels on the factor object. Now that may be different when the the new levels are not absent, but my point is that its not clear to me that resolution is a useful way of talking about factors.> Finally, you can think of a character > vector as a factor with every possible level, so factors and character > vectors are coercible. >If users want unrestricted character type behavior, then IMHO they should just be using characters, and it's quite easy for them to do so in any case I can easily think of where they have somehow gotten their hands on a factor. If, however, they want a factor, it must be - I imagine - because they actually want the the semantics and behavior *specific* to factors. Best, ~G> > (extracted from the in-progress vignette explaining how to extend > vctrs to work with your own vctrs, now that vctrs has been rewritten > to use double dispatch) > > Hadley > > -- > http://hadley.nz >-- Gabriel Becker, Ph.D Scientist Bioinformatics and Computational Biology Genentech Research [[alternative HTML version deleted]]