thr3ads.net - R devel - [Rd] vctrs: a type system for the tidyverse [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Iñaki Úcar

2018-Aug-08 17:47 UTC

[Rd] vctrs: a type system for the tidyverse

El mi?., 8 ago. 2018 a las 19:23, Gabe Becker (<becker.gabe at gene.com>)
escribi?:>
> Actually, I sent that too quickly, I should have let it stew a bit more.
> I've changed my mind about the resolution argument I Was trying to
make.
> There is more information, technically speaking, in the factor with empty
> levels. I'm still not convinced that its the right behavior,
personally. It
> may just be me though, since Martin seems on board. Mostly I'm just
very
> wary of taking away the thing about factors that makes them fundamentally
> not characters, and removing the effectiveness of the level restriction, in
> practice, does that.
For what it's worth, I always thought about factors as fundamentally
characters, but with restrictions: a subspace of all possible strings.
And I'd say that a non-negligible number of R users may think about
them in a similar way.

In fact, if you search "concatenation factors", you'll see that
back
in 2008 somebody asked on R-help [1] because he wanted to do exactly
what Hadley is describing (i.e., concatenation as character with
levels as a union of the levels), and he was surprised because...
well, the behaviour of c.factor is quite surprising if you don't read
the manual.

BTW, the solution proposed was unlist(list(fct1, fct2)).

[1] https://www.mail-archive.com/r-help at r-project.org/msg38360.html

I?aki
>
> Best,
> ~G
>
> On Wed, Aug 8, 2018 at 8:54 AM, Martin Maechler <maechler at
stat.math.ethz.ch>
> wrote:
>
> > >>>>> Hadley Wickham
> > >>>>>     on Wed, 8 Aug 2018 09:34:42 -0500 writes:
> >
> >     >>>> Method dispatch for `vec_c()` is quite simple
because
> >     >>>> associativity and commutativity mean that we can
> >     >>>> determine the output type only by considering a
pair of
> >     >>>> inputs at a time. To this end, vctrs provides
> >     >>>> `vec_type2()` which takes two inputs and returns
their
> >     >>>> common type (represented as zero length vector):
> >     >>>>
> >     >>>> str(vec_type2(integer(), double())) #> num(0)
> >     >>>>
> >     >>>> str(vec_type2(factor("a"),
factor("b"))) #> Factor w/ 2
> >     >>>> levels "a","b":
> >     >>>
> >     >>>
> >     >>> What is the reasoning behind taking the union of the
> >     >>> levels here? I'm not sure that is actually the
behavior
> >     >>> I would want if I have a vector of factors and I try
to
> >     >>> append some new data to it. I might want/ expect to
> >     >>> retain the existing levels and get either NAs or an
> >     >>> error if the new data has (present) levels not in the
> >     >>> first data. The behavior as above doesn't seem
in-line
> >     >>> with what I understand the purpose of factors to be
> >     >>> (explicit restriction of possible values).
> >     >>
> >     >> Originally (like a week ago ), we threw an error if the
> >     >> factors didn't have the same level, and provided an
> >     >> optional coercion to character. I decided that while
> >     >> correct (the factor levels are a parameter of the type,
> >     >> and hence factors with different levels aren't
> >     >> comparable), that this fights too much against how people
> >     >> actually use factors in practice. It also seems like base
> >     >> R is moving more in this direction, i.e. in 3.4
> >     >> factor("a") == factor("b") is an
error, whereas in R 3.5
> >     >> it returns FALSE.
> >
> >     > I now have a better argument, I think:
> >
> >     > If you squint your brain a little, I think you can see
> >     > that each set of automatic coercions is about increasing
> >     > resolution. Integers are low resolution versions of
> >     > doubles, and dates are low resolution versions of
> >     > date-times. Logicals are low resolution version of
> >     > integers because there's a strong convention that `TRUE`
> >     > and `FALSE` can be used interchangeably with `1` and `0`.
> >
> >     > But what is the resolution of a factor? We must take a
> >     > somewhat pragmatic approach because base R often converts
> >     > character vectors to factors, and we don't want to be
> >     > burdensome to users. So we say that a factor `x` has finer
> >     > resolution than factor `y` if the levels of `y` are
> >     > contained in `x`. So to find the common type of two
> >     > factors, we take the union of the levels of each factor,
> >     > given a factor that has finer resolution than
> >     > both. Finally, you can think of a character vector as a
> >     > factor with every possible level, so factors and character
> >     > vectors are coercible.
> >
> >     > (extracted from the in-progress vignette explaining how to
> >     > extend vctrs to work with your own vctrs, now that vctrs
> >     > has been rewritten to use double dispatch)
> >
> > I like this argumentation, and find it very nice indeed!
> > It confirms my own gut feeling which had lead me to agreeing
> > with you, Hadley, that taking the union of all factor levels
> > should be done here.
> >
> > As Gabe mentioned (and you've explained about) the term
"type"
> > is really confusing here.  As you know, the R internals are all
> > about SEXPs, TYPEOF(), etc, and that's what the R level
> > typeof(.) also returns.  As you want to use something slightly
> > different, it should be different naming, ideally something not
> > existing yet in the R / S world, maybe 'kind' ?
> >
> > Martin
> >
> >
> >     > Hadley
> >
> >     > --
> >     > http://hadley.nz
> >
> >     > ______________________________________________
> >     > R-devel at r-project.org mailing list
> >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>
> --
> Gabriel Becker, Ph.D
> Scientist
> Bioinformatics and Computational Biology
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Joris Meys

2018-Aug-09 08:58 UTC

head link

[Rd] vctrs: a type system for the tidyverse

I sent this to  I?aki personally by mistake. Thank you for notifying me.

On Wed, Aug 8, 2018 at 7:53 PM I?aki ?car <i.ucar86 at gmail.com> wrote:
>
> For what it's worth, I always thought about factors as fundamentally
> characters, but with restrictions: a subspace of all possible strings.
> And I'd say that a non-negligible number of R users may think about
> them in a similar way.
>
That idea has been a common source of bugs and the most important reason
why I always explain my students that factors are a special kind of
numeric(integer), not character. Especially people coming from SPSS see
immediately the link with categorical variables in that way, and understand
that a factor is a modeling aid rather than an alternative for characters.
It is a categorical variable and a more readable way of representing a set
of dummy variables.

I do agree that some of the factor behaviour is confusing at best, but that
doesn't change the appropriate use and meaning of factors as categorical
variables.

Even more, I oppose the ideas that :

1) factors with different levels should be concatenated.

2) when combining factors, the union of the levels would somehow be a good
choice.

Factors with different levels are variables with different information, not
more or less information. If one factor codes low and high and another
codes low, mid and high, you can't say whether mid in one factor would be
low or high in the first one. The second has a higher resolution, and
that's exactly the reason why they should NOT be combined. Different levels
indicate a different grouping, and hence that data should never be used as
one set of dummy variables in any model.

Even when combining factors, the union of levels only makes sense to me if
there's no overlap between levels of both factors. In all other cases, a
researcher will need to determine whether levels with the same label do
mean the same thing in both factors, and that's not guaranteed. And when
we're talking a factor with a higher resolution and a lower resolution, the
correct thing to do modelwise is to recode one of the factors so they have
the same resolution and every level the same definition before you merge
that data.

So imho the combination of two factors with different levels (or even
levels in a different order) should give an error. Which R currently
doesn't throw, so I get there's room for improvement.

Cheers
Joris
-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

	[[alternative HTML version deleted]]

Hadley Wickham

2018-Aug-09 12:36 UTC

head link

[Rd] vctrs: a type system for the tidyverse

On Thu, Aug 9, 2018 at 3:57 AM Joris Meys <jorismeys at gmail.com>
wrote:>
>  I sent this to  I?aki personally by mistake. Thank you for notifying me.
>
> On Wed, Aug 8, 2018 at 7:53 PM I?aki ?car <i.ucar86 at gmail.com>
wrote:
>
> >
> > For what it's worth, I always thought about factors as
fundamentally
> > characters, but with restrictions: a subspace of all possible strings.
> > And I'd say that a non-negligible number of R users may think
about
> > them in a similar way.
> >
>
> That idea has been a common source of bugs and the most important reason
> why I always explain my students that factors are a special kind of
> numeric(integer), not character. Especially people coming from SPSS see
> immediately the link with categorical variables in that way, and understand
> that a factor is a modeling aid rather than an alternative for characters.
> It is a categorical variable and a more readable way of representing a set
> of dummy variables.
>
> I do agree that some of the factor behaviour is confusing at best, but that
> doesn't change the appropriate use and meaning of factors as
categorical
> variables.
>
> Even more, I oppose the ideas that :
>
> 1) factors with different levels should be concatenated.
>
> 2) when combining factors, the union of the levels would somehow be a good
> choice.
>
> Factors with different levels are variables with different information, not
> more or less information. If one factor codes low and high and another
> codes low, mid and high, you can't say whether mid in one factor would
be
> low or high in the first one. The second has a higher resolution, and
> that's exactly the reason why they should NOT be combined. Different
levels
> indicate a different grouping, and hence that data should never be used as
> one set of dummy variables in any model.
>
> Even when combining factors, the union of levels only makes sense to me if
> there's no overlap between levels of both factors. In all other cases,
a
> researcher will need to determine whether levels with the same label do
> mean the same thing in both factors, and that's not guaranteed. And
when
> we're talking a factor with a higher resolution and a lower resolution,
the
> correct thing to do modelwise is to recode one of the factors so they have
> the same resolution and every level the same definition before you merge
> that data.
>
> So imho the combination of two factors with different levels (or even
> levels in a different order) should give an error. Which R currently
> doesn't throw, so I get there's room for improvement.
I 100% agree with you, and is this the behaviour that vctrs used to
have and dplyr currently has (at least in bind_rows()). But
pragmatically, my experience with dplyr is that people find this
behaviour confusing and unhelpful. And when I played the full
expression of this behaviour in vctrs, I found that it forced me to
think about the levels of factors more than I'd otherwise like to: it
made me think like a programmer, not like a data analyst. So in an
ideal world, yes, I think factors would have stricter behaviour, but
my sense is that imposing this strictness now will be onerous to most
analysts.

Hadley

-- 
http://hadley.nz

Possibly Parallel Threads

Search for more seemingly similar threads

R devel - Aug 2018 - vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

Possibly Parallel Threads