thr3ads.net - R devel - [Rd] vctrs: a type system for the tidyverse [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Hadley Wickham

2018-Aug-09 12:36 UTC

[Rd] vctrs: a type system for the tidyverse

On Thu, Aug 9, 2018 at 3:57 AM Joris Meys <jorismeys at gmail.com>
wrote:>
>  I sent this to  I?aki personally by mistake. Thank you for notifying me.
>
> On Wed, Aug 8, 2018 at 7:53 PM I?aki ?car <i.ucar86 at gmail.com>
wrote:
>
> >
> > For what it's worth, I always thought about factors as
fundamentally
> > characters, but with restrictions: a subspace of all possible strings.
> > And I'd say that a non-negligible number of R users may think
about
> > them in a similar way.
> >
>
> That idea has been a common source of bugs and the most important reason
> why I always explain my students that factors are a special kind of
> numeric(integer), not character. Especially people coming from SPSS see
> immediately the link with categorical variables in that way, and understand
> that a factor is a modeling aid rather than an alternative for characters.
> It is a categorical variable and a more readable way of representing a set
> of dummy variables.
>
> I do agree that some of the factor behaviour is confusing at best, but that
> doesn't change the appropriate use and meaning of factors as
categorical
> variables.
>
> Even more, I oppose the ideas that :
>
> 1) factors with different levels should be concatenated.
>
> 2) when combining factors, the union of the levels would somehow be a good
> choice.
>
> Factors with different levels are variables with different information, not
> more or less information. If one factor codes low and high and another
> codes low, mid and high, you can't say whether mid in one factor would
be
> low or high in the first one. The second has a higher resolution, and
> that's exactly the reason why they should NOT be combined. Different
levels
> indicate a different grouping, and hence that data should never be used as
> one set of dummy variables in any model.
>
> Even when combining factors, the union of levels only makes sense to me if
> there's no overlap between levels of both factors. In all other cases,
a
> researcher will need to determine whether levels with the same label do
> mean the same thing in both factors, and that's not guaranteed. And
when
> we're talking a factor with a higher resolution and a lower resolution,
the
> correct thing to do modelwise is to recode one of the factors so they have
> the same resolution and every level the same definition before you merge
> that data.
>
> So imho the combination of two factors with different levels (or even
> levels in a different order) should give an error. Which R currently
> doesn't throw, so I get there's room for improvement.
I 100% agree with you, and is this the behaviour that vctrs used to
have and dplyr currently has (at least in bind_rows()). But
pragmatically, my experience with dplyr is that people find this
behaviour confusing and unhelpful. And when I played the full
expression of this behaviour in vctrs, I found that it forced me to
think about the levels of factors more than I'd otherwise like to: it
made me think like a programmer, not like a data analyst. So in an
ideal world, yes, I think factors would have stricter behaviour, but
my sense is that imposing this strictness now will be onerous to most
analysts.

Hadley

-- 
http://hadley.nz

Joris Meys

2018-Aug-09 12:55 UTC

head link

[Rd] vctrs: a type system for the tidyverse

Hi Hadley,

my point actually came from a data analyst point of view. A character
variable is something used for extra information, eg the "any other
ideas?"
field of a questionnaire. A categorical variable is a variable describing
categories defined by the researcher. If it is made clear that a factor is
the object type needed for a categorical variable, there is no confusion.
All my students get it. But I agree that in many cases people are taught
that a factor is somehow related to character variables. And that does not
make sense from a data analyst point of view if you think about variables
as continuous, ordinal and nominal in a model context.

So I don't think adding more confusing behaviour and pitfalls is a solution
to something that's essentially a misunderstanding. It's something
that's
only solved by explaining it correctly imho.

Cheers
Joris

On Thu, Aug 9, 2018 at 2:36 PM Hadley Wickham <h.wickham at gmail.com>
wrote:
>
> I 100% agree with you, and is this the behaviour that vctrs used to
> have and dplyr currently has (at least in bind_rows()). But
> pragmatically, my experience with dplyr is that people find this
> behaviour confusing and unhelpful. And when I played the full
> expression of this behaviour in vctrs, I found that it forced me to
> think about the levels of factors more than I'd otherwise like to: it
> made me think like a programmer, not like a data analyst. So in an
> ideal world, yes, I think factors would have stricter behaviour, but
> my sense is that imposing this strictness now will be onerous to most
> analysts.
>
> Hadley
>
> --
> http://hadley.nz
>

-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Hadley Wickham

2018-Aug-09 14:30 UTC

head link

[Rd] vctrs: a type system for the tidyverse

On Thu, Aug 9, 2018 at 7:54 AM Joris Meys <jorismeys at gmail.com>
wrote:>
> Hi Hadley,
>
> my point actually came from a data analyst point of view. A character
variable is something used for extra information, eg the "any other
ideas?" field of a questionnaire. A categorical variable is a variable
describing categories defined by the researcher. If it is made clear that a
factor is the object type needed for a categorical variable, there is no
confusion. All my students get it. But I agree that in many cases people are
taught that a factor is somehow related to character variables. And that does
not make sense from a data analyst point of view if you think about variables as
continuous, ordinal and nominal in a model context.
>
> So I don't think adding more confusing behaviour and pitfalls is a
solution to something that's essentially a misunderstanding. It's
something that's only solved by explaining it correctly imho.
I agree with your definition of character and factor variables. It's
an important distinction, and I agree that the blurring of factors and
characters is generally undesirable. However, the merits of respecting
R's existing behaviour, and Martin M?chler's support, means that I'm
not going to change vctr's approach at this point in time. However, I
hear from you and Gabe that this is an important issue, and I'll
definitely keep it in mind as I solicit further feedback from users.

Hadley

-- 
http://hadley.nz

Maybe Matching Threads

Search for more apparently analagous threads

R devel - Aug 2018 - vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

[Rd] vctrs: a type system for the tidyverse

Maybe Matching Threads