On Thu, Aug 9, 2018 at 3:57 AM Joris Meys <jorismeys at gmail.com> wrote:> > I sent this to I?aki personally by mistake. Thank you for notifying me. > > On Wed, Aug 8, 2018 at 7:53 PM I?aki ?car <i.ucar86 at gmail.com> wrote: > > > > > For what it's worth, I always thought about factors as fundamentally > > characters, but with restrictions: a subspace of all possible strings. > > And I'd say that a non-negligible number of R users may think about > > them in a similar way. > > > > That idea has been a common source of bugs and the most important reason > why I always explain my students that factors are a special kind of > numeric(integer), not character. Especially people coming from SPSS see > immediately the link with categorical variables in that way, and understand > that a factor is a modeling aid rather than an alternative for characters. > It is a categorical variable and a more readable way of representing a set > of dummy variables. > > I do agree that some of the factor behaviour is confusing at best, but that > doesn't change the appropriate use and meaning of factors as categorical > variables. > > Even more, I oppose the ideas that : > > 1) factors with different levels should be concatenated. > > 2) when combining factors, the union of the levels would somehow be a good > choice. > > Factors with different levels are variables with different information, not > more or less information. If one factor codes low and high and another > codes low, mid and high, you can't say whether mid in one factor would be > low or high in the first one. The second has a higher resolution, and > that's exactly the reason why they should NOT be combined. Different levels > indicate a different grouping, and hence that data should never be used as > one set of dummy variables in any model. > > Even when combining factors, the union of levels only makes sense to me if > there's no overlap between levels of both factors. In all other cases, a > researcher will need to determine whether levels with the same label do > mean the same thing in both factors, and that's not guaranteed. And when > we're talking a factor with a higher resolution and a lower resolution, the > correct thing to do modelwise is to recode one of the factors so they have > the same resolution and every level the same definition before you merge > that data. > > So imho the combination of two factors with different levels (or even > levels in a different order) should give an error. Which R currently > doesn't throw, so I get there's room for improvement.I 100% agree with you, and is this the behaviour that vctrs used to have and dplyr currently has (at least in bind_rows()). But pragmatically, my experience with dplyr is that people find this behaviour confusing and unhelpful. And when I played the full expression of this behaviour in vctrs, I found that it forced me to think about the levels of factors more than I'd otherwise like to: it made me think like a programmer, not like a data analyst. So in an ideal world, yes, I think factors would have stricter behaviour, but my sense is that imposing this strictness now will be onerous to most analysts. Hadley -- http://hadley.nz
Hi Hadley, my point actually came from a data analyst point of view. A character variable is something used for extra information, eg the "any other ideas?" field of a questionnaire. A categorical variable is a variable describing categories defined by the researcher. If it is made clear that a factor is the object type needed for a categorical variable, there is no confusion. All my students get it. But I agree that in many cases people are taught that a factor is somehow related to character variables. And that does not make sense from a data analyst point of view if you think about variables as continuous, ordinal and nominal in a model context. So I don't think adding more confusing behaviour and pitfalls is a solution to something that's essentially a misunderstanding. It's something that's only solved by explaining it correctly imho. Cheers Joris On Thu, Aug 9, 2018 at 2:36 PM Hadley Wickham <h.wickham at gmail.com> wrote:> > I 100% agree with you, and is this the behaviour that vctrs used to > have and dplyr currently has (at least in bind_rows()). But > pragmatically, my experience with dplyr is that people find this > behaviour confusing and unhelpful. And when I played the full > expression of this behaviour in vctrs, I found that it forced me to > think about the levels of factors more than I'd otherwise like to: it > made me think like a programmer, not like a data analyst. So in an > ideal world, yes, I think factors would have stricter behaviour, but > my sense is that imposing this strictness now will be onerous to most > analysts. > > Hadley > > -- > http://hadley.nz >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
On Thu, Aug 9, 2018 at 7:54 AM Joris Meys <jorismeys at gmail.com> wrote:> > Hi Hadley, > > my point actually came from a data analyst point of view. A character variable is something used for extra information, eg the "any other ideas?" field of a questionnaire. A categorical variable is a variable describing categories defined by the researcher. If it is made clear that a factor is the object type needed for a categorical variable, there is no confusion. All my students get it. But I agree that in many cases people are taught that a factor is somehow related to character variables. And that does not make sense from a data analyst point of view if you think about variables as continuous, ordinal and nominal in a model context. > > So I don't think adding more confusing behaviour and pitfalls is a solution to something that's essentially a misunderstanding. It's something that's only solved by explaining it correctly imho.I agree with your definition of character and factor variables. It's an important distinction, and I agree that the blurring of factors and characters is generally undesirable. However, the merits of respecting R's existing behaviour, and Martin M?chler's support, means that I'm not going to change vctr's approach at this point in time. However, I hear from you and Gabe that this is an important issue, and I'll definitely keep it in mind as I solicit further feedback from users. Hadley -- http://hadley.nz