tibor@kiss m@iii@g oii rub@de
2022-Sep-19 12:07 UTC
[R] Question concerning side effects of treating invalid factor levels
Hi, this is a misunderstanding of my question. I wasn?t worried about invalid factor levels that produce NA. My question was why a column changes its class, which I thought was a side effect. If you add a vector containing one character string, the class of the whole vector becomes _chr_. And after this element has been added to a column, we have two NAs for the column which are factors, and a character string, which is responsible for the change of a numerical vector into a character string vector (see ?c, where you find: "The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.?). Best Tibor> Am 19.09.2022 um 13:59 schrieb Ebert,Timothy Aaron <tebert at ufl.edu>: > > In your example code, the variable remains a class factor, and all entries are valid. The variables will behave as expected given the factor levels in the original dataframe. > > (At least on my system R 4.2, in RStudio, in Windows) R returns a couple of error messages warning me that I was bad. > What you get is NA for "not available", or "not appropriate" or a missing value. You gave the system an invalid factor level so it was entered as missing. If you get data that has a new factor level, you need to tell R to expect a new factor level first. > > levels(f1) <- c(levels(f1),"New Level") > levels(f1) <- c(levels(f1),c("NL1","NL2")) > > > Tim > -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Tibor Kiss via R-help > Sent: Monday, September 19, 2022 6:11 AM > To: r-help at r-project.org > Subject: [R] Question concerning side effects of treating invalid factor levels > > [External Email] > > Dear List members, > > I have tried now for several times to find out about a side effect of treating invalid factor levels, but did not find an answer. Various answers on stackexchange etc. produce the stuff that irritates me without even mentioning it. > So I am asking the list (apologies if this has been treated in the past). > > If you add an invalid factor level to a column in a data frame, this has the side effect of turning a numerical column into a column with character strings. Here is a simple example: > >> df <- data.frame( > P = factor(c("mittels", "mit", "mittels", "ueber", "mit", "mit")), > ANSWER = factor(c(rep("PP>OBJ", 4), rep("OBJ>PP", 2))), > RT = round(runif(6, 7000, 16000), 0)) > >> str(df) > 'data.frame': 6 obs. of 3 variables: > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 > $ RT : num 11157 13719 14388 14527 14686 .. > >> df <- rbind(df, c("in", "V>N", round(runif(1, 7000, 16000), 0))) > >> str(df) > 'data.frame': 7 obs. of 3 variables: > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 NA > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 NA > $ RT : chr "11478" "15819" "8305" "8852" ... > > You see that RT has changed from _num_ to _chr_ as a side effect of adding the invalid factor level as NA. I would appreciate understanding what the purpose of the type coercion is. > > Thanks in advance > > > Tibor > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNDYEJKhjSu%2FtrTIwZx5yVemKgDheQYXLrcQqJ2mOgo%3D&reserved=0 > PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AP%2B4fa5pvbGr3IfwdiQvjXwkOdY90CIWIWWWmpIHH7w%3D&reserved=0 > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Sarah Goslee
2022-Sep-20 13:02 UTC
[R] Question concerning side effects of treating invalid factor levels
Hi Tibor, No, you are misunderstanding the source of the problem. It has nothing to do with factors. Instead, it has to do with the inability of a vector to hold more than one class. You are using rbind() to add a new row to your data frame, but that vector is being coerced to character. That's what is forcing your numeric column to become character: you're adding a character to it.> c("in", "V>N", round(runif(1, 7000, 16000), 0))[1] "in" "V>N" "15709" It has nothing whatsoever to do with factors or factor levels, and would occur if you were adding it to a data frame with character values. If you want to mix types, you cannot use a vector. c2 <- data.frame(P = "in", ANSWER = "V>N", RT = round(runif(1, 7000, 16000), 0))> str(rbind(df, c2))'data.frame': 7 obs. of 3 variables: $ P : Factor w/ 4 levels "mit","mittels",..: 2 1 2 3 1 1 4 $ ANSWER: Factor w/ 3 levels "OBJ>PP","PP>OBJ",..: 2 2 2 2 1 1 3 $ RT : num 10867 14808 11600 15881 8984 ... Sarah On Tue, Sep 20, 2022 at 8:45 AM Tibor Kiss via R-help <r-help at r-project.org> wrote:> > Hi, > > this is a misunderstanding of my question. I wasn?t worried about invalid factor levels that produce NA. My question was why a column changes its class, which I thought was a side effect. If you add a vector containing one character string, the class of the whole vector becomes _chr_. And after this element has been added to a column, we have two NAs for the column which are factors, and a character string, which is responsible for the change of a numerical vector into a character string vector (see ?c, where you find: "The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.?). > > > Best > > > Tibor > > > > > Am 19.09.2022 um 13:59 schrieb Ebert,Timothy Aaron <tebert at ufl.edu>: > > > > In your example code, the variable remains a class factor, and all entries are valid. The variables will behave as expected given the factor levels in the original dataframe. > > > > (At least on my system R 4.2, in RStudio, in Windows) R returns a couple of error messages warning me that I was bad. > > What you get is NA for "not available", or "not appropriate" or a missing value. You gave the system an invalid factor level so it was entered as missing. If you get data that has a new factor level, you need to tell R to expect a new factor level first. > > > > levels(f1) <- c(levels(f1),"New Level") > > levels(f1) <- c(levels(f1),c("NL1","NL2")) > > > > > > Tim > > -----Original Message----- > > From: R-help <r-help-bounces at r-project.org> On Behalf Of Tibor Kiss via R-help > > Sent: Monday, September 19, 2022 6:11 AM > > To: r-help at r-project.org > > Subject: [R] Question concerning side effects of treating invalid factor levels > > > > [External Email] > > > > Dear List members, > > > > I have tried now for several times to find out about a side effect of treating invalid factor levels, but did not find an answer. Various answers on stackexchange etc. produce the stuff that irritates me without even mentioning it. > > So I am asking the list (apologies if this has been treated in the past). > > > > If you add an invalid factor level to a column in a data frame, this has the side effect of turning a numerical column into a column with character strings. Here is a simple example: > > > >> df <- data.frame( > > P = factor(c("mittels", "mit", "mittels", "ueber", "mit", "mit")), > > ANSWER = factor(c(rep("PP>OBJ", 4), rep("OBJ>PP", 2))), > > RT = round(runif(6, 7000, 16000), 0)) > > > >> str(df) > > 'data.frame': 6 obs. of 3 variables: > > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 > > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 > > $ RT : num 11157 13719 14388 14527 14686 .. > > > >> df <- rbind(df, c("in", "V>N", round(runif(1, 7000, 16000), 0))) > > > >> str(df) > > 'data.frame': 7 obs. of 3 variables: > > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 NA > > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 NA > > $ RT : chr "11478" "15819" "8305" "8852" ... > > > > You see that RT has changed from _num_ to _chr_ as a side effect of adding the invalid factor level as NA. I would appreciate understanding what the purpose of the type coercion is. > > > > Thanks in advance > > > > > > Tibor > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNDYEJKhjSu%2FtrTIwZx5yVemKgDheQYXLrcQqJ2mOgo%3D&reserved=0 > > PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AP%2B4fa5pvbGr3IfwdiQvjXwkOdY90CIWIWWWmpIHH7w%3D&reserved=0 > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Sarah Goslee (she/her) http://www.numberwright.com