Ebert,Timothy Aaron
2022-Sep-19 11:59 UTC
[R] Question concerning side effects of treating invalid factor levels
In your example code, the variable remains a class factor, and all entries are valid. The variables will behave as expected given the factor levels in the original dataframe. (At least on my system R 4.2, in RStudio, in Windows) R returns a couple of error messages warning me that I was bad. What you get is NA for "not available", or "not appropriate" or a missing value. You gave the system an invalid factor level so it was entered as missing. If you get data that has a new factor level, you need to tell R to expect a new factor level first. levels(f1) <- c(levels(f1),"New Level") levels(f1) <- c(levels(f1),c("NL1","NL2")) Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Tibor Kiss via R-help Sent: Monday, September 19, 2022 6:11 AM To: r-help at r-project.org Subject: [R] Question concerning side effects of treating invalid factor levels [External Email] Dear List members, I have tried now for several times to find out about a side effect of treating invalid factor levels, but did not find an answer. Various answers on stackexchange etc. produce the stuff that irritates me without even mentioning it. So I am asking the list (apologies if this has been treated in the past). If you add an invalid factor level to a column in a data frame, this has the side effect of turning a numerical column into a column with character strings. Here is a simple example:> df <- data.frame(P = factor(c("mittels", "mit", "mittels", "ueber", "mit", "mit")), ANSWER = factor(c(rep("PP>OBJ", 4), rep("OBJ>PP", 2))), RT = round(runif(6, 7000, 16000), 0))> str(df)'data.frame': 6 obs. of 3 variables: $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 $ RT : num 11157 13719 14388 14527 14686 ..> df <- rbind(df, c("in", "V>N", round(runif(1, 7000, 16000), 0)))> str(df)'data.frame': 7 obs. of 3 variables: $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 NA $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 NA $ RT : chr "11478" "15819" "8305" "8852" ... You see that RT has changed from _num_ to _chr_ as a side effect of adding the invalid factor level as NA. I would appreciate understanding what the purpose of the type coercion is. Thanks in advance Tibor ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNDYEJKhjSu%2FtrTIwZx5yVemKgDheQYXLrcQqJ2mOgo%3D&reserved=0 PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AP%2B4fa5pvbGr3IfwdiQvjXwkOdY90CIWIWWWmpIHH7w%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code.
tibor@kiss m@iii@g oii rub@de
2022-Sep-19 12:07 UTC
[R] Question concerning side effects of treating invalid factor levels
Hi, this is a misunderstanding of my question. I wasn?t worried about invalid factor levels that produce NA. My question was why a column changes its class, which I thought was a side effect. If you add a vector containing one character string, the class of the whole vector becomes _chr_. And after this element has been added to a column, we have two NAs for the column which are factors, and a character string, which is responsible for the change of a numerical vector into a character string vector (see ?c, where you find: "The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression.?). Best Tibor> Am 19.09.2022 um 13:59 schrieb Ebert,Timothy Aaron <tebert at ufl.edu>: > > In your example code, the variable remains a class factor, and all entries are valid. The variables will behave as expected given the factor levels in the original dataframe. > > (At least on my system R 4.2, in RStudio, in Windows) R returns a couple of error messages warning me that I was bad. > What you get is NA for "not available", or "not appropriate" or a missing value. You gave the system an invalid factor level so it was entered as missing. If you get data that has a new factor level, you need to tell R to expect a new factor level first. > > levels(f1) <- c(levels(f1),"New Level") > levels(f1) <- c(levels(f1),c("NL1","NL2")) > > > Tim > -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Tibor Kiss via R-help > Sent: Monday, September 19, 2022 6:11 AM > To: r-help at r-project.org > Subject: [R] Question concerning side effects of treating invalid factor levels > > [External Email] > > Dear List members, > > I have tried now for several times to find out about a side effect of treating invalid factor levels, but did not find an answer. Various answers on stackexchange etc. produce the stuff that irritates me without even mentioning it. > So I am asking the list (apologies if this has been treated in the past). > > If you add an invalid factor level to a column in a data frame, this has the side effect of turning a numerical column into a column with character strings. Here is a simple example: > >> df <- data.frame( > P = factor(c("mittels", "mit", "mittels", "ueber", "mit", "mit")), > ANSWER = factor(c(rep("PP>OBJ", 4), rep("OBJ>PP", 2))), > RT = round(runif(6, 7000, 16000), 0)) > >> str(df) > 'data.frame': 6 obs. of 3 variables: > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 > $ RT : num 11157 13719 14388 14527 14686 .. > >> df <- rbind(df, c("in", "V>N", round(runif(1, 7000, 16000), 0))) > >> str(df) > 'data.frame': 7 obs. of 3 variables: > $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 NA > $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 NA > $ RT : chr "11478" "15819" "8305" "8852" ... > > You see that RT has changed from _num_ to _chr_ as a side effect of adding the invalid factor level as NA. I would appreciate understanding what the purpose of the type coercion is. > > Thanks in advance > > > Tibor > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNDYEJKhjSu%2FtrTIwZx5yVemKgDheQYXLrcQqJ2mOgo%3D&reserved=0 > PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AP%2B4fa5pvbGr3IfwdiQvjXwkOdY90CIWIWWWmpIHH7w%3D&reserved=0 > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Ebert,Timothy Aaron
2022-Sep-19 12:22 UTC
[R] Question concerning side effects of treating invalid factor levels
Sorry, My bad. A vector must be of a single class. When you declare c("in", "V>N", round(runif(1, 7000, 16000), 0)) R will calculate the random number, but then convert it to a character class to conform with the other two elements in that vector. R then binds this to your original df and finds that it must add a character to a numeric vector. To keep the vector of all the same class it converts everything to character. Better? Tim From: tibor.kiss at rub.de <tibor.kiss at rub.de> Sent: Monday, September 19, 2022 8:07 AM To: Ebert,Timothy Aaron <tebert at ufl.edu> Cc: r-help at r-project.org Subject: Re: [R] Question concerning side effects of treating invalid factor levels [External Email] Hi, this is a misunderstanding of my question. I wasn't worried about invalid factor levels that produce NA. My question was why a column changes its class, which I thought was a side effect. If you add a vector containing one character string, the class of the whole vector becomes _chr_. And after this element has been added to a column, we have two NAs for the column which are factors, and a character string, which is responsible for the change of a numerical vector into a character string vector (see ?c, where you find: "The output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression."). Best Tibor Am 19.09.2022 um 13:59 schrieb Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at ufl.edu>>: In your example code, the variable remains a class factor, and all entries are valid. The variables will behave as expected given the factor levels in the original dataframe. (At least on my system R 4.2, in RStudio, in Windows) R returns a couple of error messages warning me that I was bad. What you get is NA for "not available", or "not appropriate" or a missing value. You gave the system an invalid factor level so it was entered as missing. If you get data that has a new factor level, you need to tell R to expect a new factor level first. levels(f1) <- c(levels(f1),"New Level") levels(f1) <- c(levels(f1),c("NL1","NL2")) Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org<mailto:r-help-bounces at r-project.org>> On Behalf Of Tibor Kiss via R-help Sent: Monday, September 19, 2022 6:11 AM To: r-help at r-project.org<mailto:r-help at r-project.org> Subject: [R] Question concerning side effects of treating invalid factor levels [External Email] Dear List members, I have tried now for several times to find out about a side effect of treating invalid factor levels, but did not find an answer. Various answers on stackexchange etc. produce the stuff that irritates me without even mentioning it. So I am asking the list (apologies if this has been treated in the past). If you add an invalid factor level to a column in a data frame, this has the side effect of turning a numerical column into a column with character strings. Here is a simple example: df <- data.frame( P = factor(c("mittels", "mit", "mittels", "ueber", "mit", "mit")), ANSWER = factor(c(rep("PP>OBJ", 4), rep("OBJ>PP", 2))), RT = round(runif(6, 7000, 16000), 0)) str(df) 'data.frame': 6 obs. of 3 variables: $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 $ RT : num 11157 13719 14388 14527 14686 .. df <- rbind(df, c("in", "V>N", round(runif(1, 7000, 16000), 0))) str(df) 'data.frame': 7 obs. of 3 variables: $ P : Factor w/ 3 levels "mit","mittels",..: 2 1 2 3 1 1 NA $ ANSWER: Factor w/ 2 levels "OBJ>PP","PP>OBJ": 2 2 2 2 1 1 NA $ RT : chr "11478" "15819" "8305" "8852" ... You see that RT has changed from _num_ to _chr_ as a side effect of adding the invalid factor level as NA. I would appreciate understanding what the purpose of the type coercion is. Thanks in advance Tibor ______________________________________________ R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNDYEJKhjSu%2FtrTIwZx5yVemKgDheQYXLrcQqJ2mOgo%3D&reserved=0<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C8befab72dd954eeba5fe08da9a378808%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991860540901330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7RLCA5Mv33uyvGvYhNdZNU1EedIlseTtQEgiRsQhZN8%3D&reserved=0> PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C6ee1a1f50c14442beef508da9a301bde%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637991828670135028%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AP%2B4fa5pvbGr3IfwdiQvjXwkOdY90CIWIWWWmpIHH7w%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]