Hi Bob, many thanks for your reply. I have read the documentation. In my current project I use "item batteries" for dimensions of touchpoints which are rated by our customers. I wrote functions to analyse them. If I create a factor before filtering and analysing I lose the original values of the variable. If I use the original variable for filtering and analysis I might happen that for some dimensions values were not selected. This means they are not NA but none of the respondents chose "4" for instance on a scale from 1 to 6. That means that creating a factor from the analysed data with the complete scale (1:6) fails due the different vector length (amount of remaining unique values in the analysis vs values in the scale). As I have a function doing the analysis I am looking for a way to make my function robust to such circumstances and be able to use it to analyse all "item batteries". Thus my question. I believe my findings are not odd. Maybe there is a way dealing with that kind of problems in R and I am eager to learn how it can be solved using R. What would you suggest? Kind regards Georg Von: "Bob O'Hara" <rni.boh at gmail.com> An: G.Maubach at weinwolf.de, Kopie: r-help <r-help at r-project.org> Datum: 09.05.2017 12:26 Betreff: Re: [R] Factors and Alternatives That's easy! First> str(test3)Factor w/ 2 levels "WITHOUT Contact",..: 2 2 2 2 1 1 1 1 1 1 tells you that the internal values are 1 and 2, and the labels are "WITHOUT Contact" and "WITH Contact". If you read the help page for factor() you'll see this: levels: an optional vector of the values (as character strings) that ?x? might have taken. The default is the unique set of values taken by ?as.character(x)?, sorted into increasing order _of ?x?_. Note that this set can be specified as smaller than ?sort(unique(x))?. labels: _either_ an optional character vector of (unique) labels for the levels (in the same order as ?levels? after removing those in ?exclude?), _or_ a character string of length 1. So, when you create test3 you say that test can take values 0 and 1, and these should be labelled as "WITHOUT Contact" and "WITH Contact". So R internally codes "1" as 1 and "0" as 2 (internally R codes factors as integers, which can be both useful and dangerous), and then gives them labels "WITHOUT Contact" and "WITH Contact". It now doesn't care that they were 1 and 0, because you've told it to change the labels. If you want to filter by the original values, then don't change the labels (or at least not until after you've filtered by the original labels), or convert the filter to the new labels. You're asking for a data structure with two sets of labels, which sounds odd in general. Bob On 9 May 2017 at 12:12, <G.Maubach at weinwolf.de> wrote:> Hi All, > > I am using factors in a study for the social sciences. > > I discovered the following: > > -- cut -- > > library(dplyr) > > test1 <- c(rep(1, 4), rep(0, 6)) > d_test1 <- data.frame(test) > > test2 <- factor(test1) > d_test2 <- data.frame(test2) > > test3 <- factor(test1, > levels = c(0, 1), > labels = c("WITHOUT Contact", "WITH Contact")) > d_test3 <- data.frame(test3) > > d_test1 %>% filter(test1 == 0) # works OK > d_test2 %>% filter(test2 == 0) # works OK > d_test3 %>% filter(test3 == 0) # does not work, why? > > myf <- function(ds) { > print(levels(ds$test3)) > print(labels(ds$test3)) > print(as.numeric(ds$test3)) > print(as.character(ds$test3)) > } > > # This showsthat it is not possible to access the original > # values which were the basis to build the factor: > myf(d_test3) > > -- cut -- > > Why is it not possible to use a factor with labels for filtering withthe> original values? > Is there a data structure that works like a factor but gives also access > to the original values? > > Kind regards > > Georg > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.-- Bob O'Hara NOTE NEW ADDRESS!!! Institutt for matematiske fag NTNU 7491 Trondheim Norway Mobile: +49 1515 888 5440 Journal of Negative Results - EEB: www.jnr-eeb.org
For the problem you state, would it be enough to explicitly define your levels? fac <- rep(c("a", "b", "d"), each=4) fac.f <- factor(fac, levels=c("a", "b", "c", "d")) table(fac.f) # but be warned... fac.f2 <- factor(fac.f) table(fac.f2) This has the advantage that the code explicitly documents what the possible values are, so if something goes wrong down-stream, you know it is a real problem (well, unless you have some type conversions screwing things up). You might also want to do some defensive programming, and put some checks in the code, to make sure your factors have the right number of levels. Bob On 9 May 2017 at 13:36, <G.Maubach at weinwolf.de> wrote:> Hi Bob, > > many thanks for your reply. > > I have read the documentation. In my current project I use "item > batteries" for dimensions of touchpoints which are rated by our customers. > I wrote functions to analyse them. If I create a factor before filtering > and analysing I lose the original values of the variable. If I use the > original variable for filtering and analysis I might happen that for some > dimensions values were not selected. This means they are not NA but none > of the respondents chose "4" for instance on a scale from 1 to 6. That > means that creating a factor from the analysed data with the complete > scale (1:6) fails due the different vector length (amount of remaining > unique values in the analysis vs values in the scale). As I have a > function doing the analysis I am looking for a way to make my function > robust to such circumstances and be able to use it to analyse all "item > batteries". Thus my question. I believe my findings are not odd. Maybe > there is a way dealing with that kind of problems in R and I am eager to > learn how it can be solved using R. > > What would you suggest? > > Kind regards > > Georg > > > > > Von: "Bob O'Hara" <rni.boh at gmail.com> > An: G.Maubach at weinwolf.de, > Kopie: r-help <r-help at r-project.org> > Datum: 09.05.2017 12:26 > Betreff: Re: [R] Factors and Alternatives > > > > That's easy! First >> str(test3) > Factor w/ 2 levels "WITHOUT Contact",..: 2 2 2 2 1 1 1 1 1 1 > > tells you that the internal values are 1 and 2, and the labels are > "WITHOUT Contact" and "WITH Contact". If you read the help page for > factor() you'll see this: > > levels: an optional vector of the values (as character strings) that > ?x? might have taken. The default is the unique set of > values taken by ?as.character(x)?, sorted into increasing > order _of ?x?_. Note that this set can be specified as > smaller than ?sort(unique(x))?. > > labels: _either_ an optional character vector of (unique) labels for > the levels (in the same order as ?levels? after removing > those in ?exclude?), _or_ a character string of length 1. > > So, when you create test3 you say that test can take values 0 and 1, > and these should be labelled as "WITHOUT Contact" and "WITH Contact". > So R internally codes "1" as 1 and "0" as 2 (internally R codes > factors as integers, which can be both useful and dangerous), and then > gives them labels "WITHOUT Contact" and "WITH Contact". It now doesn't > care that they were 1 and 0, because you've told it to change the > labels. > > If you want to filter by the original values, then don't change the > labels (or at least not until after you've filtered by the original > labels), or convert the filter to the new labels. You're asking for a > data structure with two sets of labels, which sounds odd in general. > > Bob > > On 9 May 2017 at 12:12, <G.Maubach at weinwolf.de> wrote: >> Hi All, >> >> I am using factors in a study for the social sciences. >> >> I discovered the following: >> >> -- cut -- >> >> library(dplyr) >> >> test1 <- c(rep(1, 4), rep(0, 6)) >> d_test1 <- data.frame(test) >> >> test2 <- factor(test1) >> d_test2 <- data.frame(test2) >> >> test3 <- factor(test1, >> levels = c(0, 1), >> labels = c("WITHOUT Contact", "WITH Contact")) >> d_test3 <- data.frame(test3) >> >> d_test1 %>% filter(test1 == 0) # works OK >> d_test2 %>% filter(test2 == 0) # works OK >> d_test3 %>% filter(test3 == 0) # does not work, why? >> >> myf <- function(ds) { >> print(levels(ds$test3)) >> print(labels(ds$test3)) >> print(as.numeric(ds$test3)) >> print(as.character(ds$test3)) >> } >> >> # This showsthat it is not possible to access the original >> # values which were the basis to build the factor: >> myf(d_test3) >> >> -- cut -- >> >> Why is it not possible to use a factor with labels for filtering with > the >> original values? >> Is there a data structure that works like a factor but gives also access >> to the original values? >> >> Kind regards >> >> Georg >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Bob O'Hara > NOTE NEW ADDRESS!!! > Institutt for matematiske fag > NTNU > 7491 Trondheim > Norway > > Mobile: +49 1515 888 5440 > Journal of Negative Results - EEB: www.jnr-eeb.org > >-- Bob O'Hara NOTE NEW ADDRESS!!! Institutt for matematiske fag NTNU 7491 Trondheim Norway Mobile: +49 1515 888 5440 Journal of Negative Results - EEB: www.jnr-eeb.org
I'm not sure I understand your question, but you can easily include all possible answers when you create the factor by using the levels= argument as Bob pointed out. Here is an example of values that range from 1 to 6, but value 3 is not represented. Notice that a factor level 3 is created even though it does not appear in the data:> set.seed(42) > x <- sample.int(6, 10, replace=TRUE) > table(x)x 1 2 4 5 6 1 1 3 3 2> y <- factor(x, levels=1:6) > y[1] 6 6 2 5 4 4 5 1 4 5 Levels: 1 2 3 4 5 6 ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of G.Maubach at weinwolf.de Sent: Tuesday, May 9, 2017 6:37 AM To: Bob O'Hara <rni.boh at gmail.com> Cc: r-help <r-help at r-project.org> Subject: [R] Antwort: Re: Factors and Alternatives Hi Bob, many thanks for your reply. I have read the documentation. In my current project I use "item batteries" for dimensions of touchpoints which are rated by our customers. I wrote functions to analyse them. If I create a factor before filtering and analysing I lose the original values of the variable. If I use the original variable for filtering and analysis I might happen that for some dimensions values were not selected. This means they are not NA but none of the respondents chose "4" for instance on a scale from 1 to 6. That means that creating a factor from the analysed data with the complete scale (1:6) fails due the different vector length (amount of remaining unique values in the analysis vs values in the scale). As I have a function doing the analysis I am looking for a way to make my function robust to such circumstances and be able to use it to analyse all "item batteries". Thus my question. I believe my findings are not odd. Maybe there is a way dealing with that kind of problems in R and I am eager to learn how it can be solved using R. What would you suggest? Kind regards Georg Von: "Bob O'Hara" <rni.boh at gmail.com> An: G.Maubach at weinwolf.de, Kopie: r-help <r-help at r-project.org> Datum: 09.05.2017 12:26 Betreff: Re: [R] Factors and Alternatives That's easy! First> str(test3)Factor w/ 2 levels "WITHOUT Contact",..: 2 2 2 2 1 1 1 1 1 1 tells you that the internal values are 1 and 2, and the labels are "WITHOUT Contact" and "WITH Contact". If you read the help page for factor() you'll see this: levels: an optional vector of the values (as character strings) that ?x? might have taken. The default is the unique set of values taken by ?as.character(x)?, sorted into increasing order _of ?x?_. Note that this set can be specified as smaller than ?sort(unique(x))?. labels: _either_ an optional character vector of (unique) labels for the levels (in the same order as ?levels? after removing those in ?exclude?), _or_ a character string of length 1. So, when you create test3 you say that test can take values 0 and 1, and these should be labelled as "WITHOUT Contact" and "WITH Contact". So R internally codes "1" as 1 and "0" as 2 (internally R codes factors as integers, which can be both useful and dangerous), and then gives them labels "WITHOUT Contact" and "WITH Contact". It now doesn't care that they were 1 and 0, because you've told it to change the labels. If you want to filter by the original values, then don't change the labels (or at least not until after you've filtered by the original labels), or convert the filter to the new labels. You're asking for a data structure with two sets of labels, which sounds odd in general. Bob On 9 May 2017 at 12:12, <G.Maubach at weinwolf.de> wrote:> Hi All, > > I am using factors in a study for the social sciences. > > I discovered the following: > > -- cut -- > > library(dplyr) > > test1 <- c(rep(1, 4), rep(0, 6)) > d_test1 <- data.frame(test) > > test2 <- factor(test1) > d_test2 <- data.frame(test2) > > test3 <- factor(test1, > levels = c(0, 1), > labels = c("WITHOUT Contact", "WITH Contact")) > d_test3 <- data.frame(test3) > > d_test1 %>% filter(test1 == 0) # works OK > d_test2 %>% filter(test2 == 0) # works OK > d_test3 %>% filter(test3 == 0) # does not work, why? > > myf <- function(ds) { > print(levels(ds$test3)) > print(labels(ds$test3)) > print(as.numeric(ds$test3)) > print(as.character(ds$test3)) > } > > # This showsthat it is not possible to access the original > # values which were the basis to build the factor: > myf(d_test3) > > -- cut -- > > Why is it not possible to use a factor with labels for filtering withthe> original values? > Is there a data structure that works like a factor but gives also access > to the original values? > > Kind regards > > Georg > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.-- Bob O'Hara NOTE NEW ADDRESS!!! Institutt for matematiske fag NTNU 7491 Trondheim Norway Mobile: +49 1515 888 5440 Journal of Negative Results - EEB: www.jnr-eeb.org ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
G.Maubach at weinwolf.de
2017-May-09 13:37 UTC
[R] Antwort: RE: Antwort: Re: Factors and Alternatives (SOLVED)
Hi David, Hi Bob, many thanks for your help. Your solution - just to use all levels instead of just the one's found in the data - helped. The original code looked like this: -- cut -- c_v10_val_labs <- c( "1 = sehr gut", "2", "3", "4", "5", "6 = sehr schlecht" ) # where c_v10_val_labs is handed over to my function as "val_labs". ds_results$value <- factor(ds_results$value, levels = sort(unique(ds_results$value)), # old code labels = sort(unique(val_labs))) -- cut -- If I write instead -- cut -- ds_results$value <- factor(ds_results$value, levels = seq_along(val_labs), # new code 1st version labels = sort(unique(val_labs))) -- cut -- Your solution builds a factor with all factor levels even if a value for factor is not present (not NA, but does just not occur in the data, i.e. not stated by any respondent). In Zumel's book "Practical Data Science with R" ( https://www.amazon.de/Practical-Data-Science-Nina-Zumel/dp/1617291560), Shelter Island: Manning, 2014, p. 23-24, Listing 2-5, a mapping using subscripts is described: -- cut -- mapping <- list( 'A40'='car (new)', 'A41'='car (used)', 'A42'='furniture/equipment', 'A43'='radio/television', 'A44'='domestic appliances', ... ) for(i in 1:(dim(d))[2]) { if(class(d[,i])=='character') { d[,i] <- as.factor(as.character(mapping[d[,i]])) } } -- cut - Simple stated this would mean: -- cut -- val_labs <- list( "1" = "1 = sehr gut", "2" = "2", "3" = "3", "4" = "4", "5" = "5", "6" = "6 = sehr schlecht" ) set.seed(12345) answers = c(sample(1:5, 10, replace = TRUE)) test <- factor(unlist(val_labs[answers])) # or just val_labs <- c( "1 = sehr gut", "2", "3", "4", "5", "6 = sehr schlecht" ) set.seed(12345) answers = c(sample(1:5, 10, replace = TRUE)) test <- val_labs[answers] -- cut -- Adapting this to my code would give: -- cut -- ds_results$value <- factor(ds_results$value, levels = sort(unique(ds_results$value)), labels = val_labs[sort(unique(ds_results$value))]) # new code 2nd version -- cut -- This results in a factor just as long as the vector of unique resulting values. Both solutions work. Which version is best depends on the overall process and the purpose of the code. I document all this for use by readers who refer later to the list archives. Using your version and running my code reveals that ggplot runs into difficulties cause the legend lacks values and the sequence and coloring of the legend is wrong. But that's another story. Many thanks again for your help. Kind regards Georg Von: David L Carlson <dcarlson at tamu.edu> An: "G.Maubach at weinwolf.de" <G.Maubach at weinwolf.de>, "Bob O'Hara" <rni.boh at gmail.com>, Kopie: r-help <r-help at r-project.org> Datum: 09.05.2017 14:37 Betreff: RE: [R] Antwort: Re: Factors and Alternatives I'm not sure I understand your question, but you can easily include all possible answers when you create the factor by using the levels= argument as Bob pointed out. Here is an example of values that range from 1 to 6, but value 3 is not represented. Notice that a factor level 3 is created even though it does not appear in the data:> set.seed(42) > x <- sample.int(6, 10, replace=TRUE) > table(x)x 1 2 4 5 6 1 1 3 3 2> y <- factor(x, levels=1:6) > y[1] 6 6 2 5 4 4 5 1 4 5 Levels: 1 2 3 4 5 6 ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 Von: "Bob O'Hara" <rni.boh at gmail.com> An: G.Maubach at weinwolf.de, Kopie: r-help <r-help at r-project.org> Datum: 09.05.2017 13:58 Betreff: Re: Re: [R] Factors and Alternatives For the problem you state, would it be enough to explicitly define your levels? fac <- rep(c("a", "b", "d"), each=4) fac.f <- factor(fac, levels=c("a", "b", "c", "d")) table(fac.f) # but be warned... fac.f2 <- factor(fac.f) table(fac.f2) This has the advantage that the code explicitly documents what the possible values are, so if something goes wrong down-stream, you know it is a real problem (well, unless you have some type conversions screwing things up). You might also want to do some defensive programming, and put some checks in the code, to make sure your factors have the right number of levels. Bob -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of G.Maubach at weinwolf.de Sent: Tuesday, May 9, 2017 6:37 AM To: Bob O'Hara <rni.boh at gmail.com> Cc: r-help <r-help at r-project.org> Subject: [R] Antwort: Re: Factors and Alternatives Hi Bob, many thanks for your reply. I have read the documentation. In my current project I use "item batteries" for dimensions of touchpoints which are rated by our customers. I wrote functions to analyse them. If I create a factor before filtering and analysing I lose the original values of the variable. If I use the original variable for filtering and analysis I might happen that for some dimensions values were not selected. This means they are not NA but none of the respondents chose "4" for instance on a scale from 1 to 6. That means that creating a factor from the analysed data with the complete scale (1:6) fails due the different vector length (amount of remaining unique values in the analysis vs values in the scale). As I have a function doing the analysis I am looking for a way to make my function robust to such circumstances and be able to use it to analyse all "item batteries". Thus my question. I believe my findings are not odd. Maybe there is a way dealing with that kind of problems in R and I am eager to learn how it can be solved using R. What would you suggest? Kind regards Georg Von: "Bob O'Hara" <rni.boh at gmail.com> An: G.Maubach at weinwolf.de, Kopie: r-help <r-help at r-project.org> Datum: 09.05.2017 12:26 Betreff: Re: [R] Factors and Alternatives That's easy! First> str(test3)Factor w/ 2 levels "WITHOUT Contact",..: 2 2 2 2 1 1 1 1 1 1 tells you that the internal values are 1 and 2, and the labels are "WITHOUT Contact" and "WITH Contact". If you read the help page for factor() you'll see this: levels: an optional vector of the values (as character strings) that ?x? might have taken. The default is the unique set of values taken by ?as.character(x)?, sorted into increasing order _of ?x?_. Note that this set can be specified as smaller than ?sort(unique(x))?. labels: _either_ an optional character vector of (unique) labels for the levels (in the same order as ?levels? after removing those in ?exclude?), _or_ a character string of length 1. So, when you create test3 you say that test can take values 0 and 1, and these should be labelled as "WITHOUT Contact" and "WITH Contact". So R internally codes "1" as 1 and "0" as 2 (internally R codes factors as integers, which can be both useful and dangerous), and then gives them labels "WITHOUT Contact" and "WITH Contact". It now doesn't care that they were 1 and 0, because you've told it to change the labels. If you want to filter by the original values, then don't change the labels (or at least not until after you've filtered by the original labels), or convert the filter to the new labels. You're asking for a data structure with two sets of labels, which sounds odd in general. Bob On 9 May 2017 at 12:12, <G.Maubach at weinwolf.de> wrote:> Hi All, > > I am using factors in a study for the social sciences. > > I discovered the following: > > -- cut -- > > library(dplyr) > > test1 <- c(rep(1, 4), rep(0, 6)) > d_test1 <- data.frame(test) > > test2 <- factor(test1) > d_test2 <- data.frame(test2) > > test3 <- factor(test1, > levels = c(0, 1), > labels = c("WITHOUT Contact", "WITH Contact")) > d_test3 <- data.frame(test3) > > d_test1 %>% filter(test1 == 0) # works OK > d_test2 %>% filter(test2 == 0) # works OK > d_test3 %>% filter(test3 == 0) # does not work, why? > > myf <- function(ds) { > print(levels(ds$test3)) > print(labels(ds$test3)) > print(as.numeric(ds$test3)) > print(as.character(ds$test3)) > } > > # This showsthat it is not possible to access the original > # values which were the basis to build the factor: > myf(d_test3) > > -- cut -- > > Why is it not possible to use a factor with labels for filtering withthe> original values? > Is there a data structure that works like a factor but gives also access > to the original values? > > Kind regards > > Georg > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.-- Bob O'Hara NOTE NEW ADDRESS!!! Institutt for matematiske fag NTNU 7491 Trondheim Norway Mobile: +49 1515 888 5440 Journal of Negative Results - EEB: www.jnr-eeb.org ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.