Mychaleckyj, Josyf C (jcm6t)
2019-Mar-12 20:39 UTC
[Rd] as.data.frame.table() does not recognize default.stringsAsFactors()
Reporting a possible inconsistency or bug in handling stringsAsFactors in as.data.frame.table() Here is a simple test> options()$stringsAsFactors[1] TRUE> x<-c("a","b","c","a","b") > d<-as.data.frame(table(x)) > dx Freq 1 a 2 2 b 2 3 c 1> class(d$x)[1] "factor"> d2<-as.data.frame(table(x),stringsAsFactors=F) > class(d2$x)[1] ?character"> options(stringsAsFactors=F) > options()$stringsAsFactors[1] FALSE> d3<-as.data.frame(table(x)) > d3x Freq 1 a 2 2 b 2 3 c 1> class(d3$x)[1] ?factor"> d4<-as.data.frame(table(x),stringsAsFactors=F) > class(d4$x)[1] ?character" # Display the code showing the different stringsAsFactors handling in table and matrix:> as.data.frame.tablefunction (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE, sep = "", base = list(LETTERS)) { ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x, sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)), Freq = c(x), row.names = row.names)) names(ex)[3L] <- responseName eval(ex) } <bytecode: 0x28769f8> <environment: namespace:base>> as.data.frame.matrixfunction (x, row.names = NULL, optional = FALSE, make.names = TRUE, ..., stringsAsFactors = default.stringsAsFactors()) { d <- dim(x) nrows <- d[[1L]] ncols <- d[[2L]] ic <- seq_len(ncols) dn <- dimnames(x) if (is.null(row.names)) row.names <- dn[[1L]] collabs <- dn[[2L]] if (any(empty <- !nzchar(collabs))) collabs[empty] <- paste0("V", ic)[empty] value <- vector("list", ncols) if (mode(x) == "character" && stringsAsFactors) { for (i in ic) value[[i]] <- as.factor(x[, i]) } else { for (i in ic) value[[i]] <- as.vector(x[, i]) } autoRN <- (is.null(row.names) || length(row.names) != nrows) if (length(collabs) == ncols) names(value) <- collabs else if (!optional) names(value) <- paste0("V", ic) class(value) <- "data.frame" if (autoRN) attr(value, "row.names") <- .set_row_names(nrows) else .rowNamesDF(value, make.names = make.names) <- row.names value } <bytecode: 0x29995c0> <environment: namespace:base>> sessionInfo()R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS: /usr/lib64/libblas.so.3.4.2 LAPACK: /usr/lib64/liblapack.so.3.4.2 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.5.2 tools_3.5.2 Thanks, Joe [[alternative HTML version deleted]]
peter dalgaard
2019-Mar-14 15:18 UTC
[Rd] as.data.frame.table() does not recognize default.stringsAsFactors()
I have no recollection of the original rationale for as.data.frame.table, but I actually think it is fine as it is: The classifying _factors_ of a crosstable should be factors unless very specifically directed otherwise and that should not depend on the setting of an option that controls the conversion of character data. For as.data.frame.matrix, in contrast, it is the _content_ of the matrix that is being converted, and it seems much more reasonable to follow the same path as for other character data. -pd> On 12 Mar 2019, at 21:39 , Mychaleckyj, Josyf C (jcm6t) <jcm6t at virginia.edu> wrote: > > Reporting a possible inconsistency or bug in handling stringsAsFactors in as.data.frame.table() > > Here is a simple test > >> options()$stringsAsFactors > [1] TRUE >> x<-c("a","b","c","a","b") >> d<-as.data.frame(table(x)) >> d > x Freq > 1 a 2 > 2 b 2 > 3 c 1 >> class(d$x) > [1] "factor" >> d2<-as.data.frame(table(x),stringsAsFactors=F) >> class(d2$x) > [1] ?character" >> options(stringsAsFactors=F) >> options()$stringsAsFactors > [1] FALSE >> d3<-as.data.frame(table(x)) >> d3 > x Freq > 1 a 2 > 2 b 2 > 3 c 1 >> class(d3$x) > [1] ?factor" >> d4<-as.data.frame(table(x),stringsAsFactors=F) >> class(d4$x) > [1] ?character" > > > # Display the code showing the different stringsAsFactors handling in table and matrix: > >> as.data.frame.table > function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE, > sep = "", base = list(LETTERS)) > { > ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x, > sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)), > Freq = c(x), row.names = row.names)) > names(ex)[3L] <- responseName > eval(ex) > } > <bytecode: 0x28769f8> > <environment: namespace:base> > >> as.data.frame.matrix > function (x, row.names = NULL, optional = FALSE, make.names = TRUE, > ..., stringsAsFactors = default.stringsAsFactors()) > { > d <- dim(x) > nrows <- d[[1L]] > ncols <- d[[2L]] > ic <- seq_len(ncols) > dn <- dimnames(x) > if (is.null(row.names)) > row.names <- dn[[1L]] > collabs <- dn[[2L]] > if (any(empty <- !nzchar(collabs))) > collabs[empty] <- paste0("V", ic)[empty] > value <- vector("list", ncols) > if (mode(x) == "character" && stringsAsFactors) { > for (i in ic) value[[i]] <- as.factor(x[, i]) > } > else { > for (i in ic) value[[i]] <- as.vector(x[, i]) > } > autoRN <- (is.null(row.names) || length(row.names) != nrows) > if (length(collabs) == ncols) > names(value) <- collabs > else if (!optional) > names(value) <- paste0("V", ic) > class(value) <- "data.frame" > if (autoRN) > attr(value, "row.names") <- .set_row_names(nrows) > else .rowNamesDF(value, make.names = make.names) <- row.names > value > } > <bytecode: 0x29995c0> > <environment: namespace:base> > > >> sessionInfo() > R version 3.5.2 (2018-12-20) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: CentOS Linux 7 (Core) > > Matrix products: default > BLAS: /usr/lib64/libblas.so.3.4.2 > LAPACK: /usr/lib64/liblapack.so.3.4.2 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.5.2 tools_3.5.2 > > Thanks, > Joe > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Mychaleckyj, Josyf C (jcm6t)
2019-Mar-14 16:33 UTC
[Rd] as.data.frame.table() does not recognize default.stringsAsFactors()
Peter, Thanks for the response. I have no wish to prolong this and have no axe to grind. I?m sure you were delighted to see another stringsAsFactors issue. Perhaps we talking about the conflation of two steps: the first is the language ?pure' conversion of the table to a data.frame with the cross-tab factor, followed by an optional subsequent step with programmatic utility for a specific application, of conversion of that factor to a character column. As my toy example shows, the as.data.frame.table() function permits passing the inline stringsAsFactors argument and returns a data.frame with a factor cross-tab column coerced as a character column, permitting these two steps to be accomplished in a single function. If you intend the function to only meet the first step, then I would suggest you remove stringsAsFactors as an argument to this function and amend the documentation. Following this, if an application needed a coercion to a character, then it should be accomplished in a second step. If you are implying that the core team intended options(stringsAsFactors) to be a ?selective? global option then I am guess I am confused and have not seen documentation about a limited scope of the session-wide options(). ?options ?stringsAsFactors?: The default setting for arguments of ?data.frame? and ?read.table?. As a practical programming matter this inconsistency created a bug in our code that was very insidious and cost hours of debugging and a lot of head scratching. Chars and factors are always prime candidates, but we never even considered that the session option would not have been respected by a low level core function in which the function call in the documentation explicitly included the inline argument. ?as.data.frame.table() From the Usage section of as.data.frame.table() ## S3 method for class 'table' as.data.frame(x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE, sep = "", base = list(LETTERS)) Thanks, Joe.> On Mar 14, 2019, at 11:18 AM, peter dalgaard <pdalgd at gmail.com> wrote: > > I have no recollection of the original rationale for as.data.frame.table, but I actually think it is fine as it is: > > The classifying _factors_ of a crosstable should be factors unless very specifically directed otherwise and that should not depend on the setting of an option that controls the conversion of character data. > > For as.data.frame.matrix, in contrast, it is the _content_ of the matrix that is being converted, and it seems much more reasonable to follow the same path as for other character data. > > -pd > >> On 12 Mar 2019, at 21:39 , Mychaleckyj, Josyf C (jcm6t) <jcm6t at virginia.edu> wrote: >> >> Reporting a possible inconsistency or bug in handling stringsAsFactors in as.data.frame.table() >> >> Here is a simple test >> >>> options()$stringsAsFactors >> [1] TRUE >>> x<-c("a","b","c","a","b") >>> d<-as.data.frame(table(x)) >>> d >> x Freq >> 1 a 2 >> 2 b 2 >> 3 c 1 >>> class(d$x) >> [1] "factor" >>> d2<-as.data.frame(table(x),stringsAsFactors=F) >>> class(d2$x) >> [1] ?character" >>> options(stringsAsFactors=F) >>> options()$stringsAsFactors >> [1] FALSE >>> d3<-as.data.frame(table(x)) >>> d3 >> x Freq >> 1 a 2 >> 2 b 2 >> 3 c 1 >>> class(d3$x) >> [1] ?factor" >>> d4<-as.data.frame(table(x),stringsAsFactors=F) >>> class(d4$x) >> [1] ?character" >> >> >> # Display the code showing the different stringsAsFactors handling in table and matrix: >> >>> as.data.frame.table >> function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE, >> sep = "", base = list(LETTERS)) >> { >> ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x, >> sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)), >> Freq = c(x), row.names = row.names)) >> names(ex)[3L] <- responseName >> eval(ex) >> } >> <bytecode: 0x28769f8> >> <environment: namespace:base> >> >>> as.data.frame.matrix >> function (x, row.names = NULL, optional = FALSE, make.names = TRUE, >> ..., stringsAsFactors = default.stringsAsFactors()) >> { >> d <- dim(x) >> nrows <- d[[1L]] >> ncols <- d[[2L]] >> ic <- seq_len(ncols) >> dn <- dimnames(x) >> if (is.null(row.names)) >> row.names <- dn[[1L]] >> collabs <- dn[[2L]] >> if (any(empty <- !nzchar(collabs))) >> collabs[empty] <- paste0("V", ic)[empty] >> value <- vector("list", ncols) >> if (mode(x) == "character" && stringsAsFactors) { >> for (i in ic) value[[i]] <- as.factor(x[, i]) >> } >> else { >> for (i in ic) value[[i]] <- as.vector(x[, i]) >> } >> autoRN <- (is.null(row.names) || length(row.names) != nrows) >> if (length(collabs) == ncols) >> names(value) <- collabs >> else if (!optional) >> names(value) <- paste0("V", ic) >> class(value) <- "data.frame" >> if (autoRN) >> attr(value, "row.names") <- .set_row_names(nrows) >> else .rowNamesDF(value, make.names = make.names) <- row.names >> value >> } >> <bytecode: 0x29995c0> >> <environment: namespace:base> >> >> >>> sessionInfo() >> R version 3.5.2 (2018-12-20) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: CentOS Linux 7 (Core) >> >> Matrix products: default >> BLAS: /usr/lib64/libblas.so.3.4.2 >> LAPACK: /usr/lib64/liblapack.so.3.4.2 >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.5.2 tools_3.5.2 >> >> Thanks, >> Joe >> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > > > > > > > >
Martin Maechler
2019-Mar-14 16:40 UTC
[Rd] as.data.frame.table() does not recognize default.stringsAsFactors()
>>>>> peter dalgaard >>>>> on Thu, 14 Mar 2019 16:18:55 +0100 writes:> I have no recollection of the original rationale for as.data.frame.table, but I actually think it is fine as it is: > The classifying _factors_ of a crosstable should be factors unless very specifically directed otherwise and that should not depend on the setting of an option that controls the conversion of character data. > For as.data.frame.matrix, in contrast, it is the _content_ of the matrix that is being converted, and it seems much more reasonable to follow the same path as for other character data. > -pd I very strongly agree that as.data.frame.table() should not be changed to follow a global option. To the contrary: I've repeatedly mentioned that in my view it has been a design mistake to allow data.frame() and as.data.frame() be influenced by a global option [and we should've tried harder to keep things purely functional (R remaining as closely as possible a "functional language"), e.g. by providing wrapper functions the same way we have such wrappers for versions of read.table() with different defaults for some of the arguments ] Martin >> On 12 Mar 2019, at 21:39 , Mychaleckyj, Josyf C (jcm6t) <jcm6t at virginia.edu> wrote: >> >> Reporting a possible inconsistency or bug in handling stringsAsFactors in as.data.frame.table() >> >> Here is a simple test >> >>> options()$stringsAsFactors >> [1] TRUE >>> x<-c("a","b","c","a","b") >>> d<-as.data.frame(table(x)) >>> d >> x Freq >> 1 a 2 >> 2 b 2 >> 3 c 1 >>> class(d$x) >> [1] "factor" >>> d2<-as.data.frame(table(x),stringsAsFactors=F) >>> class(d2$x) >> [1] ?character" >>> options(stringsAsFactors=F) >>> options()$stringsAsFactors >> [1] FALSE >>> d3<-as.data.frame(table(x)) >>> d3 >> x Freq >> 1 a 2 >> 2 b 2 >> 3 c 1 >>> class(d3$x) >> [1] ?factor" >>> d4<-as.data.frame(table(x),stringsAsFactors=F) >>> class(d4$x) >> [1] ?character" >> >> >> # Display the code showing the different stringsAsFactors handling in table and matrix: >> >>> as.data.frame.table >> function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE, >> sep = "", base = list(LETTERS)) >> { >> ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x, >> sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)), >> Freq = c(x), row.names = row.names)) >> names(ex)[3L] <- responseName >> eval(ex) >> } >> <bytecode: 0x28769f8> >> <environment: namespace:base> >> >>> as.data.frame.matrix >> function (x, row.names = NULL, optional = FALSE, make.names = TRUE, >> ..., stringsAsFactors = default.stringsAsFactors()) >> { >> d <- dim(x) >> nrows <- d[[1L]] >> ncols <- d[[2L]] >> ic <- seq_len(ncols) >> dn <- dimnames(x) >> if (is.null(row.names)) >> row.names <- dn[[1L]] >> collabs <- dn[[2L]] >> if (any(empty <- !nzchar(collabs))) >> collabs[empty] <- paste0("V", ic)[empty] >> value <- vector("list", ncols) >> if (mode(x) == "character" && stringsAsFactors) { >> for (i in ic) value[[i]] <- as.factor(x[, i]) >> } >> else { >> for (i in ic) value[[i]] <- as.vector(x[, i]) >> } >> autoRN <- (is.null(row.names) || length(row.names) != nrows) >> if (length(collabs) == ncols) >> names(value) <- collabs >> else if (!optional) >> names(value) <- paste0("V", ic) >> class(value) <- "data.frame" >> if (autoRN) >> attr(value, "row.names") <- .set_row_names(nrows) >> else .rowNamesDF(value, make.names = make.names) <- row.names >> value >> } >> <bytecode: 0x29995c0> >> <environment: namespace:base> >> >> >>> sessionInfo() >> R version 3.5.2 (2018-12-20) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: CentOS Linux 7 (Core) >> >> Matrix products: default >> BLAS: /usr/lib64/libblas.so.3.4.2 >> LAPACK: /usr/lib64/liblapack.so.3.4.2 >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.5.2 tools_3.5.2 >> >> Thanks, >> Joe >> >> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Reasonably Related Threads
- as.data.frame.table() does not recognize default.stringsAsFactors()
- stringsAsFactors has no impact in expand.grid()?
- How to write a Surv object to a csv-file?
- `as.data.frame.matrix()` can produce a data frame without a `names` attribute
- possible BUG with as.data.frame() and/or [.data.frame