Chris Evans
2016-Dec-06 22:10 UTC
[R] Odd behaviour of mean() with a numeric column in a tibble
{{SIGH}} You are absolutely right. I wonder if I am losing some cognitive capacities that are needed to be part of the evolving R community. It seems to me that if a tibble is designed to be an enhanced replacement for a dataframe then it shouldn't quite so radically change things. I notice that the documentation on tibble says "[ Never simplifies (drops), so always returns data.frame" That is much less explicit than I would have liked and actually doesn't seem to be true. In fact, as you rightly say, it generally, but not quite always, returns a tibble. In fact it can be fooled into a vector of length 1.> tmpTibble[[1,]]Error in `[[.data.frame`(tmpTibble, 1, ) : argument "..2" is missing, with no default> tmpTibble[1]# A tibble: 26 ? 1 ID <chr> 1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j # ... with 16 more rows> tmpTibble[,1]# A tibble: 26 ? 1 ID <chr> 1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j # ... with 16 more rows> tmpTibble[1,]Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : replacement element 3 is a matrix/data frame of 26 rows, need 1 In addition: Warning messages: 1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : replacement element 1 has 26 rows to replace 1 rows 2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : replacement element 2 has 26 rows to replace 1 rows> tmpTibble[1,1:26]Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26> tmpTibble[[1,2]][1] 1> str(tmpTibble[[1,2]])int 1> str(tmpTibble[[1:2,2]])Error in col[[i, exact = exact]] : attempt to select more than one element in vectorIndex> > tmpTibble[[1,1:2]][1] "b">So [[a,b]] works if a and b are legal with the dimensions of the tibble and if a is of length 1 but returns NOT a tibble but a vector of length 1 (I think), I can see that's logical but not what it says in the documentation. [[a]] and [[,a]] return the same result, that seems excessively tolerant to me. [[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a tibble. And row subsetting/indexing has gone. Why create replacement for a dataframe that has no row indexing and so radically redefines column indexing, in fact redefines the whole of indexing and subsetting? OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. Perhaps Prof. Wickham or someone can spell out a bit less tersely, and I think incompletely, than the tibble documentation does, why all this is good. Thanks anyway Ista, you certainly hit the issue! Very best all, Chris> From: "Ista Zahn" <istazahn at gmail.com> > To: "Chris Evans" <chrishold at psyctc.org> > Cc: "r-helpr-project.org" <r-help at r-project.org> > Sent: Tuesday, 6 December, 2016 21:40:41 > Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble> Not at a computer to check right now, but I believe single bracket indexing a > tibble always returns a tibble. To extract a vector use [[> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote:>> I hope I am obeying the list rules here. I am using a raw R IDE for this and > > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit)> > Here is a reproducible example. Code only first> > require(tibble) > > tmpTibble <- tibble(ID=letters,num=1:26) > > min(tmpTibble[,2]) # fine > > max(tmpTibble[,2]) # fine > > median(tmpTibble[,2]) # not fine > > mean(tmpTibble[,2]) # not fine> I think you want> mean(tmpTibble[[2]]> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?! > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} > > newMedianFun(tmpTibble[,2]) # ditto > > str(tmpTibble[,2])> > ### then I tried this to make sure it wasn't about having fed in integers> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) > > tmpTibble2 > > mean(tmpTibble2[,3]) # not fine, not about integers!>> ### before I just created tmpTibble2 I found myself trying to add a column to > > tmpTibble > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! > > ### and oddly enough ... > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO!> > Now here it is with the output:> > > require(tibble) > > Loading required package: tibble > > > tmpTibble <- tibble(ID=letters,num=1:26) > > > min(tmpTibble[,2]) # fine > > [1] 1 > > > max(tmpTibble[,2]) # fine > > [1] 26 > > > median(tmpTibble[,2]) # not fine > > Error in median.default(tmpTibble[, 2]) : need numeric data > > > mean(tmpTibble[,2]) # not fine > > [1] NA > > Warning message: > > In mean.default(tmpTibble[, 2]) : > > argument is not numeric or logical: returning NA > > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} > > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?! > > [1] 13.5 > > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} > > > newMedianFun(tmpTibble[,2]) # ditto > > [1] 13.5 > > > str(tmpTibble[,2]) > > Classes ?tbl_df?, ?tbl? and 'data.frame': 26 obs. of 1 variable: > > $ num: int 1 2 3 4 5 6 7 8 9 10 ...> > > ### then I tried this to make sure it wasn't about having fed in integers> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) > > > tmpTibble2 > > # A tibble: 26 ? 3 > > ID num num2 > > <chr> <int> <dbl> > > 1 a 1 0.1 > > 2 b 2 0.2 > > 3 c 3 0.3 > > 4 d 4 0.4 > > 5 e 5 0.5 > > 6 f 6 0.6 > > 7 g 7 0.7 > > 8 h 8 0.8 > > 9 i 9 0.9 > > 10 j 10 1.0 > > # ... with 16 more rows > > > mean(tmpTibble2[,3]) # not fine, not about integers! > > [1] NA > > Warning message: > > In mean.default(tmpTibble2[, 3]) : > > argument is not numeric or logical: returning NA>> > ### before I just created tmpTibble2 I found myself trying to add a column to > > > tmpTibble > > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! > > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! > > > ### and oddly enough ... > > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! > > Error: Each variable must be a 1d atomic vector or list. > > Problem variables: 'newNum'>> I discovered this when I hit odd behaviour after using read_spss() from the >> haven package for the first time as it seemed to be offering a step forward >> over good old read.spss() from the excellent foreign package. I am reporting it >> here not directly to Prof. Wickham as the issues seem rather general though I'm >> guessing that it needs to be fixed with a fix to tibble. Or perhaps I've > > completely missed something.> > TIA,> > Chris> > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Jeff Newmiller
2016-Dec-06 23:23 UTC
[R] Odd behaviour of mean() with a numeric column in a tibble
You really need sleep. Then you need to read ?`[[` and in particular read about the second argument to the `[[` function, since you don't seem to understand what it is for. Maybe reread the Introduction to R document that comes with R. The simplest solution is to treat `[[` as supporting one index and `[` as supporting either one or two. As for expecting any form of row indexing of data frames or tibbles to return a vector, that is hopeless because each column can have a different type. dta[ 1, ] returns exactly what it has to return to avoid losing fidelity. If you really need row indexing to return a vector you should be using a matrix. -- Sent from my phone. Please excuse my brevity. On December 6, 2016 2:10:15 PM PST, Chris Evans <chrishold at psyctc.org> wrote:>{{SIGH}} > >You are absolutely right. > >I wonder if I am losing some cognitive capacities that are needed to be >part of the evolving R community. It seems to me that if a tibble is >designed to be an enhanced replacement for a dataframe then it >shouldn't quite so radically change things. > >I notice that the documentation on tibble says "[ Never simplifies >(drops), so always returns data.frame" >That is much less explicit than I would have liked and actually doesn't >seem to be true. In fact, as you rightly say, it generally, but not >quite always, returns a tibble. In fact it can be fooled into a vector >of length 1. > >> tmpTibble[[1,]] >Error in `[[.data.frame`(tmpTibble, 1, ) : >argument "..2" is missing, with no default > >> tmpTibble[1] ># A tibble: 26 ? 1 >ID ><chr> >1 a >2 b >3 c >4 d >5 e >6 f >7 g >8 h >9 i >10 j ># ... with 16 more rows >> tmpTibble[,1] ># A tibble: 26 ? 1 >ID ><chr> >1 a >2 b >3 c >4 d >5 e >6 f >7 g >8 h >9 i >10 j ># ... with 16 more rows >> tmpTibble[1,] >Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", >: >replacement element 3 is a matrix/data frame of 26 rows, need 1 >In addition: Warning messages: >1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : >replacement element 1 has 26 rows to replace 1 rows >2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : >replacement element 2 has 26 rows to replace 1 rows >> tmpTibble[1,1:26] >Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, >15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 >> tmpTibble[[1,2]] >[1] 1 >> str(tmpTibble[[1,2]]) >int 1 >> str(tmpTibble[[1:2,2]]) >Error in col[[i, exact = exact]] : >attempt to select more than one element in vectorIndex >> >> tmpTibble[[1,1:2]] >[1] "b" >> > >So [[a,b]] works if a and b are legal with the dimensions of the tibble >and if a is of length 1 but returns NOT a tibble but a vector of length >1 (I think), I can see that's logical but not what it says in the >documentation. > >[[a]] and [[,a]] return the same result, that seems excessively >tolerant to me. > >[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a >tibble. > >And row subsetting/indexing has gone. > >Why create replacement for a dataframe that has no row indexing and so >radically redefines column indexing, in fact redefines the whole of >indexing and subsetting? > >OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. >Perhaps Prof. Wickham or someone can spell out a bit less tersely, and >I think incompletely, than the tibble documentation does, why all this >is good. > >Thanks anyway Ista, you certainly hit the issue! > >Very best all, > >Chris > >> From: "Ista Zahn" <istazahn at gmail.com> >> To: "Chris Evans" <chrishold at psyctc.org> >> Cc: "r-helpr-project.org" <r-help at r-project.org> >> Sent: Tuesday, 6 December, 2016 21:40:41 >> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a >tibble > >> Not at a computer to check right now, but I believe single bracket >indexing a >> tibble always returns a tibble. To extract a vector use [[ > >> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote: > >>> I hope I am obeying the list rules here. I am using a raw R IDE for >this and >> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit) > >> > Here is a reproducible example. Code only first > >> > require(tibble) >> > tmpTibble <- tibble(ID=letters,num=1:26) >> > min(tmpTibble[,2]) # fine >> > max(tmpTibble[,2]) # fine >> > median(tmpTibble[,2]) # not fine >> > mean(tmpTibble[,2]) # not fine > >> I think you want > >> mean(tmpTibble[[2]] > >> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be >necessary?! >> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >> > newMedianFun(tmpTibble[,2]) # ditto >> > str(tmpTibble[,2]) > >> > ### then I tried this to make sure it wasn't about having fed in >integers > >> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >> > tmpTibble2 >> > mean(tmpTibble2[,3]) # not fine, not about integers! > > >>> ### before I just created tmpTibble2 I found myself trying to add a >column to >> > tmpTibble >> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >> > ### and oddly enough ... >> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! > >> > Now here it is with the output: > >> > > require(tibble) >> > Loading required package: tibble >> > > tmpTibble <- tibble(ID=letters,num=1:26) >> > > min(tmpTibble[,2]) # fine >> > [1] 1 >> > > max(tmpTibble[,2]) # fine >> > [1] 26 >> > > median(tmpTibble[,2]) # not fine >> > Error in median.default(tmpTibble[, 2]) : need numeric data >> > > mean(tmpTibble[,2]) # not fine >> > [1] NA >> > Warning message: >> > In mean.default(tmpTibble[, 2]) : >> > argument is not numeric or logical: returning NA >> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't >be necessary?! >> > [1] 13.5 >> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >> > > newMedianFun(tmpTibble[,2]) # ditto >> > [1] 13.5 >> > > str(tmpTibble[,2]) >> > Classes ?tbl_df?, ?tbl? and 'data.frame': 26 obs. of 1 variable: >> > $ num: int 1 2 3 4 5 6 7 8 9 10 ... > >> > > ### then I tried this to make sure it wasn't about having fed in >integers > >> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >> > > tmpTibble2 >> > # A tibble: 26 ? 3 >> > ID num num2 >> > <chr> <int> <dbl> >> > 1 a 1 0.1 >> > 2 b 2 0.2 >> > 3 c 3 0.3 >> > 4 d 4 0.4 >> > 5 e 5 0.5 >> > 6 f 6 0.6 >> > 7 g 7 0.7 >> > 8 h 8 0.8 >> > 9 i 9 0.9 >> > 10 j 10 1.0 >> > # ... with 16 more rows >> > > mean(tmpTibble2[,3]) # not fine, not about integers! >> > [1] NA >> > Warning message: >> > In mean.default(tmpTibble2[, 3]) : >> > argument is not numeric or logical: returning NA > > >>> > ### before I just created tmpTibble2 I found myself trying to add >a column to >> > > tmpTibble >> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >> > > ### and oddly enough ... >> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! >> > Error: Each variable must be a 1d atomic vector or list. >> > Problem variables: 'newNum' > > > >>> I discovered this when I hit odd behaviour after using read_spss() >from the >>> haven package for the first time as it seemed to be offering a step >forward >>> over good old read.spss() from the excellent foreign package. I am >reporting it >>> here not directly to Prof. Wickham as the issues seem rather general >though I'm >>> guessing that it needs to be fixed with a fix to tibble. Or perhaps >I've >> > completely missed something. > >> > TIA, > >> > Chris > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Ista Zahn
2016-Dec-07 00:33 UTC
[R] Odd behaviour of mean() with a numeric column in a tibble
On Tue, Dec 6, 2016 at 5:10 PM, Chris Evans <chrishold at psyctc.org> wrote:> {{SIGH}} > > You are absolutely right. > > I wonder if I am losing some cognitive capacities that are needed to be part of the evolving R community. It seems to me that if a tibble is designed to be an enhanced replacement for a dataframe then it shouldn't quite so radically change things.Well, there are some things about data frames that are darn annoying, and tibbles exist partly as an attempt to eliminate some of the inconsistencies with data.frames. That necessarily means changing things.> > I notice that the documentation on tibble says "[ Never simplifies (drops), so always returns data.frame" > That is much less explicit than I would have liked and actually doesn't seem to be true. In fact, as you rightly say, it generally, but not quite always, returns a tibble. In fact it can be fooled into a vector of length 1.Really? How?> >> tmpTibble[[1,]] > Error in `[[.data.frame`(tmpTibble, 1, ) : > argument "..2" is missing, with no defaultThat doesn't have anything to do with tibbles: as.data.frame(tmpTibble)[[1, ]] gives the same thing.> >> tmpTibble[1] > # A tibble: 26 ? 1 > ID > <chr> > 1 a > 2 b > 3 c > 4 d > 5 e > 6 f > 7 g > 8 h > 9 i > 10 j > # ... with 16 more rowsAgain, just what you expect from a data.frame (except for the print method).>> tmpTibble[,1] > # A tibble: 26 ? 1 > ID > <chr> > 1 a > 2 b > 3 c > 4 d > 5 e > 6 f > 7 g > 8 h > 9 i > 10 j > # ... with 16 more rowsThat is different, and by design as you noted. It is different from data.frame indexing, but the data.frame behavior is needlessly complicated. Sometimes you get a vector, sometimes a data.frame. That hardly seems worth it given that we already have $ or [[ if you really wanted a vector.>> tmpTibble[1,] > Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : > replacement element 3 is a matrix/data frame of 26 rows, need 1 > In addition: Warning messages: > 1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : > replacement element 1 has 26 rows to replace 1 rows > 2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : > replacement element 2 has 26 rows to replace 1 rowsThat's not what I get.> tmpTibble[1,]# A tibble: 1 ? 2 ID num <chr> <int> 1 a 1 works just as I would expect here.>> tmpTibble[1,1:26] > Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26Other than providing more information about what went wrong this is the same as data.frame:> as.data.frame(tmpTibble)[1,1:26]Error in `[.data.frame`(as.data.frame(tmpTibble), 1, 1:26) : undefined columns selected>> tmpTibble[[1,2]] > [1] 1Same as data.frame. (and not at odds with the documentations which says that [ (not [[ ) always returns a data.frame).>> str(tmpTibble[[1,2]]) > int 1 >> str(tmpTibble[[1:2,2]]) > Error in col[[i, exact = exact]] : > attempt to select more than one element in vectorIndexSame behavior as data.frame.>> >> tmpTibble[[1,1:2]] > [1] "b" >>Same behavior as data.frame.> > So [[a,b]] works if a and b are legal with the dimensions of the tibble and if a is of length 1 but returns NOT a tibble but a vector of length 1 (I think), I can see that's logical but not what it says in the documentation.In what documentation? The documentation that says [ always returns a data.frame? Note that [ and [[ are not the same, and only [ is documented to always return a data.frame.> > [[a]] and [[,a]] return the same result, that seems excessively tolerant to me.Not for me:> tmpTibble[[1]][1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z"> tmpTibble[[, 1]]Error in `[[.data.frame`(tmpTibble, , 1) : argument "..1" is missing, with no default (this is the same thing that happens with a data.frame)> > [[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a tibble.That is weird, but not different that data.frame. See above regarding "NOT a tibble".> > And row subsetting/indexing has gone.Whatever do you mean?> tmpTibble[tmpTibble$ID == "d", ]# A tibble: 1 ? 2 ID num <chr> <int> 1 d 4> > Why create replacement for a dataframe that has no row indexing and so radically redefines column indexing, in fact redefines the whole of indexing and subsetting?It has row indexing, and besides [, x] not dropping dimension it works pretty much the same.> > OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. Perhaps Prof. Wickham or someone can spell out a bit less tersely, and I think incompletely, than the tibble documentation does, why all this is good.Most of the things you identify here are issues inherited from data.frame, and and not due differences between tibbles and data.frames. Best, Ista> > Thanks anyway Ista, you certainly hit the issue! > > Very best all, > > Chris > >> From: "Ista Zahn" <istazahn at gmail.com> >> To: "Chris Evans" <chrishold at psyctc.org> >> Cc: "r-helpr-project.org" <r-help at r-project.org> >> Sent: Tuesday, 6 December, 2016 21:40:41 >> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble > >> Not at a computer to check right now, but I believe single bracket indexing a >> tibble always returns a tibble. To extract a vector use [[ > >> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote: > >>> I hope I am obeying the list rules here. I am using a raw R IDE for this and >> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit) > >> > Here is a reproducible example. Code only first > >> > require(tibble) >> > tmpTibble <- tibble(ID=letters,num=1:26) >> > min(tmpTibble[,2]) # fine >> > max(tmpTibble[,2]) # fine >> > median(tmpTibble[,2]) # not fine >> > mean(tmpTibble[,2]) # not fine > >> I think you want > >> mean(tmpTibble[[2]] > >> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?! >> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >> > newMedianFun(tmpTibble[,2]) # ditto >> > str(tmpTibble[,2]) > >> > ### then I tried this to make sure it wasn't about having fed in integers > >> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >> > tmpTibble2 >> > mean(tmpTibble2[,3]) # not fine, not about integers! > > >>> ### before I just created tmpTibble2 I found myself trying to add a column to >> > tmpTibble >> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >> > ### and oddly enough ... >> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! > >> > Now here it is with the output: > >> > > require(tibble) >> > Loading required package: tibble >> > > tmpTibble <- tibble(ID=letters,num=1:26) >> > > min(tmpTibble[,2]) # fine >> > [1] 1 >> > > max(tmpTibble[,2]) # fine >> > [1] 26 >> > > median(tmpTibble[,2]) # not fine >> > Error in median.default(tmpTibble[, 2]) : need numeric data >> > > mean(tmpTibble[,2]) # not fine >> > [1] NA >> > Warning message: >> > In mean.default(tmpTibble[, 2]) : >> > argument is not numeric or logical: returning NA >> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be necessary?! >> > [1] 13.5 >> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >> > > newMedianFun(tmpTibble[,2]) # ditto >> > [1] 13.5 >> > > str(tmpTibble[,2]) >> > Classes ?tbl_df?, ?tbl? and 'data.frame': 26 obs. of 1 variable: >> > $ num: int 1 2 3 4 5 6 7 8 9 10 ... > >> > > ### then I tried this to make sure it wasn't about having fed in integers > >> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >> > > tmpTibble2 >> > # A tibble: 26 ? 3 >> > ID num num2 >> > <chr> <int> <dbl> >> > 1 a 1 0.1 >> > 2 b 2 0.2 >> > 3 c 3 0.3 >> > 4 d 4 0.4 >> > 5 e 5 0.5 >> > 6 f 6 0.6 >> > 7 g 7 0.7 >> > 8 h 8 0.8 >> > 9 i 9 0.9 >> > 10 j 10 1.0 >> > # ... with 16 more rows >> > > mean(tmpTibble2[,3]) # not fine, not about integers! >> > [1] NA >> > Warning message: >> > In mean.default(tmpTibble2[, 3]) : >> > argument is not numeric or logical: returning NA > > >>> > ### before I just created tmpTibble2 I found myself trying to add a column to >> > > tmpTibble >> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >> > > ### and oddly enough ... >> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! >> > Error: Each variable must be a 1d atomic vector or list. >> > Problem variables: 'newNum' > > > >>> I discovered this when I hit odd behaviour after using read_spss() from the >>> haven package for the first time as it seemed to be offering a step forward >>> over good old read.spss() from the excellent foreign package. I am reporting it >>> here not directly to Prof. Wickham as the issues seem rather general though I'm >>> guessing that it needs to be fixed with a fix to tibble. Or perhaps I've >> > completely missed something. > >> > TIA, > >> > Chris > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Chris Evans
2016-Dec-10 21:57 UTC
[R] Odd behaviour of mean() with a numeric column in a tibble
Thanks to both Jeff and Ista for your inputs some days back. I confess I was _indeed_ too tired to be thinking well and laterally, and even to be copying things into Emails successfully. I have since had more sleep (!) and I have read ?`[[`, gone back to the pertinent parts of "Introduction to R" and generally pondered all this. I confess I had always avoided [[ and only ever used it for lists that were not data frames. I can now see just how badly I was misguessing its behaviour: apologies, I should have realised that I needed to go right back to basics. I _can_ see that there are things in the behaviour of data frames that are not that obvious but I had become very used to them. I can see values in converting to using tibbles instead of data frames and may try to do that. However, I think the documentation for tibble would be improved for people like myself if it started with something that made it even clearer that tibbles are lists, just as data frames are, but that whereas a data frame has a single class(df) of "data.frame", class(tibble) is: c("tbl_df","tbl","data.frame"). I can now see that what I get from ?tibble, i.e. "tibble is a trimmed down version of data.frame" is probably technically true though I'd describe it as a rationalised or even a beefed up version of data.frame. I can also now see that what I find in https://cran.r-project.org/web/packages/tibble/tibble.pdf: "[ Never simplifies (drops), so always returns data.frame" is true, but only to the extent that any tibble is still a data.frame but with "data.frame" moved to the third position in the classes of the tibble where it would be the first and only class were it a pure data.frame. I can also see now that that is not really inconsistent with what I get in https://github.com/tidyverse/tibble: "Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!" However, I think it would be better if the tibble.pdf document said: "[ Never simplifies (drops), so always returns tibble" even though "[ Never simplifies (drops), so always returns data.frame" is technically true, up to and including passing is.data.frame() as Finally, I think I can see that if want various functions I have written that worked fine on data frames, but which depended on indexing or subsetting those data frames using [,i] or sometimes [,i:j]to select vectors or matrices, then I will have to modify them so they test whether the input is a simple data frame or a data frame that is also a tibble. I guess that I could have trapped things had my functions (where appropriate) had an is.numeric() input check ... and that I have to use an is.tibble() check, not an is.data.frame() check to distinguish the two! Ah well, even after years of part-time use of R, I guess it's been good for my soul and my deeper and wider understanding of R to go right back to the basics. Thanks again to you both. I am posting here to convey thanks and in case this is useful to anyone like myself who benefits from a bit more narrative than is usually offered by R definitions and help entries. Chris ----- Original Message -----> From: "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> > To: "Chris Evans" <chrishold at psyctc.org>, "r-helpr-project.org" <r-help at r-project.org> > Sent: Tuesday, 6 December, 2016 23:23:28 > Subject: Re: [R] Odd behaviour of mean() with a numeric column in a tibble> You really need sleep. Then you need to read > > ?`[[` > > and in particular read about the second argument to the `[[` function, since you > don't seem to understand what it is for. Maybe reread the Introduction to R > document that comes with R. > > The simplest solution is to treat `[[` as supporting one index and `[` as > supporting either one or two. > > As for expecting any form of row indexing of data frames or tibbles to return a > vector, that is hopeless because each column can have a different type. dta[ > 1, ] returns exactly what it has to return to avoid losing fidelity. If you > really need row indexing to return a vector you should be using a matrix. > -- > Sent from my phone. Please excuse my brevity. > > On December 6, 2016 2:10:15 PM PST, Chris Evans <chrishold at psyctc.org> wrote: >>{{SIGH}} >> >>You are absolutely right. >> >>I wonder if I am losing some cognitive capacities that are needed to be >>part of the evolving R community. It seems to me that if a tibble is >>designed to be an enhanced replacement for a dataframe then it >>shouldn't quite so radically change things. >> >>I notice that the documentation on tibble says "[ Never simplifies >>(drops), so always returns data.frame" >>That is much less explicit than I would have liked and actually doesn't >>seem to be true. In fact, as you rightly say, it generally, but not >>quite always, returns a tibble. In fact it can be fooled into a vector >>of length 1. >> >>> tmpTibble[[1,]] >>Error in `[[.data.frame`(tmpTibble, 1, ) : >>argument "..2" is missing, with no default >> >>> tmpTibble[1] >># A tibble: 26 ? 1 >>ID >><chr> >>1 a >>2 b >>3 c >>4 d >>5 e >>6 f >>7 g >>8 h >>9 i >>10 j >># ... with 16 more rows >>> tmpTibble[,1] >># A tibble: 26 ? 1 >>ID >><chr> >>1 a >>2 b >>3 c >>4 d >>5 e >>6 f >>7 g >>8 h >>9 i >>10 j >># ... with 16 more rows >>> tmpTibble[1,] >>Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", >>: >>replacement element 3 is a matrix/data frame of 26 rows, need 1 >>In addition: Warning messages: >>1: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : >>replacement element 1 has 26 rows to replace 1 rows >>2: In `[<-.data.frame`(`*tmp*`, , value = list(ID = c("a", "a", "a", : >>replacement element 2 has 26 rows to replace 1 rows >>> tmpTibble[1,1:26] >>Error: Invalid column indexes: 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, >>15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 >>> tmpTibble[[1,2]] >>[1] 1 >>> str(tmpTibble[[1,2]]) >>int 1 >>> str(tmpTibble[[1:2,2]]) >>Error in col[[i, exact = exact]] : >>attempt to select more than one element in vectorIndex >>> >>> tmpTibble[[1,1:2]] >>[1] "b" >>> >> >>So [[a,b]] works if a and b are legal with the dimensions of the tibble >>and if a is of length 1 but returns NOT a tibble but a vector of length >>1 (I think), I can see that's logical but not what it says in the >>documentation. >> >>[[a]] and [[,a]] return the same result, that seems excessively >>tolerant to me. >> >>[[a,b:c]] actually returns [[a,c]] and again as a single value, NOT a >>tibble. >> >>And row subsetting/indexing has gone. >> >>Why create replacement for a dataframe that has no row indexing and so >>radically redefines column indexing, in fact redefines the whole of >>indexing and subsetting? >> >>OK. I will go to sleep now and hope to feel less dumb(ed) when I wake. >>Perhaps Prof. Wickham or someone can spell out a bit less tersely, and >>I think incompletely, than the tibble documentation does, why all this >>is good. >> >>Thanks anyway Ista, you certainly hit the issue! >> >>Very best all, >> >>Chris >> >>> From: "Ista Zahn" <istazahn at gmail.com> >>> To: "Chris Evans" <chrishold at psyctc.org> >>> Cc: "r-helpr-project.org" <r-help at r-project.org> >>> Sent: Tuesday, 6 December, 2016 21:40:41 >>> Subject: Re: [R] Odd behaviour of mean() with a numeric column in a >>tibble >> >>> Not at a computer to check right now, but I believe single bracket >>indexing a >>> tibble always returns a tibble. To extract a vector use [[ >> >>> On Dec 6, 2016 4:28 PM, "Chris Evans" < chrishold at psyctc.org > wrote: >> >>>> I hope I am obeying the list rules here. I am using a raw R IDE for >>this and >>> > running 3.3.2 (2016-10-31) on x86_64-w64-mingw32/x64 (64-bit) >> >>> > Here is a reproducible example. Code only first >> >>> > require(tibble) >>> > tmpTibble <- tibble(ID=letters,num=1:26) >>> > min(tmpTibble[,2]) # fine >>> > max(tmpTibble[,2]) # fine >>> > median(tmpTibble[,2]) # not fine >>> > mean(tmpTibble[,2]) # not fine >> >>> I think you want >> >>> mean(tmpTibble[[2]] >> >>> > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >>> > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't be >>necessary?! >>> > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >>> > newMedianFun(tmpTibble[,2]) # ditto >>> > str(tmpTibble[,2]) >> >>> > ### then I tried this to make sure it wasn't about having fed in >>integers >> >>> > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >>> > tmpTibble2 >>> > mean(tmpTibble2[,3]) # not fine, not about integers! >> >> >>>> ### before I just created tmpTibble2 I found myself trying to add a >>column to >>> > tmpTibble >>> > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >>> > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >>> > ### and oddly enough ... >>> > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! >> >>> > Now here it is with the output: >> >>> > > require(tibble) >>> > Loading required package: tibble >>> > > tmpTibble <- tibble(ID=letters,num=1:26) >>> > > min(tmpTibble[,2]) # fine >>> > [1] 1 >>> > > max(tmpTibble[,2]) # fine >>> > [1] 26 >>> > > median(tmpTibble[,2]) # not fine >>> > Error in median.default(tmpTibble[, 2]) : need numeric data >>> > > mean(tmpTibble[,2]) # not fine >>> > [1] NA >>> > Warning message: >>> > In mean.default(tmpTibble[, 2]) : >>> > argument is not numeric or logical: returning NA >>> > > newMeanFun <- function(x) {mean(as.numeric(unlist(x)))} >>> > > newMeanFun(tmpTibble[,2]) # solved problem but surely shouldn't >>be necessary?! >>> > [1] 13.5 >>> > > newMedianFun <- function(x) {median(as.numeric(unlist(x)))} >>> > > newMedianFun(tmpTibble[,2]) # ditto >>> > [1] 13.5 >>> > > str(tmpTibble[,2]) >>> > Classes ?tbl_df?, ?tbl? and 'data.frame': 26 obs. of 1 variable: >>> > $ num: int 1 2 3 4 5 6 7 8 9 10 ... >> >>> > > ### then I tried this to make sure it wasn't about having fed in >>integers >> >>> > > tmpTibble2 <- tibble(ID=letters,num=1:26,num2=(1:26)/10) >>> > > tmpTibble2 >>> > # A tibble: 26 ? 3 >>> > ID num num2 >>> > <chr> <int> <dbl> >>> > 1 a 1 0.1 >>> > 2 b 2 0.2 >>> > 3 c 3 0.3 >>> > 4 d 4 0.4 >>> > 5 e 5 0.5 >>> > 6 f 6 0.6 >>> > 7 g 7 0.7 >>> > 8 h 8 0.8 >>> > 9 i 9 0.9 >>> > 10 j 10 1.0 >>> > # ... with 16 more rows >>> > > mean(tmpTibble2[,3]) # not fine, not about integers! >>> > [1] NA >>> > Warning message: >>> > In mean.default(tmpTibble2[, 3]) : >>> > argument is not numeric or logical: returning NA >> >> >>>> > ### before I just created tmpTibble2 I found myself trying to add >>a column to >>> > > tmpTibble >>> > > tmpTibble$newNum <- tmpTibble[,2]/10 # NO! >>> > > tmpTibble[["newNum"]] <- tmpTibble[,2]/10 # NO! >>> > > ### and oddly enough ... >>> > > add_column(tmpTibble,newNum = tmpTibble[,2]/10) # NO! >>> > Error: Each variable must be a 1d atomic vector or list. >>> > Problem variables: 'newNum' >> >> >> >>>> I discovered this when I hit odd behaviour after using read_spss() >>from the >>>> haven package for the first time as it seemed to be offering a step >>forward >>>> over good old read.spss() from the excellent foreign package. I am >>reporting it >>>> here not directly to Prof. Wickham as the issues seem rather general >>though I'm >>>> guessing that it needs to be fixed with a fix to tibble. Or perhaps >>I've >>> > completely missed something. >> >>> > TIA, >> >>> > Chris >> >>> > ______________________________________________ >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide >>http://www.R-project.org/posting-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] >> >>______________________________________________ >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide >>http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code.