Thanks Ivan for the answer. So it confirms my first thought that these two functions are equivalent when applied to a "simple" data.frame. The reason I was asking is because I have gotten used to use length() in my scripts. It works perfectly and I understand it easily. But to be honest, ncol() is more intuitive to most users (especially the novice) so I was thinking about switching to using this function instead (all my data.frames are created from read.csv() or similar functions so there should not be any issue). But before doing that, I want to be sure that it is not going to create unexpected results. Thank you, Ivan -- Dr. Ivan Calandra TraCEr, laboratory for Traceology and Controlled Experiments MONREPOS Archaeological Research Centre and Museum for Human Behavioural Evolution Schloss Monrepos 56567 Neuwied, Germany +49 (0) 2631 9772-243 https://www.researchgate.net/profile/Ivan_Calandra On 31/03/2020 16:00, Ivan Krylov wrote:> On Tue, 31 Mar 2020 14:47:54 +0200 > Ivan Calandra <calandra at rgzm.de> wrote: > >> On a simple data.frame (i.e. each element is a vector), ncol() and >> length() will give the same result. >> Are they just equivalent on such objects, or are they differences in >> some cases? > I am not aware of any exceptions to ncol(dataframe)==length(dataframe) > (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) > returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but > watch out for AsIs columns which can have columns of their own: > > x <- data.frame(I(volcano)) > dim(x) > # [1] 87 1 > length(x) > # [1] 1 > dim(x[,1]) > # [1] 87 61 > >
Dear Ivan, if I enter ncol in the console, I get function (x) dim(x)[2L] <bytecode: 0x5559e9429030> <environment: namespace:base> indicating that function dim is called. Function dim has a method for data.frame; see methods("dim"). The dim-method for data.frame is dim.data.frame function (x) c(.row_names_info(x, 2L), length(x)) <bytecode: 0x5559eb80da40> <environment: namespace:base> Hence, it calls length on the provided data.frame. In addition, some "magic" with .row_names_info is performed, where base:::.row_names_info function (x, type = 1L) .Internal(shortRowNames(x, type)) <bytecode: 0x5559ece50160> <environment: namespace:base> Best Matthias Am 31.03.20 um 16:10 schrieb Ivan Calandra:> Thanks Ivan for the answer. > > So it confirms my first thought that these two functions are equivalent > when applied to a "simple" data.frame. > > The reason I was asking is because I have gotten used to use length() in > my scripts. It works perfectly and I understand it easily. But to be > honest, ncol() is more intuitive to most users (especially the novice) > so I was thinking about switching to using this function instead (all my > data.frames are created from read.csv() or similar functions so there > should not be any issue). But before doing that, I want to be sure that > it is not going to create unexpected results. > > Thank you, > Ivan > > -- > Dr. Ivan Calandra > TraCEr, laboratory for Traceology and Controlled Experiments > MONREPOS Archaeological Research Centre and > Museum for Human Behavioural Evolution > Schloss Monrepos > 56567 Neuwied, Germany > +49 (0) 2631 9772-243 > https://www.researchgate.net/profile/Ivan_Calandra > > On 31/03/2020 16:00, Ivan Krylov wrote: >> On Tue, 31 Mar 2020 14:47:54 +0200 >> Ivan Calandra <calandra at rgzm.de> wrote: >> >>> On a simple data.frame (i.e. each element is a vector), ncol() and >>> length() will give the same result. >>> Are they just equivalent on such objects, or are they differences in >>> some cases? >> I am not aware of any exceptions to ncol(dataframe)==length(dataframe) >> (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) >> returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but >> watch out for AsIs columns which can have columns of their own: >> >> x <- data.frame(I(volcano)) >> dim(x) >> # [1] 87 1 >> length(x) >> # [1] 1 >> dim(x[,1]) >> # [1] 87 61 >> >> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Prof. Dr. Matthias Kohl www.stamats.de
should have added: dim(x)[2L] -> length(x) Am 31.03.20 um 16:21 schrieb Prof. Dr. Matthias Kohl:> Dear Ivan, > > if I enter ncol in the console, I get > > function (x) > dim(x)[2L] > <bytecode: 0x5559e9429030> > <environment: namespace:base> > > indicating that function dim is called. Function dim has a method for > data.frame; see methods("dim"). > > The dim-method for data.frame is > > dim.data.frame > function (x) > c(.row_names_info(x, 2L), length(x)) > <bytecode: 0x5559eb80da40> > <environment: namespace:base> > > Hence, it calls length on the provided data.frame. In addition, some > "magic" with .row_names_info is performed, where > > base:::.row_names_info > function (x, type = 1L) > .Internal(shortRowNames(x, type)) > <bytecode: 0x5559ece50160> > <environment: namespace:base> > > Best > Matthias > > Am 31.03.20 um 16:10 schrieb Ivan Calandra: >> Thanks Ivan for the answer. >> >> So it confirms my first thought that these two functions are equivalent >> when applied to a "simple" data.frame. >> >> The reason I was asking is because I have gotten used to use length() in >> my scripts. It works perfectly and I understand it easily. But to be >> honest, ncol() is more intuitive to most users (especially the novice) >> so I was thinking about switching to using this function instead (all my >> data.frames are created from read.csv() or similar functions so there >> should not be any issue). But before doing that, I want to be sure that >> it is not going to create unexpected results. >> >> Thank you, >> Ivan >> >> -- >> Dr. Ivan Calandra >> TraCEr, laboratory for Traceology and Controlled Experiments >> MONREPOS Archaeological Research Centre and >> Museum for Human Behavioural Evolution >> Schloss Monrepos >> 56567 Neuwied, Germany >> +49 (0) 2631 9772-243 >> https://www.researchgate.net/profile/Ivan_Calandra >> >> On 31/03/2020 16:00, Ivan Krylov wrote: >>> On Tue, 31 Mar 2020 14:47:54 +0200 >>> Ivan Calandra <calandra at rgzm.de> wrote: >>> >>>> On a simple data.frame (i.e. each element is a vector), ncol() and >>>> length() will give the same result. >>>> Are they just equivalent on such objects, or are they differences in >>>> some cases? >>> I am not aware of any exceptions to ncol(dataframe)==length(dataframe) >>> (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) >>> returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but >>> watch out for AsIs columns which can have columns of their own: >>> >>> x <- data.frame(I(volcano)) >>> dim(x) >>> # [1] 87? 1 >>> length(x) >>> # [1] 1 >>> dim(x[,1]) >>> # [1] 87 61 >>> >>> >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >-- Prof. Dr. Matthias Kohl www.stamats.de
Hi Ivan, Like Ivan Krylov, I'm not aware of circumstances for simple dataframes where ncol(DF) does not equal length(DF). As I understand it, using ncol() versus length() is important when you're examining an object returned from a function like sapply(), since sapply() will simplify one-column dataframes to vectors. Much has been made of this sapply() feature, but it's simple enough in your code to test whether or not an object returned by sapply() has NULL columns:> dim(testDF)[1] 14 8> length(testDF)[1] 8> length(testDF[1])[1] 1> length(testDF[, 1])[1] 14> > ncol(testDF)[1] 8> ncol(testDF[1])[1] 1> ncol(testDF[, 1])NULL> is.null(ncol(testDF[, 1]))[1] TRUE>HTH, Bill. W. Michels, Ph.D. On Tue, Mar 31, 2020 at 7:11 AM Ivan Calandra <calandra at rgzm.de> wrote:> > Thanks Ivan for the answer. > > So it confirms my first thought that these two functions are equivalent > when applied to a "simple" data.frame. > > The reason I was asking is because I have gotten used to use length() in > my scripts. It works perfectly and I understand it easily. But to be > honest, ncol() is more intuitive to most users (especially the novice) > so I was thinking about switching to using this function instead (all my > data.frames are created from read.csv() or similar functions so there > should not be any issue). But before doing that, I want to be sure that > it is not going to create unexpected results. > > Thank you, > Ivan > > -- > Dr. Ivan Calandra > TraCEr, laboratory for Traceology and Controlled Experiments > MONREPOS Archaeological Research Centre and > Museum for Human Behavioural Evolution > Schloss Monrepos > 56567 Neuwied, Germany > +49 (0) 2631 9772-243 > https://www.researchgate.net/profile/Ivan_Calandra > > On 31/03/2020 16:00, Ivan Krylov wrote: > > On Tue, 31 Mar 2020 14:47:54 +0200 > > Ivan Calandra <calandra at rgzm.de> wrote: > > > >> On a simple data.frame (i.e. each element is a vector), ncol() and > >> length() will give the same result. > >> Are they just equivalent on such objects, or are they differences in > >> some cases? > > I am not aware of any exceptions to ncol(dataframe)==length(dataframe) > > (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) > > returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but > > watch out for AsIs columns which can have columns of their own: > > > > x <- data.frame(I(volcano)) > > dim(x) > > # [1] 87 1 > > length(x) > > # [1] 1 > > dim(x[,1]) > > # [1] 87 61 > > > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
As others have pointed out, ncol calls the length function, so you are pretty safe in terms of output of getting the same result when applied to the results of functions like read.csv (there will be a big difference if you ever apply those functions to a matrix or some other data structures). One thing that I have not seen yet is a comparison on timing, so here goes:> library(microbenchmark) > microbenchmark(+ length = length(iris), + ncol = ncol(iris) + ) Unit: nanoseconds expr min lq mean median uq max neval length 700 750 869 800 800 7400 100 ncol 2400 2500 2981 2600 2700 31900 100 So ncol takes about 3 times as long to run as length on the iris data frame (5 columns), you can rerun the above code with data frames more the size that you will be using to see if that makes any difference. But also notice that the units are nanoseconds, so the median time for ncol to run is less than the time it takes light to travel a kilometer in a vacuum, or about the time it takes light to go 1/3 of a mile through a fiber optic cable (en.wikipedia.org/wiki/Microsecond). If this is used as part of a simulation or other repeated procedure and it is done one million times then you will add about 2 seconds to the overall run. If this is just part of code where length/ncol will be called fewer than 10 times then nobody is going to notice. So the trade-off of moving from length to ncol is a slight decrease in speed for an increase of readability. I think that I would go with the readability myself. On Tue, Mar 31, 2020 at 8:11 AM Ivan Calandra <calandra at rgzm.de> wrote:> > Thanks Ivan for the answer. > > So it confirms my first thought that these two functions are equivalent > when applied to a "simple" data.frame. > > The reason I was asking is because I have gotten used to use length() in > my scripts. It works perfectly and I understand it easily. But to be > honest, ncol() is more intuitive to most users (especially the novice) > so I was thinking about switching to using this function instead (all my > data.frames are created from read.csv() or similar functions so there > should not be any issue). But before doing that, I want to be sure that > it is not going to create unexpected results. > > Thank you, > Ivan > > -- > Dr. Ivan Calandra > TraCEr, laboratory for Traceology and Controlled Experiments > MONREPOS Archaeological Research Centre and > Museum for Human Behavioural Evolution > Schloss Monrepos > 56567 Neuwied, Germany > +49 (0) 2631 9772-243 > https://www.researchgate.net/profile/Ivan_Calandra > > On 31/03/2020 16:00, Ivan Krylov wrote: > > On Tue, 31 Mar 2020 14:47:54 +0200 > > Ivan Calandra <calandra at rgzm.de> wrote: > > > >> On a simple data.frame (i.e. each element is a vector), ncol() and > >> length() will give the same result. > >> Are they just equivalent on such objects, or are they differences in > >> some cases? > > I am not aware of any exceptions to ncol(dataframe)==length(dataframe) > > (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) > > returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but > > watch out for AsIs columns which can have columns of their own: > > > > x <- data.frame(I(volcano)) > > dim(x) > > # [1] 87 1 > > length(x) > > # [1] 1 > > dim(x[,1]) > > # [1] 87 61 > > > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
Thank you Greg for the insights! I agree with you that the decrease in speed is not worth the decrease in readability, and I'll change my length() calls to ncol(). Best, Ivan -- Dr. Ivan Calandra TraCEr, laboratory for Traceology and Controlled Experiments MONREPOS Archaeological Research Centre and Museum for Human Behavioural Evolution Schloss Monrepos 56567 Neuwied, Germany +49 (0) 2631 9772-243 https://www.researchgate.net/profile/Ivan_Calandra On 03/04/2020 17:45, Greg Snow wrote:> As others have pointed out, ncol calls the length function, so you are > pretty safe in terms of output of getting the same result when applied > to the results of functions like read.csv (there will be a big > difference if you ever apply those functions to a matrix or some other > data structures). > > One thing that I have not seen yet is a comparison on timing, so here goes: > >> library(microbenchmark) >> microbenchmark( > + length = length(iris), > + ncol = ncol(iris) > + ) > Unit: nanoseconds > expr min lq mean median uq max neval > length 700 750 869 800 800 7400 100 > ncol 2400 2500 2981 2600 2700 31900 100 > > So ncol takes about 3 times as long to run as length on the iris data > frame (5 columns), you can rerun the above code with data frames more > the size that you will be using to see if that makes any difference. > But also notice that the units are nanoseconds, so the median time for > ncol to run is less than the time it takes light to travel a kilometer > in a vacuum, or about the time it takes light to go 1/3 of a mile > through a fiber optic cable (en.wikipedia.org/wiki/Microsecond). If > this is used as part of a simulation or other repeated procedure and > it is done one million times then you will add about 2 seconds to the > overall run. If this is just part of code where length/ncol will be > called fewer than 10 times then nobody is going to notice. > > So the trade-off of moving from length to ncol is a slight decrease in > speed for an increase of readability. I think that I would go with > the readability myself. > > On Tue, Mar 31, 2020 at 8:11 AM Ivan Calandra <calandra at rgzm.de> wrote: >> Thanks Ivan for the answer. >> >> So it confirms my first thought that these two functions are equivalent >> when applied to a "simple" data.frame. >> >> The reason I was asking is because I have gotten used to use length() in >> my scripts. It works perfectly and I understand it easily. But to be >> honest, ncol() is more intuitive to most users (especially the novice) >> so I was thinking about switching to using this function instead (all my >> data.frames are created from read.csv() or similar functions so there >> should not be any issue). But before doing that, I want to be sure that >> it is not going to create unexpected results. >> >> Thank you, >> Ivan >> >> -- >> Dr. Ivan Calandra >> TraCEr, laboratory for Traceology and Controlled Experiments >> MONREPOS Archaeological Research Centre and >> Museum for Human Behavioural Evolution >> Schloss Monrepos >> 56567 Neuwied, Germany >> +49 (0) 2631 9772-243 >> https://www.researchgate.net/profile/Ivan_Calandra >> >> On 31/03/2020 16:00, Ivan Krylov wrote: >>> On Tue, 31 Mar 2020 14:47:54 +0200 >>> Ivan Calandra <calandra at rgzm.de> wrote: >>> >>>> On a simple data.frame (i.e. each element is a vector), ncol() and >>>> length() will give the same result. >>>> Are they just equivalent on such objects, or are they differences in >>>> some cases? >>> I am not aware of any exceptions to ncol(dataframe)==length(dataframe) >>> (in fact, ncol(x) is dim(x)[2L] and ?dim says that dim(dataframe) >>> returns c(length(attr(dataframe, 'row.names')), length(dataframe))), but >>> watch out for AsIs columns which can have columns of their own: >>> >>> x <- data.frame(I(volcano)) >>> dim(x) >>> # [1] 87 1 >>> length(x) >>> # [1] 1 >>> dim(x[,1]) >>> # [1] 87 61 >>> >>> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >