Andreas Leha
2015-Jul-09 01:42 UTC
[R] why must a named colClasses in read.table be in correct order
Hi Henrik, Thanks for your reply. I am not (yet) convinced, though. The help page for read.table mentions named colClasses and if I specify colClasses for not all columns, the names are taken into account: --8<---------------cut here---------------start------------->8--- kkk <- c("a\tb", "3.14\tx") str(read.table(textConnection(kkk), sep="\t", header = TRUE)) str(read.table(textConnection(kkk), sep="\t", header = TRUE, colClasses=c(b="character"))) --8<---------------cut here---------------end--------------->8--- What am I missing? Best, Andreas On 09/07/2015 02:21, Henrik Bengtsson wrote:> read.table() does not make use of names(colClasses) - only its values. > Because of this, ordering is critical, as you noted. It shouldn't be > too hard to add support for a named `colClasses` argument of > utils::read.table(), but someone needs to convince the R core team > that this is a good idea. > > As an alternative, see R.filesets::readDataFrame() for a > read.table()-like function that matches names(colClasses) to column > names, if they exists. > > /Henrik > (author of R.filesets) > > On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha > <andreas.leha at med.uni-goettingen.de> wrote: >> Hi all, >> >> Apparently, the colClasses argument to read.table needs to be in the >> order of the columns *even when it is named*. Why is that? And where >> would I find it in the documentation? >> >> Here is a MWE: >> >> --8<---------------cut here---------------start------------->8--- >> kkk <- c("a\tb", >> "3.14\tx") >> read.table(textConnection(kkk), >> sep="\t", >> header = TRUE) >> >> cclasses=c(b="character", >> a="numeric") >> >> read.table(textConnection(kkk), >> sep="\t", >> header = TRUE, >> colClasses = cclasses) ## <--- error >> >> read.table(textConnection(kkk), >> sep="\t", >> header = TRUE, >> colClasses = cclasses[order(names(cclasses))]) >> --8<---------------cut here---------------end--------------->8--- >> >> >> Thanks, >> Andreas >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Henrik Bengtsson
2015-Jul-09 02:54 UTC
[R] why must a named colClasses in read.table be in correct order
Thanks for insisting; I was wrong and I'm happy to see that there is indeed code intended for named 'colClasses', which even goes back to 2004. But as you report, then names only work when length(colClasses) < cols (which also explains why I though it was not supported). I'm not sure if that _strictly less than_ test is intentional or a mistake, but I would propose the following patch: [HB-X201]{hb}: svn diff src\library\utils\R\readtable.R Index: src/library/utils/R/readtable.R ==================================================================--- src/library/utils/R/readtable.R (revision 68642) +++ src/library/utils/R/readtable.R (working copy) @@ -139,7 +139,7 @@ if (rlabp) col.names <- c("row.names", col.names) nmColClasses <- names(colClasses) - if(length(colClasses) < cols) + if(length(colClasses) <= cols) if(is.null(nmColClasses)) { colClasses <- rep_len(colClasses, cols) } else { Your example works with this patch. I've made it source():able so you can try it out (if you cannot source() https://, then download the file an source it locally): source("https://gist.githubusercontent.com/HenrikBengtsson/ed1eeb41a1b4d6c43b47/raw/ebe58f76e518dd014423bea466a5c93d2efd3c99/readtable-fix.R") kkk <- c("a\tb", "3.14\tx") colClasses <- c(a="numeric", b="character") data <- read.table(textConnection(kkk), sep="\t", header = TRUE, colClasses = colClasses) str(data) ### 'data.frame': 1 obs. of 2 variables: ### $ a: num 3.14 ### $ b: chr "x" ## Does not work with utils::read.table(), but with patch data <- read.table(textConnection(kkk), sep="\t", header = TRUE, colClasses = rev(colClasses)) str(data) ### 'data.frame': 1 obs. of 2 variables: ### $ a: num 3.14 ### $ b: chr "x" Let's hope that the above is a (10-year old) typo, and changing a < to a <= adds support for named 'colClasses', which is a really useful functionality. /Henrik On Wed, Jul 8, 2015 at 6:42 PM, Andreas Leha <andreas.leha at med.uni-goettingen.de> wrote:> Hi Henrik, > > Thanks for your reply. > > I am not (yet) convinced, though. The help page for read.table > mentions named colClasses and if I specify colClasses for not all > columns, the names are taken into account: > > --8<---------------cut here---------------start------------->8--- > kkk <- c("a\tb", > "3.14\tx") > str(read.table(textConnection(kkk), > sep="\t", > header = TRUE)) > > str(read.table(textConnection(kkk), > sep="\t", > header = TRUE, > colClasses=c(b="character"))) > --8<---------------cut here---------------end--------------->8--- > > What am I missing? > > Best, > Andreas > > > > On 09/07/2015 02:21, Henrik Bengtsson wrote: >> read.table() does not make use of names(colClasses) - only its values. >> Because of this, ordering is critical, as you noted. It shouldn't be >> too hard to add support for a named `colClasses` argument of >> utils::read.table(), but someone needs to convince the R core team >> that this is a good idea. >> >> As an alternative, see R.filesets::readDataFrame() for a >> read.table()-like function that matches names(colClasses) to column >> names, if they exists. >> >> /Henrik >> (author of R.filesets) >> >> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha >> <andreas.leha at med.uni-goettingen.de> wrote: >>> Hi all, >>> >>> Apparently, the colClasses argument to read.table needs to be in the >>> order of the columns *even when it is named*. Why is that? And where >>> would I find it in the documentation? >>> >>> Here is a MWE: >>> >>> --8<---------------cut here---------------start------------->8--- >>> kkk <- c("a\tb", >>> "3.14\tx") >>> read.table(textConnection(kkk), >>> sep="\t", >>> header = TRUE) >>> >>> cclasses=c(b="character", >>> a="numeric") >>> >>> read.table(textConnection(kkk), >>> sep="\t", >>> header = TRUE, >>> colClasses = cclasses) ## <--- error >>> >>> read.table(textConnection(kkk), >>> sep="\t", >>> header = TRUE, >>> colClasses = cclasses[order(names(cclasses))]) >>> --8<---------------cut here---------------end--------------->8--- >>> >>> >>> Thanks, >>> Andreas >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code.
Andreas Leha
2015-Jul-09 03:15 UTC
[R] why must a named colClasses in read.table be in correct order
Hi Henrik, Thank you very much for looking into this. And thanks for the patch! Yes, let's hope this is a typo that gets fixed. Regards, Andreas Henrik Bengtsson <henrik.bengtsson at ucsf.edu> writes:> Thanks for insisting; I was wrong and I'm happy to see that there is > indeed code intended for named 'colClasses', which even goes back to > 2004. But as you report, then names only work when > length(colClasses) < cols (which also explains why I though it was not > supported). I'm not sure if that _strictly less than_ test is > intentional or a mistake, but I would propose the following patch: > > [HB-X201]{hb}: svn diff src\library\utils\R\readtable.R > Index: src/library/utils/R/readtable.R > ==================================================================> --- src/library/utils/R/readtable.R (revision 68642) > +++ src/library/utils/R/readtable.R (working copy) > @@ -139,7 +139,7 @@ > if (rlabp) col.names <- c("row.names", col.names) > > nmColClasses <- names(colClasses) > - if(length(colClasses) < cols) > + if(length(colClasses) <= cols) > if(is.null(nmColClasses)) { > colClasses <- rep_len(colClasses, cols) > } else { > > > Your example works with this patch. I've made it source():able so you > can try it out (if you cannot source() https://, then download the > file an source it locally): > > source("https://gist.githubusercontent.com/HenrikBengtsson/ed1eeb41a1b4d6c43b47/raw/ebe58f76e518dd014423bea466a5c93d2efd3c99/readtable-fix.R") > > kkk <- c("a\tb", > "3.14\tx") > > colClasses <- c(a="numeric", b="character") > data <- read.table(textConnection(kkk), > sep="\t", > header = TRUE, > colClasses = colClasses) > str(data) > ### 'data.frame': 1 obs. of 2 variables: > ### $ a: num 3.14 > ### $ b: chr "x" > > ## Does not work with utils::read.table(), but with patch > data <- read.table(textConnection(kkk), > sep="\t", > header = TRUE, > colClasses = rev(colClasses)) > str(data) > ### 'data.frame': 1 obs. of 2 variables: > ### $ a: num 3.14 > ### $ b: chr "x" > > Let's hope that the above is a (10-year old) typo, and changing a < to > a <= adds support for named 'colClasses', which is a really useful > functionality. > > /Henrik > > On Wed, Jul 8, 2015 at 6:42 PM, Andreas Leha > <andreas.leha at med.uni-goettingen.de> wrote: >> Hi Henrik, >> >> Thanks for your reply. >> >> I am not (yet) convinced, though. The help page for read.table >> mentions named colClasses and if I specify colClasses for not all >> columns, the names are taken into account: >> >> --8<---------------cut here---------------start------------->8--- >> kkk <- c("a\tb", >> "3.14\tx") >> str(read.table(textConnection(kkk), >> sep="\t", >> header = TRUE)) >> >> str(read.table(textConnection(kkk), >> sep="\t", >> header = TRUE, >> colClasses=c(b="character"))) >> --8<---------------cut here---------------end--------------->8--- >> >> What am I missing? >> >> Best, >> Andreas >> >> >> >> On 09/07/2015 02:21, Henrik Bengtsson wrote: >>> read.table() does not make use of names(colClasses) - only its values. >>> Because of this, ordering is critical, as you noted. It shouldn't be >>> too hard to add support for a named `colClasses` argument of >>> utils::read.table(), but someone needs to convince the R core team >>> that this is a good idea. >>> >>> As an alternative, see R.filesets::readDataFrame() for a >>> read.table()-like function that matches names(colClasses) to column >>> names, if they exists. >>> >>> /Henrik >>> (author of R.filesets) >>> >>> On Wed, Jul 8, 2015 at 5:41 PM, Andreas Leha >>> <andreas.leha at med.uni-goettingen.de> wrote: >>>> Hi all, >>>> >>>> Apparently, the colClasses argument to read.table needs to be in the >>>> order of the columns *even when it is named*. Why is that? And where >>>> would I find it in the documentation? >>>> >>>> Here is a MWE: >>>> >>>> --8<---------------cut here---------------start------------->8--- >>>> kkk <- c("a\tb", >>>> "3.14\tx") >>>> read.table(textConnection(kkk), >>>> sep="\t", >>>> header = TRUE) >>>> >>>> cclasses=c(b="character", >>>> a="numeric") >>>> >>>> read.table(textConnection(kkk), >>>> sep="\t", >>>> header = TRUE, >>>> colClasses = cclasses) ## <--- error >>>> >>>> read.table(textConnection(kkk), >>>> sep="\t", >>>> header = TRUE, >>>> colClasses = cclasses[order(names(cclasses))]) >>>> --8<---------------cut here---------------end--------------->8--- >>>> >>>> >>>> Thanks, >>>> Andreas >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code.