Hi, I have been using 'read.table' regularly to read tab-delimited text files with data. No problem, until now. Now I have a file that appeared to have read fine, and the data inside looks correct (structure etc), except I only had 15000+ rows out of the expected 24000. Using 'readLines' instead, and breaking up the data by tabs, gives me the expected result. I do not understand why this is happening and I can't find anything obvious in the data to explain the bahaviour... Does anybody have an explanation? something to watch out for? If I run this I get the incomplete set:> oldprobesets<-read.table("All_norm_calls.txt",sep="\t",header=T,stringsAsFactors=F) > dim(oldprobesets)[1] 15733 11 but I get the right data if I use:> probesets<-readLines("All_norm_calls.txt") > tmp<-matrix(ncol=11,nrow=24000) > for (i in 1:24000) tmp[i,]<-unlist(strsplit(probesets[i+1],split="\t")) > colnames(tmp)<-unlist(strsplit(probesets[1],split="\t")) > probesets<-data.frame(tmp,stringsAsFactors=F) > dim(probesets)[1] 24000 11 Here's my sessionInfo output:> sessionInfo()R version 2.7.0 (2008-04-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices datasets tcltk utils methods [8] base other attached packages: [1] limma_2.14.0 svSocket_0.9-5 svIO_0.9-5 R2HTML_1.59 svMisc_0.9-5 [6] svIDE_0.9-5 loaded via a namespace (and not attached): [1] tools_2.7.0 Thanks! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
J.delasHeras at ed.ac.uk wrote:> > Hi, > > > I have been using 'read.table' regularly to read tab-delimited text > files with data. No problem, until now. > Now I have a file that appeared to have read fine, and the data inside > looks correct (structure etc), except I only had 15000+ rows out of > the expected 24000. Using 'readLines' instead, and breaking up the > data by tabs, gives me the expected result. > I do not understand why this is happening and I can't find anything > obvious in the data to explain the bahaviour... > Does anybody have an explanation? something to watch out for?Hmm: - completely blank lines - filling - quotes My bet would be on the last one. Does read.delim work better? Also, just in case: Check length(probesets) after the readLines call.> > If I run this I get the incomplete set: >> oldprobesets<-read.table("All_norm_calls.txt",sep="\t",header=T,stringsAsFactors=F) >> >> dim(oldprobesets) > [1] 15733 11 > > but I get the right data if I use: > >> probesets<-readLines("All_norm_calls.txt") >> tmp<-matrix(ncol=11,nrow=24000) >> for (i in 1:24000) tmp[i,]<-unlist(strsplit(probesets[i+1],split="\t")) >> colnames(tmp)<-unlist(strsplit(probesets[1],split="\t")) >> probesets<-data.frame(tmp,stringsAsFactors=F) >> dim(probesets) > [1] 24000 11 > > > Here's my sessionInfo output: > >> sessionInfo() > R version 2.7.0 (2008-04-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > Kingdom.1252;LC_MONETARY=English_United > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices datasets tcltk utils methods > [8] base > > other attached packages: > [1] limma_2.14.0 svSocket_0.9-5 svIO_0.9-5 R2HTML_1.59 > svMisc_0.9-5 > [6] svIDE_0.9-5 > > loaded via a namespace (and not attached): > [1] tools_2.7.0 > > > Thanks! > > Jose > > --Dr. Jose I. de las Heras Email: > J.delasHeras at ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 > Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > --The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Try looking at the result of count.fields to diagnose it. On Tue, Sep 23, 2008 at 5:19 AM, <J.delasHeras at ed.ac.uk> wrote:> > Hi, > > > I have been using 'read.table' regularly to read tab-delimited text files > with data. No problem, until now. > Now I have a file that appeared to have read fine, and the data inside looks > correct (structure etc), except I only had 15000+ rows out of the expected > 24000. Using 'readLines' instead, and breaking up the data by tabs, gives me > the expected result. > I do not understand why this is happening and I can't find anything obvious > in the data to explain the bahaviour... > Does anybody have an explanation? something to watch out for? > > If I run this I get the incomplete set: >> >> >> oldprobesets<-read.table("All_norm_calls.txt",sep="\t",header=T,stringsAsFactors=F) >> dim(oldprobesets) > > [1] 15733 11 > > but I get the right data if I use: > >> probesets<-readLines("All_norm_calls.txt") >> tmp<-matrix(ncol=11,nrow=24000) >> for (i in 1:24000) tmp[i,]<-unlist(strsplit(probesets[i+1],split="\t")) >> colnames(tmp)<-unlist(strsplit(probesets[1],split="\t")) >> probesets<-data.frame(tmp,stringsAsFactors=F) >> dim(probesets) > > [1] 24000 11 > > > Here's my sessionInfo output: > >> sessionInfo() > > R version 2.7.0 (2008-04-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United > Kingdom.1252;LC_MONETARY=English_United > Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices datasets tcltk utils methods > [8] base > > other attached packages: > [1] limma_2.14.0 svSocket_0.9-5 svIO_0.9-5 R2HTML_1.59 svMisc_0.9-5 > [6] svIDE_0.9-5 > > loaded via a namespace (and not attached): > [1] tools_2.7.0 > > > Thanks! > > Jose > > -- > Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 > Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >