I have been trying to read in a large data set using read.table, but I've only been able to grab the first 50,871 rows of the total 122,269 rows. > f <- read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", header=TRUE, nrows=123000, comment.char="", sep="\t") > length(f$change_rate) [1] 50871 From searching the email archives, I believe this is due to size limits of a data frame. So... 1) Why doesn't read.table give a proper warning when it doesn't place every read item into a data frame? 2) Why isn't there a parameter to read.table that allows the user to specify which columns s/he is interested in? This functionality would allow extraneous columns to be ignored which would improve memory usage. I've already made a work-around by loading the table into mysql and doing a select on the 2 columns I need. I just wonder why the above 2 points aren't implemented. Maybe they are and I'm totally missing it. Thanks, Frank -- Frank McCown Old Dominion University http://www.cs.odu.edu/~fmccown/
Frank McCown wrote:> I have been trying to read in a large data set using read.table, but > I've only been able to grab the first 50,871 rows of the total 122,269 rows. > > > f <- > read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", > header=TRUE, nrows=123000, comment.char="", sep="\t") > > length(f$change_rate) > [1] 50871 > > From searching the email archives, I believe this is due to size limits > of a data frame. So... >I think you believe wrongly...> 1) Why doesn't read.table give a proper warning when it doesn't place > every read item into a data frame? >That isn't the problem, it is a somewhat obscure interaction between quote= and sep= that is doing you in. Remove the sep="\t" and/or add quote="" and your life should be easier.> 2) Why isn't there a parameter to read.table that allows the user to > specify which columns s/he is interested in? This functionality would > allow extraneous columns to be ignored which would improve memory usage. > >There is! check out colClasses> cc <- rep("NULL",5) > cc[4:5] <- NA > f <-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", header=TRUE, sep="\t", quote="", colClasses=cc)> str(f)'data.frame': 122271 obs. of 2 variables: $ recovered : Factor w/ 5 levels "changed","identical",..: 5 3 3 3 2 2 2 2 1 2 ... $ change_rate: num 1 0 0 1 0 0 0 0 0 0 ...> I've already made a work-around by loading the table into mysql and > doing a select on the 2 columns I need. I just wonder why the above 2 > points aren't implemented. Maybe they are and I'm totally missing it. > > Thanks, > Frank > > >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Frank McCown schrieb:> I have been trying to read in a large data set using read.table, but > I've only been able to grab the first 50,871 rows of the total 122,269 rows. > > > f <- > read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", > header=TRUE, nrows=123000, comment.char="", sep="\t") > > length(f$change_rate) > [1] 50871 > > From searching the email archives, I believe this is due to size limits > of a data frame. So... > >It is not due to size limits, see below.> 1) Why doesn't read.table give a proper warning when it doesn't place > every read item into a data frame? > >In your case, read.table behaves as documented. The ' - character is one of the standard quoting characters. Some (but very few) of the entrys contain single ' chars, so sometimes more than ten thousand lines are just treated as a single entry. Try using quote="" to disable quoting, as documented on the help page: f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", header=TRUE, nrows=123000, comment.char="", sep="\t",quote="") length(f$change_rate) [1] 122271> 2) Why isn't there a parameter to read.table that allows the user to > specify which columns s/he is interested in? This functionality would > allow extraneous columns to be ignored which would improve memory usage. > >There is (colClasses, works as documented). Try f<-read.table("http://www.cs.odu.edu/~fmccown/R/Tchange_rates_crawled.dat", + header=TRUE, nrows=123000, comment.char="", sep="\t",quote="",colClasses=c("character","NULL","NULL","NULL","NULL")) > dim(f) [1] 122271 1> I've already made a work-around by loading the table into mysql and > doing a select on the 2 columns I need. I just wonder why the above 2 > points aren't implemented. Maybe they are and I'm totally missing it. > >Did you read the help page?> Thanks, > Frank > > >Regards, Martin
The problem is somewhere in the file, probably with tab characters, as removing sep="" from your call does the job.> dfr<-read.table("Tchange_rates_crawled.dat",header=TRUE) > str(dfr)'data.frame': 122271 obs. of 5 variables: [skipped]> dfr<-read.table("Tchange_rates_crawled.dat",header=TRUE,stringsAsFactors=FALSE)> str(dfr)'data.frame': 122271 obs. of 5 variables: [skipped] R has no limitations you're talking about. A couple hours ago my R has successfully read in 460000 rows of data from a text file in a data frame using read.table. I have also processed even larger data sets (more than 500000 rows).