Asis Hallab
2013-Aug-05 11:11 UTC
[R] Function read.table(…) reads in only 40% of a my table's lines
Dear R experts, I have a large table saved in a file called "plant_genome.gff". The file has 481848 lines in nine columns, which are TAB delimited, and is 53 MegaBytes large. For anyone who might know the GFF3 format: The table holds a plant genome's annotation. If I read in the table with read.table( "plant_genome.gff" ) I get the following error "line 2 did not have 12 elements". If I read in the table with read.table( "plant_genome.gff", sep="\t" ) no error or warning is given, but my resulting table has only 193547 instead of the expected 481848 rows! 60% of the lines are omitted. Also passing in the arguments as.is = TRUE or setting the columns' classes with colClasses = c( "character", ?, "integer", "integer", "numeric", "character", ? ) # columns 4, and 5 are integers, column 6 is numeric, all others are characters does not resolve the problem. If I read in the file with readLines and then manually split them using strplit(?) and combine them into a data.frame with as.data.frame( do.call( "rbind", splitted.lines ), colClasses=?) I get the expected and correct data.frame, representing my GFF3 data. My questions are: 1) Am I using read.table wrong, or did I miss something in the documentation? 2) Or is this is known problem with large TAB delimited tables, whose columns contain white-spaces and are not surrounded by quotes? Unfortunately due to the unpublished nature of the plant genome I am not allowed to give access to the GFF table that causes this problem. Any ideas, hints, help - or comments on my stupidity having missed something important - will be much appreciated! Cheers!
jim holtman
2013-Aug-05 12:00 UTC
[R] Function read.table(…) reads in only 40% of a my table's lines
Couple of things to try. May have an extra quote, so put: quote = '' as one of the parameters. Also, might have comments, so try: comment.char = "" Take alook at your file and determine what line was the last complete one and see if there might be a problem in that line, or preceeding ones. On Mon, Aug 5, 2013 at 7:11 AM, Asis Hallab <asis.hallab@gmail.com> wrote:> Dear R experts, > > I have a large table saved in a file called "plant_genome.gff". The > file has 481848 lines in nine columns, which are TAB delimited, and is > 53 MegaBytes large. > For anyone who might know the GFF3 format: The table holds a plant > genome's annotation. > > If I read in the table with > read.table( "plant_genome.gff" ) > I get the following error > "line 2 did not have 12 elements". > > If I read in the table with > read.table( "plant_genome.gff", sep="\t" ) > no error or warning is given, but my resulting table has only 193547 > instead of the expected 481848 rows! 60% of the lines are omitted. > > Also passing in the arguments > as.is = TRUE > or setting the columns' classes with > colClasses = c( "character", …, "integer", "integer", "numeric", > "character", … ) > # columns 4, and 5 are integers, column 6 is numeric, all others > are characters > does not resolve the problem. > > If I read in the file with readLines and then manually split them using > strplit(…) > and combine them into a data.frame with > as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…) > I get the expected and correct data.frame, representing my GFF3 data. > > My questions are: > 1) Am I using read.table wrong, or did I miss something in the > documentation? > 2) Or is this is known problem with large TAB delimited tables, whose > columns contain white-spaces and are not surrounded by quotes? > > Unfortunately due to the unpublished nature of the plant genome I am > not allowed to give access to the GFF table that causes this problem. > > Any ideas, hints, help - or comments on my stupidity having missed > something important - will be much appreciated! > > Cheers! > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. [[alternative HTML version deleted]]
David Winsemius
2013-Aug-05 16:20 UTC
[R] Function read.table(…) reads in only 40% of a my table's lines
On Aug 5, 2013, at 4:11 AM, Asis Hallab wrote:> Dear R experts, > > I have a large table saved in a file called "plant_genome.gff". The > file has 481848 lines in nine columns, which are TAB delimited, and is > 53 MegaBytes large. > For anyone who might know the GFF3 format: The table holds a plant > genome's annotation. > > If I read in the table with > read.table( "plant_genome.gff" ) > I get the following error > "line 2 did not have 12 elements". > > If I read in the table with > read.table( "plant_genome.gff", sep="\t" ) > no error or warning is given, but my resulting table has only 193547 > instead of the expected 481848 rows! 60% of the lines are omitted. > > Also passing in the arguments > as.is = TRUE > or setting the columns' classes with > colClasses = c( "character", ?, "integer", "integer", "numeric", > "character", ? ) > # columns 4, and 5 are integers, column 6 is numeric, all others > are characters > does not resolve the problem. > > If I read in the file with readLines and then manually split them using > strplit(?)THat doesn't unambiguously define the process.> and combine them into a data.frame with > as.data.frame( do.call( "rbind", splitted.lines ), colClasses=?) > I get the expected and correct data.frame, representing my GFF3 data. > > My questions are: > 1) Am I using read.table wrong, or did I miss something in the documentation? > 2) Or is this is known problem with large TAB delimited tables, whose > columns contain white-spaces and are not surrounded by quotes?I would think this is not "a known problem" but rather "entirely expected and documented behavior". The read.table function uses white-space as its default separation rule. The large-ness of the file has nothing to do with it. You would get the same problem with a very small example. If you want tab-separation then use read.delim which has sep="\t" as its default. -- David Winsemius Alameda, CA, USA