thr3ads.net - R help - [R] Function read.table(…) reads in only 40% of a my table's lines [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Asis Hallab

2013-Aug-05 11:11 UTC

[R] Function read.table(…) reads in only 40% of a my table's lines

Dear R experts,

I have a large table saved in a file called "plant_genome.gff". The
file has 481848 lines in nine columns, which are TAB delimited, and is
53 MegaBytes large.
For anyone who might know the GFF3 format: The table holds a plant
genome's annotation.

If I read in the table with
read.table( "plant_genome.gff" )
I get the following error
"line 2 did not have 12 elements".

If I read in the table with
read.table( "plant_genome.gff", sep="\t" )
no error or warning is given, but my resulting table has only 193547
instead of the expected 481848 rows! 60% of the lines are omitted.

Also passing in the arguments
as.is = TRUE
or setting the columns' classes with
colClasses = c( "character", ?, "integer",
"integer", "numeric",
"character", ? )
   # columns 4, and 5 are integers, column 6 is numeric, all others
are characters
does not resolve the problem.

If I read in the file with readLines and then manually split them using
strplit(?)
and combine them into a data.frame with
as.data.frame( do.call( "rbind", splitted.lines ), colClasses=?)
I get the expected and correct data.frame, representing my GFF3 data.

My questions are:
1) Am I using read.table wrong, or did I miss something in the documentation?
2) Or is this is known problem with large TAB delimited tables, whose
columns contain white-spaces and are not surrounded by quotes?

Unfortunately due to the unpublished nature of the plant genome I am
not allowed to give access to the GFF table that causes this problem.

Any ideas, hints, help - or comments on my stupidity having missed
something important - will be much appreciated!

Cheers!

jim holtman

2013-Aug-05 12:00 UTC

head link

[R] Function read.table(…) reads in only 40% of a my table's lines

Couple of things to try.  May have an extra quote, so put:

quote = ''

as one of the parameters.  Also, might have comments, so try:

comment.char = ""

Take alook at your file and determine what line was the last complete one
and see if there might be a problem in that line, or preceeding ones.

On Mon, Aug 5, 2013 at 7:11 AM, Asis Hallab <asis.hallab@gmail.com> wrote:
> Dear R experts,
>
> I have a large table saved in a file called "plant_genome.gff".
The
> file has 481848 lines in nine columns, which are TAB delimited, and is
> 53 MegaBytes large.
> For anyone who might know the GFF3 format: The table holds a plant
> genome's annotation.
>
> If I read in the table with
> read.table( "plant_genome.gff" )
> I get the following error
> "line 2 did not have 12 elements".
>
> If I read in the table with
> read.table( "plant_genome.gff", sep="\t" )
> no error or warning is given, but my resulting table has only 193547
> instead of the expected 481848 rows! 60% of the lines are omitted.
>
> Also passing in the arguments
> as.is = TRUE
> or setting the columns' classes with
> colClasses = c( "character", …, "integer",
"integer", "numeric",
> "character", … )
>    # columns 4, and 5 are integers, column 6 is numeric, all others
> are characters
> does not resolve the problem.
>
> If I read in the file with readLines and then manually split them using
> strplit(…)
> and combine them into a data.frame with
> as.data.frame( do.call( "rbind", splitted.lines ), colClasses=…)
> I get the expected and correct data.frame, representing my GFF3 data.
>
> My questions are:
> 1) Am I using read.table wrong, or did I miss something in the
> documentation?
> 2) Or is this is known problem with large TAB delimited tables, whose
> columns contain white-spaces and are not surrounded by quotes?
>
> Unfortunately due to the unpublished nature of the plant genome I am
> not allowed to give access to the GFF table that causes this problem.
>
> Any ideas, hints, help - or comments on my stupidity having missed
> something important - will be much appreciated!
>
> Cheers!
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

	[[alternative HTML version deleted]]

David Winsemius

2013-Aug-05 16:20 UTC

head link

[R] Function read.table(…) reads in only 40% of a my table's lines

On Aug 5, 2013, at 4:11 AM, Asis Hallab wrote:
> Dear R experts,
> 
> I have a large table saved in a file called "plant_genome.gff".
The
> file has 481848 lines in nine columns, which are TAB delimited, and is
> 53 MegaBytes large.
> For anyone who might know the GFF3 format: The table holds a plant
> genome's annotation.
> 
> If I read in the table with
> read.table( "plant_genome.gff" )
> I get the following error
> "line 2 did not have 12 elements".
> 
> If I read in the table with
> read.table( "plant_genome.gff", sep="\t" )
> no error or warning is given, but my resulting table has only 193547
> instead of the expected 481848 rows! 60% of the lines are omitted.
> 
> Also passing in the arguments
> as.is = TRUE
> or setting the columns' classes with
> colClasses = c( "character", ?, "integer",
"integer", "numeric",
> "character", ? )
>   # columns 4, and 5 are integers, column 6 is numeric, all others
> are characters
> does not resolve the problem.
> 
> If I read in the file with readLines and then manually split them using
> strplit(?)
THat doesn't unambiguously define the process.
> and combine them into a data.frame with
> as.data.frame( do.call( "rbind", splitted.lines ), colClasses=?)
> I get the expected and correct data.frame, representing my GFF3 data.
> 
> My questions are:
> 1) Am I using read.table wrong, or did I miss something in the
documentation?
> 2) Or is this is known problem with large TAB delimited tables, whose
> columns contain white-spaces and are not surrounded by quotes?
I would think this is not "a known problem" but rather "entirely
expected and documented behavior". The read.table function uses white-space
as its default separation rule. The large-ness of the file has nothing to do
with it. You would get the same problem with a very small example. If you want
tab-separation then use read.delim which has sep="\t" as its default.


-- 
David Winsemius
Alameda, CA, USA

R help - Aug 2013 - Function read.table(…) reads in only 40% of a my table's lines

[R] Function read.table(…) reads in only 40% of a my table's lines

[R] Function read.table(…) reads in only 40% of a my table's lines

[R] Function read.table(…) reads in only 40% of a my table's lines