thr3ads.net - R help - [R] watch out for quotes in data files [Jul 2001]

If this information is useful, please help other people find it:
Share via:

Douglas Bates

2001-Jul-10 03:53 UTC

[R] watch out for quotes in data files

I have just spent a day trying to determine why I seemed to be unable
to read a file of microarray expression results into R properly.  The
file was produced by the Dchip software developed by Li and Wong at
Harvard's Department of Biostatistics.  It contains rows of
tab-delimited fields in the order

Probe set identifier
Probe set description
Array 1 expression
Array 1 call
Array 2 expression
Array 2 call
...

plus an extra tab (which I think is due to a programming glich).

There are 7130 rows, including the column headers, for results from
Affymetrix Hu6800 chips. 

When I read this file using read.table(filename, sep = "\t", head =
TRUE)
I got only 3720 rows.  Furthermore count.fields(filename, sep = "\t")
gave a result of length 7130 but several of the rows were reported as
having only two fields instead of 15 like the other rows.

It seemed to me that the important characteristic of these rows was
their having a very long "Probe set description" and I wasted quite a
bit of time looking for possible buffer overflows that might be
triggered by this.

When I finally came to my senses and created a much smaller input file
that only contained a few rows, including one that was giving an
aberrant field count, I could directly examine the results of scan()
applied to it.  I noticed that the second field for the aberrant line
contained all the subsequent lines and then I saw that its description
included "5'" (as in the 5' end of the sequence versus the
3' end).
Other descriptions had this written as "5 prime" but this one used
"5'".  What was happening was that everything from there to the
next
"'" character in the file was being included as part of that
description.

I could read the file properly by adding the optional argument quote
"" to the call to read.table.

The moral of the story is to watch out for molecular biologists who
use unpaired quote characters in their descriptions.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Samak, Vele [EQRE]

2001-Jul-10 13:50 UTC

head link

[R] watch out for quotes in data files

I have had similar problems with R 1.2.2. Everytime a string has the '
single quote, it reads up to a maximum of 8192 characters into the item,
creating memory and parsing problems. I make it a habit now to remove all '
(single quotes) from the text or replace them with double quotes.
-- 
Vele Samak, Vice President
Global Quantitative Research
Salomon Smith Barney 
7 WTC, New York, NY 10048, 212-783-7007


-----Original Message-----
From: Douglas Bates [mailto:bates at stat.wisc.edu]
Sent: Monday, July 09, 2001 11:53 PM
To: R-help at stat.math.ethz.ch
Subject: [R] watch out for quotes in data files


I have just spent a day trying to determine why I seemed to be unable
to read a file of microarray expression results into R properly.  The
file was produced by the Dchip software developed by Li and Wong at
Harvard's Department of Biostatistics.  It contains rows of
tab-delimited fields in the order

Probe set identifier
Probe set description
Array 1 expression
Array 1 call
Array 2 expression
Array 2 call
...

plus an extra tab (which I think is due to a programming glich).

There are 7130 rows, including the column headers, for results from
Affymetrix Hu6800 chips. 

When I read this file using read.table(filename, sep = "\t", head =
TRUE)
I got only 3720 rows.  Furthermore count.fields(filename, sep = "\t")
gave a result of length 7130 but several of the rows were reported as
having only two fields instead of 15 like the other rows.

It seemed to me that the important characteristic of these rows was
their having a very long "Probe set description" and I wasted quite a
bit of time looking for possible buffer overflows that might be
triggered by this.

When I finally came to my senses and created a much smaller input file
that only contained a few rows, including one that was giving an
aberrant field count, I could directly examine the results of scan()
applied to it.  I noticed that the second field for the aberrant line
contained all the subsequent lines and then I saw that its description
included "5'" (as in the 5' end of the sequence versus the
3' end).
Other descriptions had this written as "5 prime" but this one used
"5'".  What was happening was that everything from there to the
next
"'" character in the file was being included as part of that
description.

I could read the file properly by adding the optional argument quote
"" to the call to read.table.

The moral of the story is to watch out for molecular biologists who
use unpaired quote characters in their descriptions.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Henrik Bengtsson

2001-Jul-13 09:07 UTC

head link

[R] watch out for quotes in data files

I had exactly the same problem with some GenePix Results Data files. The
solution is to add an argument quote="" to read.table() and/or scan().
In
your case I believe you should use

  read.table(filename, sep = "\t", quote = "", header =
TRUE)

instead. You don't have to modify the source files.

Henrik Bengtsson
>I have had similar problems with R 1.2.2. Everytime a string has the '
>single quote, it reads up to a maximum of 8192 characters into the item,
>creating memory and parsing problems. I make it a habit now to remove all
'
>(single quotes) from the text or replace them with double quotes.
>
>
>--
>Vele Samak, Vice President
>Global Quantitative Research
>Salomon Smith Barney
>7 WTC, New York, NY 10048, 212-783-7007
>>-----Original Message-----
>>From: Douglas Bates [mailto:bates at stat.wisc.edu]
>>Sent: Monday, July 09, 2001 11:53 PM
>>To: R-help at stat.math.ethz.ch
>>Subject: [R] watch out for quotes in data files
>>I have just spent a day trying to determine why I seemed to be unable
>>to read a file of microarray expression results into R properly.  The
>>file was produced by the Dchip software developed by Li and Wong at
>>Harvard's Department of Biostatistics.  It contains rows of
>>tab-delimited fields in the order
>>Probe set identifier
>>Probe set description
>>Array 1 expression
>>Array 1 call
>>Array 2 expression
>>Array 2 call
>>...
>>plus an extra tab (which I think is due to a programming glich).
>>There are 7130 rows, including the column headers, for results from
>>Affymetrix Hu6800 chips.
>>When I read this file using read.table(filename, sep = "\t",
head = TRUE)
>>I got only 3720 rows.  Furthermore count.fields(filename, sep =
"\t")
>>gave a result of length 7130 but several of the rows were reported as
>>having only two fields instead of 15 like the other rows.
>>It seemed to me that the important characteristic of these rows was
>>their having a very long "Probe set description" and I wasted
quite a
>>bit of time looking for possible buffer overflows that might be
>>triggered by this.
>>When I finally came to my senses and created a much smaller input file
>>that only contained a few rows, including one that was giving an
>>aberrant field count, I could directly examine the results of scan()
>>applied to it.  I noticed that the second field for the aberrant line
>>contained all the subsequent lines and then I saw that its description
>>included "5'" (as in the 5' end of the sequence versus
the 3' end).
>>Other descriptions had this written as "5 prime" but this one
used
>>"5'".  What was happening was that everything from there
to the next
>>"'" character in the file was being included as part of
that
>>description.
>>I could read the file properly by adding the optional argument quote
>>"" to the call to read.table.
>>The moral of the story is to watch out for molecular biologists who
>>use unpaired quote characters in their descriptions.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Seemingly Similar Threads

Search for more reasonably related threads

R help - Jul 2001 - watch out for quotes in data files

[R] watch out for quotes in data files

[R] watch out for quotes in data files

[R] watch out for quotes in data files

Seemingly Similar Threads