I have just spent a day trying to determine why I seemed to be unable to read a file of microarray expression results into R properly. The file was produced by the Dchip software developed by Li and Wong at Harvard's Department of Biostatistics. It contains rows of tab-delimited fields in the order Probe set identifier Probe set description Array 1 expression Array 1 call Array 2 expression Array 2 call ... plus an extra tab (which I think is due to a programming glich). There are 7130 rows, including the column headers, for results from Affymetrix Hu6800 chips. When I read this file using read.table(filename, sep = "\t", head = TRUE) I got only 3720 rows. Furthermore count.fields(filename, sep = "\t") gave a result of length 7130 but several of the rows were reported as having only two fields instead of 15 like the other rows. It seemed to me that the important characteristic of these rows was their having a very long "Probe set description" and I wasted quite a bit of time looking for possible buffer overflows that might be triggered by this. When I finally came to my senses and created a much smaller input file that only contained a few rows, including one that was giving an aberrant field count, I could directly examine the results of scan() applied to it. I noticed that the second field for the aberrant line contained all the subsequent lines and then I saw that its description included "5'" (as in the 5' end of the sequence versus the 3' end). Other descriptions had this written as "5 prime" but this one used "5'". What was happening was that everything from there to the next "'" character in the file was being included as part of that description. I could read the file properly by adding the optional argument quote "" to the call to read.table. The moral of the story is to watch out for molecular biologists who use unpaired quote characters in their descriptions. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I have had similar problems with R 1.2.2. Everytime a string has the ' single quote, it reads up to a maximum of 8192 characters into the item, creating memory and parsing problems. I make it a habit now to remove all ' (single quotes) from the text or replace them with double quotes. -- Vele Samak, Vice President Global Quantitative Research Salomon Smith Barney 7 WTC, New York, NY 10048, 212-783-7007 -----Original Message----- From: Douglas Bates [mailto:bates at stat.wisc.edu] Sent: Monday, July 09, 2001 11:53 PM To: R-help at stat.math.ethz.ch Subject: [R] watch out for quotes in data files I have just spent a day trying to determine why I seemed to be unable to read a file of microarray expression results into R properly. The file was produced by the Dchip software developed by Li and Wong at Harvard's Department of Biostatistics. It contains rows of tab-delimited fields in the order Probe set identifier Probe set description Array 1 expression Array 1 call Array 2 expression Array 2 call ... plus an extra tab (which I think is due to a programming glich). There are 7130 rows, including the column headers, for results from Affymetrix Hu6800 chips. When I read this file using read.table(filename, sep = "\t", head = TRUE) I got only 3720 rows. Furthermore count.fields(filename, sep = "\t") gave a result of length 7130 but several of the rows were reported as having only two fields instead of 15 like the other rows. It seemed to me that the important characteristic of these rows was their having a very long "Probe set description" and I wasted quite a bit of time looking for possible buffer overflows that might be triggered by this. When I finally came to my senses and created a much smaller input file that only contained a few rows, including one that was giving an aberrant field count, I could directly examine the results of scan() applied to it. I noticed that the second field for the aberrant line contained all the subsequent lines and then I saw that its description included "5'" (as in the 5' end of the sequence versus the 3' end). Other descriptions had this written as "5 prime" but this one used "5'". What was happening was that everything from there to the next "'" character in the file was being included as part of that description. I could read the file properly by adding the optional argument quote "" to the call to read.table. The moral of the story is to watch out for molecular biologists who use unpaired quote characters in their descriptions. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I had exactly the same problem with some GenePix Results Data files. The solution is to add an argument quote="" to read.table() and/or scan(). In your case I believe you should use read.table(filename, sep = "\t", quote = "", header = TRUE) instead. You don't have to modify the source files. Henrik Bengtsson>I have had similar problems with R 1.2.2. Everytime a string has the ' >single quote, it reads up to a maximum of 8192 characters into the item, >creating memory and parsing problems. I make it a habit now to remove all ' >(single quotes) from the text or replace them with double quotes. > > >-- >Vele Samak, Vice President >Global Quantitative Research >Salomon Smith Barney >7 WTC, New York, NY 10048, 212-783-7007 >>-----Original Message----- >>From: Douglas Bates [mailto:bates at stat.wisc.edu] >>Sent: Monday, July 09, 2001 11:53 PM >>To: R-help at stat.math.ethz.ch >>Subject: [R] watch out for quotes in data files >>I have just spent a day trying to determine why I seemed to be unable >>to read a file of microarray expression results into R properly. The >>file was produced by the Dchip software developed by Li and Wong at >>Harvard's Department of Biostatistics. It contains rows of >>tab-delimited fields in the order >>Probe set identifier >>Probe set description >>Array 1 expression >>Array 1 call >>Array 2 expression >>Array 2 call >>... >>plus an extra tab (which I think is due to a programming glich). >>There are 7130 rows, including the column headers, for results from >>Affymetrix Hu6800 chips. >>When I read this file using read.table(filename, sep = "\t", head = TRUE) >>I got only 3720 rows. Furthermore count.fields(filename, sep = "\t") >>gave a result of length 7130 but several of the rows were reported as >>having only two fields instead of 15 like the other rows. >>It seemed to me that the important characteristic of these rows was >>their having a very long "Probe set description" and I wasted quite a >>bit of time looking for possible buffer overflows that might be >>triggered by this. >>When I finally came to my senses and created a much smaller input file >>that only contained a few rows, including one that was giving an >>aberrant field count, I could directly examine the results of scan() >>applied to it. I noticed that the second field for the aberrant line >>contained all the subsequent lines and then I saw that its description >>included "5'" (as in the 5' end of the sequence versus the 3' end). >>Other descriptions had this written as "5 prime" but this one used >>"5'". What was happening was that everything from there to the next >>"'" character in the file was being included as part of that >>description. >>I could read the file properly by adding the optional argument quote >>"" to the call to read.table. >>The moral of the story is to watch out for molecular biologists who >>use unpaired quote characters in their descriptions.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._