Robert M. Flight
2011-Feb-15 17:21 UTC
[R] expected behavior when parsing lines with special characters
Say I have a tab-delimited table I want to read into R. What should I expect to happen if some of the entries contain the character " ' "? I thought it would read the file fine, but that is not what happens. Instead, all the values in between two " ' "s get read into one field, and things are just seriously messed up. Is this a bug, and besides removing the offending characters, is there a fix? Example Input file: testFile.txt: 3499 9031 424823 COP'B2 118094989 XP_422637.2 3499 7955 114454 copb2 50080158 NP_001001940.1 3499 7227 45757 betaCop 24584107 NP_524836.2 3499 7165 1278426 AgaP_AGAP004798 158297839 XP_318012.4 3499 6239 177779 F38E11.5 17540286 NP_501671.1 3499 4896 2540050 sec'27 19113604 NP_596811.1 3499 4932 852740 SEC27 6321301 NP_011378.1 3499 28985 2897447 KLLA0B01958g 50303353 XP_451618.1 3499 33169 4621659 AGOS_AFL118W 45198403 NP_985432.1 3499 148305 2682116 MGG_10504 145615762 XP_366285.2 3499 5141 2709504 NCU07319.1 32414251 XP_327605.1 3499 3702 820842 AT3G15980 30683862 NP_850592.1 3499 3702 841666 AT1G52360 15218215 NP_175645.1 3499 3702 844339 AT1G79990 30699476 NP_178116.2 3499 4530 4340097 Os06g0143900 115466360 NP_001056779.1 testDat <- read.table('testFile.txt',sep='\t') testDat V1 V2 V3 1 3499 9031 424823 2 3499 4932 852740 3 3499 28985 2897447 4 3499 33169 4621659 5 3499 148305 2682116 6 3499 5141 2709504 7 3499 3702 820842 8 3499 3702 841666 9 3499 3702 844339 10 3499 4530 4340097 V4 1 COPB2\t118094989\tXP_422637.2\n3499\t7955\t114454\tcopb2\t50080158\tNP_001001940.1\n3499\t7227\t45757\tbetaCop\t24584107\tNP_524836.2\n3499\t7165\t1278426\tAgaP_AGAP004798\t158297839\tXP_318012.4\n3499\t6239\t177779\tF38E11.5\t17540286\tNP_501671.1\n3499\t4896\t2540050\tsec27 2 SEC27 3 KLLA0B01958g 4 AGOS_AFL118W 5 MGG_10504 6 NCU07319.1 7 AT3G15980 8 AT1G52360 9 AT1G79990 10 Os06g0143900 V5 V6 1 19113604 NP_596811.1 2 6321301 NP_011378.1 3 50303353 XP_451618.1 4 45198403 NP_985432.1 5 145615762 XP_366285.2 6 32414251 XP_327605.1 7 30683862 NP_850592.1 8 15218215 NP_175645.1 9 30699476 NP_178116.2 10 115466360 NP_001056779.1 I would appreciate any feedback. Thanks, -Robert> sessionInfo()R version 2.12.1 (2010-12-16) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.12.1 Robert M. Flight, Ph.D. University of Louisville Bioinformatics Laboratory University of Louisville Louisville, KY PH 502-852-1809 (HSC) PH 502-852-0467 (Belknap) EM robert.flight at louisville.edu EM rflight79 at gmail.com Williams and Holland's Law: ? ? ?? If enough data is collected, anything may be proven by statistical methods.
Peter Langfelder
2011-Feb-15 17:25 UTC
[R] expected behavior when parsing lines with special characters
On Tue, Feb 15, 2011 at 9:21 AM, Robert M. Flight <rflight79 at gmail.com> wrote:> Say I have a tab-delimited table I want to read into R. What should I > expect to happen if some of the entries contain the character " ' "? I > thought it would read the file fine, but that is not what happens. > Instead, all the values in between two " ' "s get read into one field, > and things are just seriously messed up. Is this a bug, and besides > removing the offending characters, is there a fix?Yes, use argument quote="\"" or even quote="" to disable quoting altogether. See ?read.table Peter
David Wolfskill
2011-Feb-15 17:26 UTC
[R] expected behavior when parsing lines with special characters
On Tue, Feb 15, 2011 at 12:21:18PM -0500, Robert M. Flight wrote:> Say I have a tab-delimited table I want to read into R. What should I > expect to happen if some of the entries contain the character " ' "? I > thought it would read the file fine, but that is not what happens. > Instead, all the values in between two " ' "s get read into one field, > and things are just seriously messed up. Is this a bug, and besides > removing the offending characters, is there a fix? > > Example Input file: > > testFile.txt: > 3499 9031 424823 COP'B2 118094989 XP_422637.2 > 3499 7955 114454 copb2 50080158 NP_001001940.1 > 3499 7227 45757 betaCop 24584107 NP_524836.2 > ... > > testDat <- read.table('testFile.txt',sep='\t') > testDatI believe you want to use: testDat <- read.table('testFile.txt',sep='\t',quote="") Ref.: quote: the set of quoting characters. To disable quoting altogether, use 'quote = ""'. See 'scan' for the behaviour on quotes embedded in quotes. Quoting is only considered for columns read as character, which is all of them unless 'colClasses' is specified.>...Peace, david -- David H. Wolfskill david at catwhisker.org Depriving a girl or boy of an opportunity for education is evil. See http://www.catwhisker.org/~david/publickey.gpg for my public key. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110215/3e644e00/attachment.bin>
jim holtman
2011-Feb-15 17:28 UTC
[R] expected behavior when parsing lines with special characters
Check out the arguments for read.table especially 'quote' you probably want quote='' to suppress the special meaning of quote. You might also need comment.char in the future. On Tue, Feb 15, 2011 at 12:21 PM, Robert M. Flight <rflight79 at gmail.com> wrote:> Say I have a tab-delimited table I want to read into R. What should I > expect to happen if some of the entries contain the character " ' "? I > thought it would read the file fine, but that is not what happens. > Instead, all the values in between two " ' "s get read into one field, > and things are just seriously messed up. Is this a bug, and besides > removing the offending characters, is there a fix? > > Example Input file: > > testFile.txt: > 3499 ? ?9031 ? ?424823 ?COP'B2 ?118094989 ? ? ? XP_422637.2 > 3499 ? ?7955 ? ?114454 ?copb2 ? 50080158 ? ? ? ?NP_001001940.1 > 3499 ? ?7227 ? ?45757 ? betaCop 24584107 ? ? ? ?NP_524836.2 > 3499 ? ?7165 ? ?1278426 AgaP_AGAP004798 158297839 ? ? ? XP_318012.4 > 3499 ? ?6239 ? ?177779 ?F38E11.5 ? ? ? ?17540286 ? ? ? ?NP_501671.1 > 3499 ? ?4896 ? ?2540050 sec'27 ?19113604 ? ? ? ?NP_596811.1 > 3499 ? ?4932 ? ?852740 ?SEC27 ? 6321301 NP_011378.1 > 3499 ? ?28985 ? 2897447 KLLA0B01958g ? ?50303353 ? ? ? ?XP_451618.1 > 3499 ? ?33169 ? 4621659 AGOS_AFL118W ? ?45198403 ? ? ? ?NP_985432.1 > 3499 ? ?148305 ?2682116 MGG_10504 ? ? ? 145615762 ? ? ? XP_366285.2 > 3499 ? ?5141 ? ?2709504 NCU07319.1 ? ? ?32414251 ? ? ? ?XP_327605.1 > 3499 ? ?3702 ? ?820842 ?AT3G15980 ? ? ? 30683862 ? ? ? ?NP_850592.1 > 3499 ? ?3702 ? ?841666 ?AT1G52360 ? ? ? 15218215 ? ? ? ?NP_175645.1 > 3499 ? ?3702 ? ?844339 ?AT1G79990 ? ? ? 30699476 ? ? ? ?NP_178116.2 > 3499 ? ?4530 ? ?4340097 Os06g0143900 ? ?115466360 ? ? ? NP_001056779.1 > > testDat <- read.table('testFile.txt',sep='\t') > testDat > > ? ? V1 ? ? V2 ? ? ?V3 > 1 ?3499 ? 9031 ?424823 > 2 ?3499 ? 4932 ?852740 > 3 ?3499 ?28985 2897447 > 4 ?3499 ?33169 4621659 > 5 ?3499 148305 2682116 > 6 ?3499 ? 5141 2709504 > 7 ?3499 ? 3702 ?820842 > 8 ?3499 ? 3702 ?841666 > 9 ?3499 ? 3702 ?844339 > 10 3499 ? 4530 4340097 > > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? V4 > 1 ?COPB2\t118094989\tXP_422637.2\n3499\t7955\t114454\tcopb2\t50080158\tNP_001001940.1\n3499\t7227\t45757\tbetaCop\t24584107\tNP_524836.2\n3499\t7165\t1278426\tAgaP_AGAP004798\t158297839\tXP_318012.4\n3499\t6239\t177779\tF38E11.5\t17540286\tNP_501671.1\n3499\t4896\t2540050\tsec27 > 2 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SEC27 > 3 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? KLLA0B01958g > 4 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? AGOS_AFL118W > 5 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MGG_10504 > 6 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NCU07319.1 > 7 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT3G15980 > 8 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT1G52360 > 9 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?AT1G79990 > 10 > > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? Os06g0143900 > ? ? ? ? ?V5 ? ? ? ? ? ? V6 > 1 ? 19113604 ? ?NP_596811.1 > 2 ? ?6321301 ? ?NP_011378.1 > 3 ? 50303353 ? ?XP_451618.1 > 4 ? 45198403 ? ?NP_985432.1 > 5 ?145615762 ? ?XP_366285.2 > 6 ? 32414251 ? ?XP_327605.1 > 7 ? 30683862 ? ?NP_850592.1 > 8 ? 15218215 ? ?NP_175645.1 > 9 ? 30699476 ? ?NP_178116.2 > 10 115466360 NP_001056779.1 > > I would appreciate any feedback. > > Thanks, > > -Robert > >> sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 ?LC_CTYPE=English_United > States.1252 > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > loaded via a namespace (and not attached): > [1] tools_2.12.1 > > > Robert M. Flight, Ph.D. > University of Louisville Bioinformatics Laboratory > University of Louisville > Louisville, KY > > PH 502-852-1809 (HSC) > PH 502-852-0467 (Belknap) > EM robert.flight at louisville.edu > EM rflight79 at gmail.com > > Williams and Holland's Law: > ? ? ?? If enough data is collected, anything may be proven by > statistical methods. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?