I'm trying to read a comma delimited dataset that uses '.' for NA. I found that if the last field on a line was a missing '.' it was not read as NA, but just a '.', and the life variable was made a factor. The data looks like this, income,imr,region,oilexprt,imr80,gnp80,life Afghanistan,75,400.0,4,0,185.0,.,37.5 Algeria,400,86.3,2,1,20.5,1920,50.7 Argentina,1191,59.6,1,0,40.8,2390,67.1 Australia,3426,26.7,4,0,12.5,9820,71.0 Austria,3350,23.7,3,0,14.8,10230,70.4 Bangladesh,100,124.3,4,0,139.0,120,. Belgium,3346,17.0,3,0,11.2,12180,70.6 Benin,81,109.6,2,0,109.6,300,. Bolivia,200,60.4,1,0,77.3,570,49.7 Brazil,425,170.0,1,0,84.0,2020,60.7 Britain,2503,17.5,3,0,12.6,7920,72.0 Burma,73,200.0,4,0,195.0,180,42.3 ... and I used > nations <- read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.name=1,sep=",",header=TRUE)> nations[1:10,]income imr region oilexprt imr80 gnp80 life Afghanistan 75 400.0 4 0 185.0 NA 37.5 Algeria 400 86.3 2 1 20.5 1920 50.7 Argentina 1191 59.6 1 0 40.8 2390 67.1 Australia 3426 26.7 4 0 12.5 9820 71.0 Austria 3350 23.7 3 0 14.8 10230 70.4 Bangladesh 100 124.3 4 0 139.0 120 . Belgium 3346 17.0 3 0 11.2 12180 70.6 Benin 81 109.6 2 0 109.6 300 . Bolivia 200 60.4 1 0 77.3 570 49.7 Brazil 425 170.0 1 0 84.0 2020 60.7> summary(nations$life). 27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2 37.5 38.5 38.8 40.5 2 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0 49.7 49.9 50.0 50.5 1 6 1 4 1 1 1 1 1 3 1 3 1 1 2 1 After much hair-pulling, I discovered that the data lines for Bangladesh and Benin contained a trailing space after the '.'. Removing those made the problem go away, but that shouldn't happen and I wonder if this is still a potential problem for others. I'm using R 1.8.1. -Michael -- Michael Friendly Email: friendly at yorku.ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA
On Wed, 6 Oct 2004, Michael Friendly wrote:> I'm trying to read a comma delimited dataset that uses '.' for NA. I > found that if the last field on a line was a missing '.' > it was not read as NA, but just a '.', and the life variable was made a > factor. The data looks like this, > > income,imr,region,oilexprt,imr80,gnp80,life > Afghanistan,75,400.0,4,0,185.0,.,37.5 > Algeria,400,86.3,2,1,20.5,1920,50.7 > Argentina,1191,59.6,1,0,40.8,2390,67.1 > Australia,3426,26.7,4,0,12.5,9820,71.0 > Austria,3350,23.7,3,0,14.8,10230,70.4 > Bangladesh,100,124.3,4,0,139.0,120,. > Belgium,3346,17.0,3,0,11.2,12180,70.6 > Benin,81,109.6,2,0,109.6,300,. > Bolivia,200,60.4,1,0,77.3,570,49.7 > Brazil,425,170.0,1,0,84.0,2020,60.7 > Britain,2503,17.5,3,0,12.6,7920,72.0 > Burma,73,200.0,4,0,195.0,180,42.3 > ... > > and I used > > nations <- > read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.name=1,sep=",",header=TRUE) > > > nations[1:10,] > income imr region oilexprt imr80 gnp80 life > Afghanistan 75 400.0 4 0 185.0 NA 37.5 > Algeria 400 86.3 2 1 20.5 1920 50.7 > Argentina 1191 59.6 1 0 40.8 2390 67.1 > Australia 3426 26.7 4 0 12.5 9820 71.0 > Austria 3350 23.7 3 0 14.8 10230 70.4 > Bangladesh 100 124.3 4 0 139.0 120 . > Belgium 3346 17.0 3 0 11.2 12180 70.6 > Benin 81 109.6 2 0 109.6 300 . > Bolivia 200 60.4 1 0 77.3 570 49.7 > Brazil 425 170.0 1 0 84.0 2020 60.7 > > summary(nations$life) > . 27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2 37.5 38.5 38.8 40.5 > 2 1 1 1 1 1 2 1 1 1 1 1 1 3 1 1 > 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0 49.7 49.9 50.0 50.5 > 1 6 1 4 1 1 1 1 1 3 1 3 1 1 2 1 > > > After much hair-pulling, I discovered that the data lines for Bangladesh > and Benin contained a trailing space after the '.'. Removing those made > the problem go away, but that shouldn't happen and I wonder if this is > still a potential problem for others. I'm using R 1.8.1.It should happen. The entry there is ". " and that is not an NA string. If you use a non-whitespace delimiter, all whitespace is significant. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dear Mike, This is a trap, but it's not a bug, and to "correct" it wouldn't be appropriate, I think. That is, the string ". " wasn't declared as NA. One could do the following to avoid the problem:> read.csv("c:/temp/test.txt", na.strings=".", strip.white=TRUE)income imr region oilexprt imr80 gnp80 life Afghanistan 75 400.0 4 0 185.0 NA 37.5 Algeria 400 86.3 2 1 20.5 1920 50.7 Argentina 1191 59.6 1 0 40.8 2390 67.1 Australia 3426 26.7 4 0 12.5 9820 71.0 Austria 3350 23.7 3 0 14.8 10230 70.4 Bangladesh 100 124.3 4 0 139.0 120 NA Belgium 3346 17.0 3 0 11.2 12180 70.6 Benin 81 109.6 2 0 109.6 300 NA Bolivia 200 60.4 1 0 77.3 570 49.7 Brazil 425 170.0 1 0 84.0 2020 60.7 Britain 2503 17.5 3 0 12.6 7920 72.0 Burma 73 200.0 4 0 195.0 180 42.3 Regards, John> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Michael Friendly > Sent: Wednesday, October 06, 2004 8:18 AM > To: R-help > Subject: [R] read.delim problem with trailing spaces > > I'm trying to read a comma delimited dataset that uses '.' > for NA. I found that if the last field on a line was a missing '.' > it was not read as NA, but just a '.', and the life variable > was made a factor. The data looks like this, > > income,imr,region,oilexprt,imr80,gnp80,life > Afghanistan,75,400.0,4,0,185.0,.,37.5 > Algeria,400,86.3,2,1,20.5,1920,50.7 > Argentina,1191,59.6,1,0,40.8,2390,67.1 > Australia,3426,26.7,4,0,12.5,9820,71.0 > Austria,3350,23.7,3,0,14.8,10230,70.4 > Bangladesh,100,124.3,4,0,139.0,120,. > Belgium,3346,17.0,3,0,11.2,12180,70.6 > Benin,81,109.6,2,0,109.6,300,. > Bolivia,200,60.4,1,0,77.3,570,49.7 > Brazil,425,170.0,1,0,84.0,2020,60.7 > Britain,2503,17.5,3,0,12.6,7920,72.0 > Burma,73,200.0,4,0,195.0,180,42.3 > ... > > and I used > > nations <- > read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.na > me=1,sep=",",header=TRUE) > > > nations[1:10,] > income imr region oilexprt imr80 gnp80 life > Afghanistan 75 400.0 4 0 185.0 NA 37.5 > Algeria 400 86.3 2 1 20.5 1920 50.7 > Argentina 1191 59.6 1 0 40.8 2390 67.1 > Australia 3426 26.7 4 0 12.5 9820 71.0 > Austria 3350 23.7 3 0 14.8 10230 70.4 > Bangladesh 100 124.3 4 0 139.0 120 . > Belgium 3346 17.0 3 0 11.2 12180 70.6 > Benin 81 109.6 2 0 109.6 300 . > Bolivia 200 60.4 1 0 77.3 570 49.7 > Brazil 425 170.0 1 0 84.0 2020 60.7 > > summary(nations$life) > . 27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2 > 37.5 38.5 38.8 40.5 > 2 1 1 1 1 1 2 1 1 1 1 1 > 1 3 1 1 > 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0 > 49.7 49.9 50.0 50.5 > 1 6 1 4 1 1 1 1 1 3 1 3 > 1 1 2 1 > > > After much hair-pulling, I discovered that the data lines for > Bangladesh and Benin contained a trailing space after the > '.'. Removing those made the problem go away, but that > shouldn't happen and I wonder if this is > still a potential problem for others. I'm using R 1.8.1. > > -Michael > > -- > Michael Friendly Email: friendly at yorku.ca > Professor, Psychology Dept. > York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 > 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html > Toronto, ONT M3J 1P3 CANADA > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
Michael Friendly wrote:> I'm trying to read a comma delimited dataset that uses '.' for NA. I > found that if the last field on a line was a missing '.' > it was not read as NA, but just a '.', and the life variable was made a > factor. The data looks like this, > > income,imr,region,oilexprt,imr80,gnp80,life > Afghanistan,75,400.0,4,0,185.0,.,37.5 > Algeria,400,86.3,2,1,20.5,1920,50.7 > Argentina,1191,59.6,1,0,40.8,2390,67.1 > Australia,3426,26.7,4,0,12.5,9820,71.0 > Austria,3350,23.7,3,0,14.8,10230,70.4 > Bangladesh,100,124.3,4,0,139.0,120,. > Belgium,3346,17.0,3,0,11.2,12180,70.6 > Benin,81,109.6,2,0,109.6,300,. > Bolivia,200,60.4,1,0,77.3,570,49.7 > Brazil,425,170.0,1,0,84.0,2020,60.7 > Britain,2503,17.5,3,0,12.6,7920,72.0 > Burma,73,200.0,4,0,195.0,180,42.3 > ... > > and I used > > nations <- > read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.name=1,sep=",",header=TRUE)You need to specify the argument strip.white = TRUE BTW: Do you know that R-2.0.0 has been release? Uwe> >> nations[1:10,] > > income imr region oilexprt imr80 gnp80 life > Afghanistan 75 400.0 4 0 185.0 NA 37.5 > Algeria 400 86.3 2 1 20.5 1920 50.7 > Argentina 1191 59.6 1 0 40.8 2390 67.1 > Australia 3426 26.7 4 0 12.5 9820 71.0 > Austria 3350 23.7 3 0 14.8 10230 70.4 > Bangladesh 100 124.3 4 0 139.0 120 . > Belgium 3346 17.0 3 0 11.2 12180 70.6 > Benin 81 109.6 2 0 109.6 300 . > Bolivia 200 60.4 1 0 77.3 570 49.7 > Brazil 425 170.0 1 0 84.0 2020 60.7 > >> summary(nations$life) > > . 27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2 37.5 38.5 > 38.8 40.5 > 2 1 1 1 1 1 2 1 1 1 1 1 1 3 > 1 1 > 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0 49.7 49.9 > 50.0 50.5 > 1 6 1 4 1 1 1 1 1 3 1 3 1 1 > 2 1 > > > After much hair-pulling, I discovered that the data lines for Bangladesh > and Benin contained a trailing space after the '.'. Removing those made > the problem go away, but that shouldn't happen and I wonder if this is > still a potential problem for others. I'm using R 1.8.1. > > -Michael >