Bob O'Hara
2014-May-13 13:35 UTC
[R] File coding problem: how to read a windows-1252 encoded file
I'm trying to read a text file (actually the ftp file in command below), and I'm getting an error:> SpCodes=read.fwf("ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", + widths=c(7,6,51,51), skip=6, n=5, header=F, stringsAsFactors=F) Error in substring(x, first, last) : invalid multibyte string at '<e0> vent' The problem is caused by"Dendrocygne à ventre noir", which has a French character which seems to be causing the problems: there are more throughout the file (and I want to read the whole file: I'm picking uot bits above to make it easier), so I can't manually delete this. The file is apparently in the ISO-8859 format (or it might be windows-1252), but using that in either encoding= or fileEncoding= doesn't work: SpCodes=read.fwf(" ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", widths=c(7,6,51,51), skip=6, n=5, header=F, stringsAsFactors=F, fileEncoding="ISO-8859") Can anyone suggest a solution? In case it helps, here's my session info:> sessionInfo()R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.1.0>-- Bob O'Hara Biodiversity and Climate Research Centre Senckenberganlage 25 D-60325 Frankfurt am Main, Germany Tel: +49 69 798 40226 Mobile: +49 1515 888 5440 WWW: http://www.bik-f.de/root/index.php?page_id=219 Blog: http://occamstypewriter.org/boboh/ Journal of Negative Results - EEB: www.jnr-eeb.org [[alternative HTML version deleted]]
Prof Brian Ripley
2014-May-13 14:03 UTC
[R] File coding problem: how to read a windows-1252 encoded file
On 13/05/2014 14:35, Bob O'Hara wrote:> I'm trying to read a text file (actually the ftp file in command below), > and I'm getting an error: > >> SpCodes=read.fwf(" > ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", > + widths=c(7,6,51,51), skip=6, n=5, header=F, > stringsAsFactors=F) > Error in substring(x, first, last) : > invalid multibyte string at '<e0> vent' > > The problem is caused by"Dendrocygne ? ventre noir", which has a French > character which seems to be causing the problems: there are more throughout > the file (and I want to read the whole file: I'm picking uot bits above to > make it easier), so I can't manually delete this. The file is apparently in > the ISO-8859 format (or it might be windows-1252), but using that in either > encoding= or fileEncoding= doesn't work:Why do you expect them to? read.fwf reads the file (not read.table) and it does not have those arguments. You need to give a file/url connection with specified encoding. > con <- url("ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", encoding = "cp1252") > read.fwf(con, widths=c(7,6,51,51), skip=6, n=5, header=F) > close(con)> > SpCodes=read.fwf(" > ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", > widths=c(7,6,51,51), skip=6, n=5, header=F, > stringsAsFactors=F, fileEncoding="ISO-8859")-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
peter dalgaard
2014-May-13 14:10 UTC
[R] File coding problem: how to read a windows-1252 encoded file
Hi Bob, Long time no see. The following works for me. In general, I think it is tricky to rely on encodings to be passed on to the appropriate agent, so try to be as specific as possible about it. con <- url("ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", encoding="Latin1") SpCodes=read.fwf(con, widths=c(7,6,51,51), skip=6, n=5, header=F, stringsAsFactors=F) AFAICT, the root cause is that encoding= is passed by read.fwf() to read.table(), once the columns are split out, but not to the file connection used to get the data for splitting. It also worked to get the whole enchilada using readLines, convert with iconv() and then use read.fwf on a textConnection to the converted lines. And, BTW, even though encoding names vary between platforms, "ISO-8859" is almost surely wrong, because there is "ISO-8859-1", "ISO-8859-2", ... - Peter On 13 May 2014, at 15:35 , Bob O'Hara <rni.boh at gmail.com> wrote:> I'm trying to read a text file (actually the ftp file in command below), > and I'm getting an error: > >> SpCodes=read.fwf(" > ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", > + widths=c(7,6,51,51), skip=6, n=5, header=F, > stringsAsFactors=F) > Error in substring(x, first, last) : > invalid multibyte string at '<e0> vent' > > The problem is caused by"Dendrocygne ? ventre noir", which has a French > character which seems to be causing the problems: there are more throughout > the file (and I want to read the whole file: I'm picking uot bits above to > make it easier), so I can't manually delete this. The file is apparently in > the ISO-8859 format (or it might be windows-1252), but using that in either > encoding= or fileEncoding= doesn't work: > > SpCodes=read.fwf(" > ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/DataFiles/SpeciesList.txt", > widths=c(7,6,51,51), skip=6, n=5, header=F, > stringsAsFactors=F, fileEncoding="ISO-8859") > > Can anyone suggest a solution? In case it helps, here's my session info: >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_GB.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] tools_3.1.0 >> > > > -- > Bob O'Hara > > Biodiversity and Climate Research Centre > Senckenberganlage 25 > D-60325 Frankfurt am Main, > Germany > > Tel: +49 69 798 40226 > Mobile: +49 1515 888 5440 > WWW: http://www.bik-f.de/root/index.php?page_id=219 > Blog: http://occamstypewriter.org/boboh/ > Journal of Negative Results - EEB: www.jnr-eeb.org > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com