I have a large (105MB) data file, tab-delimited with a header. There are some odd characters at the beginning of the file that are preventing it from being read by R. > dfTemp = read.delim(filename) Error in make.names(col.names, unique = TRUE) : invalid multibyte string at '<ff><fe>m' When I view the file with head, I see: ??muni_code parcel_id? The file is too large to edit in a graphical text editor (gedit). I tried just dropping the header row with sed '1 d' <old.txt >new.txt" but then > dfTemp = read.delim(filename) Error in read.table(file = file, header = header, sep = sep, quote = quote, : empty beginning of file I tried some other shenanigans with sed (with which I am not really experienced) but did not get a usable file. Does anyone have any ideas for how to (a) directly read this into R, skipping the offending line or characters, or (b) preprocess it so that I can read it into R? Best, --Lee R version 2.14.1 (2011-12-22) Platform: x86_64-pc-linux-gnu (64-bit) Linux Mint 13 -- Lee Hachadoorian Assistant Professor in Geography, Dartmouth College http://freecity.commons.gc.cuny.edu
On 08/11/2012 07:11, Lee Hachadoorian wrote:> I have a large (105MB) data file, tab-delimited with a header. There are > some odd characters at the beginning of the file that are preventing it > from being read by R. > > > dfTemp = read.delim(filename) > Error in make.names(col.names, unique = TRUE) : > invalid multibyte string at '<ff><fe>m' > > When I view the file with head, I see: > > ??muni_code parcel_id? > > The file is too large to edit in a graphical text editor (gedit). I > tried just dropping the header row with > > sed '1 d' <old.txt >new.txt" > > but then > > > dfTemp = read.delim(filename) > Error in read.table(file = file, header = header, sep = sep, quote > quote, : > empty beginning of file > > I tried some other shenanigans with sed (with which I am not really > experienced) but did not get a usable file. Does anyone have any ideas > for how to (a) directly read this into R, skipping the offending line or > characters, or (b) preprocess it so that I can read it into R?That is a BOM make in UCS-2 encoding. Was this file created on Windows? It so try using iconv to convert it to UTF-8, or in R use read.delim(filename, fileEncoding = "UCS-2LE")> > Best, > --Lee > > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > Linux Mint 13Yes, but what locale? See the 'at a minimum' information asked for in your posting guide. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On 11/08/2012 02:51 AM, Prof Brian Ripley wrote:> On 08/11/2012 07:11, Lee Hachadoorian wrote: >> I have a large (105MB) data file, tab-delimited with a header. There are >> some odd characters at the beginning of the file that are preventing it >> from being read by R. >> > That is a BOM make in UCS-2 encoding. Was this file created on Windows? > > It so try using iconv to convert it to UTF-8, or in R use > > read.delim(filename, fileEncoding = "UCS-2LE")Perfect. I tried it both ways, and both iconv and the fileEncoding parameter did the trick. As far as I know the file (which was provided by a public agency) was created in Windows. Thanks, --Lee -- Lee Hachadoorian Assistant Professor in Geography, Dartmouth College http://freecity.commons.gc.cuny.edu