I have a text file that is UTF-16LE encoded with CRLF line endings and '@' as field separators that I want to read in R on a Linux system. Which would be fine as read.table("foo.txt", file.encoding = "UTF-16LE", sep = "@", ...) *except* that the data may contain the LF character which R treats as end-of-line and then barfs that there are too few elements on that line. Any suggestions for how to process this one efficiently in R? There is probably a solution using read.table(..., nrows = 1, ...) to get the header, split it on '@', build a list with that many character(0) elements, and then using scan(..., multi.line=TRUE, ...) ..... but that all sounds very complicated. Allan.
I ended up pre-processing the files outside of R using a script along the lines of #!/bin/bash for f in *_table_extract_*.txt; do echo -n "Processing $f..." o="${f}.xz" iconv -f "UTF-16LE" -t "UTF-8" $f | \ tail -c +4 | \ perl -l012 -015 -pe 's/\n//g' | \ perl -ne 'print if (!m{\A \( \d+ \s row\(s\) \s affected \) \s* \z}ixms && !m{\A \s* \z}xms)' | \ xz -7 > $o echo "done." done Ugly, but it worked for me. You can change the first perl regular expression to do different things with line terminating \n versus in-field \n characters but I just dropped them all. The tail command drops the byte-order-mark (which we do not need for utf-8) and the second perl command drops blanks and a stupid SQL tool output. Thanks to Prof. Brian Ripley who, essentially, pointed out that with embedded linefeed characters my file was a binary file and not really a text file. Her Majesty's government respectfully begs to disagree [1] but that's the R definition so we'll use it on this list. Allan [1] Original data sets described at http://www.hm-treasury.gov.uk/psr_coins_data.htm and downloaded from http://data.gov.uk/dataset/coins (hint: you'll need p7zip to unpack them on a Linux box). On 04/06/10 14:49, Allan Engelhardt wrote:> I have a text file that is UTF-16LE encoded with CRLF line endings and > '@' as field separators that I want to read in R on a Linux system. > Which would be fine as > > read.table("foo.txt", file.encoding = "UTF-16LE", sep = "@", ...) > > *except* that the data may contain the LF character which R treats as > end-of-line and then barfs that there are too few elements on that line. > > Any suggestions for how to process this one efficiently in R? [...]