thr3ads.net - R help - [R] Reading newlines with read.table? [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Allan Engelhardt

2010-Jun-04 13:49 UTC

[R] Reading newlines with read.table?

I have a text file that is UTF-16LE encoded with CRLF line endings and 
'@' as field separators that I want to read in R on a Linux system.  
Which would be fine as

read.table("foo.txt", file.encoding = "UTF-16LE", sep =
"@", ...)

*except* that the data may contain the LF character which R treats as 
end-of-line and then barfs that there are too few elements on that line.

Any suggestions for how to process this one efficiently in R?  There is 
probably a solution using read.table(..., nrows = 1, ...) to get the 
header, split it on '@', build a list with that many character(0) 
elements, and then using scan(..., multi.line=TRUE, ...) ..... but that 
all sounds very complicated.

Allan.

Allan Engelhardt

2010-Jun-04 16:07 UTC

head link

[R] Reading newlines with read.table?

I ended up pre-processing the files outside of R using a script along 
the lines of

#!/bin/bash
for f in *_table_extract_*.txt; do
     echo -n "Processing $f..."
     o="${f}.xz"
     iconv -f "UTF-16LE" -t "UTF-8" $f | \
         tail -c +4 | \
         perl -l012 -015 -pe 's/\n//g' | \
         perl -ne 'print if (!m{\A \( \d+ \s row\(s\) \s affected \) \s* 
\z}ixms && !m{\A \s* \z}xms)' | \
         xz -7 > $o
     echo "done."
done

Ugly, but it worked for me.  You can change the first perl regular 
expression to do different things with line terminating \n versus 
in-field \n characters but I just dropped them all.  The tail command 
drops the byte-order-mark (which we do not need for utf-8) and the 
second perl command drops blanks and a stupid SQL tool output.

Thanks to Prof. Brian Ripley who, essentially, pointed out that with 
embedded linefeed characters my file was a binary file and not really a 
text file.  Her Majesty's government respectfully begs to disagree [1] 
but that's the R definition so we'll use it on this list.

Allan

[1] Original data sets described at 
http://www.hm-treasury.gov.uk/psr_coins_data.htm and downloaded from 
http://data.gov.uk/dataset/coins (hint: you'll need p7zip to unpack them 
on a Linux box).

On 04/06/10 14:49, Allan Engelhardt wrote:> I have a text file that is UTF-16LE encoded with CRLF line endings and 
> '@' as field separators that I want to read in R on a Linux system.
> Which would be fine as
>
> read.table("foo.txt", file.encoding = "UTF-16LE", sep =
"@", ...)
>
> *except* that the data may contain the LF character which R treats as 
> end-of-line and then barfs that there are too few elements on that line.
>
> Any suggestions for how to process this one efficiently in R?  [...]

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jun 2010 - Reading newlines with read.table?

[R] Reading newlines with read.table?

[R] Reading newlines with read.table?

Reasonably Related Threads