Andzsin
2010-Feb-03 04:41 UTC
[R] "read.table" and "scan" skips newlines which "count.fields" finds in Thai textfile
Hi there, I have some problems reading in a Thai text. Some of the newlines are skipped. (see the contents of my file below) R>count.fields ("my.txt", sep='\n', quote="") [1] 1 1 1 Three lines with one item each, right? R> scan("my.txt", what="", sep="\t", quote="") Read 2 items [1] "?\x83???\x88 ?\x84?????\x9a\n?\x83???\x88 ?\x84????" [2] "?\x83???\x88 ?\x84?????\x9a\n" Two items. Note that arguments to "count.fields" and "scan" are the same. There is a newline within the first item ("\n").> read.table("my.txt", encoding="UTF-8", header=F, sep="\t", quote="")[1] V1 <0 rows> (or 0-length row.names) Zero items. I just reduced my file to zero. Needless to say, my editors show 3 lines (Vim, Em, Hidemaru) Hex dump shows the newline chars clearly (see below). I have seen related questions but not the solution: e.g. http://n4.nabble.com/problem-with-scan-recognizing-newline-n-td896114.html#a896115 And just for fun : R>read.table("my.txt", encoding="justkidding") [1] V1 <0 rows> (or 0-length row.names) Its funny NOT to see any complaints about "justkidding" encoding... (it is so not R-ish :-) We are using R2.8 -> for a while we are stuck with it. (I briefly installed R2.10 but did not seem to overcome the problem) Any kind of help is greatly appreciated. Best, andzsin ps : replacing Thai with Japanese text (same utf-8) had slightly different results (only some of the newlines were ignored) ******** Details: ************* name : my.txt lg : Thai enc : UTF-8 EOL : CR+LF (0d0a) content : ??? ???? ??? ??? ??? ???? [EOF] HEX : <copy-paste to some prg that goes with fixed-width chars> 00000000 e0b9 83e0 b88a e0b9 8820 e0b8 84e0 b8a3 `9.`8.`9. `8.`8# 00000010 e0b8 b1e0 b89a 0d0a e0b9 83e0 b88a e0b9 `81`8...`9.`8.`9 ^^^^ 00000020 8820 e0b8 84e0 b988 e0b8 b00d 0ae0 b983 . `8.`9.`80..`9. ^^^^^ 00000030 e0b8 8ae0 b988 20e0 b884 e0b8 a3e0 b8b1 `8.`9. `8.`8#`81 00000040 e0b8 9a0d 0a `8... ^^^^^ R>sessionInfo() R version 2.8.0 (2008-10-20) i386-pc-mingw32 locale: LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932 -- View this message in context: http://n4.nabble.com/read-table-and-scan-skips-newlines-which-count-fields-finds-in-Thai-textfile-tp1460736p1460736.html Sent from the R help mailing list archive at Nabble.com.