For an historical paper I'm working on, I have some Spanish plaintext, presently in the form of a Word .doc file, http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc and also some ciphered text from the same original source. The ultimate goal is to use some frequency analysis of letters and word lengths in the plaintext to help decode the ciphered text. For now, I'm stuck on how to read the Spanish plaintext into R as a text string, given that it is in a Word .doc file using some form of latin1 encoding. From Word, I can Save As .. plain text (.txt), but I'm worried about losing character encoding information and I don't see anything in the list of Other encodings presented that seems helpful. A naive attempt to read the .doc file directly gives: > langren.sp.file <- "http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc" > > langren.txt <- scan(langren.sp.file, encoding="latin1") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '??????' > Can someone help? -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA
When I open that link in OpenOffice.org Writer and then save in "Text encoded" format with "Unicode" encoding, the diacriticals (is that the correct font-ish term?) seem to remain intact wehn re-opended. When I read that file in, not with scan() but with readLines(), here is what I get for the second string: langren.txt <- readLines("/Users/davidwinsemius/Downloads/Verdadera- spanish-stripped-1.txt", encoding="UTF-8") langren.txt[2] [2] "MIGUEL FLORENCIO VAN LANGREN Matem?tico y cosm?grafo de su Majestad presenta las siguientes consideraciones de la Longitud por Mar y Tierra; y dice que su Padre y Abuelo fueron astr?nomos y ge?grafos, y en particular su padre asisti? a las observaciones celestes realizadas por el famoso astr?nomo Ticho Brahe, de quien recibi? sus primeras observaciones, como consta por las obras del dicho Ticho. As? mismo su padre sirvi? a su majestad como cosm?grafo en Flandes. Y el dicho VAN LANGREN, a imitaci?n de sus antepasados, ha ejercitado en esas artes y descubierto cosas que no se sab?an sobre la verdadera longitud por mar y tierra, apoy?ndose m?s en lo esencial que en lo especulativo. Y habi?ndolo propuesto a la infanta Isabel, muy aficionada a dichas artes, ella le recomend? al rey por una carta en 1629 (p?gina 9 de este documento), para que le encargase corregir la geograf?a. Su majestad lo aprob? por una real c?dula, debido a los enormes errores que muestran las distancias calculadas por eminentes astr?nomos y ge?grafos entre Toledo y Roma, tal como se muestra en esta l?nea, por la cual se pueden conjeturar los errores entre lugares m?s distantes." Mind you this was on a Mac so the usual cross-platform caveats apply: > sessionInfo() R version 2.9.1 Patched (2009-07-04 r48897) x86_64-apple-darwin9.7.0 locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] lattice_0.17-25 MASS_7.2-46 plotrix_2.6-4 plyr_0.1.9 Design_2.1-2 survival_2.35-4 [7] Hmisc_3.5-2 loaded via a namespace (and not attached): [1] cluster_1.12.0 grid_2.9.1 tools_2.9.1 -- DW On Aug 5, 2009, at 2:19 PM, Michael Friendly wrote:> For an historical paper I'm working on, I have some Spanish > plaintext, presently in the form of a Word .doc > file, > http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc > > and also some ciphered text from the same original source. The > ultimate goal is to use some > frequency analysis of letters and word lengths in the plaintext to > help decode the ciphered text. > > For now, I'm stuck on how to read the Spanish plaintext into R as a > text string, given that it is in a Word .doc file > using some form of latin1 encoding. From Word, I can Save As .. > plain text (.txt), but I'm worried about losing > character encoding information and I don't see anything in the list > of Other encodings presented that seems > helpful. > A naive attempt to read the .doc file directly gives: > > > langren.sp.file <- "http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc > " > > > > langren.txt <- scan(langren.sp.file, encoding="latin1") > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > scan() expected 'a real', got '??????' > > > > Can someone help? > > -- > Michael Friendly Email: friendly AT yorku DOT ca Professor, > Psychology Dept. > York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 > 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html > Toronto, ONT M3J 1P3 CANADA >David Winsemius, MD Heritage Laboratories West Hartford, CT
I used the readDOC function in tm. After storing the document locally on a Windows pc... langren.sp.path <- "C:\\text\\" #store file by itself in this directory langren.corpus <- (Corpus(DirSource(langren.sp.path), readerControl = list(reader = readDOC(AntiwordOptions = "-t"), language = "spa", load = TRUE))) (langren.sp.file <- langren.corpus[[1]])[1:10] I think the default encoding for antiword is latin1, but antiword -m option can handle other mappings. Sam Thomas -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michael Friendly Sent: Wednesday, August 05, 2009 2:19 PM To: R-Help Subject: [R] reading and frequency analysis of Spanish text For an historical paper I'm working on, I have some Spanish plaintext, presently in the form of a Word .doc file, http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc and also some ciphered text from the same original source. The ultimate goal is to use some frequency analysis of letters and word lengths in the plaintext to help decode the ciphered text. For now, I'm stuck on how to read the Spanish plaintext into R as a text string, given that it is in a Word .doc file using some form of latin1 encoding. From Word, I can Save As .. plain text (.txt), but I'm worried about losing character encoding information and I don't see anything in the list of Other encodings presented that seems helpful. A naive attempt to read the .doc file directly gives: > langren.sp.file <- "http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc" > > langren.txt <- scan(langren.sp.file, encoding="latin1") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '??????' > Can someone help? -- Michael Friendly Email: friendly AT yorku DOT ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.