When I open that link in OpenOffice.org Writer and then save in "Text
encoded" format with "Unicode" encoding, the diacriticals (is
that the
correct font-ish term?) seem to remain intact wehn re-opended. When I
read that file in, not with scan() but with readLines(), here is what
I get for the second string:
langren.txt <- readLines("/Users/davidwinsemius/Downloads/Verdadera-
spanish-stripped-1.txt", encoding="UTF-8")
langren.txt[2]
[2] "MIGUEL FLORENCIO VAN LANGREN Matem?tico y cosm?grafo de su
Majestad presenta las siguientes consideraciones de la Longitud por
Mar y Tierra; y dice que su Padre y Abuelo fueron astr?nomos y
ge?grafos, y en particular su padre asisti? a las observaciones
celestes realizadas por el famoso astr?nomo Ticho Brahe, de quien
recibi? sus primeras observaciones, como consta por las obras del
dicho Ticho. As? mismo su padre sirvi? a su majestad como cosm?grafo
en Flandes. Y el dicho VAN LANGREN, a imitaci?n de sus antepasados, ha
ejercitado en esas artes y descubierto cosas que no se sab?an sobre la
verdadera longitud por mar y tierra, apoy?ndose m?s en lo esencial que
en lo especulativo. Y habi?ndolo propuesto a la infanta Isabel, muy
aficionada a dichas artes, ella le recomend? al rey por una carta en
1629 (p?gina 9 de este documento), para que le encargase corregir la
geograf?a. Su majestad lo aprob? por una real c?dula, debido a los
enormes errores que muestran las distancias calculadas por eminentes
astr?nomos y ge?grafos entre Toledo y Roma, tal como se muestra en
esta l?nea, por la cual se pueden conjeturar los errores entre lugares
m?s distantes."
Mind you this was on a Mac so the usual cross-platform caveats apply:
> sessionInfo()
R version 2.9.1 Patched (2009-07-04 r48897)
x86_64-apple-darwin9.7.0
locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets
methods base
other attached packages:
[1] lattice_0.17-25 MASS_7.2-46 plotrix_2.6-4 plyr_0.1.9
Design_2.1-2 survival_2.35-4
[7] Hmisc_3.5-2
loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1 tools_2.9.1
--
DW
On Aug 5, 2009, at 2:19 PM, Michael Friendly wrote:
> For an historical paper I'm working on, I have some Spanish
> plaintext, presently in the form of a Word .doc
> file,
>
euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc
>
> and also some ciphered text from the same original source. The
> ultimate goal is to use some
> frequency analysis of letters and word lengths in the plaintext to
> help decode the ciphered text.
>
> For now, I'm stuck on how to read the Spanish plaintext into R as a
> text string, given that it is in a Word .doc file
> using some form of latin1 encoding. From Word, I can Save As ..
> plain text (.txt), but I'm worried about losing
> character encoding information and I don't see anything in the list
> of Other encodings presented that seems
> helpful.
> A naive attempt to read the .doc file directly gives:
>
> > langren.sp.file <-
"euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc
> "
> >
> > langren.txt <- scan(langren.sp.file, encoding="latin1")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, :
> scan() expected 'a real', got '??????'
> >
>
> Can someone help?
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca Professor,
> Psychology Dept.
> York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street math.yorku.ca/SCS/friendly.html
> Toronto, ONT M3J 1P3 CANADA
>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT