Hi Sascha
Your code gives the correct results on my machine (OS X),
either reading from the file directly or via readLines() and passing
the text to xmlEventParse().
The problem might be the version of the XML package or your environment
settings. And it is important to report the session information.
So you should provide the output from
sessionInfo()
Sys.getenv()
libxmlVersion()
D
On 7/15/13 4:41 AM, Sascha Wolfer wrote:> Dear list,
>
> I have got a weird encoding problem with the xmlEventParse() function from
the 'XML' package.
>
> I tried finding an answer on the web for several hours and a Stack Exchange
question came back without success :(
>
> So here's the problem. I created a small XML test file, which looks
like this:
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <!DOCTYPE testFile>
> <s type="manual">auch der Schulleiter steht daf?r zur
Verf?gung. Das ist se?haft mit ? und ?...</s>
>
> This file is encoded with the iso-8859-1 encoding which is also defined in
its header.
>
> I have 3 handler functions, definitions as follows:
>
> sE2 <- function (name, attrs) {
> if (name == "s") {
> get.text <<- T }
> }
>
> eE2 <- function (name, attrs) {
> if (name == "s") {
> get.text <<- F
> }
> }
>
> tS2 <- function (content, ...) {
> if (get.text & nchar(content) > 0) {
> collected.text <<- c(collected.text, content)
> }
> }
>
> I have one wrapper function around xmlEventParse(), definition as follows:
>
> get.all.text <- function (file) {
> t1 <- Sys.time()
> read.file <- paste(readLines(file, encoding = ""), collapse
= " ")
> print(read.file)
> assign("collected.text", c(), env = .GlobalEnv)
> assign("get.text", F, env = .GlobalEnv)
> xmlEventParse(read.file, asText = T, list(startElement = sE2,
> endElement = eE2,
> text = tS2),
> error = function (...) { },
> saxVersion = 1)
> t2 <- Sys.time()
> cat("That took", round(difftime(t2,t1, units="secs"),
1), "seconds.\n")
> cat("Result of reading is in variable
'collected.text'.\n")
> collected.text
> }
>
> The output of calling get.all.text(<test file>) is as follows:
> [1] "<?xml version=\"1.0\"
encoding=\"iso-8859-1\"?> <!DOCTYPE testFile> <s
type=\"manual\">auch der Schulleiter steht
> daf?r zur Verf?gung. Das ist se?haft mit ? und ?...</s> "
> That took 0 seconds.
> Result of reading is in variable 'collected.text'.
> [1] "auch der Schulleiter steht daf"
"??r zur Verf??gung. Das ist se??haft mit ?? und ??..."
>
> Now the REALLY weird thing (for me) is that R obviously reads in the file
correctly (first output) with 'readLines()'.
> Then this output is passed to xmlEventParse. Afterwards the output is
broken and it sometimes also inserts weird breaks
> were special characters occur.
>
> Do you have any ideas how to solve this problem?
>
> I cannot use the xmlParse() function because I need the SAX functionality
of xmlEventParse(). I also tried reading the
> file with xmlEventParse() directly (with asText = F). No changes...
>
> Thanks a lot,
> Sascha W.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.