Alexander Heidrich
2007-Sep-04 16:17 UTC
[R] SOLVED: importing huge XML-Files -- new problem: special characters
Hi all, thanks to the people who replied to my question! I finally solved the issue by writing own handlers and using xmlEventParse - which leads to the following problem which is so odd that its probably a bug. I use several special charachter in my XML-File, e.g. umlauts or ? or ? - but no matter how I encode my XML (UTF or ISO) or I escape these characters xmlEventParse always stops parsing after the first umlaut and pretends to have more than one node even if there is really just one! Example: <locations>abc ab?cd abdec</locations> causes two events for locations and produces output in the form of: [,1] [,2] [,3] [1,] abc [2,] ab?cd abdec Should it be like that? If I remove the umlauts, than everything is fine! If I do the following: <locations>?abc ab?cd abdec</locations> the output is [,1] [,2] [,3] [1,] ?abc ab?cd abdec Any suggestions? Thanks in advance and many greetings! Alex
Duncan Temple Lang
2007-Sep-04 17:15 UTC
[R] SOLVED: importing huge XML-Files -- new problem: special characters
Alexander Heidrich wrote:> Hi all, > > thanks to the people who replied to my question! I finally solved the > issue by writing own handlers and using xmlEventParse - which leads > to the following problem which is so odd that its probably a bug. > > I use several special charachter in my XML-File, e.g. umlauts or ? or > ? - but no matter how I encode my XML (UTF or ISO) or I escape these > characters xmlEventParse always stops parsing after the first umlaut > and pretends to have more than one node even if there is really just > one! > > Example: > > <locations>abc ab?cd abdec</locations> > > causes two events for locations and produces output in the form of: > > [,1] [,2] [,3] > [1,] abc > [2,] ab?cd abdec >Well, your output is particular to your text event handlers so what you show us does not tell us what were the inputs. If you have two events and you got "abc " and "abocd abdec" (or the trailing spaces from the first appeared on the second and not the first), that would not suprise me. The underlying XML parser is extracting content from a stream of bytes. It makes no guarantee that contiguous text content is delivered in a single event to the handlers. Instead, it consumes as much of the stream as it wants and delivers that and then continues from where it left off in the stream. If it encounters a text node with a large amount of text, it will deliver that in smaller chunks. This undoubtedly makes the processing of the stream slightly harder for the handler as it has to remember where it "was", but this is true of all handlers so not a significant burden. The branches parameter of the xmlEventParse() function does provide a way to mix SAX/event parsing with the easier DOM/node style parsing. D.> > Should it be like that? If I remove the umlauts, than everything is > fine! > > If I do the following: > > <locations>?abc ab?cd abdec</locations> > > the output is > > [,1] [,2] [,3] > [1,] ?abc ab?cd abdec > > Any suggestions? > > Thanks in advance and many greetings! > > Alex > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Duncan Temple Lang duncan at wald.ucdavis.edu Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Bldg. fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/20070904/7152998e/attachment.bin