thr3ads.net - R help - [R] SOLVED: importing huge XML-Files -- new problem: special characters [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Alexander Heidrich

2007-Sep-04 16:17 UTC

[R] SOLVED: importing huge XML-Files -- new problem: special characters

Hi all,

thanks to the people who replied to my question! I finally solved the  
issue by writing own handlers and using xmlEventParse - which leads  
to the following problem which is so odd that its probably a bug.

I use several special charachter in my XML-File, e.g. umlauts or ? or  
? - but no matter how I encode my XML (UTF or ISO) or I escape these  
characters xmlEventParse always stops parsing after the first umlaut  
and pretends to have more than one node even if there is really just  
one!

Example:

<locations>abc	ab?cd	abdec</locations>

causes two events for locations and produces output in the form of:

	[,1]	[,2]	[,3]
[1,]	abc
[2,]	ab?cd	abdec


Should it be like that? If I remove the umlauts, than everything is  
fine!

If I do the following:

<locations>?abc	ab?cd	abdec</locations>

the output is

	[,1]	[,2]	[,3]
[1,]	?abc	ab?cd	abdec

Any suggestions?

Thanks in advance and many greetings!

Alex

Duncan Temple Lang

2007-Sep-04 17:15 UTC

head link

[R] SOLVED: importing huge XML-Files -- new problem: special characters

Alexander Heidrich wrote:> Hi all,
> 
> thanks to the people who replied to my question! I finally solved the  
> issue by writing own handlers and using xmlEventParse - which leads  
> to the following problem which is so odd that its probably a bug.
> 
> I use several special charachter in my XML-File, e.g. umlauts or ? or  
> ? - but no matter how I encode my XML (UTF or ISO) or I escape these  
> characters xmlEventParse always stops parsing after the first umlaut  
> and pretends to have more than one node even if there is really just  
> one!
> 
> Example:
> 
> <locations>abc	ab?cd	abdec</locations>
> 
> causes two events for locations and produces output in the form of:
> 
> 	[,1]	[,2]	[,3]
> [1,]	abc
> [2,]	ab?cd	abdec
> 
Well, your output is particular to your text event handlers so 
what you show us does not tell us what were the inputs.
If you have two events and you got "abc     "
and "abocd	abdec" (or the trailing spaces from the first
appeared on the second and not the first), that would not
suprise me. 

The underlying XML parser is extracting content from a stream
of bytes. It makes no guarantee that contiguous text
content is delivered in a single event to the handlers.
Instead, it consumes as much of the stream as it wants
and delivers that and then continues from where it left off
in the stream. If it encounters a text node with a large amount
of text, it will deliver that in smaller chunks. 

This undoubtedly makes the processing of the stream slightly harder
for the handler as it has to remember where it "was", but this is true
of all handlers so not a significant burden.

The branches parameter of the xmlEventParse() function does provide
a way to mix SAX/event parsing with the easier DOM/node style parsing.

 D.
> 
> Should it be like that? If I remove the umlauts, than everything is  
> fine!
> 
> If I do the following:
> 
> <locations>?abc	ab?cd	abdec</locations>
> 
> the output is
> 
> 	[,1]	[,2]	[,3]
> [1,]	?abc	ab?cd	abdec
> 
> Any suggestions?
> 
> Thanks in advance and many greetings!
> 
> Alex
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Duncan Temple Lang                duncan at wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
4210 Mathematical Sciences Bldg.  fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url :
https://stat.ethz.ch/pipermail/r-help/attachments/20070904/7152998e/attachment.bin

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Sep 2007 - SOLVED: importing huge XML-Files -- new problem: special characters

[R] SOLVED: importing huge XML-Files -- new problem: special characters

[R] SOLVED: importing huge XML-Files -- new problem: special characters

Possibly Parallel Threads