I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, some of them are malformed in a way that makes R crash. The problem is that closing tags are sometimes repeated like this: <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> I want to preprocess the contents of the XML file using gsub() before feeding them to xmlTreeParse() to clean them up, but I can't figure out how to do it. What I need is something that transforms the example above into: <tag>value1</tag><tag>value2</tag><tag>value3</tag> Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*". Thanks in advance for you ideas, Uli
Ulrich Keller <ulrich.keller at emacs.lu> writes:> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*". > > Thanks in advance for you ideas,Hmm, there are things you just cannot do with RE's, and I suspect that this is one of them. Something involving explicit splitting of the strings might work, though. How's this for size?> trim <-function(x)paste(sub("</tag>.*","</tag>",x),collapse="<tag>")> sapply(strsplit(x,"<tag>"),trim)[1] "<tag>value1</tag><tag>value2</tag><tag>value3</tag>" -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Gabor Grothendieck
2007-Feb-24 16:07 UTC
[R] gsub: replacing a.*a if no occurence of b in .*
I assume <tag> is known. This removes any occurrence </tag>.*</tag> where .* does not contain <tag> or </tag>. The regular expression, re, matches </tag>, then does a greedy match (?U) for anything followed by </tag> but uses a zero width lookahead subexpression (?=...) for the second </tag> so that it it can be rematched again. gsubfn in package gsubfn is like the usual gsub except that instead of replacing the match with a string it passes the match to function f and then replaces the match with the output of f. See the gsubfn home page: http://code.google.com/p/gsubfn/ and vignette. library(gsubfn) text <- paste("<tag>value1</tag><tag>value2</tag>some", "garbage</tag></tag><tag>value3</tag>") re <- "</tag>((?U).*(?=</tag>))" f <- function(x) if (regexpr("<tag>", x) > 0) x else "" gsubfn(re, f, text, backref = 0, perl = TRUE) On 2/24/07, Ulrich Keller <ulrich.keller at emacs.lu> wrote:> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*". > > Thanks in advance for you ideas, > > Uli > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Duncan Temple Lang
2007-Feb-24 18:01 UTC
[R] gsub: replacing a.*a if no occurence of b in .*
Ulrich Keller wrote:> I am trying to read a number of XML files using xmlTreeParse(). Unfortunately, > some of them are malformed in a way that makes R crash. The problem is that > closing tags are sometimes repeated like this: > > <tag>value1</tag><tag>value2</tag>some garbage</tag></tag><tag>value3</tag> > > I want to preprocess the contents of the XML file using gsub() before feeding > them to xmlTreeParse() to clean them up, but I can't figure out how to do it. > What I need is something that transforms the example above into: > > <tag>value1</tag><tag>value2</tag><tag>value3</tag> > > Some kind of "</tag>.*</tag>" that only matches if there is no "<tag>" in ".*". > > Thanks in advance for you ideas,Instead of using xmlTreeParse() which really expects well-formed XML, and assuming you cannot have the XML generation mechanism fixed, you might try to use htmlTreeParse(). While the name suggests it is for HTML, it is really a "relaxed" XML parser that is capable of handling malformed XML. This typically occurs in HTML and hence the name. Of course, since the XML is malformed, the results will be hard to predict as it is hard to make sense of "non-sense". If xmlTreeParse() is actually causing R to exit (i.e. what some people refer to as crashing), as Jeff (Horner) said, we would like to be able to stop this. We will need the actual text/file passed to xmlTreeParse(), version information of operating system, R and the XML package and any locale information. However, if by crashing you mean generates an error, then that is expected on malformed XML inputs. HTH, D.> > Uli > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Duncan Temple Lang duncan at wald.ucdavis.edu Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Bldg. fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/20070224/8edba5e8/attachment.bin