Is there some way to remove superscripts from objects returned by html/xmlParse (XML package)? h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>" doc <- htmlParse(h) xpathSApply(doc, "//p", xmlValue) [1] "Cata" "Dog" I could probably remove the <sup> tags from the "h" object above, but I'd rather just work with the results from htmlParse if possible (and not use readLines to load raw HTML first). Thanks, Chris Stubben -- View this message in context: http://r.789695.n4.nabble.com/Remove-superscripts-from-HTML-objects-tp4550738p4550738.html Sent from the R help mailing list archive at Nabble.com.
Hi, h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>" sub("<sup.*sup>","",h) see http://en.wikibooks.org/wiki/R_Programming/Text_Processing for more information. Regards!
> h <- "<html><p>Cat<sup>a</sup></p><p>Dog</p></html>" > sub("<sup.*sup>","",h)Probably safer to do gsub("<sup.*?sup>","",h) to avoid replacing multiple superscripts. eg h2 <- "<html><p>Cat<sup>a</sup></p><p>Dog</p><p>Mouse<sup>a</sup></p><p>Raccoon</p></html>" sub("<sup.*sup>","",h2) #drops everything between first <sup and last sup> gsub("<sup.*?sup>","",h2) #Drops each <sub>xxx</sup> ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Sorry if I was not clear. I wanted to remove the superscripts using xpath queries if possible. For example this will get p nodes with superscripts, but how do I remove the superscripts if there are many matching nodes and different superscripts? xpathSApply(doc, "//p[sup]", xmlValue) [1] "Cata" Chris -- View this message in context: http://r.789695.n4.nabble.com/Remove-superscripts-from-HTML-objects-tp4550738p4555370.html Sent from the R help mailing list archive at Nabble.com.