I need help with creating custom xml reader for use with the tm package. The objective is to crate a corpus for analysis. Files that I'm working with come from solr and are in a funky XML format never the less I'm able to parse the XML files using solrDocs.R function provided by Duncan Temple Lang. The problem I'm having that once I parse the document I need to create a custom reader that would be compatible with the tm package. If someone build a custom reader for tm package, or has some ideas of how to go about this, I would greatly appreciate the help. Thanks -- View this message in context: http://r.789695.n4.nabble.com/tm-package-custom-reader-tp4292766p4292766.html Sent from the R help mailing list archive at Nabble.com.
Le vendredi 13 janvier 2012 ? 09:00 -0800, pl.rudy at gmail.com a ?crit :> I need help with creating custom xml reader for use with the tm package. The > objective is to crate a corpus for analysis. Files that I'm working with > come from solr and are in a funky XML format never the less I'm able to > parse the XML files using solrDocs.R function provided by Duncan Temple > Lang. > > The problem I'm having that once I parse the document I need to create a > custom reader that would be compatible with the tm package. > > If someone build a custom reader for tm package, or has some ideas of how to > go about this, I would greatly appreciate the help.I've just written a custom XML source for tm a few days ago, so I guess I can help. First, tm has a document explaining how to write an XML reader [1], and it's relatively easy. Though, I think you shouldn't base your tm reader on the functions solrDocs.R, since they don't share the same structure as what tm expects. But you can probably adapt the code from there. To sum up how tm extensions work, you should have one function parsing the XML file and returning one XML string for each document in a corpus: this is the source. And one function parsing these per-document XML strings, and filling the document's body and meta-data from the XML tags. I think your code can be simpler than solrDocs.R since you probably know beforehand which tags are useful for you, which aren't, and what their types are. Feel free to ask for help on specific issues you may have. But please provide a short XML example (and possible code). Also, when you're done, please consider making this available, either from tm itself, or from a new package, if it can be useful to others. Regards 1: http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf
On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat <nalimilan@club.fr>wrote:> Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit : > > Hi Milan, > > > > > > The xml solr files are not in a typical format, here is an example > > http://www.omegahat.org/RSXML/solr.xml > > I'm not sure how to parse the documents with out using solrDocs.R > > function, and how to make the function compatible with a tm package. > Indeed, this doesn't seem to be easy to parse using the generic XML > source from tm. So it will be easier for you to create your own custom > source from scratch. Have a look at the source.R and reader.R files in > the tm source: you need to replicate the behavior of one of the sources. > > The code should include the following functions: > > readSorl <- FunctionGenerator(function(...) { > function(elem, language, id) { > # Use elem$content, which contains an item set by SorlSource() > below, > # and create a PlainTextDocument() from it, > # putting the data where appropriate (text, meta-data) > } > }) > > SorlSource <- function(x) { > # Parse the XML file using functions from solrDocs.R, and > # create "content", which is a list with one item for each document, > # to pass to readSorl() one by one > > s <- tm:::.Source(readSorl, "UTF-8", length(content), FALSE, seq(1, > length(content)), 0, FALSE) > s$Content <- content > s$URI <- match.call()$x > class(s) = c("SorlSource", "Source") > s > } > > getElem <- function(x) UseMethod("getElem", x) > getElem.SorlSource <- function(x) { > list(content = x$Content[[x$Position]], uri = match.call()$x) > } > > eoi <- function(x) UseMethod("eoi", x) > eoi.SorlSource <- function(x) length(x$Content) <= x$Position > > > Hope this helps > >[[alternative HTML version deleted]]
On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat <nalimilan@club.fr>wrote:> Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit : > > Hi Milan, > > > > > > The xml solr files are not in a typical format, here is an example > > http://www.omegahat.org/RSXML/solr.xml > > I'm not sure how to parse the documents with out using solrDocs.R > > function, and how to make the function compatible with a tm package. > Indeed, this doesn't seem to be easy to parse using the generic XML > source from tm. So it will be easier for you to create your own custom > source from scratch. Have a look at the source.R and reader.R files in > the tm source: you need to replicate the behavior of one of the sources. > > The code should include the following functions: > > readSorl <- FunctionGenerator(function(...) { > function(elem, language, id) { > # Use elem$content, which contains an item set by SorlSource() > below, > # and create a PlainTextDocument() from it, > # putting the data where appropriate (text, meta-data) > } > }) > > SorlSource <- function(x) { > # Parse the XML file using functions from solrDocs.R, and > # create "content", which is a list with one item for each document, > # to pass to readSorl() one by one > > s <- tm:::.Source(readSorl, "UTF-8", length(content), FALSE, seq(1, > length(content)), 0, FALSE) > s$Content <- content > s$URI <- match.call()$x > class(s) = c("SorlSource", "Source") > s > } > > getElem <- function(x) UseMethod("getElem", x) > getElem.SorlSource <- function(x) { > list(content = x$Content[[x$Position]], uri = match.call()$x) > } > > eoi <- function(x) UseMethod("eoi", x) > eoi.SorlSource <- function(x) length(x$Content) <= x$Position > > > Hope this helps > >[[alternative HTML version deleted]]