thr3ads.net - R help - [R] tm package, custom reader [Jan 2012]

If this information is useful, please help other people find it:
Share via:

pl.rudy at gmail.com

2012-Jan-13 17:00 UTC

[R] tm package, custom reader

I need help with creating custom xml reader for use with the tm package.  The
objective is to crate a corpus for analysis.  Files that I'm working with
come from solr and are in a funky XML format never the less I'm able to
parse the XML files using  solrDocs.R function provided by Duncan Temple
Lang.  

The problem I'm having that once I parse the document I need to create a
custom reader that would be compatible with the  tm package.  

If someone build a custom reader for tm package, or has some ideas of how to
go about this,  I would greatly appreciate the help.

Thanks 

--
View this message in context:
r.789695.n4.nabble.com/tm-package-custom-reader-tp4292766p4292766.html
Sent from the R help mailing list archive at Nabble.com.

Milan Bouchet-Valat

2012-Jan-14 14:20 UTC

head link

[R] tm package, custom reader

Le vendredi 13 janvier 2012 ? 09:00 -0800, pl.rudy at gmail.com a ?crit
:> I need help with creating custom xml reader for use with the tm package. 
The
> objective is to crate a corpus for analysis.  Files that I'm working
with
> come from solr and are in a funky XML format never the less I'm able to
> parse the XML files using  solrDocs.R function provided by Duncan Temple
> Lang.  
> 
> The problem I'm having that once I parse the document I need to create
a
> custom reader that would be compatible with the  tm package.  
> 
> If someone build a custom reader for tm package, or has some ideas of how
to
> go about this,  I would greatly appreciate the help.I've just written a custom XML source for tm a few days ago, so I guess
I can help. First, tm has a document explaining how to write an XML
reader [1], and it's relatively easy.

Though, I think you shouldn't base your tm reader on the functions
solrDocs.R, since they don't share the same structure as what tm
expects. But you can probably adapt the code from there.

To sum up how tm extensions work, you should have one function parsing
the XML file and returning one XML string for each document in a corpus:
this is the source. And one function parsing these per-document XML
strings, and filling the document's body and meta-data from the XML
tags. I think your code can be simpler than solrDocs.R since you
probably know beforehand which tags are useful for you, which aren't,
and what their types are.

Feel free to ask for help on specific issues you may have. But please
provide a short XML example (and possible code). Also, when you're done,
please consider making this available, either from tm itself, or from a
new package, if it can be useful to others.


Regards

1: cran.r-project.org/web/packages/tm/vignettes/extensions.pdf

Andy Adamiec

2012-Jan-14 19:51 UTC

head link

[R] tm package, custom reader

On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat
<nalimilan@club.fr>wrote:
> Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit :
> > Hi Milan,
> >
> >
> > The xml solr files are not in a typical format, here is an example
> > omegahat.org/RSXML/solr.xml
> > I'm not sure how to parse the documents with out using solrDocs.R
> > function, and how to make the function compatible with a tm package.
> Indeed, this doesn't seem to be easy to parse using the generic XML
> source from tm. So it will be easier for you to create your own custom
> source from scratch. Have a look at the source.R and reader.R files in
> the tm source: you need to replicate the behavior of one of the sources.
>
> The code should include the following functions:
>
> readSorl <- FunctionGenerator(function(...) {
>    function(elem, language, id) {
>        # Use elem$content, which contains an item set by SorlSource()
> below,
>        # and create a PlainTextDocument() from it,
>        # putting the data where appropriate (text, meta-data)
>    }
> })
>
> SorlSource <- function(x) {
>    # Parse the XML file using functions from solrDocs.R, and
>    # create "content", which is a list with one item for each
document,
>    # to pass to readSorl() one by one
>
>    s <- tm:::.Source(readSorl, "UTF-8", length(content),
FALSE, seq(1,
> length(content)), 0, FALSE)
>    s$Content <- content
>    s$URI <- match.call()$x
>    class(s) = c("SorlSource", "Source")
>    s
> }
>
> getElem <- function(x) UseMethod("getElem", x)
> getElem.SorlSource <-  function(x) {
>    list(content = x$Content[[x$Position]], uri = match.call()$x)
> }
>
> eoi <- function(x) UseMethod("eoi", x)
> eoi.SorlSource <- function(x) length(x$Content) <= x$Position
>
>
> Hope this helps
>
>
	[[alternative HTML version deleted]]

Andy Adamiec

2012-Jan-14 19:54 UTC

head link

[R] tm package, custom reader

On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat
<nalimilan@club.fr>wrote:
> Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit :
> > Hi Milan,
> >
> >
> > The xml solr files are not in a typical format, here is an example
> > omegahat.org/RSXML/solr.xml
> > I'm not sure how to parse the documents with out using solrDocs.R
> > function, and how to make the function compatible with a tm package.
> Indeed, this doesn't seem to be easy to parse using the generic XML
> source from tm. So it will be easier for you to create your own custom
> source from scratch. Have a look at the source.R and reader.R files in
> the tm source: you need to replicate the behavior of one of the sources.
>
> The code should include the following functions:
>
> readSorl <- FunctionGenerator(function(...) {
>    function(elem, language, id) {
>        # Use elem$content, which contains an item set by SorlSource()
> below,
>        # and create a PlainTextDocument() from it,
>        # putting the data where appropriate (text, meta-data)
>    }
> })
>
> SorlSource <- function(x) {
>    # Parse the XML file using functions from solrDocs.R, and
>    # create "content", which is a list with one item for each
document,
>    # to pass to readSorl() one by one
>
>    s <- tm:::.Source(readSorl, "UTF-8", length(content),
FALSE, seq(1,
> length(content)), 0, FALSE)
>    s$Content <- content
>    s$URI <- match.call()$x
>    class(s) = c("SorlSource", "Source")
>    s
> }
>
> getElem <- function(x) UseMethod("getElem", x)
> getElem.SorlSource <-  function(x) {
>    list(content = x$Content[[x$Position]], uri = match.call()$x)
> }
>
> eoi <- function(x) UseMethod("eoi", x)
> eoi.SorlSource <- function(x) length(x$Content) <= x$Position
>
>
> Hope this helps
>
>
	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jan 2012 - tm package, custom reader

[R] tm package, custom reader

[R] tm package, custom reader

[R] tm package, custom reader

[R] tm package, custom reader

Reasonably Related Threads