thr3ads.net - R help - [R] Extracting information from a tm corpus. [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Shawn Way

2017-Mar-02 16:23 UTC

[R] Extracting information from a tm corpus.

I'm trying to use the tm package to extract text from a corpus of documents.
I'm able to read in a set of PDF's and get a corpus that is filtered to
include all the documents that contain a term, for example, "hot
water".  I'm also able to get a list of the documents using the names()
function but I just cannot get a handle on getting the lines out of the corpus.

I would like to get a corpus that had just the filtered content out, ie the
lines containing the term.

I can manage to do this using something like :

  library(tm)
  library(tidyverse)
  library(tidytext)
  library(stringr)
  cname <- file.path(".","pdfs")
  docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
  docs <- tm_map(docs, content_transformer(tolower))

  search.par <- c("18")
  docs_filtered <- docs %>%
      tm_filter(FUN=function(x) any(grep(search.par, content(x))))


content(docs_filtered[[1]])[grep(search.par,content(docs_filtered[[1]]))]

This gives me the lines that contain the term "18"  in corpus document
1.  Is there any way to do this for all the corpus documents?

What I would like is something that has the lines containing the search
parameter in the corpus document to allow printing, at least to screen.

Thank you!

Shawn Way
???

R help - Mar 2017 - Extracting information from a tm corpus.

[R] Extracting information from a tm corpus.