thr3ads.net - R help - [R] using package tm to find phrases [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Mark Kimpel

2009-Aug-13 19:36 UTC

[R] using package tm to find phrases

I am using the package "tm" for text-mining of abstracts and would
like to
use it to find instances of gene names that may contain white space. For
instance "gene regulatory protein 1". The default behavior of tm is to
parse
this into 4 separate words, but I would like to use the class constructor
"dictionary" to define phrases such as just mentioned.

Is this possible? If so, how?

Thanks,
Mark
------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do."
-- B.
F. Skinner
******************************************************************

	[[alternative HTML version deleted]]

Ingo Feinerer

2009-Aug-13 22:11 UTC

head link

[R] using package tm to find phrases

On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel
wrote:> I am using the package "tm" for text-mining of abstracts and
would like to use
> it to find instances of gene names that may contain white space. For
instance
> "gene regulatory protein 1". The default behavior of tm is to
parse this into 4
> separate words, but I would like to use the class constructor
"dictionary" to
> define phrases such as just mentioned.
> 
> Is this possible? If so, how?
Yes.

* In case you only need to find instances, you could use full text
  search on your corpus, e.g.

  R> tmIndex(yourCorpus, "gene regulatory protein 1")

  would return the indices of all documents in your corpus containing
  this phrase.

* If you need tokens (in a term-document matrix) of length 4, you could
  use a n-gram tokenizer (n = 4). See e.g.,
  tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use
  the dictionary argument to store only your selection of gene
  names. I.e., something like

  R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x,
Weka_control(min = 4, max = 4))
  R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer,
dictionary = yourDictionary))

  where yourDictionary contains the gene names (a character vector
  suffices) to be included in the term-document matrix.

* If you want to extract arbitrary patterns of different length that
  could match some gene names (and build a dictionary from that), you
  need some custom functionality. Regular expressions might be a good
  starting point ...

Best regards, Ingo

-- 
Ingo Feinerer
Vienna University of Technology
dbai.tuwien.ac.at/staff/feinerer

Maybe Matching Threads

Search for more possibly parallel threads

R help - Aug 2009 - using package tm to find phrases

[R] using package tm to find phrases

[R] using package tm to find phrases

Maybe Matching Threads