On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel
wrote:> I am using the package "tm" for text-mining of abstracts and
would like to use
> it to find instances of gene names that may contain white space. For
instance
> "gene regulatory protein 1". The default behavior of tm is to
parse this into 4
> separate words, but I would like to use the class constructor
"dictionary" to
> define phrases such as just mentioned.
>
> Is this possible? If so, how?
Yes.
* In case you only need to find instances, you could use full text
search on your corpus, e.g.
R> tmIndex(yourCorpus, "gene regulatory protein 1")
would return the indices of all documents in your corpus containing
this phrase.
* If you need tokens (in a term-document matrix) of length 4, you could
use a n-gram tokenizer (n = 4). See e.g.,
http://tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use
the dictionary argument to store only your selection of gene
names. I.e., something like
R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x,
Weka_control(min = 4, max = 4))
R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer,
dictionary = yourDictionary))
where yourDictionary contains the gene names (a character vector
suffices) to be included in the term-document matrix.
* If you want to extract arbitrary patterns of different length that
could match some gene names (and build a dictionary from that), you
need some custom functionality. Regular expressions might be a good
starting point ...
Best regards, Ingo
--
Ingo Feinerer
Vienna University of Technology
http://www.dbai.tuwien.ac.at/staff/feinerer