thr3ads.net - R help - [R] Finding words that are within +/- X words of "KRAS" using tm package or other means [May 2012]

If this information is useful, please help other people find it:
Share via:

Paul Miller

2012-May-16 18:27 UTC

[R] Finding words that are within +/- X words of "KRAS" using tm package or other means

Hello All,

This will probably be easy for some but isn't for me. Currently am working
on a text mining exercise. Want to be able to predict whether cancer patients
got KRAS testing, and, if so, whether the test yielded a result of wild
type/negative or mutant/positive. I've begun with a "bag-of-words
approach" that looks at the count of specific terms in the medical records
and then uses some of those as predictors.

This works great for predicting whether or not patients got tested. It's not
so good though when it comes to predicting the outcome of testing. Trouble is
that patients can have a reference to KRAS testing and also have a lot of
references to, say, "positive" where that term has nothing to do with
the result of their KRAS testing.

So I'd like to be able to identify the number of instances in a
patient's medical record where relevant terms like "wild type",
"negative", "mutant", or "positive" come either
shortly before or shortly after "KRAS". It would be great if there is
a way to do this within the tm package. I've found that very helpful for
preparing my data thus far.

If not though, I have a data frame that contains patient number in one column
and the patient's complete text medical record in another. So some sort of
regular expression likely would work just fine.

Here are some examples of the sort of thing I'm looking to count:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the
presence of a mutation."

"Tumor is KRAS negative"

"KRAS (mutated)" 

"Tumor is positive for KRAS mutation" 

And here's an example of something I want to ignore.

"Will conduct KRAS testing prior to initiation of therapy. ... (Several
lines of material) ... Bilirubin positive."

A couple of things stand out here. The first is that I need to be able to pick
up on variations of the relevant terms. So, for example, that means being able
to identify that either "mutant" or "mutated" came in close
proximity to "KRAS".

The other thing is that while increasing the number of words to look forward and
backward will identify more valid cases, it will also tend to identify more
invalid ones as well. For example, looking as many as 12 words after KRAS will
lead to correct identification of:

"Received KRAS testing results on xx/xx/xxxx. Test results indicate the
presence of a mutation."

but also incorrect identification of:

"Will conduct KRAS testing prior to initiation of therapy. Note that
patient was positive for Lynch mutation."

Thinking I will need to to keep the window short in order to obtain the best
results. Would be nice if I could easily increase or decrease the number of
words to look forward and backward though. Would also be good if I could, say,
select a relatively small number of terms to look forward and a larger number of
words to look forward.

Having gotten to the end of this description it occurs to me this is actually
harder than I thought.

If one of you gurus could help me out, that would be greatly appreciated.

Thanks,

Paul

Reasonably Related Threads

Search for more reasonably related threads

R help - May 2012 - Finding words that are within +/- X words of "KRAS" using tm package or other means

[R] Finding words that are within +/- X words of "KRAS" using tm package or other means

Reasonably Related Threads

Wisdom of the Ancients