Paul Miller
2017-Jul-13 19:00 UTC
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hi Robert,
Thank you for your reply. An attempt to solve this via a regular expression
query is particularly helpful. Unfortunately, I don't have much time to play
around with this just now. Ultimately though, I think I would like to implement
a solution something along the lines of what you have done. I have a book on
regular expressions that I am now starting to read. In the meantime, the code
I'm using is a good way to assess the feasibility of some ideas I'd like
to implement.
The advantage of your approach I think is that it makes fewer passes through the
data. That should make it a lot faster and more efficient than what I've
done. I'm currently working with a little more than 2.5 million text records
and I think that number will only rise. So efficiency really should matter.
I've pasted the latest version of my sample code below. This shows how
I'd like to add the result of the text search as a column in a data frame.
It also shows how I'd like to append the sentence number to each identified
sentence. The single colon that appears where there is no match is not by
design. It's something that I need to tidy.
My sense is that if I used your regular expression as written, I'd lose the
information about the sentence number when I added the result as a column in my
data frame. Presumably, I'd need to collapse the information into a single
text string, and then the numbering would be lost. If you were going to get the
sentence numbers as well, without making several passes through the data like my
code does, how would you go about it?
Thanks,
Paul
library(tidyverse)
library(stringr)
library(lubridate)
sentence_match <- function(x){
sentence_extract <- str_extract_all(x, boundary("sentence"),
simplify = TRUE)
sentence_number <- intersect(str_which(sentence_extract,
"breast"), str_which(sentence_extract, "metastatic|stage
IV"))
sentence_match <- str_c(sentence_number, ": ",
sentence_extract[sentence_number], collapse = "")
sentence_match
}
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 2),
DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
TYPE = c("Progress note", "CAT scan", "Progress
note", "Progress note"),
TVAR = c(
"This sentence contains the word metastatic. This sentence contains
the term stage IV.",
"This sentence contains no target words. This sentence also
contains no target words.",
"This sentence contains the word metastatic and the word breast.
This sentence contains no target words.",
"This sentence contains the words breast and the term metastatic.
This
sentence contains the word breast and the term stage IV."
)
),
.Names = c("PTNO", "DATE", "TYPE",
"TVAR"),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-4L)
)
sampletxt$EXTRACTED <- sapply(sampletxt$TVAR, sentence_match)
sampletxt$EXTRACTED
> sampletxt$EXTRACTED
[1] ":
"
[2] ":
"
[3] "1: This sentence
contains the word metastatic and the word breast.
"
[4] "1: This sentence contains the words breast and the term metastatic. 2:
This sentence contains the word breast and the term stage IV."
________________________________
From: Robert McGehee <rmcgehee at walleyetrading.net>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Wednesday, July 12, 2017 12:47 PM
Subject: RE: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Hi Paul,
Sounds like you have your answer, but for fun I thought I'd try solving your
problem using only a regular expression query and base R. I believe this works:
> txt <- "Patient had stage IV breast cancer. Nothing matches this
sentence. Metastatic and breast match this sentence. French bike champion takes
stage IV victory in Tour de France."
> pattern <-
"([^.?!]*(?=[^.?!]*\\bbreast\\b)(?=[^.?!]*\\b(metastatic|stage
IV)\\b)(?=[\\s.?!])[^.?!]*[.?!])"
> regmatches(txt, gregexpr(pattern, txt, perl=TRUE, ignore.case=TRUE))[[1]]
[1] "Patient had stage IV breast cancer."
[2] " Metastatic and breast match this sentence."
Cheers, Robert
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Paul Miller
via R-help
Sent: Wednesday, July 12, 2017 8:49 AM
To: Bert Gunter <bgunter.4567 at gmail.com>
Cc: R-help <r-help at r-project.org>
Subject: Re: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Hi Bert,
Thanks for your reply. It appears that I didn't replace the variable name
"sampletxt" with the argument "x" in my function. I've
corrected that and now my code seems to be working fine.
Paul
________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Cc: R-help <r-help at r-project.org>
Sent: Tuesday, July 11, 2017 2:00 PM
Subject: Re: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Have you looked at the CRAN Natural Language Processing Task View? If not, why
not? If so, why were the resources described there inadequate?
Bert
On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help at
r-project.org> wrote:
Hello All,>
>I need some help figuring out how to extract combinations of target
words/terms from cancer patient text medical records. I've provided some
sample data and code below to illustrate what I'm trying to do. At the
moment, I'm trying to extract sentences that contain the word
"breast" plus either "metastatic" or "stage IV".
>
>It's been some time since I used R and I feel a bit rusty. I wrote a
function called "sentence_match" that seemed to work well when applied
to a single piece of text. You can see that by running the section titled
>
>"Working code". I thought that it might be possible easily to
apply my function to a data set (tibble or df) but that doesn't seem to be
the case. My unsuccessful attempt to do this appears in the section titled
"Non-working code".
>
>If someone could help me get my code up and running, that would be greatly
appreciated. I'm using a lot of functions from Hadley Wickham's
packages, but that's not particularly necessary. Although I have only a few
entries in my sample data, my actual data are pretty large. Currently, I'm
working with over a million records. Some records contain only a single
sentence, but many have several paragraphs. One concern I had was that, even if
I could get my code working, it would be too inefficient to handle that volume
of data.
>
>Thanks,
>
>Paul
>
>
>library(tidyverse)
>library(stringr)
>library(lubridate)
>
>sentence_match <- function(x){
> sentence_extract <- str_extract_all(sampletxt,
boundary("sentence"), simplify = TRUE)
> sentence_number <- intersect(str_which(sentence_ extract,
"breast"), str_which(sentence_extract, "metastatic|stage
IV"))
> sentence_match <- str_c(sentence_number, ": ",
sentence_extract[sentence_ number], collapse = "")
> sentence_match
>}
>
>#### Working code ####
>
>sampletxt <- "This sentence contains the word metastatic and the
word breast. This sentence contains no target words."
>
>sentence_match(sampletxt)
>
>#### Non-working code ####
>
>sampletxt <-
> structure(
> list(
> PTNO = c(1, 2, 2, 2),
> DATE = structure(c(16436, 16436, 16832, 16845), class =
"Date"),
> TYPE = c("Progress note", "CAT scan",
"Progress note", "Progress note"),
> TVAR = c(
> "This sentence contains the word metastatic. This sentence
contains the term stage IV.",
> "This sentence contains no target words. This sentence also
contains no target words.",
> "This sentence contains the word metastatic and the word
breast. This sentence contains no target words.",
> "This sentence contains the words breast and the term
metastatic. This
>sentence contains the word breast and the term stage IV."
> )
> ),
> .Names = c("PTNO", "DATE", "TYPE",
"TVAR"),
> class = c("tbl_df",
> "tbl", "data.frame"),
> row.names = c(NA,-4L)
> )
>
>sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
>sampletxt2 <-
> sampletxt2 %>%
> mutate(
> EXTRACTED = sentence_match(TVAR)
> )
>
>______________________________ ________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/ listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/
posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Robert McGehee
2017-Jul-13 20:23 UTC
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hi Paul,
No need to collapse the information into a single text string, gregexpr() can
take a vector of strings (sentences in your case). You can split your sentences
up, number them how you want, then search for your pattern either via regex or
via these extra packages you use which probably use the PCRE regex library
anyway. However, as this is basically what you did, I'm not sure why
you're not happy with your existing approach.
-----Original Message-----
From: Paul Miller [mailto:pjmiller_57 at yahoo.com]
Sent: Thursday, July 13, 2017 3:01 PM
To: Robert McGehee <rmcgehee at walleyetrading.net>
Cc: r-help at r-project.org
Subject: Re: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Hi Robert,
Thank you for your reply. An attempt to solve this via a regular expression
query is particularly helpful. Unfortunately, I don't have much time to play
around with this just now. Ultimately though, I think I would like to implement
a solution something along the lines of what you have done. I have a book on
regular expressions that I am now starting to read. In the meantime, the code
I'm using is a good way to assess the feasibility of some ideas I'd like
to implement.
The advantage of your approach I think is that it makes fewer passes through the
data. That should make it a lot faster and more efficient than what I've
done. I'm currently working with a little more than 2.5 million text records
and I think that number will only rise. So efficiency really should matter.
I've pasted the latest version of my sample code below. This shows how
I'd like to add the result of the text search as a column in a data frame.
It also shows how I'd like to append the sentence number to each identified
sentence. The single colon that appears where there is no match is not by
design. It's something that I need to tidy.
My sense is that if I used your regular expression as written, I'd lose the
information about the sentence number when I added the result as a column in my
data frame. Presumably, I'd need to collapse the information into a single
text string, and then the numbering would be lost. If you were going to get the
sentence numbers as well, without making several passes through the data like my
code does, how would you go about it?
Thanks,
Paul
library(tidyverse)
library(stringr)
library(lubridate)
sentence_match <- function(x){
sentence_extract <- str_extract_all(x, boundary("sentence"),
simplify = TRUE)
sentence_number <- intersect(str_which(sentence_extract,
"breast"), str_which(sentence_extract, "metastatic|stage
IV"))
sentence_match <- str_c(sentence_number, ": ",
sentence_extract[sentence_number], collapse = "")
sentence_match
}
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 2),
DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
TYPE = c("Progress note", "CAT scan", "Progress
note", "Progress note"),
TVAR = c(
"This sentence contains the word metastatic. This sentence contains
the term stage IV.",
"This sentence contains no target words. This sentence also
contains no target words.",
"This sentence contains the word metastatic and the word breast.
This sentence contains no target words.",
"This sentence contains the words breast and the term metastatic.
This
sentence contains the word breast and the term stage IV."
)
),
.Names = c("PTNO", "DATE", "TYPE",
"TVAR"),
class = c("tbl_df",
"tbl", "data.frame"),
row.names = c(NA,-4L)
)
sampletxt$EXTRACTED <- sapply(sampletxt$TVAR, sentence_match)
sampletxt$EXTRACTED
> sampletxt$EXTRACTED
[1] ":
"
[2] ":
"
[3] "1: This sentence
contains the word metastatic and the word breast.
"
[4] "1: This sentence contains the words breast and the term metastatic. 2:
This sentence contains the word breast and the term stage IV."
________________________________
From: Robert McGehee <rmcgehee at walleyetrading.net>
To: Paul Miller <pjmiller_57 at yahoo.com>; Bert Gunter <bgunter.4567
at gmail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Wednesday, July 12, 2017 12:47 PM
Subject: RE: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Hi Paul,
Sounds like you have your answer, but for fun I thought I'd try solving your
problem using only a regular expression query and base R. I believe this works:
> txt <- "Patient had stage IV breast cancer. Nothing matches this
sentence. Metastatic and breast match this sentence. French bike champion takes
stage IV victory in Tour de France."
> pattern <-
"([^.?!]*(?=[^.?!]*\\bbreast\\b)(?=[^.?!]*\\b(metastatic|stage
IV)\\b)(?=[\\s.?!])[^.?!]*[.?!])"
> regmatches(txt, gregexpr(pattern, txt, perl=TRUE, ignore.case=TRUE))[[1]]
[1] "Patient had stage IV breast cancer."
[2] " Metastatic and breast match this sentence."
Cheers, Robert
-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Paul Miller
via R-help
Sent: Wednesday, July 12, 2017 8:49 AM
To: Bert Gunter <bgunter.4567 at gmail.com>
Cc: R-help <r-help at r-project.org>
Subject: Re: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Hi Bert,
Thanks for your reply. It appears that I didn't replace the variable name
"sampletxt" with the argument "x" in my function. I've
corrected that and now my code seems to be working fine.
Paul
________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Cc: R-help <r-help at r-project.org>
Sent: Tuesday, July 11, 2017 2:00 PM
Subject: Re: [R] Extracting sentences with combinations of target words/terms
from cancer patient text medical records
Have you looked at the CRAN Natural Language Processing Task View? If not, why
not? If so, why were the resources described there inadequate?
Bert
On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help at
r-project.org> wrote:
Hello All,>
>I need some help figuring out how to extract combinations of target
words/terms from cancer patient text medical records. I've provided some
sample data and code below to illustrate what I'm trying to do. At the
moment, I'm trying to extract sentences that contain the word
"breast" plus either "metastatic" or "stage IV".
>
>It's been some time since I used R and I feel a bit rusty. I wrote a
function called "sentence_match" that seemed to work well when applied
to a single piece of text. You can see that by running the section titled
>
>"Working code". I thought that it might be possible easily to
apply my function to a data set (tibble or df) but that doesn't seem to be
the case. My unsuccessful attempt to do this appears in the section titled
"Non-working code".
>
>If someone could help me get my code up and running, that would be greatly
appreciated. I'm using a lot of functions from Hadley Wickham's
packages, but that's not particularly necessary. Although I have only a few
entries in my sample data, my actual data are pretty large. Currently, I'm
working with over a million records. Some records contain only a single
sentence, but many have several paragraphs. One concern I had was that, even if
I could get my code working, it would be too inefficient to handle that volume
of data.
>
>Thanks,
>
>Paul
>
>
>library(tidyverse)
>library(stringr)
>library(lubridate)
>
>sentence_match <- function(x){
> sentence_extract <- str_extract_all(sampletxt,
boundary("sentence"), simplify = TRUE)
> sentence_number <- intersect(str_which(sentence_ extract,
"breast"), str_which(sentence_extract, "metastatic|stage
IV"))
> sentence_match <- str_c(sentence_number, ": ",
sentence_extract[sentence_ number], collapse = "")
> sentence_match
>}
>
>#### Working code ####
>
>sampletxt <- "This sentence contains the word metastatic and the
word breast. This sentence contains no target words."
>
>sentence_match(sampletxt)
>
>#### Non-working code ####
>
>sampletxt <-
> structure(
> list(
> PTNO = c(1, 2, 2, 2),
> DATE = structure(c(16436, 16436, 16832, 16845), class =
"Date"),
> TYPE = c("Progress note", "CAT scan",
"Progress note", "Progress note"),
> TVAR = c(
> "This sentence contains the word metastatic. This sentence
contains the term stage IV.",
> "This sentence contains no target words. This sentence also
contains no target words.",
> "This sentence contains the word metastatic and the word
breast. This sentence contains no target words.",
> "This sentence contains the words breast and the term
metastatic. This
>sentence contains the word breast and the term stage IV."
> )
> ),
> .Names = c("PTNO", "DATE", "TYPE",
"TVAR"),
> class = c("tbl_df",
> "tbl", "data.frame"),
> row.names = c(NA,-4L)
> )
>
>sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
>sampletxt2 <-
> sampletxt2 %>%
> mutate(
> EXTRACTED = sentence_match(TVAR)
> )
>
>______________________________ ________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/ listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/
posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reasonably Related Threads
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records