Paul Miller
2017-Jul-11 14:10 UTC
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hello All, I need some help figuring out how to extract combinations of target words/terms from cancer patient text medical records. I've provided some sample data and code below to illustrate what I'm trying to do. At the moment, I'm trying to extract sentences that contain the word "breast" plus either "metastatic" or "stage IV". It's been some time since I used R and I feel a bit rusty. I wrote a function called "sentence_match" that seemed to work well when applied to a single piece of text. You can see that by running the section titled "Working code". I thought that it might be possible easily to apply my function to a data set (tibble or df) but that doesn't seem to be the case. My unsuccessful attempt to do this appears in the section titled "Non-working code". If someone could help me get my code up and running, that would be greatly appreciated. I'm using a lot of functions from Hadley Wickham's packages, but that's not particularly necessary. Although I have only a few entries in my sample data, my actual data are pretty large. Currently, I'm working with over a million records. Some records contain only a single sentence, but many have several paragraphs. One concern I had was that, even if I could get my code working, it would be too inefficient to handle that volume of data. Thanks, Paul library(tidyverse) library(stringr) library(lubridate) sentence_match <- function(x){ sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), simplify = TRUE) sentence_number <- intersect(str_which(sentence_extract, "breast"), str_which(sentence_extract, "metastatic|stage IV")) sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_number], collapse = "") sentence_match } #### Working code #### sampletxt <- "This sentence contains the word metastatic and the word breast. This sentence contains no target words." sentence_match(sampletxt) #### Non-working code #### sampletxt <- structure( list( PTNO = c(1, 2, 2, 2), DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"), TVAR = c( "This sentence contains the word metastatic. This sentence contains the term stage IV.", "This sentence contains no target words. This sentence also contains no target words.", "This sentence contains the word metastatic and the word breast. This sentence contains no target words.", "This sentence contains the words breast and the term metastatic. This sentence contains the word breast and the term stage IV." ) ), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,-4L) ) sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) sampletxt2 <- sampletxt2 %>% mutate( EXTRACTED = sentence_match(TVAR) )
Bert Gunter
2017-Jul-11 18:00 UTC
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Have you looked at the CRAN Natural Language Processing Task View? If not, why not? If so, why were the resources described there inadequate? Bert On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help at r-project.org> wrote:> Hello All, > > I need some help figuring out how to extract combinations of target > words/terms from cancer patient text medical records. I've provided some > sample data and code below to illustrate what I'm trying to do. At the > moment, I'm trying to extract sentences that contain the word "breast" plus > either "metastatic" or "stage IV". > > It's been some time since I used R and I feel a bit rusty. I wrote a > function called "sentence_match" that seemed to work well when applied to a > single piece of text. You can see that by running the section titled > > "Working code". I thought that it might be possible easily to apply my > function to a data set (tibble or df) but that doesn't seem to be the case. > My unsuccessful attempt to do this appears in the section titled > "Non-working code". > > If someone could help me get my code up and running, that would be greatly > appreciated. I'm using a lot of functions from Hadley Wickham's packages, > but that's not particularly necessary. Although I have only a few entries > in my sample data, my actual data are pretty large. Currently, I'm working > with over a million records. Some records contain only a single sentence, > but many have several paragraphs. One concern I had was that, even if I > could get my code working, it would be too inefficient to handle that > volume of data. > > Thanks, > > Paul > > > library(tidyverse) > library(stringr) > library(lubridate) > > sentence_match <- function(x){ > sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), > simplify = TRUE) > sentence_number <- intersect(str_which(sentence_extract, "breast"), > str_which(sentence_extract, "metastatic|stage IV")) > sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_number], > collapse = "") > sentence_match > } > > #### Working code #### > > sampletxt <- "This sentence contains the word metastatic and the word > breast. This sentence contains no target words." > > sentence_match(sampletxt) > > #### Non-working code #### > > sampletxt <- > structure( > list( > PTNO = c(1, 2, 2, 2), > DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), > TYPE = c("Progress note", "CAT scan", "Progress note", "Progress > note"), > TVAR = c( > "This sentence contains the word metastatic. This sentence > contains the term stage IV.", > "This sentence contains no target words. This sentence also > contains no target words.", > "This sentence contains the word metastatic and the word breast. > This sentence contains no target words.", > "This sentence contains the words breast and the term metastatic. > This > sentence contains the word breast and the term stage IV." > ) > ), > .Names = c("PTNO", "DATE", "TYPE", "TVAR"), > class = c("tbl_df", > "tbl", "data.frame"), > row.names = c(NA,-4L) > ) > > sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) > sampletxt2 <- > sampletxt2 %>% > mutate( > EXTRACTED = sentence_match(TVAR) > ) > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Paul Miller
2017-Jul-12 12:48 UTC
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hi Bert, Thanks for your reply. It appears that I didn't replace the variable name "sampletxt" with the argument "x" in my function. I've corrected that and now my code seems to be working fine. Paul ________________________________ From: Bert Gunter <bgunter.4567 at gmail.com> Cc: R-help <r-help at r-project.org> Sent: Tuesday, July 11, 2017 2:00 PM Subject: Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records Have you looked at the CRAN Natural Language Processing Task View? If not, why not? If so, why were the resources described there inadequate? Bert On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help at r-project.org> wrote: Hello All,> >I need some help figuring out how to extract combinations of target words/terms from cancer patient text medical records. I've provided some sample data and code below to illustrate what I'm trying to do. At the moment, I'm trying to extract sentences that contain the word "breast" plus either "metastatic" or "stage IV". > >It's been some time since I used R and I feel a bit rusty. I wrote a function called "sentence_match" that seemed to work well when applied to a single piece of text. You can see that by running the section titled > >"Working code". I thought that it might be possible easily to apply my function to a data set (tibble or df) but that doesn't seem to be the case. My unsuccessful attempt to do this appears in the section titled "Non-working code". > >If someone could help me get my code up and running, that would be greatly appreciated. I'm using a lot of functions from Hadley Wickham's packages, but that's not particularly necessary. Although I have only a few entries in my sample data, my actual data are pretty large. Currently, I'm working with over a million records. Some records contain only a single sentence, but many have several paragraphs. One concern I had was that, even if I could get my code working, it would be too inefficient to handle that volume of data. > >Thanks, > >Paul > > >library(tidyverse) >library(stringr) >library(lubridate) > >sentence_match <- function(x){ > sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), simplify = TRUE) > sentence_number <- intersect(str_which(sentence_ extract, "breast"), str_which(sentence_extract, "metastatic|stage IV")) > sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_ number], collapse = "") > sentence_match >} > >#### Working code #### > >sampletxt <- "This sentence contains the word metastatic and the word breast. This sentence contains no target words." > >sentence_match(sampletxt) > >#### Non-working code #### > >sampletxt <- > structure( > list( > PTNO = c(1, 2, 2, 2), > DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), > TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"), > TVAR = c( > "This sentence contains the word metastatic. This sentence contains the term stage IV.", > "This sentence contains no target words. This sentence also contains no target words.", > "This sentence contains the word metastatic and the word breast. This sentence contains no target words.", > "This sentence contains the words breast and the term metastatic. This >sentence contains the word breast and the term stage IV." > ) > ), > .Names = c("PTNO", "DATE", "TYPE", "TVAR"), > class = c("tbl_df", > "tbl", "data.frame"), > row.names = c(NA,-4L) > ) > >sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) >sampletxt2 <- > sampletxt2 %>% > mutate( > EXTRACTED = sentence_match(TVAR) > ) > >______________________________ ________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/ listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html >and provide commented, minimal, self-contained, reproducible code. >
Seemingly Similar Threads
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records
- Extracting sentences with combinations of target words/terms from cancer patient text medical records