Tony B
2010-Mar-31 13:20 UTC
[R] regular expression help to extract specific strings from text
Dear all, Lets say I have the following:> x <- c("Eve: Going to try something new today...", "Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's awesome, so much better at statistics that #Excel ever was! @Cain & @Able disagree though :(", "Adam: @Eve I'm sure they'll sort it out :)", "blahblah") > x[1] "Eve: Going to try something new today..." [2] "Adam: Hey @Eve, how are you finding R? #rstats" [3] "Eve: @Adam, It's awesome, so much better at statistics that \n#Excel ever was! @Cain & @Able disagree though :(" [4] "Adam: @Eve I'm sure they'll sort it out :)" [5] "blahblah" I would like to come up with a data frame which looks like this (pulling out the usernames and #tags):> data.frame(Msg = x, Source = c("Eve", "Adam", "Eve", "Adam", NA), Mentions = c(NA, "Eve", "Adam, Cain, Able", "Eve", NA), HashTags = c(NA, "rstats", "Excel", NA, NA))The best I can do so far is: source <- lapply(x, function (x) { tmp <- strsplit(x, ":", fixed = TRUE) if(length(tmp[[1]]) < 2) { tmp <- c(NA, tmp) } return(tmp[[1]][1]) } ) source <- unlist(source) [1] "Eve" "Adam" "Eve" "Adam" NA I can't work out how to extract the usernames starting with '@' or the #tags. I can identify them using gsub and replace them, but I don't know how to just extract those terms only, e.g. sort of the opposite of the following> gsub("@([A-Za-z0-9_]+)", "@[...]", x)[1] "Eve: Going to try something new today..." [2] "Adam: Hey @[...], how are you finding R? #rstats" [3] "Eve: @[...], It's awesome, so much better at statistics that #Excel ever was! @[...] & @[...] disagree though :(" [4] "Adam: @[...] I'm sure they'll sort it out :)" [5] "blahblah" and> gsub("#([A-Za-z0-9_]+)", "#[...]", x)[1] "Eve: Going to try something new today..." [2] "Adam: Hey @Eve, how are you finding R? #[...]" [3] "Eve: @Adam, It's awesome, so much better at statistics that #[...] ever was! @Cain & @Able disagree though :(" [4] "Adam: @Eve I'm sure they'll sort it out :)" [5] "blahblah" I hope that makes sense, and thank you kindly in advance for your time. Tony Breyal
Gabor Grothendieck
2010-Mar-31 14:37 UTC
[R] regular expression help to extract specific strings from text
strapply in gsubfn can extract matches based on content which seems to be what you want: library(gsubfn) f <- function(...) sapply(list(...), paste, collapse = ", ") DF <- data.frame(x, Source = strapply(x, "^(\\w+):", c, simplify = f), Mentions = strapply(x, "@(\\w+)", c, simplify = f), HashTags = strapply(x, "#(\\w+)", c, simplify = f)) DF[DF == ""] <- NA On Wed, Mar 31, 2010 at 9:20 AM, Tony B <tony.breyal at googlemail.com> wrote:> Dear all, > > Lets say I have the following: > >> x <- c("Eve: Going to try something new today...", "Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's awesome, so much better at statistics that #Excel ever was! @Cain & @Able disagree though :(", "Adam: @Eve I'm sure they'll sort it out :)", "blahblah") >> x > [1] "Eve: Going to try something new > today..." > [2] "Adam: Hey @Eve, how are you finding R? > #rstats" > [3] "Eve: @Adam, It's awesome, so much better at statistics that > \n#Excel ever was! @Cain & @Able disagree though :(" > [4] "Adam: @Eve I'm sure they'll sort it > out :)" > [5] "blahblah" > > I would like to come up with a data frame which looks like this > (pulling out the usernames and #tags): > >> data.frame(Msg = x, Source = c("Eve", "Adam", "Eve", "Adam", NA), Mentions = c(NA, "Eve", "Adam, Cain, Able", "Eve", NA), HashTags = c(NA, "rstats", "Excel", NA, NA)) > > The best I can do so far is: > > source <- lapply(x, function (x) { > ? tmp <- strsplit(x, ":", fixed = TRUE) > ? if(length(tmp[[1]]) < 2) { > ? ? tmp <- c(NA, tmp) > ? } > ? return(tmp[[1]][1]) > ?} ) > source <- unlist(source) > > [1] "Eve" ?"Adam" "Eve" ?"Adam" NA > > I can't work out how to extract the usernames starting with '@' or the > #tags. I can identify them using gsub and replace them, but I don't > know how to just extract those terms only, e.g. sort of the opposite > of the following > >> gsub("@([A-Za-z0-9_]+)", "@[...]", x) > [1] "Eve: Going to try something new today..." > [2] "Adam: Hey @[...], how are you finding R? #rstats" > [3] "Eve: @[...], It's awesome, so much better at statistics that > #Excel ever was! @[...] & @[...] disagree though :(" > [4] "Adam: @[...] I'm sure they'll sort it out :)" > [5] "blahblah" > > and > >> gsub("#([A-Za-z0-9_]+)", "#[...]", x) > [1] "Eve: Going to try something new today..." > [2] "Adam: Hey @Eve, how are you finding R? #[...]" > [3] "Eve: @Adam, It's awesome, so much better at statistics that > #[...] ever was! @Cain & @Able disagree though :(" > [4] "Adam: @Eve I'm sure they'll sort it out :)" > [5] "blahblah" > > I hope that makes sense, and thank you kindly in advance for your > time. > Tony Breyal > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
hadley wickham
2010-Mar-31 14:50 UTC
[R] regular expression help to extract specific strings from text
On Wed, Mar 31, 2010 at 8:20 AM, Tony B <tony.breyal at googlemail.com> wrote:> Dear all, > > Lets say I have the following: > >> x <- c("Eve: Going to try something new today...", "Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's awesome, so much better at statistics that #Excel ever was! @Cain & @Able disagree though :(", "Adam: @Eve I'm sure they'll sort it out :)", "blahblah") >> x > [1] "Eve: Going to try something new > today..." > [2] "Adam: Hey @Eve, how are you finding R? > #rstats" > [3] "Eve: @Adam, It's awesome, so much better at statistics that > \n#Excel ever was! @Cain & @Able disagree though :(" > [4] "Adam: @Eve I'm sure they'll sort it > out :)" > [5] "blahblah" > > I would like to come up with a data frame which looks like this > (pulling out the usernames and #tags): > >> data.frame(Msg = x, Source = c("Eve", "Adam", "Eve", "Adam", NA), Mentions = c(NA, "Eve", "Adam, Cain, Able", "Eve", NA), HashTags = c(NA, "rstats", "Excel", NA, NA))You can do this pretty easily with the stringr package: library(stringr) str_extract_all(x, "@[a-zA-z]+") sapply(str_extract_all(x, "@[a-zA-z]+"), str_c, collapse = ", ") Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
Tony B
2010-Apr-01 11:22 UTC
[R] regular expression help to extract specific strings from text
Thank you guys, both solutions work great! Seems I have two new packages to investigate :) Regards, Tony Breyal On 31 Mar, 14:20, Tony B <tony.bre... at googlemail.com> wrote:> Dear all, > > Lets say I have the following: > > > x <- c("Eve: Going to try something new today...", "Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's awesome, so much better at statistics that #Excel ever was! @Cain & @Able disagree though :(", "Adam: @Eve I'm sure they'll sort it out :)", "blahblah") > > x > > [1] "Eve: Going to try something new > today..." > [2] "Adam: Hey @Eve, how are you finding R? > #rstats" > [3] "Eve: @Adam, It's awesome, so much better at statistics that > \n#Excel ever was! @Cain & @Able disagree though :(" > [4] "Adam: @Eve I'm sure they'll sort it > out :)" > [5] "blahblah" > > I would like to come up with a data frame which looks like this > (pulling out the usernames and #tags): > > > data.frame(Msg = x, Source = c("Eve", "Adam", "Eve", "Adam", NA), Mentions = c(NA, "Eve", "Adam, Cain, Able", "Eve", NA), HashTags = c(NA, "rstats", "Excel", NA, NA)) > > The best I can do so far is: > > source <- lapply(x, function (x) { > ? ?tmp <- strsplit(x, ":", fixed = TRUE) > ? ?if(length(tmp[[1]]) < 2) { > ? ? ?tmp <- c(NA, tmp) > ? ?} > ? ?return(tmp[[1]][1]) > ?} ) > source <- unlist(source) > > [1] "Eve" ?"Adam" "Eve" ?"Adam" NA > > I can't work out how to extract the usernames starting with '@' or the > #tags. I can identify them using gsub and replace them, but I don't > know how to just extract those terms only, e.g. sort of the opposite > of the following > > > gsub("@([A-Za-z0-9_]+)", "@[...]", x) > > [1] "Eve: Going to try something new today..." > [2] "Adam: Hey @[...], how are you finding R? #rstats" > [3] "Eve: @[...], It's awesome, so much better at statistics that > #Excel ever was! @[...] & @[...] disagree though :(" > [4] "Adam: @[...] I'm sure they'll sort it out :)" > [5] "blahblah" > > and > > > gsub("#([A-Za-z0-9_]+)", "#[...]", x) > > [1] "Eve: Going to try something new today..." > [2] "Adam: Hey @Eve, how are you finding R? #[...]" > [3] "Eve: @Adam, It's awesome, so much better at statistics that > #[...] ever was! @Cain & @Able disagree though :(" > [4] "Adam: @Eve I'm sure they'll sort it out :)" > [5] "blahblah" > > I hope that makes sense, and thank you kindly in advance for your > time. > Tony Breyal > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Maybe Matching Threads
- [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool
- [PATCH] virtio-net: Reporting traffic queue distribution statistics through ethtool
- Displaying line breaks & paragraphs using 'simple_format' ?
- help cannot put multiple chart Stacked Bar (from PerformanceAnalysis library) in a single plot
- IAXTEL toll-free gateway