thr3ads.net - R help - [R] regular expression help to extract specific strings from text [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Tony B

2010-Mar-31 13:20 UTC

[R] regular expression help to extract specific strings from text

Dear all,

Lets say I have the following:
> x <- c("Eve: Going to try something new today...", "Adam:
Hey @Eve, how are you finding R? #rstats", "Eve: @Adam, It's
awesome, so much better at statistics that #Excel ever was! @Cain & @Able
disagree though :(", "Adam: @Eve I'm sure they'll sort it out
:)", "blahblah")
> x[1] "Eve: Going to try something new
today..."
[2] "Adam: Hey @Eve, how are you finding R?
#rstats"
[3] "Eve: @Adam, It's awesome, so much better at statistics that
\n#Excel ever was! @Cain & @Able disagree though :("
[4] "Adam: @Eve I'm sure they'll sort it
out :)"
[5] "blahblah"

I would like to come up with a data frame which looks like this
(pulling out the usernames and #tags):
> data.frame(Msg = x, Source = c("Eve", "Adam",
"Eve", "Adam", NA), Mentions = c(NA, "Eve",
"Adam, Cain, Able", "Eve", NA), HashTags = c(NA,
"rstats", "Excel", NA, NA))
The best I can do so far is:

source <- lapply(x, function (x) {
   tmp <- strsplit(x, ":", fixed = TRUE)
   if(length(tmp[[1]]) < 2) {
     tmp <- c(NA, tmp)
   }
   return(tmp[[1]][1])
 } )
source <- unlist(source)

[1] "Eve"  "Adam" "Eve"  "Adam" NA

I can't work out how to extract the usernames starting with '@' or
the
#tags. I can identify them using gsub and replace them, but I don't
know how to just extract those terms only, e.g. sort of the opposite
of the following
> gsub("@([A-Za-z0-9_]+)", "@[...]", x)[1] "Eve: Going to try something new today..."
[2] "Adam: Hey @[...], how are you finding R? #rstats"
[3] "Eve: @[...], It's awesome, so much better at statistics that
#Excel ever was! @[...] & @[...] disagree though :("
[4] "Adam: @[...] I'm sure they'll sort it out :)"
[5] "blahblah"

and
> gsub("#([A-Za-z0-9_]+)", "#[...]", x)[1] "Eve: Going to try something new today..."
[2] "Adam: Hey @Eve, how are you finding R? #[...]"
[3] "Eve: @Adam, It's awesome, so much better at statistics that
#[...] ever was! @Cain & @Able disagree though :("
[4] "Adam: @Eve I'm sure they'll sort it out :)"
[5] "blahblah"

I hope that makes sense, and thank you kindly in advance for your
time.
Tony Breyal

Gabor Grothendieck

2010-Mar-31 14:37 UTC

head link

[R] regular expression help to extract specific strings from text

strapply in gsubfn can extract matches based on content which seems to
be what you want:

library(gsubfn)

f <- function(...) sapply(list(...), paste, collapse = ", ")

DF <- data.frame(x,
	Source = strapply(x, "^(\\w+):", c, simplify = f),
	Mentions = strapply(x, "@(\\w+)", c, simplify = f),
	HashTags = strapply(x, "#(\\w+)", c, simplify = f))

DF[DF == ""] <- NA



On Wed, Mar 31, 2010 at 9:20 AM, Tony B <tony.breyal at googlemail.com>
wrote:> Dear all,
>
> Lets say I have the following:
>
>> x <- c("Eve: Going to try something new today...",
"Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam,
It's awesome, so much better at statistics that #Excel ever was! @Cain &
@Able disagree though :(", "Adam: @Eve I'm sure they'll sort
it out :)", "blahblah")
>> x
> [1] "Eve: Going to try something new
> today..."
> [2] "Adam: Hey @Eve, how are you finding R?
> #rstats"
> [3] "Eve: @Adam, It's awesome, so much better at statistics that
> \n#Excel ever was! @Cain & @Able disagree though :("
> [4] "Adam: @Eve I'm sure they'll sort it
> out :)"
> [5] "blahblah"
>
> I would like to come up with a data frame which looks like this
> (pulling out the usernames and #tags):
>
>> data.frame(Msg = x, Source = c("Eve", "Adam",
"Eve", "Adam", NA), Mentions = c(NA, "Eve",
"Adam, Cain, Able", "Eve", NA), HashTags = c(NA,
"rstats", "Excel", NA, NA))
>
> The best I can do so far is:
>
> source <- lapply(x, function (x) {
> ? tmp <- strsplit(x, ":", fixed = TRUE)
> ? if(length(tmp[[1]]) < 2) {
> ? ? tmp <- c(NA, tmp)
> ? }
> ? return(tmp[[1]][1])
> ?} )
> source <- unlist(source)
>
> [1] "Eve" ?"Adam" "Eve" ?"Adam" NA
>
> I can't work out how to extract the usernames starting with '@'
or the
> #tags. I can identify them using gsub and replace them, but I don't
> know how to just extract those terms only, e.g. sort of the opposite
> of the following
>
>> gsub("@([A-Za-z0-9_]+)", "@[...]", x)
> [1] "Eve: Going to try something new today..."
> [2] "Adam: Hey @[...], how are you finding R? #rstats"
> [3] "Eve: @[...], It's awesome, so much better at statistics that
> #Excel ever was! @[...] & @[...] disagree though :("
> [4] "Adam: @[...] I'm sure they'll sort it out :)"
> [5] "blahblah"
>
> and
>
>> gsub("#([A-Za-z0-9_]+)", "#[...]", x)
> [1] "Eve: Going to try something new today..."
> [2] "Adam: Hey @Eve, how are you finding R? #[...]"
> [3] "Eve: @Adam, It's awesome, so much better at statistics that
> #[...] ever was! @Cain & @Able disagree though :("
> [4] "Adam: @Eve I'm sure they'll sort it out :)"
> [5] "blahblah"
>
> I hope that makes sense, and thank you kindly in advance for your
> time.
> Tony Breyal
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

hadley wickham

2010-Mar-31 14:50 UTC

head link

[R] regular expression help to extract specific strings from text

On Wed, Mar 31, 2010 at 8:20 AM, Tony B <tony.breyal at googlemail.com>
wrote:> Dear all,
>
> Lets say I have the following:
>
>> x <- c("Eve: Going to try something new today...",
"Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam,
It's awesome, so much better at statistics that #Excel ever was! @Cain &
@Able disagree though :(", "Adam: @Eve I'm sure they'll sort
it out :)", "blahblah")
>> x
> [1] "Eve: Going to try something new
> today..."
> [2] "Adam: Hey @Eve, how are you finding R?
> #rstats"
> [3] "Eve: @Adam, It's awesome, so much better at statistics that
> \n#Excel ever was! @Cain & @Able disagree though :("
> [4] "Adam: @Eve I'm sure they'll sort it
> out :)"
> [5] "blahblah"
>
> I would like to come up with a data frame which looks like this
> (pulling out the usernames and #tags):
>
>> data.frame(Msg = x, Source = c("Eve", "Adam",
"Eve", "Adam", NA), Mentions = c(NA, "Eve",
"Adam, Cain, Able", "Eve", NA), HashTags = c(NA,
"rstats", "Excel", NA, NA))
You can do this pretty easily with the stringr package:

library(stringr)
str_extract_all(x, "@[a-zA-z]+")
sapply(str_extract_all(x, "@[a-zA-z]+"), str_c, collapse = ",
")

Hadley



-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Tony B

2010-Apr-01 11:22 UTC

head link

[R] regular expression help to extract specific strings from text

Thank you guys, both solutions work great! Seems I have two new
packages to investigate :)

Regards,
Tony Breyal

On 31 Mar, 14:20, Tony B <tony.bre... at googlemail.com>
wrote:> Dear all,
>
> Lets say I have the following:
>
> > x <- c("Eve: Going to try something new today...",
"Adam: Hey @Eve, how are you finding R? #rstats", "Eve: @Adam,
It's awesome, so much better at statistics that #Excel ever was! @Cain &
@Able disagree though :(", "Adam: @Eve I'm sure they'll sort
it out :)", "blahblah")
> > x
>
> [1] "Eve: Going to try something new
> today..."
> [2] "Adam: Hey @Eve, how are you finding R?
> #rstats"
> [3] "Eve: @Adam, It's awesome, so much better at statistics that
> \n#Excel ever was! @Cain & @Able disagree though :("
> [4] "Adam: @Eve I'm sure they'll sort it
> out :)"
> [5] "blahblah"
>
> I would like to come up with a data frame which looks like this
> (pulling out the usernames and #tags):
>
> > data.frame(Msg = x, Source = c("Eve", "Adam",
"Eve", "Adam", NA), Mentions = c(NA, "Eve",
"Adam, Cain, Able", "Eve", NA), HashTags = c(NA,
"rstats", "Excel", NA, NA))
>
> The best I can do so far is:
>
> source <- lapply(x, function (x) {
> ? ?tmp <- strsplit(x, ":", fixed = TRUE)
> ? ?if(length(tmp[[1]]) < 2) {
> ? ? ?tmp <- c(NA, tmp)
> ? ?}
> ? ?return(tmp[[1]][1])
> ?} )
> source <- unlist(source)
>
> [1] "Eve" ?"Adam" "Eve" ?"Adam" NA
>
> I can't work out how to extract the usernames starting with '@'
or the
> #tags. I can identify them using gsub and replace them, but I don't
> know how to just extract those terms only, e.g. sort of the opposite
> of the following
>
> > gsub("@([A-Za-z0-9_]+)", "@[...]", x)
>
> [1] "Eve: Going to try something new today..."
> [2] "Adam: Hey @[...], how are you finding R? #rstats"
> [3] "Eve: @[...], It's awesome, so much better at statistics that
> #Excel ever was! @[...] & @[...] disagree though :("
> [4] "Adam: @[...] I'm sure they'll sort it out :)"
> [5] "blahblah"
>
> and
>
> > gsub("#([A-Za-z0-9_]+)", "#[...]", x)
>
> [1] "Eve: Going to try something new today..."
> [2] "Adam: Hey @Eve, how are you finding R? #[...]"
> [3] "Eve: @Adam, It's awesome, so much better at statistics that
> #[...] ever was! @Cain & @Able disagree though :("
> [4] "Adam: @Eve I'm sure they'll sort it out :)"
> [5] "blahblah"
>
> I hope that makes sense, and thank you kindly in advance for your
> time.
> Tony Breyal
>
> ______________________________________________
> R-h... at r-project.org mailing
listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more maybe matching threads

R help - Mar 2010 - regular expression help to extract specific strings from text

[R] regular expression help to extract specific strings from text

[R] regular expression help to extract specific strings from text

[R] regular expression help to extract specific strings from text

[R] regular expression help to extract specific strings from text

Apparently Analagous Threads