R Help Forum Is there a R library (or a way) that I can extract unique character strings, or repeating patterns in textual strings. Say for example I have the following records: Abc_1234_kjhksh_276 Abc Abc_1234_lakdofyo_324 Bce_876_skdhk_*&^%*& Bce Bce_454 And I would like to see the following results Abc Abc_1234 Bce Jeff Reichman [[alternative HTML version deleted]]
The answer is, of course, using regular expressions and/or libraries
therefor. However, I do not think you have defined your problem
sufficiently. Some questions I have:
1. Do possible patterns to be matched always appear at the beginning
of your strings?
2. Always together between specified separators ("_" in your
example); or one of several specified separators; or otherwise?
3. Do spaces or other nonprinting characters occur in your strings?
e.g. would
abc_something
this.is_a long stringwithabcinthemiddle
be considered matching?
There are undoubtedly other possibilities that I've missed.
You may also find it useful to check this "task view" out for
possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote:> R Help Forum
>
>
>
> Is there a R library (or a way) that I can extract unique character
strings,
> or repeating patterns in textual strings. Say for example I have the
> following records:
>
>
>
> Abc_1234_kjhksh_276
>
> Abc
>
> Abc_1234_lakdofyo_324
>
> Bce_876_skdhk_*&^%*&
>
> Bce
>
> Bce_454
>
>
>
> And I would like to see the following results
>
> Abc
>
> Abc_1234
>
> Bce
>
>
>
>
>
> Jeff Reichman
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
"Does that help?"
No. I am not your private consultant. You need to reply to the list, which
I have cc'ed here, not just me.
I am still somewhat confused by your specifications, but others may not be.
Part of my confusion stems from your failure to provide a reproducible
example (see e.g. the posting guide linked below). For example, I cannot
tell from your text whether the Abc and Bce strings contain one or more
spaces at the end. I shall assume they may but need not.
Anyway, here is a reproducible example and solution that assumes that the
substrings/patterns of interest to you occur at the beginning of the
strings and may or may not be followed by one of "." "_" or
" "(space) and
then possibly further text which should be ignored. Assuming that you are
familiar with regular expressions, maybe this will help to get you started
even if I have misunderstood your specifications. If you aren't familiar
with regex's, maybe the stringr package may provide a gentler interface
than using R's raw regex functionality. Or maybe someone else can suggest a
better approach (which is another reason why you should reply to the list,
not just me).
z <- c("abc",
"abc_def",
"abc.def",
"abc def",
"abcd_ef",
"abcd",
"e","f")
pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
## gives:> pats
[1] "abc" "abcd" "e" "f"
This gives you the four separate patterns that you could then use to group
your records, perhaps by:
> lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"),
z))
[[1]]
[1] 1 2 3 4
[[2]]
[1] 5 6
[[3]]
[1] 7
[[4]]
[1] 8
That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote:
> Bert
>
> Thank you for the link. Figured there might be something
>
> Regarding your questions
>
> This is from a large 53 Billion records. The column in question are
> AdNames (Real Time Bidding data)
>
> #1. Generally yes, but not always
>
> #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC ......
>
> #3 Yes. So there could be Abc 123 could be a matching string
>
> This would not be considered a match ...
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> The sequence(s) are always are at the beginning (or so it appears). Out
> of the 54 billion records I am able to pull (SparkR sql) 948,679 unique
> strings. It is from these unique strings that I (if possible) want to
> identify the "key" strings.
>
> 1. Abc_1232.niok7j9hd
> 2. Abc
> 3. Abc.2#348hfk2.njilo
> 4. Abc.2
> 5. Abc.7
> 6. BAdfr_kajdhf98#kjsdh
> 7. BAdrf_gofer
> 948679 ....
>
>
> So I may have a thousand individuals strings all of which have Abc as a
> common string, or Badrf. So I am looking to pull "Abc,"
"BAdrf", etc. So
> then I can go back and restructure the data to show that any record with
> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
>
> Does that help
>
> Jeff
>
> -----Original Message-----
> From: Bert Gunter <bgunter.4567 at gmail.com>
> Sent: Friday, May 4, 2018 5:41 PM
> To: reichmanj at sbcglobal.net
> Cc: R-help <R-help at r-project.org>
> Subject: Re: [R] Discovering patterns in textual strings
>
> The answer is, of course, using regular expressions and/or libraries
> therefor. However, I do not think you have defined your problem
> sufficiently. Some questions I have:
>
> 1. Do possible patterns to be matched always appear at the beginning of
> your strings?
>
> 2. Always together between specified separators ("_" in your
example); or
> one of several specified separators; or otherwise?
>
> 3. Do spaces or other nonprinting characters occur in your strings?
>
> e.g. would
>
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> be considered matching?
> There are undoubtedly other possibilities that I've missed.
>
>
>
> You may also find it useful to check this "task view" out for
> possibilities:
> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at
sbcglobal.net>
> wrote:
> > R Help Forum
> >
> >
> >
> > Is there a R library (or a way) that I can extract unique character
> > strings, or repeating patterns in textual strings. Say for example I
> > have the following records:
> >
> >
> >
> > Abc_1234_kjhksh_276
> >
> > Abc
> >
> > Abc_1234_lakdofyo_324
> >
> > Bce_876_skdhk_*&^%*&
> >
> > Bce
> >
> > Bce_454
> >
> >
> >
> > And I would like to see the following results
> >
> > Abc
> >
> > Abc_1234
> >
> > Bce
> >
> >
> >
> >
> >
> > Jeff Reichman
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
Jeff: The previous solution I sent you was hugely inefficient and frankly kind of stupid. Here is a much better and simpler solution.> z <- c("abc","abc_def", "abc.def", "abc def", "abcd_ef", "abcd", "e","f") ## Create vector of patterns of same length as z, many of which are repeated> pats <- sub("^(.+)[. _].*","\\1",z)## Now can use tapply() to get indices if desired ## Note that the patterns label the groups> tapply(seq_along(z),pats,I)$abc [1] 1 2 3 4 $abcd [1] 5 6 $e [1] 7 $f [1] 8 No need to reply. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, May 5, 2018 at 12:14 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> "Does that help?" > > No. I am not your private consultant. You need to reply to the list, which > I have cc'ed here, not just me. > > I am still somewhat confused by your specifications, but others may not > be. Part of my confusion stems from your failure to provide a reproducible > example (see e.g. the posting guide linked below). For example, I cannot > tell from your text whether the Abc and Bce strings contain one or more > spaces at the end. I shall assume they may but need not. > > Anyway, here is a reproducible example and solution that assumes that the > substrings/patterns of interest to you occur at the beginning of the > strings and may or may not be followed by one of "." "_" or " "(space) and > then possibly further text which should be ignored. Assuming that you are > familiar with regular expressions, maybe this will help to get you started > even if I have misunderstood your specifications. If you aren't familiar > with regex's, maybe the stringr package may provide a gentler interface > than using R's raw regex functionality. Or maybe someone else can suggest a > better approach (which is another reason why you should reply to the list, > not just me). > > z <- c("abc", > "abc_def", > "abc.def", > "abc def", > "abcd_ef", > "abcd", > "e","f") > > pats <- unique(sub("^(.+)[. _]+.*", "\\1", z)) > ## gives: > > pats > [1] "abc" "abcd" "e" "f" > > > This gives you the four separate patterns that you could then use to group > your records, perhaps by: > > > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z)) > [[1]] > [1] 1 2 3 4 > > [[2]] > [1] 5 6 > > [[3]] > [1] 7 > > [[4]] > [1] 8 > > That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc. > > > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net> > wrote: > >> Bert >> >> Thank you for the link. Figured there might be something >> >> Regarding your questions >> >> This is from a large 53 Billion records. The column in question are >> AdNames (Real Time Bidding data) >> >> #1. Generally yes, but not always >> >> #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC ...... >> >> #3 Yes. So there could be Abc 123 could be a matching string >> >> This would not be considered a match ... >> abc_something >> this.is_a long stringwithabcinthemiddle >> >> The sequence(s) are always are at the beginning (or so it appears). Out >> of the 54 billion records I am able to pull (SparkR sql) 948,679 unique >> strings. It is from these unique strings that I (if possible) want to >> identify the "key" strings. >> >> 1. Abc_1232.niok7j9hd >> 2. Abc >> 3. Abc.2#348hfk2.njilo >> 4. Abc.2 >> 5. Abc.7 >> 6. BAdfr_kajdhf98#kjsdh >> 7. BAdrf_gofer >> 948679 .... >> >> >> So I may have a thousand individuals strings all of which have Abc as a >> common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So >> then I can go back and restructure the data to show that any record with >> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ??? >> >> Does that help >> >> Jeff >> >> -----Original Message----- >> From: Bert Gunter <bgunter.4567 at gmail.com> >> Sent: Friday, May 4, 2018 5:41 PM >> To: reichmanj at sbcglobal.net >> Cc: R-help <R-help at r-project.org> >> Subject: Re: [R] Discovering patterns in textual strings >> >> The answer is, of course, using regular expressions and/or libraries >> therefor. However, I do not think you have defined your problem >> sufficiently. Some questions I have: >> >> 1. Do possible patterns to be matched always appear at the beginning of >> your strings? >> >> 2. Always together between specified separators ("_" in your example); >> or one of several specified separators; or otherwise? >> >> 3. Do spaces or other nonprinting characters occur in your strings? >> >> e.g. would >> >> abc_something >> this.is_a long stringwithabcinthemiddle >> >> be considered matching? >> There are undoubtedly other possibilities that I've missed. >> >> >> >> You may also find it useful to check this "task view" out for >> possibilities: >> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html >> >> Cheers, >> Bert >> >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net> >> wrote: >> > R Help Forum >> > >> > >> > >> > Is there a R library (or a way) that I can extract unique character >> > strings, or repeating patterns in textual strings. Say for example I >> > have the following records: >> > >> > >> > >> > Abc_1234_kjhksh_276 >> > >> > Abc >> > >> > Abc_1234_lakdofyo_324 >> > >> > Bce_876_skdhk_*&^%*& >> > >> > Bce >> > >> > Bce_454 >> > >> > >> > >> > And I would like to see the following results >> > >> > Abc >> > >> > Abc_1234 >> > >> > Bce >> > >> > >> > >> > >> > >> > Jeff Reichman >> > >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> >[[alternative HTML version deleted]]
Bert
Here are some examples of the type of text strings I?m dealing with:
??????.??.???
??????.??.??????????
?Torrent? Pro - Torrent App
?Torrent?-Torrent Downloader
1 Pic 8 Words - Syllables
1 Pic 8 Words - Syllables
27043_Spanish songs for children
28.android.com.alpha.horoscope
28.android.com.bravo.horoscope
28.Card Game - Offline
28.card Game Multiplayer
37045_Spanish songs for children
7 Minute Workout for Weight Loss: Daily Cardio App
7 Minute Workout Plus
7 Minute
Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_Android_MEDIUMRECTANGLE_300x250_IAB7
7 Nights at Pizza House - 2
7 Nights at Pizza House 3D
com.zombodroid
com.zombodroid.battle
com.zombodroid.memegenerator
com.zone.talking.pet
com.zone.yinshidaquan
Disney Kingdom
Disney Kingdom_Android
Evite
Evite Invitations
Evite IOS_Evite_IOS_320x50
Excavator Simulator 3D:Sand
Excavator Snow Plow Loader Truck
Flippy Knife
Flippy Knife - 654567
fliptech.iowafmworld
fliptech.serbiafmworld
Floor is lava!
Floor is lava: Escape
Go_Launcher
Go_Launcher_Lite
myyearbook Android
myyearbook.com-MeetMe_Android_300x250_UK
hoping to obtain something like ?.
??????.??
Torrent
1 Pic 8 Words
7 Minute Workout
7 Nights at Pizza House
com.zombodroid
com.zone
Disney Kingdom
Flippy Knife
fliptech
Floor is lava
Go_Launcher
myyearbook
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Saturday, May 5, 2018 2:14 AM
To: reichmanj at sbcglobal.net
Cc: R-help <r-help at r-project.org>
Subject: Re: [R] Discovering patterns in textual strings
I am still somewhat confused by your specifications, but others may not be. Part
of my confusion stems from your failure to provide a reproducible example (see
e.g. the posting guide linked below). For example, I cannot tell from your text
whether the Abc and Bce strings contain one or more spaces at the end. I shall
assume they may but need not.
Anyway, here is a reproducible example and solution that assumes that the
substrings/patterns of interest to you occur at the beginning of the strings and
may or may not be followed by one of "." "_" or "
"(space) and then possibly further text which should be ignored. Assuming
that you are familiar with regular expressions, maybe this will help to get you
started even if I have misunderstood your specifications. If you aren't
familiar with regex's, maybe the stringr package may provide a gentler
interface than using R's raw regex functionality. Or maybe someone else can
suggest a better approach (which is another reason why you should reply to the
list, not just me).
z <- c("abc",
"abc_def",
"abc.def",
"abc def",
"abcd_ef",
"abcd",
"e","f")
pats <- unique(sub("^(.+)[. _]+.*", "\\1 <file://1>
", z))
## gives:> pats
[1] "abc" "abcd" "e" "f"
This gives you the four separate patterns that you could then use to group your
records, perhaps by:
> lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"),
z))
[[1]]
[1] 1 2 3 4
[[2]]
[1] 5 6
[[3]]
[1] 7
[[4]]
[1] 8
That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net
<mailto:reichmanj at sbcglobal.net> > wrote:
Bert
Thank you for the link. Figured there might be something
Regarding your questions
This is from a large 53 Billion records. The column in question are AdNames
(Real Time Bidding data)
#1. Generally yes, but not always
#2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC .....
#3 Yes. So there could be Abc 123 could be a matching string
This would not be considered a match ...
abc_something
this.is_a long stringwithabcinthemiddle
The sequence(s) are always are at the beginning (or so it appears). Out of the
54 billion records I am able to pull (SparkR sql) 948,679 unique strings. It
is from these unique strings that I (if possible) want to identify the
"key" strings.
1. Abc_1232.niok7j9hd
2. Abc
3. Abc.2#348hfk2.njilo
4. Abc.2
5. Abc.7
6. BAdfr_kajdhf98#kjsdh
7. BAdrf_gofer
948679 ....
So I may have a thousand individuals strings all of which have Abc as a common
string, or Badrf. So I am looking to pull "Abc," "BAdrf",
etc. So then I can go back and restructure the data to show that any record
with Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
Does that help
Jeff
-----Original Message-----
From: Bert Gunter <bgunter.4567 at gmail.com <mailto:bgunter.4567 at
gmail.com> >
Sent: Friday, May 4, 2018 5:41 PM
To: reichmanj at sbcglobal.net <mailto:reichmanj at sbcglobal.net>
Cc: R-help <R-help at r-project.org <mailto:R-help at r-project.org>
>
Subject: Re: [R] Discovering patterns in textual strings
The answer is, of course, using regular expressions and/or libraries therefor.
However, I do not think you have defined your problem sufficiently. Some
questions I have:
1. Do possible patterns to be matched always appear at the beginning of your
strings?
2. Always together between specified separators ("_" in your
example); or one of several specified separators; or otherwise?
3. Do spaces or other nonprinting characters occur in your strings?
e.g. would
abc_something
this.is_a long stringwithabcinthemiddle
be considered matching?
There are undoubtedly other possibilities that I've missed.
You may also find it useful to check this "task view" out for
possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net
<mailto:reichmanj at sbcglobal.net> > wrote:> R Help Forum
>
>
>
> Is there a R library (or a way) that I can extract unique character
> strings, or repeating patterns in textual strings. Say for example I
> have the following records:
>
>
>
> Abc_1234_kjhksh_276
>
> Abc
>
> Abc_1234_lakdofyo_324
>
> Bce_876_skdhk_*&^%*&
>
> Bce
>
> Bce_454
>
>
>
> And I would like to see the following results
>
> Abc
>
> Abc_1234
>
> Bce
>
>
>
>
>
> Jeff Reichman
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]