thr3ads.net - R help - [R] Discovering patterns in textual strings [May 2018]

If this information is useful, please help other people find it:
Share via:

Jeff Reichman

2018-May-04 22:25 UTC

[R] Discovering patterns in textual strings

R Help Forum

 

Is there a R library (or a way) that I can extract unique character strings,
or repeating patterns in textual strings.  Say for example I have the
following records:

 

Abc_1234_kjhksh_276

Abc

Abc_1234_lakdofyo_324

Bce_876_skdhk_*&^%*&

Bce

Bce_454

 

And I would like to see the following results

Abc

Abc_1234

Bce

 

 

Jeff Reichman


	[[alternative HTML version deleted]]

Bert Gunter

2018-May-04 22:41 UTC

head link

[R] Discovering patterns in textual strings

The answer is, of course, using regular expressions and/or libraries
therefor. However, I do not think you have defined your problem
sufficiently. Some questions I have:

1. Do possible patterns to be matched always appear at the beginning
of your strings?

2. Always together between specified separators ("_"  in your
example); or one of several specified separators; or otherwise?

3. Do spaces or other nonprinting characters occur in your strings?

e.g. would

abc_something
this.is_a long stringwithabcinthemiddle

be considered matching?
There are undoubtedly other possibilities that I've missed.

You may also find it useful to check this "task view" out for
possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote:> R Help Forum
>
>
>
> Is there a R library (or a way) that I can extract unique character
strings,
> or repeating patterns in textual strings.  Say for example I have the
> following records:
>
>
>
> Abc_1234_kjhksh_276
>
> Abc
>
> Abc_1234_lakdofyo_324
>
> Bce_876_skdhk_*&^%*&
>
> Bce
>
> Bce_454
>
>
>
> And I would like to see the following results
>
> Abc
>
> Abc_1234
>
> Bce
>
>
>
>
>
> Jeff Reichman
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2018-May-05 07:14 UTC

head link

[R] Discovering patterns in textual strings

"Does that help?"

No. I am not your private consultant. You need to reply to the list, which
I have cc'ed here, not just me.

I am still somewhat confused by your specifications, but others may not be.
Part of my confusion stems from your failure to provide a reproducible
example (see e.g. the posting guide linked below).  For example, I cannot
tell from your text whether the Abc and Bce strings contain one or more
spaces at the end. I shall assume they may but need not.

Anyway, here is a reproducible example and solution that assumes that the
substrings/patterns of interest to you occur at the beginning of the
strings and may or may not be followed by one of "." "_" or
" "(space) and
then possibly further text which should be ignored. Assuming that you are
familiar with regular expressions, maybe this will help to get you started
even if I have misunderstood your specifications. If you aren't familiar
with regex's, maybe the stringr package may provide a gentler interface
than using R's raw regex functionality. Or maybe someone else can suggest a
better approach (which is another reason why you should reply to the list,
not just me).

z <- c("abc",
       "abc_def",
       "abc.def",
       "abc def",
       "abcd_ef",
       "abcd",
       "e","f")

pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
## gives:> pats[1] "abc"  "abcd" "e"    "f"


This gives you the four separate patterns that you could then use to group
your records, perhaps by:
> lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"),
z))[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6

[[3]]
[1] 7

[[4]]
[1] 8

That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.



Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote:
> Bert
>
> Thank you for the  link.  Figured there might be something
>
> Regarding your questions
>
> This is from a large 53 Billion records.  The column in question are
> AdNames (Real Time Bidding data)
>
> #1. Generally yes, but not always
>
> #2 Separators could be underscores  (_) or dots (.) as in 1.2.3_ABC ......
>
> #3 Yes. So there could be Abc 123 could be a matching string
>
> This would not be considered a match  ...
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> The sequence(s) are always are at the beginning (or so it appears).  Out
> of the 54 billion records  I am able to pull (SparkR sql) 948,679 unique
> strings.  It is from these unique strings that I (if possible)  want to
> identify the "key" strings.
>
> 1.  Abc_1232.niok7j9hd
> 2.  Abc
> 3.  Abc.2#348hfk2.njilo
> 4.  Abc.2
> 5.  Abc.7
> 6.  BAdfr_kajdhf98#kjsdh
> 7.  BAdrf_gofer
> 948679 ....
>
>
> So I may have a thousand individuals strings all of which have Abc as a
> common string, or Badrf.  So I am looking to pull "Abc,"
"BAdrf", etc.  So
> then I can go back and restructure the data to show that any record with
> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
>
> Does that help
>
> Jeff
>
> -----Original Message-----
> From: Bert Gunter <bgunter.4567 at gmail.com>
> Sent: Friday, May 4, 2018 5:41 PM
> To: reichmanj at sbcglobal.net
> Cc: R-help <R-help at r-project.org>
> Subject: Re: [R] Discovering patterns in textual strings
>
> The answer is, of course, using regular expressions and/or libraries
> therefor. However, I do not think you have defined your problem
> sufficiently. Some questions I have:
>
> 1. Do possible patterns to be matched always appear at the beginning of
> your strings?
>
> 2. Always together between specified separators ("_"  in your
example); or
> one of several specified separators; or otherwise?
>
> 3. Do spaces or other nonprinting characters occur in your strings?
>
> e.g. would
>
> abc_something
> this.is_a long stringwithabcinthemiddle
>
> be considered matching?
> There are undoubtedly other possibilities that I've missed.
>
>
>
> You may also find it useful to check this "task view" out for
> possibilities:
> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at
sbcglobal.net>
> wrote:
> > R Help Forum
> >
> >
> >
> > Is there a R library (or a way) that I can extract unique character
> > strings, or repeating patterns in textual strings.  Say for example I
> > have the following records:
> >
> >
> >
> > Abc_1234_kjhksh_276
> >
> > Abc
> >
> > Abc_1234_lakdofyo_324
> >
> > Bce_876_skdhk_*&^%*&
> >
> > Bce
> >
> > Bce_454
> >
> >
> >
> > And I would like to see the following results
> >
> > Abc
> >
> > Abc_1234
> >
> > Bce
> >
> >
> >
> >
> >
> > Jeff Reichman
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
	[[alternative HTML version deleted]]

Bert Gunter

2018-May-05 20:59 UTC

head link

[R] Discovering patterns in textual strings

Jeff:

The previous solution I sent you was hugely inefficient and frankly kind of
stupid. Here is a much better and simpler solution.
> z <- c("abc",       "abc_def",
       "abc.def",
       "abc def",
       "abcd_ef",
       "abcd",
       "e","f")

## Create vector of patterns of same length as z, many of which are
repeated> pats <- sub("^(.+)[. _].*","\\1",z)
## Now can use tapply() to get indices if desired
## Note that the patterns label the groups
> tapply(seq_along(z),pats,I)$abc
[1] 1 2 3 4

$abcd
[1] 5 6

$e
[1] 7

$f
[1] 8

No need to reply.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sat, May 5, 2018 at 12:14 AM, Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> "Does that help?"
>
> No. I am not your private consultant. You need to reply to the list, which
> I have cc'ed here, not just me.
>
> I am still somewhat confused by your specifications, but others may not
> be. Part of my confusion stems from your failure to provide a reproducible
> example (see e.g. the posting guide linked below).  For example, I cannot
> tell from your text whether the Abc and Bce strings contain one or more
> spaces at the end. I shall assume they may but need not.
>
> Anyway, here is a reproducible example and solution that assumes that the
> substrings/patterns of interest to you occur at the beginning of the
> strings and may or may not be followed by one of "."
"_" or " "(space) and
> then possibly further text which should be ignored. Assuming that you are
> familiar with regular expressions, maybe this will help to get you started
> even if I have misunderstood your specifications. If you aren't
familiar
> with regex's, maybe the stringr package may provide a gentler interface
> than using R's raw regex functionality. Or maybe someone else can
suggest a
> better approach (which is another reason why you should reply to the list,
> not just me).
>
> z <- c("abc",
>        "abc_def",
>        "abc.def",
>        "abc def",
>        "abcd_ef",
>        "abcd",
>        "e","f")
>
> pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
> ## gives:
> > pats
> [1] "abc"  "abcd" "e"    "f"
>
>
> This gives you the four separate patterns that you could then use to group
> your records, perhaps by:
>
> > lapply(pats,function(x)grep(paste0("^", x,"([_.
]|$)"), z))
> [[1]]
> [1] 1 2 3 4
>
> [[2]]
> [1] 5 6
>
> [[3]]
> [1] 7
>
> [[4]]
> [1] 8
>
> That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.
>
>
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
> On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at
sbcglobal.net>
> wrote:
>
>> Bert
>>
>> Thank you for the  link.  Figured there might be something
>>
>> Regarding your questions
>>
>> This is from a large 53 Billion records.  The column in question are
>> AdNames (Real Time Bidding data)
>>
>> #1. Generally yes, but not always
>>
>> #2 Separators could be underscores  (_) or dots (.) as in 1.2.3_ABC
......
>>
>> #3 Yes. So there could be Abc 123 could be a matching string
>>
>> This would not be considered a match  ...
>> abc_something
>> this.is_a long stringwithabcinthemiddle
>>
>> The sequence(s) are always are at the beginning (or so it appears). 
Out
>> of the 54 billion records  I am able to pull (SparkR sql) 948,679
unique
>> strings.  It is from these unique strings that I (if possible)  want to
>> identify the "key" strings.
>>
>> 1.  Abc_1232.niok7j9hd
>> 2.  Abc
>> 3.  Abc.2#348hfk2.njilo
>> 4.  Abc.2
>> 5.  Abc.7
>> 6.  BAdfr_kajdhf98#kjsdh
>> 7.  BAdrf_gofer
>> 948679 ....
>>
>>
>> So I may have a thousand individuals strings all of which have Abc as a
>> common string, or Badrf.  So I am looking to pull "Abc,"
"BAdrf", etc.  So
>> then I can go back and restructure the data to show that any record
with
>> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
>>
>> Does that help
>>
>> Jeff
>>
>> -----Original Message-----
>> From: Bert Gunter <bgunter.4567 at gmail.com>
>> Sent: Friday, May 4, 2018 5:41 PM
>> To: reichmanj at sbcglobal.net
>> Cc: R-help <R-help at r-project.org>
>> Subject: Re: [R] Discovering patterns in textual strings
>>
>> The answer is, of course, using regular expressions and/or libraries
>> therefor. However, I do not think you have defined your problem
>> sufficiently. Some questions I have:
>>
>> 1. Do possible patterns to be matched always appear at the beginning of
>> your strings?
>>
>> 2. Always together between specified separators ("_"  in your
example);
>> or one of several specified separators; or otherwise?
>>
>> 3. Do spaces or other nonprinting characters occur in your strings?
>>
>> e.g. would
>>
>> abc_something
>> this.is_a long stringwithabcinthemiddle
>>
>> be considered matching?
>> There are undoubtedly other possibilities that I've missed.
>>
>>
>>
>> You may also find it useful to check this "task view" out for
>> possibilities:
>> https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at
sbcglobal.net>
>> wrote:
>> > R Help Forum
>> >
>> >
>> >
>> > Is there a R library (or a way) that I can extract unique
character
>> > strings, or repeating patterns in textual strings.  Say for
example I
>> > have the following records:
>> >
>> >
>> >
>> > Abc_1234_kjhksh_276
>> >
>> > Abc
>> >
>> > Abc_1234_lakdofyo_324
>> >
>> > Bce_876_skdhk_*&^%*&
>> >
>> > Bce
>> >
>> > Bce_454
>> >
>> >
>> >
>> > And I would like to see the following results
>> >
>> > Abc
>> >
>> > Abc_1234
>> >
>> > Bce
>> >
>> >
>> >
>> >
>> >
>> > Jeff Reichman
>> >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
	[[alternative HTML version deleted]]

Jeff Reichman

2018-May-07 21:02 UTC

head link

[R] Discovering patterns in textual strings

Bert

Here are some examples of the type of text strings I?m dealing with:

??????.??.???

??????.??.??????????

?Torrent? Pro - Torrent App

?Torrent?-Torrent Downloader

1 Pic 8 Words - Syllables

1 Pic 8 Words - Syllables

27043_Spanish songs for children

28.android.com.alpha.horoscope

28.android.com.bravo.horoscope

28.Card Game - Offline

28.card Game Multiplayer

37045_Spanish songs for children

7 Minute Workout for Weight Loss: Daily Cardio App

7 Minute Workout Plus

7 Minute
Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_Android_MEDIUMRECTANGLE_300x250_IAB7

7 Nights at Pizza House - 2

7 Nights at Pizza House 3D

com.zombodroid

com.zombodroid.battle

com.zombodroid.memegenerator

com.zone.talking.pet

com.zone.yinshidaquan

Disney Kingdom

Disney Kingdom_Android

Evite

Evite Invitations

Evite IOS_Evite_IOS_320x50

Excavator Simulator 3D:Sand

Excavator Snow Plow Loader Truck

Flippy Knife

Flippy Knife - 654567

fliptech.iowafmworld

fliptech.serbiafmworld

Floor is lava!

Floor is lava: Escape

Go_Launcher

Go_Launcher_Lite

myyearbook Android

myyearbook.com-MeetMe_Android_300x250_UK

hoping to obtain something like ?.

??????.??

Torrent

1 Pic 8 Words

7 Minute Workout

7 Nights at Pizza House

com.zombodroid

com.zone

Disney Kingdom

Flippy Knife

fliptech

Floor is lava

Go_Launcher

myyearbook 

From: Bert Gunter <bgunter.4567 at gmail.com> 
Sent: Saturday, May 5, 2018 2:14 AM
To: reichmanj at sbcglobal.net
Cc: R-help <r-help at r-project.org>
Subject: Re: [R] Discovering patterns in textual strings

I am still somewhat confused by your specifications, but others may not be. Part
of my confusion stems from your failure to provide a reproducible example (see
e.g. the posting guide linked below).  For example, I cannot tell from your text
whether the Abc and Bce strings contain one or more spaces at the end. I shall
assume they may but need not.

Anyway, here is a reproducible example and solution that assumes that the
substrings/patterns of interest to you occur at the beginning of the strings and
may or may not be followed by one of "." "_" or "
"(space) and then possibly further text which should be ignored. Assuming
that you are familiar with regular expressions, maybe this will help to get you
started even if I have misunderstood your specifications. If you aren't
familiar with regex's, maybe the stringr package may provide a gentler
interface than using R's raw regex functionality. Or maybe someone else can
suggest a better approach (which is another reason why you should reply to the
list, not just me).

z <- c("abc",
       "abc_def",
       "abc.def",
       "abc def",
       "abcd_ef",
       "abcd",
       "e","f")

pats <- unique(sub("^(.+)[. _]+.*", "\\1 <file://1>
", z))

## gives:> pats[1] "abc"  "abcd" "e"    "f"  

This gives you the four separate patterns that you could then use to group your
records, perhaps by:
> lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"),
z))[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6

[[3]]
[1] 7

[[4]]
[1] 8 

That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net
<mailto:reichmanj at sbcglobal.net> > wrote:

Bert

Thank you for the  link.  Figured there might be something

Regarding your questions

This is from a large 53 Billion records.  The column in question are AdNames
(Real Time Bidding data)

#1. Generally yes, but not always

#2 Separators could be underscores  (_) or dots (.) as in 1.2.3_ABC .....

#3 Yes. So there could be Abc 123 could be a matching string

This would not be considered a match  ...
abc_something
this.is_a long stringwithabcinthemiddle

The sequence(s) are always are at the beginning (or so it appears).  Out of the
54 billion records  I am able to pull (SparkR sql) 948,679 unique strings.  It
is from these unique strings that I (if possible)  want to identify the
"key" strings.

1.  Abc_1232.niok7j9hd
2.  Abc
3.  Abc.2#348hfk2.njilo
4.  Abc.2
5.  Abc.7
6.  BAdfr_kajdhf98#kjsdh
7.  BAdrf_gofer
948679 ....

So I may have a thousand individuals strings all of which have Abc as a common
string, or Badrf.  So I am looking to pull "Abc," "BAdrf",
etc.  So then I can go back and restructure the data to show that any record
with Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???

Does that help

Jeff

-----Original Message-----
From: Bert Gunter <bgunter.4567 at gmail.com <mailto:bgunter.4567 at
gmail.com> >
Sent: Friday, May 4, 2018 5:41 PM
To: reichmanj at sbcglobal.net <mailto:reichmanj at sbcglobal.net> 
Cc: R-help <R-help at r-project.org <mailto:R-help at r-project.org>
>
Subject: Re: [R] Discovering patterns in textual strings

The answer is, of course, using regular expressions and/or libraries therefor.
However, I do not think you have defined your problem sufficiently. Some
questions I have:

1. Do possible patterns to be matched always appear at the beginning of your
strings?

2. Always together between specified separators ("_"  in your
example); or one of several specified separators; or otherwise?

3. Do spaces or other nonprinting characters occur in your strings?

e.g. would

abc_something
this.is_a long stringwithabcinthemiddle

be considered matching?
There are undoubtedly other possibilities that I've missed.

You may also find it useful to check this "task view" out for
possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net
<mailto:reichmanj at sbcglobal.net> > wrote:> R Help Forum
>
>
>
> Is there a R library (or a way) that I can extract unique character 
> strings, or repeating patterns in textual strings.  Say for example I 
> have the following records:
>
>
>
> Abc_1234_kjhksh_276
>
> Abc
>
> Abc_1234_lakdofyo_324
>
> Bce_876_skdhk_*&^%*&
>
> Bce
>
> Bce_454
>
>
>
> And I would like to see the following results
>
> Abc
>
> Abc_1234
>
> Bce
>
>
>
>
>
> Jeff Reichman
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org>  mailing
list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more apparently analagous threads

R help - May 2018 - Discovering patterns in textual strings

[R] Discovering patterns in textual strings

[R] Discovering patterns in textual strings

[R] Discovering patterns in textual strings

[R] Discovering patterns in textual strings

[R] Discovering patterns in textual strings

Reasonably Related Threads