thr3ads.net - R help - [R] regex - optional part isn't considered in replacement with gsub [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Omar André Gonzáles Díaz

2017-Aug-27 16:18 UTC

[R] regex - optional part isn't considered in replacement with gsub

Hello, I need some help with regex.

I have this to sentences. I need to extract both "49MU6300" and
"LE32S5970"
and put them in a new colum "SKU".

A) SMART TV UHD 49'' CURVO 49MU6300
B) SMART TV HD 32'' LE32S5970

DataFrame for testing:

ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD
49'' CURVO
49MU6300",
                             "SMART TV HD 32'' LE32S5970"))


I'm using gsub like this:

1.- This would capture A as intended but only "32S5970" from B
(missing
"LE").

ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
ecommerce$producto)


2.- This would capture "LE32S5970" but not "49MU6300".

ecommerce$sku <-
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
ecommerce$producto)


3.- If I make the 2 first letter optional with:

ecommerce$sku <-
gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
ecommerce$producto)


"49MU6300" is capture, but again only "32S5970" from B
(missing "LE").


What should I do? How would you approche it?

	[[alternative HTML version deleted]]

Jeff Newmiller

2017-Aug-27 16:54 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

Clearly you are being too specific about the structure of the sku. In the
absence of better information about the sku you need to focus on identifying the
delimiters and position of the sku... one way might be:

ecommerce$sku  <- sub( "^(.*)[ \n]+([^ \n]+)$", "\\2",
ecommerce$producto )

Please learn to post using plain text format, as HTML corrupts the latter on
this mailing list. The option exists in your email client (including the GMail
Web interface if that is what you use).
-- 
Sent from my phone. Please excuse my brevity.

On August 27, 2017 9:18:52 AM PDT, "Omar Andr? Gonz?les D?az"
<oma.gonzales at gmail.com> wrote:>Hello, I need some help with regex.
>
>I have this to sentences. I need to extract both "49MU6300" and
>"LE32S5970"
>and put them in a new colum "SKU".
>
>A) SMART TV UHD 49'' CURVO 49MU6300
>B) SMART TV HD 32'' LE32S5970
>
>DataFrame for testing:
>
>ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD
49''
>CURVO
>49MU6300",
>                             "SMART TV HD 32''
LE32S5970"))
>
>
>I'm using gsub like this:
>
>1.- This would capture A as intended but only "32S5970" from B
(missing
>"LE").
>
>ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>"\\2",
>ecommerce$producto)
>
>
>2.- This would capture "LE32S5970" but not "49MU6300".
>
>ecommerce$sku <-
>gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
>ecommerce$producto)
>
>
>3.- If I make the 2 first letter optional with:
>
>ecommerce$sku <-
>gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
>ecommerce$producto)
>
>
>"49MU6300" is capture, but again only "32S5970" from B
(missing "LE").
>
>
>What should I do? How would you approche it?
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2017-Aug-27 17:10 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

You may have to provide us more detail on **exactly** the sorts of
patterns you wish to "capture" -- including exactly what you mean by
"capture" (what vaue do you wish to return?) -- as the
"obvious"
answer is probably not sufficient:

## using your example -- thankyou
> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])[1] "49MU6300"  "LE32S5970"


Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az
<oma.gonzales at gmail.com> wrote:> Hello, I need some help with regex.
>
> I have this to sentences. I need to extract both "49MU6300" and
"LE32S5970"
> and put them in a new colum "SKU".
>
> A) SMART TV UHD 49'' CURVO 49MU6300
> B) SMART TV HD 32'' LE32S5970
>
> DataFrame for testing:
>
> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD
49'' CURVO
> 49MU6300",
>                              "SMART TV HD 32''
LE32S5970"))
>
>
> I'm using gsub like this:
>
> 1.- This would capture A as intended but only "32S5970" from B
(missing
> "LE").
>
> ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
> ecommerce$producto)
>
>
> 2.- This would capture "LE32S5970" but not "49MU6300".
>
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> ecommerce$producto)
>
>
> 3.- If I make the 2 first letter optional with:
>
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> ecommerce$producto)
>
>
> "49MU6300" is capture, but again only "32S5970" from B
(missing "LE").
>
>
> What should I do? How would you approche it?
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2017-Aug-27 23:01 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

Omar:

I don't think this can work. For example number-letter patterns 4),
5), and 6) would all be matched by pattern 6).

As Jeff indicated, you need to provide the delimiters -- what
characters come before and after the SKU patterns -- to be able to
recognize them. In a quick look at the text file you attached, the
delimiters appeared to be either "-" or " " (blank) and
perhaps <end
of character string>. If that is correct or if you can tell us how to
make it correct, then it's straightforward to proceed. Otherwise, I am
unable to help. Maybe someone else can.

Cheers,
Bert






On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az
<oma.gonzales at gmail.com> wrote:> Hi Jeff, Bert, thank you for your input.
>
> I'm attaching a sample of the data, feel free to explore it.
>
> As I said, I need to extract the SKUs of the products (a key that
> identifies every product). Not every producto (row) has a SKU, in this
> case "no SKU" should be the output.
>
> I've identify these patterns so far:
>
> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter.
> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
> 4.-LH5000: 2 letters, 4 numbers.
> 5.-B8500: 1 letters, 4 numbers.
> 6.-E310: 1 letter, 3 numbers.
> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
>
>
> I think those cover the mayority of skus. So I would appreciate a a
> guidence on how to extract all those different patterns.
>
> Relate but not the question asked: The idea is that after extracting
> the skus, there should be skus repeted accros the different ecommerce.
> Those skus would permit us to compare the products and their prices.
>
>
> Thank you in advance.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
>> You may have to provide us more detail on **exactly** the sorts of
>> patterns you wish to "capture" -- including exactly what you
mean by
>> "capture" (what vaue do you wish to return?) -- as the
"obvious"
>> answer is probably not sufficient:
>>
>> ## using your example -- thankyou
>>
>>>
gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
>> [1] "49MU6300"  "LE32S5970"
>>
>>
>> Cheers,
>> Bert
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az
>> <oma.gonzales at gmail.com> wrote:
>>> Hello, I need some help with regex.
>>>
>>> I have this to sentences. I need to extract both
"49MU6300" and "LE32S5970"
>>> and put them in a new colum "SKU".
>>>
>>> A) SMART TV UHD 49'' CURVO 49MU6300
>>> B) SMART TV HD 32'' LE32S5970
>>>
>>> DataFrame for testing:
>>>
>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV
UHD 49'' CURVO
>>> 49MU6300",
>>>                              "SMART TV HD 32''
LE32S5970"))
>>>
>>>
>>> I'm using gsub like this:
>>>
>>> 1.- This would capture A as intended but only "32S5970"
from B (missing
>>> "LE").
>>>
>>> ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2",
>>> ecommerce$producto)
>>>
>>>
>>> 2.- This would capture "LE32S5970" but not
"49MU6300".
>>>
>>> ecommerce$sku <-
>>>
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
>>> ecommerce$producto)
>>>
>>>
>>> 3.- If I make the 2 first letter optional with:
>>>
>>> ecommerce$sku <-
>>>
gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
>>> ecommerce$producto)
>>>
>>>
>>> "49MU6300" is capture, but again only "32S5970"
from B (missing "LE").
>>>
>>>
>>> What should I do? How would you approche it?
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2017-Aug-28 05:15 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

"Please, consider that some SKUs have "-"
in the middle, for example: "PG-9021".

Then you need to include these in the list of patterns you gave. Try it
again -- this time with a **complete** list.

-- Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sun, Aug 27, 2017 at 10:01 PM, Omar Andr? Gonz?les D?az <
oma.gonzales at gmail.com> wrote:
> Hi Bert,
>
> I would say that the delimitir is "blank", every other row with
"-" as
> delimiter should be ignore. Please, consider that some SKUs have
"-"
> in the middle, for example: "PG-9021".
>
> As for the <end of character string>, it's now corrected. There
> shouldn't be any case of this (if there are, just ignore them).
>
> I've tried to apply different gsub operations to capture different
> cases, for example:
>
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> ecommerce$producto)
>
>
> ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
> "\\2", ecommerce$sku)
>
>
> ecommerce$sku <-
> gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)",
"\\2",
> ecommerce$sku)
>
> ecommerce$sku <-
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)",
> "\\2", ecommerce$sku)
>
>
> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)",
"\\2",
> ecommerce$sku)
>
>
> I don't know if that is the best approache, but I couldn't capture
the
> case in the initial question. And as I've said, the important thing is
> to capture as many SKUs as possibe.
>
> Thank you for your time, Sir.
>
>
>
>
> 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>:
> > Omar:
> >
> > I don't think this can work. For example number-letter patterns
4),
> > 5), and 6) would all be matched by pattern 6).
> >
> > As Jeff indicated, you need to provide the delimiters -- what
> > characters come before and after the SKU patterns -- to be able to
> > recognize them. In a quick look at the text file you attached, the
> > delimiters appeared to be either "-" or " "
(blank) and perhaps <end
> > of character string>. If that is correct or if you can tell us how
to
> > make it correct, then it's straightforward to proceed. Otherwise,
I am
> > unable to help. Maybe someone else can.
> >
> > Cheers,
> > Bert
> >
> >
> >
> >
> >
> >
> > On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az
> > <oma.gonzales at gmail.com> wrote:
> >> Hi Jeff, Bert, thank you for your input.
> >>
> >> I'm attaching a sample of the data, feel free to explore it.
> >>
> >> As I said, I need to extract the SKUs of the products (a key that
> >> identifies every product). Not every producto (row) has a SKU, in
this
> >> case "no SKU" should be the output.
> >>
> >> I've identify these patterns so far:
> >>
> >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
> >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter.
> >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
> >> 4.-LH5000: 2 letters, 4 numbers.
> >> 5.-B8500: 1 letters, 4 numbers.
> >> 6.-E310: 1 letter, 3 numbers.
> >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
> >>
> >>
> >> I think those cover the mayority of skus. So I would appreciate a
a
> >> guidence on how to extract all those different patterns.
> >>
> >> Relate but not the question asked: The idea is that after
extracting
> >> the skus, there should be skus repeted accros the different
ecommerce.
> >> Those skus would permit us to compare the products and their
prices.
> >>
> >>
> >> Thank you in advance.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at
gmail.com>:
> >>> You may have to provide us more detail on **exactly** the
sorts of
> >>> patterns you wish to "capture" -- including exactly
what you mean by
> >>> "capture" (what vaue do you wish to return?) -- as
the "obvious"
> >>> answer is probably not sufficient:
> >>>
> >>> ## using your example -- thankyou
> >>>
> >>>>
gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
> >>> [1] "49MU6300"  "LE32S5970"
> >>>
> >>>
> >>> Cheers,
> >>> Bert
> >>> Bert Gunter
> >>>
> >>> "The trouble with having an open mind is that people keep
coming along
> >>> and sticking things into it."
> >>> -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
> >>>
> >>>
> >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az
> >>> <oma.gonzales at gmail.com> wrote:
> >>>> Hello, I need some help with regex.
> >>>>
> >>>> I have this to sentences. I need to extract both
"49MU6300" and
> "LE32S5970"
> >>>> and put them in a new colum "SKU".
> >>>>
> >>>> A) SMART TV UHD 49'' CURVO 49MU6300
> >>>> B) SMART TV HD 32'' LE32S5970
> >>>>
> >>>> DataFrame for testing:
> >>>>
> >>>> ecommerce <- data.frame(a = c(1,2), producto =
c("SMART TV UHD 49''
> CURVO
> >>>> 49MU6300",
> >>>>                              "SMART TV HD
32'' LE32S5970"))
> >>>>
> >>>>
> >>>> I'm using gsub like this:
> >>>>
> >>>> 1.- This would capture A as intended but only
"32S5970" from B
> (missing
> >>>> "LE").
> >>>>
> >>>> ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
> "\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> 2.- This would capture "LE32S5970" but not
"49MU6300".
> >>>>
> >>>> ecommerce$sku <-
> >>>>
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> 3.- If I make the 2 first letter optional with:
> >>>>
> >>>> ecommerce$sku <-
> >>>>
gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> >>>> ecommerce$producto)
> >>>>
> >>>>
> >>>> "49MU6300" is capture, but again only
"32S5970" from B (missing "LE").
> >>>>
> >>>>
> >>>> What should I do? How would you approche it?
> >>>>
> >>>>         [[alternative HTML version deleted]]
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >>>> and provide commented, minimal, self-contained,
reproducible code.
>
	[[alternative HTML version deleted]]

Jeff Newmiller

2017-Aug-28 05:37 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

Omar, please remember that this is R-help,  not R-do-my-work-for-me... you have
already been given several hints as to how you can refine your patterns
yourself. These skills are key to real world data science, so you need to work
at being able to take hints and expand on them if you are to be successful in
these kinds of tasks. Also, if you cannot learn to make reproducible examples
([1][2][3]) to illustrate your problems then we have about reached the limit of
our ability to help you.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

[2] http://adv-r.had.co.nz/Reproducibility.html

[3] https://cran.r-project.org/web/packages/reprex/index.html (read the
vignette)
-- 
Sent from my phone. Please excuse my brevity.

On August 27, 2017 10:15:25 PM PDT, Bert Gunter <bgunter.4567 at
gmail.com> wrote:>"Please, consider that some SKUs have "-"
>in the middle, for example: "PG-9021".
>
>Then you need to include these in the list of patterns you gave. Try it
>again -- this time with a **complete** list.
>
>-- Bert
>
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and
>sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>On Sun, Aug 27, 2017 at 10:01 PM, Omar Andr? Gonz?les D?az <
>oma.gonzales at gmail.com> wrote:
>
>> Hi Bert,
>>
>> I would say that the delimitir is "blank", every other row
with "-"
>as
>> delimiter should be ignore. Please, consider that some SKUs have
"-"
>> in the middle, for example: "PG-9021".
>>
>> As for the <end of character string>, it's now corrected.
There
>> shouldn't be any case of this (if there are, just ignore them).
>>
>> I've tried to apply different gsub operations to capture different
>> cases, for example:
>>
>> ecommerce$sku <-
>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
>> ecommerce$producto)
>>
>>
>> ecommerce$sku <-
gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>> "\\2", ecommerce$sku)
>>
>>
>> ecommerce$sku <-
>> gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)",
"\\2",
>> ecommerce$sku)
>>
>> ecommerce$sku <-
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)",
>> "\\2", ecommerce$sku)
>>
>>
>> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)",
"\\2",
>> ecommerce$sku)
>>
>>
>> I don't know if that is the best approache, but I couldn't
capture
>the
>> case in the initial question. And as I've said, the important thing
>is
>> to capture as many SKUs as possibe.
>>
>> Thank you for your time, Sir.
>>
>>
>>
>>
>> 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at
gmail.com>:
>> > Omar:
>> >
>> > I don't think this can work. For example number-letter
patterns 4),
>> > 5), and 6) would all be matched by pattern 6).
>> >
>> > As Jeff indicated, you need to provide the delimiters -- what
>> > characters come before and after the SKU patterns -- to be able to
>> > recognize them. In a quick look at the text file you attached, the
>> > delimiters appeared to be either "-" or " "
(blank) and perhaps
><end
>> > of character string>. If that is correct or if you can tell us
how
>to
>> > make it correct, then it's straightforward to proceed.
Otherwise, I
>am
>> > unable to help. Maybe someone else can.
>> >
>> > Cheers,
>> > Bert
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az
>> > <oma.gonzales at gmail.com> wrote:
>> >> Hi Jeff, Bert, thank you for your input.
>> >>
>> >> I'm attaching a sample of the data, feel free to explore
it.
>> >>
>> >> As I said, I need to extract the SKUs of the products (a key
that
>> >> identifies every product). Not every producto (row) has a SKU,
in
>this
>> >> case "no SKU" should be the output.
>> >>
>> >> I've identify these patterns so far:
>> >>
>> >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter.
>> >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1
letter.
>> >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters.
>> >> 4.-LH5000: 2 letters, 4 numbers.
>> >> 5.-B8500: 1 letters, 4 numbers.
>> >> 6.-E310: 1 letter, 3 numbers.
>> >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters.
>> >>
>> >>
>> >> I think those cover the mayority of skus. So I would
appreciate a
>a
>> >> guidence on how to extract all those different patterns.
>> >>
>> >> Relate but not the question asked: The idea is that after
>extracting
>> >> the skus, there should be skus repeted accros the different
>ecommerce.
>> >> Those skus would permit us to compare the products and their
>prices.
>> >>
>> >>
>> >> Thank you in advance.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at
gmail.com>:
>> >>> You may have to provide us more detail on **exactly** the
sorts
>of
>> >>> patterns you wish to "capture" -- including
exactly what you mean
>by
>> >>> "capture" (what vaue do you wish to return?) --
as the "obvious"
>> >>> answer is probably not sufficient:
>> >>>
>> >>> ## using your example -- thankyou
>> >>>
>> >>>>
gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])
>> >>> [1] "49MU6300"  "LE32S5970"
>> >>>
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>> Bert Gunter
>> >>>
>> >>> "The trouble with having an open mind is that people
keep coming
>along
>> >>> and sticking things into it."
>> >>> -- Opus (aka Berkeley Breathed in his "Bloom
County" comic strip
>)
>> >>>
>> >>>
>> >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az
>> >>> <oma.gonzales at gmail.com> wrote:
>> >>>> Hello, I need some help with regex.
>> >>>>
>> >>>> I have this to sentences. I need to extract both
"49MU6300" and
>> "LE32S5970"
>> >>>> and put them in a new colum "SKU".
>> >>>>
>> >>>> A) SMART TV UHD 49'' CURVO 49MU6300
>> >>>> B) SMART TV HD 32'' LE32S5970
>> >>>>
>> >>>> DataFrame for testing:
>> >>>>
>> >>>> ecommerce <- data.frame(a = c(1,2), producto =
c("SMART TV UHD
>49''
>> CURVO
>> >>>> 49MU6300",
>> >>>>                              "SMART TV HD
32'' LE32S5970"))
>> >>>>
>> >>>>
>> >>>> I'm using gsub like this:
>> >>>>
>> >>>> 1.- This would capture A as intended but only
"32S5970" from B
>> (missing
>> >>>> "LE").
>> >>>>
>> >>>> ecommerce$sku <-
>gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>> "\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> 2.- This would capture "LE32S5970" but not
"49MU6300".
>> >>>>
>> >>>> ecommerce$sku <-
>> >>>>
gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>"\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> 3.- If I make the 2 first letter optional with:
>> >>>>
>> >>>> ecommerce$sku <-
>> >>>>
gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
>"\\2",
>> >>>> ecommerce$producto)
>> >>>>
>> >>>>
>> >>>> "49MU6300" is capture, but again only
"32S5970" from B (missing
>"LE").
>> >>>>
>> >>>>
>> >>>> What should I do? How would you approche it?
>> >>>>
>> >>>>         [[alternative HTML version deleted]]
>> >>>>
>> >>>> ______________________________________________
>> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more,
>see
>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>> PLEASE do read the posting guide
http://www.R-project.org/
>> posting-guide.html
>> >>>> and provide commented, minimal, self-contained,
reproducible
>code.
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Stefan Evert

2017-Aug-29 16:54 UTC

head link

[R] regex - optional part isn't considered in replacement with gsub

> On 27 Aug 2017, at 18:18, Omar Andr? Gonz?les D?az <oma.gonzales at
gmail.com> wrote:
> 
> 3.- If I make the 2 first letter optional with:
> 
> ecommerce$sku <-
> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2",
> ecommerce$producto)
> 
> "49MU6300" is capture, but again only "32S5970" from B
(missing "LE").
Regular expressions are matched greedily from left to right, i.e. the first (.*)
will consume as many characters as possible (including the first two letters
because they're optional in the following subexpression).

If you make the first group non-greedy (.*?), this works for me:

	ecommerce$sku <-
gsub("(.*?)([a-zA-Z]{0,2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)",
"\\2", ecommerce$producto)

But as others have pointed out, you might want to explore more robust approaches
(take a look at \\b to match a word boundary, for instance).

Best,
Stefan

Reasonably Related Threads

Search for more maybe matching threads

R help - Aug 2017 - regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

[R] regex - optional part isn't considered in replacement with gsub

Reasonably Related Threads