Omar André Gonzáles Díaz
2017-Aug-27 16:18 UTC
[R] regex - optional part isn't considered in replacement with gsub
Hello, I need some help with regex. I have this to sentences. I need to extract both "49MU6300" and "LE32S5970" and put them in a new colum "SKU". A) SMART TV UHD 49'' CURVO 49MU6300 B) SMART TV HD 32'' LE32S5970 DataFrame for testing: ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' CURVO 49MU6300", "SMART TV HD 32'' LE32S5970")) I'm using gsub like this: 1.- This would capture A as intended but only "32S5970" from B (missing "LE"). ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", ecommerce$producto) 2.- This would capture "LE32S5970" but not "49MU6300". ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", ecommerce$producto) 3.- If I make the 2 first letter optional with: ecommerce$sku <- gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", ecommerce$producto) "49MU6300" is capture, but again only "32S5970" from B (missing "LE"). What should I do? How would you approche it? [[alternative HTML version deleted]]
Jeff Newmiller
2017-Aug-27 16:54 UTC
[R] regex - optional part isn't considered in replacement with gsub
Clearly you are being too specific about the structure of the sku. In the absence of better information about the sku you need to focus on identifying the delimiters and position of the sku... one way might be: ecommerce$sku <- sub( "^(.*)[ \n]+([^ \n]+)$", "\\2", ecommerce$producto ) Please learn to post using plain text format, as HTML corrupts the latter on this mailing list. The option exists in your email client (including the GMail Web interface if that is what you use). -- Sent from my phone. Please excuse my brevity. On August 27, 2017 9:18:52 AM PDT, "Omar Andr? Gonz?les D?az" <oma.gonzales at gmail.com> wrote:>Hello, I need some help with regex. > >I have this to sentences. I need to extract both "49MU6300" and >"LE32S5970" >and put them in a new colum "SKU". > >A) SMART TV UHD 49'' CURVO 49MU6300 >B) SMART TV HD 32'' LE32S5970 > >DataFrame for testing: > >ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' >CURVO >49MU6300", > "SMART TV HD 32'' LE32S5970")) > > >I'm using gsub like this: > >1.- This would capture A as intended but only "32S5970" from B (missing >"LE"). > >ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", >"\\2", >ecommerce$producto) > > >2.- This would capture "LE32S5970" but not "49MU6300". > >ecommerce$sku <- >gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >ecommerce$producto) > > >3.- If I make the 2 first letter optional with: > >ecommerce$sku <- >gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >ecommerce$producto) > > >"49MU6300" is capture, but again only "32S5970" from B (missing "LE"). > > >What should I do? How would you approche it? > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2017-Aug-27 17:10 UTC
[R] regex - optional part isn't considered in replacement with gsub
You may have to provide us more detail on **exactly** the sorts of patterns you wish to "capture" -- including exactly what you mean by "capture" (what vaue do you wish to return?) -- as the "obvious" answer is probably not sufficient: ## using your example -- thankyou> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]])[1] "49MU6300" "LE32S5970" Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az <oma.gonzales at gmail.com> wrote:> Hello, I need some help with regex. > > I have this to sentences. I need to extract both "49MU6300" and "LE32S5970" > and put them in a new colum "SKU". > > A) SMART TV UHD 49'' CURVO 49MU6300 > B) SMART TV HD 32'' LE32S5970 > > DataFrame for testing: > > ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' CURVO > 49MU6300", > "SMART TV HD 32'' LE32S5970")) > > > I'm using gsub like this: > > 1.- This would capture A as intended but only "32S5970" from B (missing > "LE"). > > ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > > 2.- This would capture "LE32S5970" but not "49MU6300". > > ecommerce$sku <- > gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > > 3.- If I make the 2 first letter optional with: > > ecommerce$sku <- > gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > > "49MU6300" is capture, but again only "32S5970" from B (missing "LE"). > > > What should I do? How would you approche it? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2017-Aug-27 23:01 UTC
[R] regex - optional part isn't considered in replacement with gsub
Omar: I don't think this can work. For example number-letter patterns 4), 5), and 6) would all be matched by pattern 6). As Jeff indicated, you need to provide the delimiters -- what characters come before and after the SKU patterns -- to be able to recognize them. In a quick look at the text file you attached, the delimiters appeared to be either "-" or " " (blank) and perhaps <end of character string>. If that is correct or if you can tell us how to make it correct, then it's straightforward to proceed. Otherwise, I am unable to help. Maybe someone else can. Cheers, Bert On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az <oma.gonzales at gmail.com> wrote:> Hi Jeff, Bert, thank you for your input. > > I'm attaching a sample of the data, feel free to explore it. > > As I said, I need to extract the SKUs of the products (a key that > identifies every product). Not every producto (row) has a SKU, in this > case "no SKU" should be the output. > > I've identify these patterns so far: > > 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter. > 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter. > 3.-MT48AF: 2 letters, 2 numbers, 2 letters. > 4.-LH5000: 2 letters, 4 numbers. > 5.-B8500: 1 letters, 4 numbers. > 6.-E310: 1 letter, 3 numbers. > 7.-X541UJ: 1 letter, 3 numbers, 2 letters. > > > I think those cover the mayority of skus. So I would appreciate a a > guidence on how to extract all those different patterns. > > Relate but not the question asked: The idea is that after extracting > the skus, there should be skus repeted accros the different ecommerce. > Those skus would permit us to compare the products and their prices. > > > Thank you in advance. > > > > > > > > > > > > > > > 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>: >> You may have to provide us more detail on **exactly** the sorts of >> patterns you wish to "capture" -- including exactly what you mean by >> "capture" (what vaue do you wish to return?) -- as the "obvious" >> answer is probably not sufficient: >> >> ## using your example -- thankyou >> >>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]]) >> [1] "49MU6300" "LE32S5970" >> >> >> Cheers, >> Bert >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az >> <oma.gonzales at gmail.com> wrote: >>> Hello, I need some help with regex. >>> >>> I have this to sentences. I need to extract both "49MU6300" and "LE32S5970" >>> and put them in a new colum "SKU". >>> >>> A) SMART TV UHD 49'' CURVO 49MU6300 >>> B) SMART TV HD 32'' LE32S5970 >>> >>> DataFrame for testing: >>> >>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' CURVO >>> 49MU6300", >>> "SMART TV HD 32'' LE32S5970")) >>> >>> >>> I'm using gsub like this: >>> >>> 1.- This would capture A as intended but only "32S5970" from B (missing >>> "LE"). >>> >>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >>> ecommerce$producto) >>> >>> >>> 2.- This would capture "LE32S5970" but not "49MU6300". >>> >>> ecommerce$sku <- >>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >>> ecommerce$producto) >>> >>> >>> 3.- If I make the 2 first letter optional with: >>> >>> ecommerce$sku <- >>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >>> ecommerce$producto) >>> >>> >>> "49MU6300" is capture, but again only "32S5970" from B (missing "LE"). >>> >>> >>> What should I do? How would you approche it? >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2017-Aug-28 05:15 UTC
[R] regex - optional part isn't considered in replacement with gsub
"Please, consider that some SKUs have "-" in the middle, for example: "PG-9021". Then you need to include these in the list of patterns you gave. Try it again -- this time with a **complete** list. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Aug 27, 2017 at 10:01 PM, Omar Andr? Gonz?les D?az < oma.gonzales at gmail.com> wrote:> Hi Bert, > > I would say that the delimitir is "blank", every other row with "-" as > delimiter should be ignore. Please, consider that some SKUs have "-" > in the middle, for example: "PG-9021". > > As for the <end of character string>, it's now corrected. There > shouldn't be any case of this (if there are, just ignore them). > > I've tried to apply different gsub operations to capture different > cases, for example: > > ecommerce$sku <- > gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > > ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", > "\\2", ecommerce$sku) > > > ecommerce$sku <- > gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)", "\\2", > ecommerce$sku) > > ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)", > "\\2", ecommerce$sku) > > > ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)", "\\2", > ecommerce$sku) > > > I don't know if that is the best approache, but I couldn't capture the > case in the initial question. And as I've said, the important thing is > to capture as many SKUs as possibe. > > Thank you for your time, Sir. > > > > > 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>: > > Omar: > > > > I don't think this can work. For example number-letter patterns 4), > > 5), and 6) would all be matched by pattern 6). > > > > As Jeff indicated, you need to provide the delimiters -- what > > characters come before and after the SKU patterns -- to be able to > > recognize them. In a quick look at the text file you attached, the > > delimiters appeared to be either "-" or " " (blank) and perhaps <end > > of character string>. If that is correct or if you can tell us how to > > make it correct, then it's straightforward to proceed. Otherwise, I am > > unable to help. Maybe someone else can. > > > > Cheers, > > Bert > > > > > > > > > > > > > > On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az > > <oma.gonzales at gmail.com> wrote: > >> Hi Jeff, Bert, thank you for your input. > >> > >> I'm attaching a sample of the data, feel free to explore it. > >> > >> As I said, I need to extract the SKUs of the products (a key that > >> identifies every product). Not every producto (row) has a SKU, in this > >> case "no SKU" should be the output. > >> > >> I've identify these patterns so far: > >> > >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter. > >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter. > >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters. > >> 4.-LH5000: 2 letters, 4 numbers. > >> 5.-B8500: 1 letters, 4 numbers. > >> 6.-E310: 1 letter, 3 numbers. > >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters. > >> > >> > >> I think those cover the mayority of skus. So I would appreciate a a > >> guidence on how to extract all those different patterns. > >> > >> Relate but not the question asked: The idea is that after extracting > >> the skus, there should be skus repeted accros the different ecommerce. > >> Those skus would permit us to compare the products and their prices. > >> > >> > >> Thank you in advance. > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>: > >>> You may have to provide us more detail on **exactly** the sorts of > >>> patterns you wish to "capture" -- including exactly what you mean by > >>> "capture" (what vaue do you wish to return?) -- as the "obvious" > >>> answer is probably not sufficient: > >>> > >>> ## using your example -- thankyou > >>> > >>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]]) > >>> [1] "49MU6300" "LE32S5970" > >>> > >>> > >>> Cheers, > >>> Bert > >>> Bert Gunter > >>> > >>> "The trouble with having an open mind is that people keep coming along > >>> and sticking things into it." > >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >>> > >>> > >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az > >>> <oma.gonzales at gmail.com> wrote: > >>>> Hello, I need some help with regex. > >>>> > >>>> I have this to sentences. I need to extract both "49MU6300" and > "LE32S5970" > >>>> and put them in a new colum "SKU". > >>>> > >>>> A) SMART TV UHD 49'' CURVO 49MU6300 > >>>> B) SMART TV HD 32'' LE32S5970 > >>>> > >>>> DataFrame for testing: > >>>> > >>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD 49'' > CURVO > >>>> 49MU6300", > >>>> "SMART TV HD 32'' LE32S5970")) > >>>> > >>>> > >>>> I'm using gsub like this: > >>>> > >>>> 1.- This would capture A as intended but only "32S5970" from B > (missing > >>>> "LE"). > >>>> > >>>> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", > "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> 2.- This would capture "LE32S5970" but not "49MU6300". > >>>> > >>>> ecommerce$sku <- > >>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> 3.- If I make the 2 first letter optional with: > >>>> > >>>> ecommerce$sku <- > >>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > >>>> ecommerce$producto) > >>>> > >>>> > >>>> "49MU6300" is capture, but again only "32S5970" from B (missing "LE"). > >>>> > >>>> > >>>> What should I do? How would you approche it? > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> ______________________________________________ > >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Jeff Newmiller
2017-Aug-28 05:37 UTC
[R] regex - optional part isn't considered in replacement with gsub
Omar, please remember that this is R-help, not R-do-my-work-for-me... you have already been given several hints as to how you can refine your patterns yourself. These skills are key to real world data science, so you need to work at being able to take hints and expand on them if you are to be successful in these kinds of tasks. Also, if you cannot learn to make reproducible examples ([1][2][3]) to illustrate your problems then we have about reached the limit of our ability to help you. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example [2] http://adv-r.had.co.nz/Reproducibility.html [3] https://cran.r-project.org/web/packages/reprex/index.html (read the vignette) -- Sent from my phone. Please excuse my brevity. On August 27, 2017 10:15:25 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:>"Please, consider that some SKUs have "-" >in the middle, for example: "PG-9021". > >Then you need to include these in the list of patterns you gave. Try it >again -- this time with a **complete** list. > >-- Bert > > > >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and >sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >On Sun, Aug 27, 2017 at 10:01 PM, Omar Andr? Gonz?les D?az < >oma.gonzales at gmail.com> wrote: > >> Hi Bert, >> >> I would say that the delimitir is "blank", every other row with "-" >as >> delimiter should be ignore. Please, consider that some SKUs have "-" >> in the middle, for example: "PG-9021". >> >> As for the <end of character string>, it's now corrected. There >> shouldn't be any case of this (if there are, just ignore them). >> >> I've tried to apply different gsub operations to capture different >> cases, for example: >> >> ecommerce$sku <- >> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", >> ecommerce$producto) >> >> >> ecommerce$sku <- gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", >> "\\2", ecommerce$sku) >> >> >> ecommerce$sku <- >> gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{1}[a-zA-Z]{1})(.*)", "\\2", >> ecommerce$sku) >> >> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{2})(.*)", >> "\\2", ecommerce$sku) >> >> >> ecommerce$sku <- gsub("(.*)([a-zA-Z]{2}[0-9]{3,4})(.*)", "\\2", >> ecommerce$sku) >> >> >> I don't know if that is the best approache, but I couldn't capture >the >> case in the initial question. And as I've said, the important thing >is >> to capture as many SKUs as possibe. >> >> Thank you for your time, Sir. >> >> >> >> >> 2017-08-27 18:01 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>: >> > Omar: >> > >> > I don't think this can work. For example number-letter patterns 4), >> > 5), and 6) would all be matched by pattern 6). >> > >> > As Jeff indicated, you need to provide the delimiters -- what >> > characters come before and after the SKU patterns -- to be able to >> > recognize them. In a quick look at the text file you attached, the >> > delimiters appeared to be either "-" or " " (blank) and perhaps ><end >> > of character string>. If that is correct or if you can tell us how >to >> > make it correct, then it's straightforward to proceed. Otherwise, I >am >> > unable to help. Maybe someone else can. >> > >> > Cheers, >> > Bert >> > >> > >> > >> > >> > >> > >> > On Sun, Aug 27, 2017 at 11:47 AM, Omar Andr? Gonz?les D?az >> > <oma.gonzales at gmail.com> wrote: >> >> Hi Jeff, Bert, thank you for your input. >> >> >> >> I'm attaching a sample of the data, feel free to explore it. >> >> >> >> As I said, I need to extract the SKUs of the products (a key that >> >> identifies every product). Not every producto (row) has a SKU, in >this >> >> case "no SKU" should be the output. >> >> >> >> I've identify these patterns so far: >> >> >> >> 1.- 75Q8C : 2 numbers, 1 letter, 1 number, 1 letter. >> >> 2.-OLED65E7P: 4 letters, 2 numbers, 1 letter, 1 number, 1 letter. >> >> 3.-MT48AF: 2 letters, 2 numbers, 2 letters. >> >> 4.-LH5000: 2 letters, 4 numbers. >> >> 5.-B8500: 1 letters, 4 numbers. >> >> 6.-E310: 1 letter, 3 numbers. >> >> 7.-X541UJ: 1 letter, 3 numbers, 2 letters. >> >> >> >> >> >> I think those cover the mayority of skus. So I would appreciate a >a >> >> guidence on how to extract all those different patterns. >> >> >> >> Relate but not the question asked: The idea is that after >extracting >> >> the skus, there should be skus repeted accros the different >ecommerce. >> >> Those skus would permit us to compare the products and their >prices. >> >> >> >> >> >> Thank you in advance. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 2017-08-27 12:10 GMT-05:00 Bert Gunter <bgunter.4567 at gmail.com>: >> >>> You may have to provide us more detail on **exactly** the sorts >of >> >>> patterns you wish to "capture" -- including exactly what you mean >by >> >>> "capture" (what vaue do you wish to return?) -- as the "obvious" >> >>> answer is probably not sufficient: >> >>> >> >>> ## using your example -- thankyou >> >>> >> >>>> gsub(".*(49MU6300|LE32S5970).*","\\1",ecommerce[[2]]) >> >>> [1] "49MU6300" "LE32S5970" >> >>> >> >>> >> >>> Cheers, >> >>> Bert >> >>> Bert Gunter >> >>> >> >>> "The trouble with having an open mind is that people keep coming >along >> >>> and sticking things into it." >> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip >) >> >>> >> >>> >> >>> On Sun, Aug 27, 2017 at 9:18 AM, Omar Andr? Gonz?les D?az >> >>> <oma.gonzales at gmail.com> wrote: >> >>>> Hello, I need some help with regex. >> >>>> >> >>>> I have this to sentences. I need to extract both "49MU6300" and >> "LE32S5970" >> >>>> and put them in a new colum "SKU". >> >>>> >> >>>> A) SMART TV UHD 49'' CURVO 49MU6300 >> >>>> B) SMART TV HD 32'' LE32S5970 >> >>>> >> >>>> DataFrame for testing: >> >>>> >> >>>> ecommerce <- data.frame(a = c(1,2), producto = c("SMART TV UHD >49'' >> CURVO >> >>>> 49MU6300", >> >>>> "SMART TV HD 32'' LE32S5970")) >> >>>> >> >>>> >> >>>> I'm using gsub like this: >> >>>> >> >>>> 1.- This would capture A as intended but only "32S5970" from B >> (missing >> >>>> "LE"). >> >>>> >> >>>> ecommerce$sku <- >gsub("(.*)([0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", >> "\\2", >> >>>> ecommerce$producto) >> >>>> >> >>>> >> >>>> 2.- This would capture "LE32S5970" but not "49MU6300". >> >>>> >> >>>> ecommerce$sku <- >> >>>> gsub("(.*)([a-zA-Z]{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", >"\\2", >> >>>> ecommerce$producto) >> >>>> >> >>>> >> >>>> 3.- If I make the 2 first letter optional with: >> >>>> >> >>>> ecommerce$sku <- >> >>>> gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", >"\\2", >> >>>> ecommerce$producto) >> >>>> >> >>>> >> >>>> "49MU6300" is capture, but again only "32S5970" from B (missing >"LE"). >> >>>> >> >>>> >> >>>> What should I do? How would you approche it? >> >>>> >> >>>> [[alternative HTML version deleted]] >> >>>> >> >>>> ______________________________________________ >> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >see >> >>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> >>>> and provide commented, minimal, self-contained, reproducible >code. >> > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Stefan Evert
2017-Aug-29 16:54 UTC
[R] regex - optional part isn't considered in replacement with gsub
> On 27 Aug 2017, at 18:18, Omar Andr? Gonz?les D?az <oma.gonzales at gmail.com> wrote: > > 3.- If I make the 2 first letter optional with: > > ecommerce$sku <- > gsub("(.*)([a-zA-Z]?{2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", > ecommerce$producto) > > "49MU6300" is capture, but again only "32S5970" from B (missing "LE").Regular expressions are matched greedily from left to right, i.e. the first (.*) will consume as many characters as possible (including the first two letters because they're optional in the following subexpression). If you make the first group non-greedy (.*?), this works for me: ecommerce$sku <- gsub("(.*?)([a-zA-Z]{0,2}[0-9]{2}[a-zA-Z]{1,2}[0-9]{2,4})(.*)", "\\2", ecommerce$producto) But as others have pointed out, you might want to explore more robust approaches (take a look at \\b to match a word boundary, for instance). Best, Stefan
Reasonably Related Threads
- regex - optional part isn't considered in replacement with gsub
- regex - optional part isn't considered in replacement with gsub
- New x86-64 micro-architecture levels
- handling association changes? What's the best practice?
- RFC: System (cache, etc.) model for LLVM