Bill Dunlap
2006-Apr-04 15:54 UTC
[Rd] extending strsplit(): supply pattern to keep, not to split by
strsplit() is a convenient way to get a list of items from a string when you have a regular expression for what is not an item. E.g., > strsplit("1.2, 34, 1.7e-2", split="[ ,] *") [[1]]: [1] "1.2" "34" "1.7e-2" However, sometimes is it more convenient to give a pattern for the items you do want. E.g., suppose you want to pull all the numbers out of a string which contains a mix of numbers and words. Making a pattern for what a number is simpler than making a pattern for what may come between the number. > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?" I propose adding a keep=FALSE argument to strsplit() to do this. If keep is FALSE, then the split argument matches the stuff to omit from the output; if keep is TRUE then split matches the stuff to put into the output. Then we could do the following to get a list of all the numbers in a string (done in a version of strsplit() I'm working on for S-PLUS): > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE) [[1]]: [1] "1.2" "34" "1.7e-2" > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE) [[1]]: [1] "200" Is this a reasonable thing to want strsplit to do? Is this a reasonable parameterization of it? ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."
Gabor Grothendieck
2006-Apr-04 16:01 UTC
[Rd] extending strsplit(): supply pattern to keep, not to split by
gsubfn in package gsubfn can do this. See the examples in ?gsubfn On 4/4/06, Bill Dunlap <bill at insightful.com> wrote:> strsplit() is a convenient way to get a > list of items from a string when you > have a regular expression for what is not > an item. E.g., > > > strsplit("1.2, 34, 1.7e-2", split="[ ,] *") > [[1]]: > [1] "1.2" "34" "1.7e-2" > > However, sometimes is it more convenient to > give a pattern for the items you do want. > E.g., suppose you want to pull all the numbers > out of a string which contains a mix of numbers > and words. Making a pattern for what a > number is simpler than making a pattern > for what may come between the number. > > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?" > > I propose adding a keep=FALSE argument to > strsplit() to do this. If keep is FALSE, > then the split argument matches the stuff to > omit from the output; if keep is TRUE then > split matches the stuff to put into the > output. Then we could do the following to > get a list of all the numbers in a string > (done in a version of strsplit() I'm working on > for S-PLUS): > > > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE) > [[1]]: > [1] "1.2" "34" "1.7e-2" > > > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE) > [[1]]: > [1] "200" > > Is this a reasonable thing to want strsplit to do? > Is this a reasonable parameterization of it? > > ---------------------------------------------------------------------------- > Bill Dunlap > Insightful Corporation > bill at insightful dot com > 360-428-8146 > > "All statements in this message represent the opinions of the author and do > not necessarily reflect Insightful Corporation policy or position." > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >