Cyclic Group Z_1
2019-Aug-15 05:56 UTC
[Rd] Feature request: non-dropping regmatches/strextract
A very common use case for regmatches is to extract regex matches into a new column in a data.frame (or data.table, etc.) or otherwise use the extracted strings alongside the input. However, the default behavior is to drop empty matches, which results in mismatches in column length if reassignment is done without subsetting. For consistency with other R functions and compatibility with this use case, it would be nice if regmatches did not automatically drop empty matches and would instead insert an NA_character_ value (similar to stringr::str_extract). This alternative regmatches could be implemented through an optional drop argument, a new function, or mentioned in the documentation (a la resample in ?sample).? Alternatively, at the moment, there is a non-exported function strextract in utils which is very similar to stringr::str_extract. It would be great if this function, once exported, were to include a drop argument to prevent dropping positions with no matches.? An example solution (last option): strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop = T) { m <- regexec(pattern, x, perl=perl, useBytes=useBytes) result <- regmatches(x, m) if(isTRUE(drop)){ unlist(result) } else if(isFALSE(drop)) { unlist({result[lengths(result)==0] <- NA_character_; result}) } else { stop("Invalid argument for `drop`") } } Based on?Ricardo Saporta's response to?How to prevent regmatches drop non matches? --CG
William Dunlap
2019-Aug-15 15:08 UTC
[Rd] Feature request: non-dropping regmatches/strextract
Changing the default behavior of regmatches would break its use with gregexpr, where the number of matches per input element faries, so a zero-length character vector makes more sense than NA_character_.> x <- c("John Doe", "e e cummings", "Juan de la Madrid") > m <- gregexpr("[A-Z]", x) > regmatches(x,m)[[1]] [1] "J" "D" [[2]] character(0) [[3]] [1] "J" "M"> vapply(.Last.value, function(x)paste(paste0(x, "."),collapse=""), "")[1] "J.D." "." "J.M." (We don't want e e cummings initials mapped to "NA.") Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel < r-devel at r-project.org> wrote:> A very common use case for regmatches is to extract regex matches into a > new column in a data.frame (or data.table, etc.) or otherwise use the > extracted strings alongside the input. However, the default behavior is to > drop empty matches, which results in mismatches in column length if > reassignment is done without subsetting. > > For consistency with other R functions and compatibility with this use > case, it would be nice if regmatches did not automatically drop empty > matches and would instead insert an NA_character_ value (similar to > stringr::str_extract). This alternative regmatches could be implemented > through an optional drop argument, a new function, or mentioned in the > documentation (a la resample in ?sample). > > Alternatively, at the moment, there is a non-exported function strextract > in utils which is very similar to stringr::str_extract. It would be great > if this function, once exported, were to include a drop argument to prevent > dropping positions with no matches. > > An example solution (last option): > > strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop > T) { > m <- regexec(pattern, x, perl=perl, useBytes=useBytes) > result <- regmatches(x, m) > > if(isTRUE(drop)){ > unlist(result) > } else if(isFALSE(drop)) { > unlist({result[lengths(result)==0] <- NA_character_; result}) > } else { > stop("Invalid argument for `drop`") > } > } > > Based on Ricardo Saporta's response to How to prevent regmatches drop non > matches? > > --CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Cyclic Group Z_1
2019-Aug-15 18:31 UTC
[Rd] Feature request: non-dropping regmatches/strextract
I do think keeping the default behavior is desirable for backwards compatibility; my suggestion is not to change default behavior but to add an optional argument that allows a different behavior. Although this can be implemented in a user-defined function, retaining empty matches facilitates programmatic use, and seems to be something that should be available in base R. It is available, for example, in MATLAB, a comparable array language. Alternatively, perhaps a nomatch (or maybe emptymatch) argument in the spirit of `[.data.table`? That is, an argument nomatch where nomatch = NULL (the default) results in drops for vector outputs and character(0) for list outputs and nomatch = NA results in insertion of NA_character_, and nomatch = '' results in insertion of empty string. I can submit proposed patch code if others think this is a good idea. What are your thoughts on the proposed alteration to (currently nonexported) strextract? I assume (maybe wrongly) that the plan is to eventually export that function. Thank you, CG
Toby Hocking
2019-Aug-29 21:00 UTC
[Rd] Feature request: non-dropping regmatches/strextract
if you want "to extract regex matches into a new column in a data.frame" then there are some package functions which do exactly that. three examples are namedCapture::df_match_variable, rematch2::bind_re_match, and tidyr::extract. For a more detailed discussion see my R journal submission (under review) about regular expression packages, https://raw.githubusercontent.com/tdhock/namedCapture-article/master/RJwrapper.pdf Comments/suggestions welcome. On Thu, Aug 15, 2019 at 12:15 AM Cyclic Group Z_1 via R-devel < r-devel at r-project.org> wrote:> A very common use case for regmatches is to extract regex matches into a > new column in a data.frame (or data.table, etc.) or otherwise use the > extracted strings alongside the input. However, the default behavior is to > drop empty matches, which results in mismatches in column length if > reassignment is done without subsetting. > > For consistency with other R functions and compatibility with this use > case, it would be nice if regmatches did not automatically drop empty > matches and would instead insert an NA_character_ value (similar to > stringr::str_extract). This alternative regmatches could be implemented > through an optional drop argument, a new function, or mentioned in the > documentation (a la resample in ?sample). > > Alternatively, at the moment, there is a non-exported function strextract > in utils which is very similar to stringr::str_extract. It would be great > if this function, once exported, were to include a drop argument to prevent > dropping positions with no matches. > > An example solution (last option): > > strextract <- function(pattern, x, perl = FALSE, useBytes = FALSE, drop > T) { > m <- regexec(pattern, x, perl=perl, useBytes=useBytes) > result <- regmatches(x, m) > > if(isTRUE(drop)){ > unlist(result) > } else if(isFALSE(drop)) { > unlist({result[lengths(result)==0] <- NA_character_; result}) > } else { > stop("Invalid argument for `drop`") > } > } > > Based on Ricardo Saporta's response to How to prevent regmatches drop non > matches? > > --CG > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Cyclic Group Z_1
2019-Aug-29 21:19 UTC
[Rd] Feature request: non-dropping regmatches/strextract
Thank you, I am aware that there are packages that can accomplish this. I mentioned stringr::str_extract as a function that does not drop empty matches. I think that the behavior of regmatches(..., regexpr(...))?in base R should permit an option to prevent dropping of empty matches both for sake of consistency with the rest of the language (missing data does not yield a dropped index in other sorts of R functions, and an empty match conceptually corresponds with missing data) and facility of use in data.frames. The behavior of regmatches(..., gregexpr(...)) is not objectionable to me, as lists do not drop indices when they contain character(0) vectors. Alternatively, perhaps this should be reflected in the (currently non-exported) strextract. Best, CG