Michael Chirico
2023-Oct-09 23:57 UTC
[Rd] FR: valid_regex() to test string validity as a regular expression
It will be useful to package authors trying to validate input which is supposed to be a valid regular expression. As near as I can tell, the only way we can do so now is to run any regex function and check for the warning and/or condition to bubble up: valid_regex <- function(str) { stopifnot(is.character(str), length(str) == 1L) !inherits(tryCatch(grepl(str, ""), condition = identity), "condition") } That's pretty hefty/inscrutable for such a simple validation. I see a variety of similar approaches in CRAN packages [1], all slightly different. It would be good for R to expose a "canonical" way to run this validation. At root, the problem is that R does not expose the regex compilation routines like 'tre_regcomp', so from the R side we have to resort to hacky approaches. Things get slightly complicated by encoding/useBytes modes (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb, tre_regncompb; all in tre.h), but all are already present in other regex routines, so this is doable. Exposing a function to compile regular expressions is common in other languages, e.g. Go [2], Python [3], JavaScript [4]. [1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code [2] https://pkg.go.dev/regexp#Compile [3] https://docs.python.org/3/library/re.html#re.compile [4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
Duncan Murdoch
2023-Oct-10 00:19 UTC
[Rd] FR: valid_regex() to test string validity as a regular expression
On 09/10/2023 7:57 p.m., Michael Chirico via R-devel wrote:> It will be useful to package authors trying to validate input which is > supposed to be a valid regular expression. > > As near as I can tell, the only way we can do so now is to run any > regex function and check for the warning and/or condition to bubble > up: > > valid_regex <- function(str) { > stopifnot(is.character(str), length(str) == 1L) > !inherits(tryCatch(grepl(str, ""), condition = identity), "condition") > } > > That's pretty hefty/inscrutable for such a simple validation. I see a > variety of similar approaches in CRAN packages [1], all slightly > different. It would be good for R to expose a "canonical" way to run > this validation.I think currently we do as.character(str) (or some equivalent), so the test shouldn't require str to be a character to start. For example, this is currently valid code: grepl(1, "abc123") It's not great style, but shouldn't generate an error. Duncan Murdoch> > At root, the problem is that R does not expose the regex compilation > routines like 'tre_regcomp', so from the R side we have to resort to > hacky approaches. > > Things get slightly complicated by encoding/useBytes modes > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb, > tre_regncompb; all in tre.h), but all are already present in other > regex routines, so this is doable. > > Exposing a function to compile regular expressions is common in other > languages, e.g. Go [2], Python [3], JavaScript [4]. > > [1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code > [2] https://pkg.go.dev/regexp#Compile > [3] https://docs.python.org/3/library/re.html#re.compile > [4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Tomas Kalibera
2023-Oct-10 06:30 UTC
[Rd] FR: valid_regex() to test string validity as a regular expression
On 10/10/23 01:57, Michael Chirico via R-devel wrote:> It will be useful to package authors trying to validate input which is > supposed to be a valid regular expression. > > As near as I can tell, the only way we can do so now is to run any > regex function and check for the warning and/or condition to bubble > up: > > valid_regex <- function(str) { > stopifnot(is.character(str), length(str) == 1L) > !inherits(tryCatch(grepl(str, ""), condition = identity), "condition") > } > > That's pretty hefty/inscrutable for such a simple validation. I see a > variety of similar approaches in CRAN packages [1], all slightly > different. It would be good for R to expose a "canonical" way to run > this validation. > > At root, the problem is that R does not expose the regex compilation > routines like 'tre_regcomp', so from the R side we have to resort to > hacky approaches.Hi Michael, I don't think you need compilation functions for that. If a regular expression is found invalid by a specific third party library R uses, the library should return and error to R and R should return an error to you, and you should probably propagate that to your users. Grepping an empty string might work in many cases as a test, but it is probably more portable to simply be prepared to propagate such errors from the actual use on real inputs. In theory, there could be some optimization for a particular case, the checking may not be the same - but that is the same say for compilation and checking.> Things get slightly complicated by encoding/useBytes modes > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb, > tre_regncompb; all in tre.h), but all are already present in other > regex routines, so this is doable.Re encodings, simply R strings should be valid in their encoding. This is not just for regular expressions but also for anything else. You shouldn't assume that R can handle invalid strings in any reasonable way. Definitely you shouldn't try adding invalid strings in tests - behavior with invalid strings is unspecified. To test whether a string is valid, there is validEnc() (or validUTF8()). But, again, it is probably safest to propagate errors from the regular expression R functions (in case the checks differ, particularly for non-UTF-8), also, duplicating the encoding checks can be a non-trivial overhead. If there was a strong need to have an automated way to somehow classify specifically errors from the regex libraries, perhaps R could attach some classes to them when the library tells. Tomas> Exposing a function to compile regular expressions is common in other > languages, e.g. Go [2], Python [3], JavaScript [4]. > > [1]https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code > [2]https://pkg.go.dev/regexp#Compile > [3]https://docs.python.org/3/library/re.html#re.compile > [4]https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]