Tomas Kalibera
2023-Oct-10 06:30 UTC
[Rd] FR: valid_regex() to test string validity as a regular expression
On 10/10/23 01:57, Michael Chirico via R-devel wrote:> It will be useful to package authors trying to validate input which is > supposed to be a valid regular expression. > > As near as I can tell, the only way we can do so now is to run any > regex function and check for the warning and/or condition to bubble > up: > > valid_regex <- function(str) { > stopifnot(is.character(str), length(str) == 1L) > !inherits(tryCatch(grepl(str, ""), condition = identity), "condition") > } > > That's pretty hefty/inscrutable for such a simple validation. I see a > variety of similar approaches in CRAN packages [1], all slightly > different. It would be good for R to expose a "canonical" way to run > this validation. > > At root, the problem is that R does not expose the regex compilation > routines like 'tre_regcomp', so from the R side we have to resort to > hacky approaches.Hi Michael, I don't think you need compilation functions for that. If a regular expression is found invalid by a specific third party library R uses, the library should return and error to R and R should return an error to you, and you should probably propagate that to your users. Grepping an empty string might work in many cases as a test, but it is probably more portable to simply be prepared to propagate such errors from the actual use on real inputs. In theory, there could be some optimization for a particular case, the checking may not be the same - but that is the same say for compilation and checking.> Things get slightly complicated by encoding/useBytes modes > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb, > tre_regncompb; all in tre.h), but all are already present in other > regex routines, so this is doable.Re encodings, simply R strings should be valid in their encoding. This is not just for regular expressions but also for anything else. You shouldn't assume that R can handle invalid strings in any reasonable way. Definitely you shouldn't try adding invalid strings in tests - behavior with invalid strings is unspecified. To test whether a string is valid, there is validEnc() (or validUTF8()). But, again, it is probably safest to propagate errors from the regular expression R functions (in case the checks differ, particularly for non-UTF-8), also, duplicating the encoding checks can be a non-trivial overhead. If there was a strong need to have an automated way to somehow classify specifically errors from the regex libraries, perhaps R could attach some classes to them when the library tells. Tomas> Exposing a function to compile regular expressions is common in other > languages, e.g. Go [2], Python [3], JavaScript [4]. > > [1]https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code > [2]https://pkg.go.dev/regexp#Compile > [3]https://docs.python.org/3/library/re.html#re.compile > [4]https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel[[alternative HTML version deleted]]
Michael Chirico
2023-Oct-10 14:57 UTC
[Rd] FR: valid_regex() to test string validity as a regular expression
> Grepping an empty string might work in many cases...That's precisely why a base R offering is important, as a surer way of validating in all cases. To be clear I am trying to directly access the results of tre_regcomp().> it is probably more portable to simply be prepared to propagate sucherrors from the actual use on real inputs That works best in self-contained calls -- foo(re) and we execute re inside foo(). But the specific context where I found myself looking for a regex validator is more complicated (https://github.com/r-lib/lintr/pull/2225). User supplies a regular expression in a configuration file, only "later" is it actually supplied to grepl(). Till now, we've done your suggestion -- just surface the regex error at run time. But our goal is to make it friendlier and fail earlier at "compile time" as the config is loaded, "long" before any regex is actually executed. At a bare minimum this is a good place to return a classed warning (say invalid_regex_warning) to allow finer control than tryCatch(condition=). On Mon, Oct 9, 2023, 11:30?PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > On 10/10/23 01:57, Michael Chirico via R-devel wrote: > > It will be useful to package authors trying to validate input which is > supposed to be a valid regular expression. > > As near as I can tell, the only way we can do so now is to run any > regex function and check for the warning and/or condition to bubble > up: > > valid_regex <- function(str) { > stopifnot(is.character(str), length(str) == 1L) > !inherits(tryCatch(grepl(str, ""), condition = identity), "condition") > } > > That's pretty hefty/inscrutable for such a simple validation. I see a > variety of similar approaches in CRAN packages [1], all slightly > different. It would be good for R to expose a "canonical" way to run > this validation. > > At root, the problem is that R does not expose the regex compilation > routines like 'tre_regcomp', so from the R side we have to resort to > hacky approaches. > > Hi Michael, > > I don't think you need compilation functions for that. If a regular > expression is found invalid by a specific third party library R uses, the > library should return and error to R and R should return an error to you, > and you should probably propagate that to your users. Grepping an empty > string might work in many cases as a test, but it is probably more portable > to simply be prepared to propagate such errors from the actual use on real > inputs. In theory, there could be some optimization for a particular case, > the checking may not be the same - but that is the same say for compilation > and checking. > > Things get slightly complicated by encoding/useBytes modes > (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb, > tre_regncompb; all in tre.h), but all are already present in other > regex routines, so this is doable. > > Re encodings, simply R strings should be valid in their encoding. This is > not just for regular expressions but also for anything else. You shouldn't > assume that R can handle invalid strings in any reasonable way. Definitely > you shouldn't try adding invalid strings in tests - behavior with invalid > strings is unspecified. To test whether a string is valid, there is > validEnc() (or validUTF8()). But, again, it is probably safest to propagate > errors from the regular expression R functions (in case the checks differ, > particularly for non-UTF-8), also, duplicating the encoding checks can be a > non-trivial overhead. > > If there was a strong need to have an automated way to somehow classify > specifically errors from the regex libraries, perhaps R could attach some > classes to them when the library tells. > > Tomas > > Exposing a function to compile regular expressions is common in other > languages, e.g. Go [2], Python [3], JavaScript [4]. > > [1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code > [2] https://pkg.go.dev/regexp#Compile > [3] https://docs.python.org/3/library/re.html#re.compile > [4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp > > ______________________________________________R-devel at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-devel > >[[alternative HTML version deleted]]