I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), when the text contains a missing value and perl=TRUE. { # NA in text input should map to row of NA's in output, without warning r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", NA, "Fifty 50"), data.frame(Initial=factor(), Number=numeric())) e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label c("F", "O"), class = "factor"), Number = c(1, NA, 50)), row.names = c(NA, -3L), class = "data.frame") all.equal(e9p, r9p) } #Error in if (any(ind)) { : missing value where TRUE/FALSE needed Bill Dunlap TIBCO Software wdunlap tibco.com On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:> The new behavior is that it yields NAs when the pattern does not match > (like strptime) and for empty captures in a matching pattern it yields > the empty string, which is consistent with regmatches(). > > Michael > > On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> wrote: > > If there are any matches then strcapture can see if the pattern has the > same > > number of capture expressions as the prototype has columns and give an > > error if not. That seems appropriate. > > > > If there are no matches, then there is no easy way to see if the > prototype > > is compatible with the pattern, so should strcapture just assume the best > > and fill in the prototype with NA's? > > > > Should there be warnings? This is kind of like strptime(), which > silently > > gives NA's when the format does not match the text input. > > > > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence > > <lawrence.michael at gene.com> wrote: > >> > >> Hi Bill, > >> > >> Thanks, another good suggestion. strcapture() now returns NAs for > >> non-matches. It's nice to have someone kicking the tires on that > >> function. > >> > >> Michael > >> > >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel > >> <r-devel at r-project.org> wrote: > >> > Michael, thanks for looking at my first issue with utils::strcapture. > >> > > >> > Another issue is how it deals with lines that don't match the pattern. > >> > Currently it gives an error > >> > > >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"), > >> > proto=list(Name="", Number=0)) > >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three > 3"), > >> > : > >> > number of matches does not always match ncol(proto) > >> > > >> > First, isn't the 'number of matches' the number of parenthesized > >> > subpatterns in the regular expression? I thought that if the entire > >> > pattern matches then the subpatterns without matches would be > >> > shown as matches at position 0 with length 0. Hence either the > >> > pattern is compatible with the prototype or it isn't, it does not > depend > >> > on the text input. E.g., > >> > > >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", > "Z280")) > >> > [[1]] > >> > [1] 1 1 1 0 > >> > attr(,"match.length") > >> > [1] 6 6 6 0 > >> > attr(,"useBytes") > >> > [1] TRUE > >> > > >> > [[2]] > >> > [1] 1 1 0 1 > >> > attr(,"match.length") > >> > [1] 2 2 0 2 > >> > attr(,"useBytes") > >> > [1] TRUE > >> > > >> > [[3]] > >> > [1] -1 > >> > attr(,"match.length") > >> > [1] -1 > >> > attr(,"useBytes") > >> > [1] TRUE > >> > > >> > Second, an error message like 'some lines were bad' is not very > helpful. > >> > Should it put NA's in all the columns of the current output row if the > >> > input line didn't match the pattern and perhaps warn the user that > there > >> > were problems? The user could then look for rows of NA's to see where > >> > the > >> > problems were. > >> > > >> > Bill Dunlap > >> > TIBCO Software > >> > wdunlap tibco.com > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-devel at r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > >[[alternative HTML version deleted]]
Hi Bill, This is a bug in regexec() and I will commit a fix. Thanks for the report, Michael On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote:> I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), when > the text contains a missing value and perl=TRUE. > > { > # NA in text input should map to row of NA's in output, without > warning > r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", NA, > "Fifty 50"), data.frame(Initial=factor(), Number=numeric())) > e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label > c("F", "O"), class = "factor"), > Number = c(1, NA, 50)), > row.names = c(NA, -3L), > class = "data.frame") > all.equal(e9p, r9p) > } > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence > <lawrence.michael at gene.com> wrote: >> >> The new behavior is that it yields NAs when the pattern does not match >> (like strptime) and for empty captures in a matching pattern it yields >> the empty string, which is consistent with regmatches(). >> >> Michael >> >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> wrote: >> > If there are any matches then strcapture can see if the pattern has the >> > same >> > number of capture expressions as the prototype has columns and give an >> > error if not. That seems appropriate. >> > >> > If there are no matches, then there is no easy way to see if the >> > prototype >> > is compatible with the pattern, so should strcapture just assume the >> > best >> > and fill in the prototype with NA's? >> > >> > Should there be warnings? This is kind of like strptime(), which >> > silently >> > gives NA's when the format does not match the text input. >> > >> > >> > Bill Dunlap >> > TIBCO Software >> > wdunlap tibco.com >> > >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence >> > <lawrence.michael at gene.com> wrote: >> >> >> >> Hi Bill, >> >> >> >> Thanks, another good suggestion. strcapture() now returns NAs for >> >> non-matches. It's nice to have someone kicking the tires on that >> >> function. >> >> >> >> Michael >> >> >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel >> >> <r-devel at r-project.org> wrote: >> >> > Michael, thanks for looking at my first issue with utils::strcapture. >> >> > >> >> > Another issue is how it deals with lines that don't match the >> >> > pattern. >> >> > Currently it gives an error >> >> > >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"), >> >> > proto=list(Name="", Number=0)) >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three >> >> > 3"), >> >> > : >> >> > number of matches does not always match ncol(proto) >> >> > >> >> > First, isn't the 'number of matches' the number of parenthesized >> >> > subpatterns in the regular expression? I thought that if the entire >> >> > pattern matches then the subpatterns without matches would be >> >> > shown as matches at position 0 with length 0. Hence either the >> >> > pattern is compatible with the prototype or it isn't, it does not >> >> > depend >> >> > on the text input. E.g., >> >> > >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", >> >> >> "Z280")) >> >> > [[1]] >> >> > [1] 1 1 1 0 >> >> > attr(,"match.length") >> >> > [1] 6 6 6 0 >> >> > attr(,"useBytes") >> >> > [1] TRUE >> >> > >> >> > [[2]] >> >> > [1] 1 1 0 1 >> >> > attr(,"match.length") >> >> > [1] 2 2 0 2 >> >> > attr(,"useBytes") >> >> > [1] TRUE >> >> > >> >> > [[3]] >> >> > [1] -1 >> >> > attr(,"match.length") >> >> > [1] -1 >> >> > attr(,"useBytes") >> >> > [1] TRUE >> >> > >> >> > Second, an error message like 'some lines were bad' is not very >> >> > helpful. >> >> > Should it put NA's in all the columns of the current output row if >> >> > the >> >> > input line didn't match the pattern and perhaps warn the user that >> >> > there >> >> > were problems? The user could then look for rows of NA's to see >> >> > where >> >> > the >> >> > problems were. >> >> > >> >> > Bill Dunlap >> >> > TIBCO Software >> >> > wdunlap tibco.com >> >> > >> >> > [[alternative HTML version deleted]] >> >> > >> >> > ______________________________________________ >> >> > R-devel at r-project.org mailing list >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> > > >
It is also not catching the cases where the number of capture expressions does not match the number of entries in proto. I think all of the following should give an error about the mismatch.> strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"),proto=list(A="",B="",C="")) A B C 1 a b cd 2 d fg f 3 ij i j 4 l m ab Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [15] is not a sub-multiple or multiple of the number of rows [4]> strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"),proto=list(A="",B="")) A B 1 a b 2 def d 3 f ghi 4 h i 5 j k 6 mno m 7 o abc Warning message: In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) : data length [20] is not a sub-multiple or multiple of the number of rows [7]> strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A=""))A 1 a 2 c 3 d 4 f Bill Dunlap TIBCO Software wdunlap tibco.com On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:> Hi Bill, > > This is a bug in regexec() and I will commit a fix. > > Thanks for the report, > Michael > > On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote: > > I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), > when > > the text contains a missing value and perl=TRUE. > > > > { > > # NA in text input should map to row of NA's in output, without > > warning > > r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", > NA, > > "Fifty 50"), data.frame(Initial=factor(), Number=numeric())) > > e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label > > c("F", "O"), class = "factor"), > > Number = c(1, NA, 50)), > > row.names = c(NA, -3L), > > class = "data.frame") > > all.equal(e9p, r9p) > > } > > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed > > > > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence > > <lawrence.michael at gene.com> wrote: > >> > >> The new behavior is that it yields NAs when the pattern does not match > >> (like strptime) and for empty captures in a matching pattern it yields > >> the empty string, which is consistent with regmatches(). > >> > >> Michael > >> > >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> > wrote: > >> > If there are any matches then strcapture can see if the pattern has > the > >> > same > >> > number of capture expressions as the prototype has columns and give an > >> > error if not. That seems appropriate. > >> > > >> > If there are no matches, then there is no easy way to see if the > >> > prototype > >> > is compatible with the pattern, so should strcapture just assume the > >> > best > >> > and fill in the prototype with NA's? > >> > > >> > Should there be warnings? This is kind of like strptime(), which > >> > silently > >> > gives NA's when the format does not match the text input. > >> > > >> > > >> > Bill Dunlap > >> > TIBCO Software > >> > wdunlap tibco.com > >> > > >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence > >> > <lawrence.michael at gene.com> wrote: > >> >> > >> >> Hi Bill, > >> >> > >> >> Thanks, another good suggestion. strcapture() now returns NAs for > >> >> non-matches. It's nice to have someone kicking the tires on that > >> >> function. > >> >> > >> >> Michael > >> >> > >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel > >> >> <r-devel at r-project.org> wrote: > >> >> > Michael, thanks for looking at my first issue with > utils::strcapture. > >> >> > > >> >> > Another issue is how it deals with lines that don't match the > >> >> > pattern. > >> >> > Currently it gives an error > >> >> > > >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"), > >> >> > proto=list(Name="", Number=0)) > >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three > >> >> > 3"), > >> >> > : > >> >> > number of matches does not always match ncol(proto) > >> >> > > >> >> > First, isn't the 'number of matches' the number of parenthesized > >> >> > subpatterns in the regular expression? I thought that if the > entire > >> >> > pattern matches then the subpatterns without matches would be > >> >> > shown as matches at position 0 with length 0. Hence either the > >> >> > pattern is compatible with the prototype or it isn't, it does not > >> >> > depend > >> >> > on the text input. E.g., > >> >> > > >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", > >> >> >> "Z280")) > >> >> > [[1]] > >> >> > [1] 1 1 1 0 > >> >> > attr(,"match.length") > >> >> > [1] 6 6 6 0 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > [[2]] > >> >> > [1] 1 1 0 1 > >> >> > attr(,"match.length") > >> >> > [1] 2 2 0 2 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > [[3]] > >> >> > [1] -1 > >> >> > attr(,"match.length") > >> >> > [1] -1 > >> >> > attr(,"useBytes") > >> >> > [1] TRUE > >> >> > > >> >> > Second, an error message like 'some lines were bad' is not very > >> >> > helpful. > >> >> > Should it put NA's in all the columns of the current output row if > >> >> > the > >> >> > input line didn't match the pattern and perhaps warn the user that > >> >> > there > >> >> > were problems? The user could then look for rows of NA's to see > >> >> > where > >> >> > the > >> >> > problems were. > >> >> > > >> >> > Bill Dunlap > >> >> > TIBCO Software > >> >> > wdunlap tibco.com > >> >> > > >> >> > [[alternative HTML version deleted]] > >> >> > > >> >> > ______________________________________________ > >> >> > R-devel at r-project.org mailing list > >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > >> > > > > > >[[alternative HTML version deleted]]