Hi all, Based upon an offlist communication this morning, I am somewhat confused (more than I usually am on most Monday mornings...) about the use of grep() with factors as the 'x' argument. The argument guidance in ?grep indicates: x, text a character vector where matches are sought. Coerced to character if possible. and in the Details section: Arguments which should be character strings or character vectors are coerced to character if possible. The wording of both would seem to reasonably lead to the conclusion that a factor could be coerced to a character vector by the use of as.character(FACTOR). In tracing through the C code in character.c for do_grep(), which in turn calls coerceVector() in coerce.c, unless I am mis-reading the code (always possible), I don't see an indication that a factor would be coerced to a character vector. Since a factor -> character coercion would seem at face value, the most logical coercion to take place when using grep(), I am curious if I am missing something, or if perhaps ?grep needs to be more clear in the coercions that will or might take place. Perhaps even the consideration of an error message if a factor is passed as the 'x' argument, if indeed the coercion would not take place. Perhaps the easiest example here might be: # On R Version 2.3.1 (2006-06-01) on FC5> grep("[a-z]", letters)[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 23 24 25 26> grep("[a-z]", factor(letters))numeric(0) Thanks for any comments or any virtual rotten tomatoes coming my way at high speed. :-) Marc Schwartz
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:> Based upon an offlist communication this morning, I am somewhat confused > (more than I usually am on most Monday mornings...) about the use of > grep() with factors as the 'x' argument. > ... > > grep("[a-z]", letters) > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > [23] 23 24 25 26 > > > grep("[a-z]", factor(letters)) > numeric(0)I was recently surprised by this also. In addition, if R's grep did support factors in this way, what sort of object (factor or character) should it return when value=T? I recently changed Splus's grep to return a character vector in that case. Splus> grep("[def]", letters[26:1]) [1] 21 22 23 Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1])) [1] 21 22 23 Splus> grep("[def]", letters[26:1], value=T) [1] "f" "e" "d" Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T) [1] "f" "e" "d" Splus> class(.Last.value) [1] "character" R does this when grepping an integer vector. R> grep("1", 0:11, value=T) [1] "1" "10" "11" help(grep) says it returns "the matching elements themselves", but doesn't say if "themselves" means before or after the conversion to character. ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."
On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:> Hi all, > > Based upon an offlist communication this morning, I am somewhat confused > (more than I usually am on most Monday mornings...) about the use of > grep() with factors as the 'x' argument. > > The argument guidance in ?grep indicates: > > x, text a character vector where matches are sought. Coerced to > character if possible. > > and in the Details section: > > Arguments which should be character strings or character vectors are > coerced to character if possible. > > > The wording of both would seem to reasonably lead to the conclusion that > a factor could be coerced to a character vector by the use of > as.character(FACTOR).Well, that is not what is meant by the wording, nor what happens: there is no method dispatch so the factor is coerced from an integer vector to a character vector. 'coerced' usually means at low level: where as.character() is involved we tend to say so. As for the comments on what happens if value=TRUE: if the 'x' has been coerced, I would expect the value to be based on the coerced value (and it currently is).> grep("1", factor(letters))[1] 1 10 11 12 13 14 15 16 17 18 19 21> grep("1", factor(letters), value=TRUE)[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "21" So whereas I am quite happy to replace the low-level coercion by method dispatch on as.character, I don't think this should be altered (and am pretty sure there is code out there which expects a character vector result).> In tracing through the C code in character.c for do_grep(), which in > turn calls coerceVector() in coerce.c, unless I am mis-reading the code > (always possible), I don't see an indication that a factor would be > coerced to a character vector. > > Since a factor -> character coercion would seem at face value, the most > logical coercion to take place when using grep(), I am curious if I am > missing something, or if perhaps ?grep needs to be more clear in the > coercions that will or might take place. Perhaps even the consideration > of an error message if a factor is passed as the 'x' argument, if indeed > the coercion would not take place. > > Perhaps the easiest example here might be: > > # On R Version 2.3.1 (2006-06-01) on FC5 > >> grep("[a-z]", letters) > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > [23] 23 24 25 26 > >> grep("[a-z]", factor(letters)) > numeric(0) > > > Thanks for any comments or any virtual rotten tomatoes coming my way at > high speed. :-) > > Marc Schwartz > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595