Max Bane
2009-Mar-21 00:25 UTC
[R] Subsetting data where the condition is that the value of some column contains some substring
I have some data that looks like this:> dataPinput output corpusFreq pvolOT pvolRatioOT 1 give(my sister, the old book) P 47.0 56016 0.1543651 5 donate(her, the book) P 48.7 68928 0.1899471 9 give(my sister, the book) P 73.4 80136 0.2208333 13 donate(my sister, the old book) P 79.0 57024 0.1571429 20 give(my sister, it) P 100.0 132408 0.3648810 21 give(her, it) P 100.0 157248 0.4333333 24 donate(my sister, it) P 100.0 130720 0.3602293 28 give(her, the book) P 5.7 65232 0.1797619 31 donate(her, it) P 100.0 152064 0.4190476 35 give(my little sister, the book) P 91.8 112032 0.3087302 39 donate(my little sister, the book) P 98.4 114048 0.3142857 43 donate(my sister, the book) P 94.4 82800 0.2281746 I would like to extract the subset of this data in which the value of the "input" column contains the substring "her". I was thinking I could use the grep function to test for the presence of this substring. For instance, if a string does not contain it, then grep returns a zero length integer vector:> grep("her", "give(my sister, it)")integer(0) And if the string does contain the substring, grep returns a vector of the indices where the substring is located:> grep("her", "give(her, it)")[1] 1 I can thus test for the presence of the substring by converting the length of the result of grep into a boolean:> as.logical(length(grep("her", "give(my sister, it)")))[1] FALSE> as.logical(length(grep("her", "give(her, it)")))[1] TRUE> as.logical(length(grep("her", "give(her, it)"))) == TRUE[1] TRUE> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE[1] FALSE I would like to use this test as a criterion for constructing a subset of my data. Unfortunately, it does not work:> subset(dataP, as.logical(length(grep("her", input)))==TRUE)input output corpusFreq pvolOT pvolRatioOT 1 give(my sister, the old book) P 47.0 56016 0.1543651 5 donate(her, the book) P 48.7 68928 0.1899471 9 give(my sister, the book) P 73.4 80136 0.2208333 13 donate(my sister, the old book) P 79.0 57024 0.1571429 20 give(my sister, it) P 100.0 132408 0.3648810 21 give(her, it) P 100.0 157248 0.4333333 24 donate(my sister, it) P 100.0 130720 0.3602293 28 give(her, the book) P 5.7 65232 0.1797619 31 donate(her, it) P 100.0 152064 0.4190476 35 give(my little sister, the book) P 91.8 112032 0.3087302 39 donate(my little sister, the book) P 98.4 114048 0.3142857 43 donate(my sister, the book) P 94.4 82800 0.2281746 As you can see, I get back the whole data set, rather than just the subset where the input column contains "her". And if I invert the test, which I would expect to give the subset *not* containing "her", I instead get the empty subset, rather mysteriously:> subset(dataP, as.logical(length(grep("her", input)))==FALSE)[1] input output corpusFreq pvolOT pvolRatioOT <0 rows> (or 0-length row.names) The type of the input column is definitely character. To be double sure:> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)does the same thing. Could somebody with more R experience than I have please explain what I am doing wrong here? I'll be much obliged. -- Max Bane PhD Student, Linguistics University of Chicago bane at uchicago.edu
jim holtman
2009-Mar-21 00:57 UTC
[R] Subsetting data where the condition is that the value of some column contains some substring
Try using regexpr instead:> x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT+ give(mysister,theoldbook) P 47.0 56016 0.1543651 + donate(her,thebook) P 48.7 68928 0.1899471 + give(mysister,thebook) P 73.4 80136 0.2208333 + donate(mysister,theoldbook) P 79.0 57024 0.1571429 + give(mysister,it) P 100.0 132408 0.3648810 + give(her,it) P 100.0 157248 0.4333333 + donate(mysister,it) P 100.0 130720 0.3602293 + give(her,thebook) P 5.7 65232 0.1797619 + donate(her,it) P 100.0 152064 0.4190476 + give(mylittlesister,thebook) P 91.8 112032 0.3087302 + donate(mylittlesister,thebook) P 98.4 114048 0.3142857 + donate(mysister,thebook) P 94.4 82800 0.2281746"), header=TRUE)> # use regexpr > matched <- regexpr("her", x$input) != -1 > notMatched <- !matched > x[matched,]input output corpusFreq pvolOT pvolRatioOT 2 donate(her,thebook) P 48.7 68928 0.1899471 6 give(her,it) P 100.0 157248 0.4333333 8 give(her,thebook) P 5.7 65232 0.1797619 9 donate(her,it) P 100.0 152064 0.4190476> x[notMatched,]input output corpusFreq pvolOT pvolRatioOT 1 give(mysister,theoldbook) P 47.0 56016 0.1543651 3 give(mysister,thebook) P 73.4 80136 0.2208333 4 donate(mysister,theoldbook) P 79.0 57024 0.1571429 5 give(mysister,it) P 100.0 132408 0.3648810 7 donate(mysister,it) P 100.0 130720 0.3602293 10 give(mylittlesister,thebook) P 91.8 112032 0.3087302 11 donate(mylittlesister,thebook) P 98.4 114048 0.3142857 12 donate(mysister,thebook) P 94.4 82800 0.2281746> >On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:> I have some data that looks like this: > >> dataP > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT > 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 > 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 > 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 > 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 > 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 > 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 > 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 > 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 > 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476 > 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 > 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 > 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746 > > I would like to extract the subset of this data in which the value of > the "input" column contains the substring "her". I was thinking I > could use the grep function to test for the presence of this > substring. For instance, if a string does not contain it, then grep > returns a zero length integer vector: > >> grep("her", "give(my sister, it)") > integer(0) > > And if the string does contain the substring, grep returns a vector of > the indices where the substring is located: > >> grep("her", "give(her, it)") > [1] 1 > > I can thus test for the presence of the substring by converting the > length of the result of grep into a boolean: > >> as.logical(length(grep("her", "give(my sister, it)"))) > [1] FALSE >> as.logical(length(grep("her", "give(her, it)"))) > [1] TRUE >> as.logical(length(grep("her", "give(her, it)"))) == TRUE > [1] TRUE >> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE > [1] FALSE > > I would like to use this test as a criterion for constructing a subset > of my data. Unfortunately, it does not work: > >> subset(dataP, as.logical(length(grep("her", input)))==TRUE) > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT > 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 > 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 > 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 > 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 > 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 > 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 > 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 > 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 > 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476 > 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 > 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 > 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746 > > As you can see, I get back the whole data set, rather than just the > subset where the input column contains "her". And if I invert the > test, which I would expect to give the subset *not* containing "her", > I instead get the empty subset, rather mysteriously: > >> subset(dataP, as.logical(length(grep("her", input)))==FALSE) > [1] input ? ? ? output ? ? ?corpusFreq ?pvolOT ? ? ?pvolRatioOT > <0 rows> (or 0-length row.names) > > The type of the input column is definitely character. To be double sure: > >> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE) > > does the same thing. > > Could somebody with more R experience than I have please explain what > I am doing wrong here? I'll be much obliged. > > -- > Max Bane > PhD Student, Linguistics > University of Chicago > bane at uchicago.edu > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?