thr3ads.net - R help - [R] Subsetting data where the condition is that the value of some column contains some substring [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Max Bane

2009-Mar-21 00:25 UTC

[R] Subsetting data where the condition is that the value of some column contains some substring

I have some data that looks like this:
> dataP                                input output corpusFreq pvolOT pvolRatioOT
1       give(my sister, the old book)      P       47.0  56016   0.1543651
5               donate(her, the book)      P       48.7  68928   0.1899471
9           give(my sister, the book)      P       73.4  80136   0.2208333
13    donate(my sister, the old book)      P       79.0  57024   0.1571429
20                give(my sister, it)      P      100.0 132408   0.3648810
21                      give(her, it)      P      100.0 157248   0.4333333
24              donate(my sister, it)      P      100.0 130720   0.3602293
28                give(her, the book)      P        5.7  65232   0.1797619
31                    donate(her, it)      P      100.0 152064   0.4190476
35   give(my little sister, the book)      P       91.8 112032   0.3087302
39 donate(my little sister, the book)      P       98.4 114048   0.3142857
43        donate(my sister, the book)      P       94.4  82800   0.2281746

I would like to extract the subset of this data in which the value of
the "input" column contains the substring "her". I was
thinking I
could use the grep function to test for the presence of this
substring. For instance, if a string does not contain it, then grep
returns a zero length integer vector:
> grep("her", "give(my sister, it)")integer(0)

And if the string does contain the substring, grep returns a vector of
the indices where the substring is located:
> grep("her", "give(her, it)")[1] 1

I can thus test for the presence of the substring by converting the
length of the result of grep into a boolean:
> as.logical(length(grep("her", "give(my sister, it)")))
[1] FALSE> as.logical(length(grep("her", "give(her, it)")))
[1] TRUE> as.logical(length(grep("her", "give(her, it)"))) ==
TRUE
[1] TRUE> as.logical(length(grep("her", "give(my sister, it)")))
== TRUE[1] FALSE

I would like to use this test as a criterion for constructing a subset
of my data. Unfortunately, it does not work:
> subset(dataP, as.logical(length(grep("her", input)))==TRUE)                                input output corpusFreq pvolOT pvolRatioOT
1       give(my sister, the old book)      P       47.0  56016   0.1543651
5               donate(her, the book)      P       48.7  68928   0.1899471
9           give(my sister, the book)      P       73.4  80136   0.2208333
13    donate(my sister, the old book)      P       79.0  57024   0.1571429
20                give(my sister, it)      P      100.0 132408   0.3648810
21                      give(her, it)      P      100.0 157248   0.4333333
24              donate(my sister, it)      P      100.0 130720   0.3602293
28                give(her, the book)      P        5.7  65232   0.1797619
31                    donate(her, it)      P      100.0 152064   0.4190476
35   give(my little sister, the book)      P       91.8 112032   0.3087302
39 donate(my little sister, the book)      P       98.4 114048   0.3142857
43        donate(my sister, the book)      P       94.4  82800   0.2281746

As you can see, I get back the whole data set, rather than just the
subset where the input column contains "her". And if I invert the
test, which I would expect to give the subset *not* containing "her",
I instead get the empty subset, rather mysteriously:
> subset(dataP, as.logical(length(grep("her", input)))==FALSE)[1] input       output      corpusFreq  pvolOT      pvolRatioOT
<0 rows> (or 0-length row.names)

The type of the input column is definitely character. To be double sure:
> subset(dataP, as.logical(length(grep("her",
as.character(input))))==TRUE)
does the same thing.

Could somebody with more R experience than I have please explain what
I am doing wrong here? I'll be much obliged.

-- 
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu

jim holtman

2009-Mar-21 00:57 UTC

head link

[R] Subsetting data where the condition is that the value of some column contains some substring

Try using regexpr instead:
> x <- read.table(textConnection("input output corpusFreq pvolOT
pvolRatioOT+ give(mysister,theoldbook)      P       47.0  56016   0.1543651
+ donate(her,thebook)      P       48.7  68928   0.1899471
+ give(mysister,thebook)      P       73.4  80136   0.2208333
+ donate(mysister,theoldbook)      P       79.0  57024   0.1571429
+ give(mysister,it)      P      100.0 132408   0.3648810
+ give(her,it)      P      100.0 157248   0.4333333
+ donate(mysister,it)      P      100.0 130720   0.3602293
+ give(her,thebook)      P        5.7  65232   0.1797619
+ donate(her,it)      P      100.0 152064   0.4190476
+ give(mylittlesister,thebook)      P       91.8 112032   0.3087302
+ donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
+ donate(mysister,thebook)      P       94.4  82800   0.2281746"),
header=TRUE)> # use regexpr
> matched <- regexpr("her", x$input) != -1
> notMatched <- !matched
> x[matched,]                input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook)      P       48.7  68928   0.1899471
6        give(her,it)      P      100.0 157248   0.4333333
8   give(her,thebook)      P        5.7  65232   0.1797619
9      donate(her,it)      P      100.0 152064  
0.4190476> x[notMatched,]                            input output corpusFreq pvolOT pvolRatioOT
1       give(mysister,theoldbook)      P       47.0  56016   0.1543651
3          give(mysister,thebook)      P       73.4  80136   0.2208333
4     donate(mysister,theoldbook)      P       79.0  57024   0.1571429
5               give(mysister,it)      P      100.0 132408   0.3648810
7             donate(mysister,it)      P      100.0 130720   0.3602293
10   give(mylittlesister,thebook)      P       91.8 112032   0.3087302
11 donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
12       donate(mysister,thebook)      P       94.4  82800  
0.2281746>
>

On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com>
wrote:> I have some data that looks like this:
>
>> dataP
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT
> 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651
> 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471
> 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333
> 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429
> 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810
> 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333
> 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293
> 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619
> 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476
> 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302
> 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857
> 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746
>
> I would like to extract the subset of this data in which the value of
> the "input" column contains the substring "her". I was
thinking I
> could use the grep function to test for the presence of this
> substring. For instance, if a string does not contain it, then grep
> returns a zero length integer vector:
>
>> grep("her", "give(my sister, it)")
> integer(0)
>
> And if the string does contain the substring, grep returns a vector of
> the indices where the substring is located:
>
>> grep("her", "give(her, it)")
> [1] 1
>
> I can thus test for the presence of the substring by converting the
> length of the result of grep into a boolean:
>
>> as.logical(length(grep("her", "give(my sister,
it)")))
> [1] FALSE
>> as.logical(length(grep("her", "give(her, it)")))
> [1] TRUE
>> as.logical(length(grep("her", "give(her, it)"))) ==
TRUE
> [1] TRUE
>> as.logical(length(grep("her", "give(my sister,
it)"))) == TRUE
> [1] FALSE
>
> I would like to use this test as a criterion for constructing a subset
> of my data. Unfortunately, it does not work:
>
>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT
> 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651
> 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471
> 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333
> 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429
> 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810
> 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333
> 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293
> 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619
> 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476
> 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302
> 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857
> 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746
>
> As you can see, I get back the whole data set, rather than just the
> subset where the input column contains "her". And if I invert the
> test, which I would expect to give the subset *not* containing
"her",
> I instead get the empty subset, rather mysteriously:
>
>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
> [1] input ? ? ? output ? ? ?corpusFreq ?pvolOT ? ? ?pvolRatioOT
> <0 rows> (or 0-length row.names)
>
> The type of the input column is definitely character. To be double sure:
>
>> subset(dataP, as.logical(length(grep("her",
as.character(input))))==TRUE)
>
> does the same thing.
>
> Could somebody with more R experience than I have please explain what
> I am doing wrong here? I'll be much obliged.
>
> --
> Max Bane
> PhD Student, Linguistics
> University of Chicago
> bane at uchicago.edu
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Maybe Matching Threads

Search for more possibly parallel threads

R help - Mar 2009 - Subsetting data where the condition is that the value of some column contains some substring

[R] Subsetting data where the condition is that the value of some column contains some substring

[R] Subsetting data where the condition is that the value of some column contains some substring

Maybe Matching Threads