Andrew Hoerner
2014-Apr-12 12:36 UTC
[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.
Dear Folks--
I have a file with 3 million-odd rows of data from the 2007 U.S. Economic
Census. I am trying to pare it down to a subset of rows that both (1) has
any one of a vector of NAICS economic sector codes, and (2) also has any
one of a vector of geographic ID codes.
Here is the code I am trying to use.
ECwork <- EC07_A1[ any(GEO_ID == c("01000US",
"04000US06", "33000US488",
"31000US41860", "31400US4186036084"
"05000US06001", "E6000US0600153000") &
any(SECTOR == c("32", "33", "42", 44",
45", 51", 54", 61", "71",
"81"), ]
I get back the following error:
Warning message:
In EC07_A1$SECTOR == c("32", "33", "42",
"44", "45", "51", "54", :
longer object length is not a multiple of shorter object length
I see what R is doing. Instead of comparing each element of the column
SECTOR to the row vector of codes, and returning a logical vector of the
length of SECTOR with rows marked as TRUE that match any of the codes, it
is lining my code list up with SECTOR as a column vector and doing
element-by-element testing, and then recycling the code list over three
million rows. But I am not sure how to make it do what I want -- test the
sector code in each row against the vector of code I am looking for. I
would be grateful if anyone could suggest an alternative that would achieve
my ends.
Oh, and I would add, if there is a way of correctly using doing this with
the extract function [], I would like to know what it is. If not, I guess
I'd like to know that too.
Sincerely, Andrew Hoerner
--
J. Andrew Hoerner
Director, Sustainable Economics Program
Redefining Progress
(510) 507-4820
[[alternative HTML version deleted]]
Sarah Goslee
2014-Apr-12 13:04 UTC
[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.
You need %in% instead.
This is untested, but something like this should work:
ECwork <- EC07_A1[ EC07_A1$GEO_ID %in% c("01000US",
"04000US06", "33000US488",
"31000US41860", "31400US4186036084"
"05000US06001", "E6000US0600153000") &
EC07_A1$SECTOR %in% c("32", "33", "42",
44", 45", 51", 54", 61", "71",
"81"), ]
(Note that your original code snippet had a shortage of ) and didn't
specify the data frame from which to take the columns.)
Sarah
On Sat, Apr 12, 2014 at 8:36 AM, Andrew Hoerner <ahoerner at
rprogress.org> wrote:> Dear Folks--
> I have a file with 3 million-odd rows of data from the 2007 U.S. Economic
> Census. I am trying to pare it down to a subset of rows that both (1) has
> any one of a vector of NAICS economic sector codes, and (2) also has any
> one of a vector of geographic ID codes.
>
> Here is the code I am trying to use.
>
> ECwork <- EC07_A1[ any(GEO_ID == c("01000US",
"04000US06", "33000US488",
> "31000US41860", "31400US4186036084"
"05000US06001", "E6000US0600153000") &
> any(SECTOR == c("32", "33", "42",
44", 45", 51", 54", 61", "71",
> "81"), ]
>
> I get back the following error:
>
> Warning message:
> In EC07_A1$SECTOR == c("32", "33", "42",
"44", "45", "51", "54", :
> longer object length is not a multiple of shorter object length
>
> I see what R is doing. Instead of comparing each element of the column
> SECTOR to the row vector of codes, and returning a logical vector of the
> length of SECTOR with rows marked as TRUE that match any of the codes, it
> is lining my code list up with SECTOR as a column vector and doing
> element-by-element testing, and then recycling the code list over three
> million rows. But I am not sure how to make it do what I want -- test the
> sector code in each row against the vector of code I am looking for. I
> would be grateful if anyone could suggest an alternative that would achieve
> my ends.
>
> Oh, and I would add, if there is a way of correctly using doing this with
> the extract function [], I would like to know what it is. If not, I guess
> I'd like to know that too.
>
> Sincerely, Andrew Hoerner
>
> --
> J. Andrew Hoerner
> Director, Sustainable Economics Program
> Redefining Progress
> (510) 507-4820
>
--
Sarah Goslee
http://www.functionaldiversity.org