Andrew Hoerner
2014-Apr-12 12:36 UTC
[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.
Dear Folks-- I have a file with 3 million-odd rows of data from the 2007 U.S. Economic Census. I am trying to pare it down to a subset of rows that both (1) has any one of a vector of NAICS economic sector codes, and (2) also has any one of a vector of geographic ID codes. Here is the code I am trying to use. ECwork <- EC07_A1[ any(GEO_ID == c("01000US", "04000US06", "33000US488", "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") & any(SECTOR == c("32", "33", "42", 44", 45", 51", 54", 61", "71", "81"), ] I get back the following error: Warning message: In EC07_A1$SECTOR == c("32", "33", "42", "44", "45", "51", "54", : longer object length is not a multiple of shorter object length I see what R is doing. Instead of comparing each element of the column SECTOR to the row vector of codes, and returning a logical vector of the length of SECTOR with rows marked as TRUE that match any of the codes, it is lining my code list up with SECTOR as a column vector and doing element-by-element testing, and then recycling the code list over three million rows. But I am not sure how to make it do what I want -- test the sector code in each row against the vector of code I am looking for. I would be grateful if anyone could suggest an alternative that would achieve my ends. Oh, and I would add, if there is a way of correctly using doing this with the extract function [], I would like to know what it is. If not, I guess I'd like to know that too. Sincerely, Andrew Hoerner -- J. Andrew Hoerner Director, Sustainable Economics Program Redefining Progress (510) 507-4820 [[alternative HTML version deleted]]
Sarah Goslee
2014-Apr-12 13:04 UTC
[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.
You need %in% instead. This is untested, but something like this should work: ECwork <- EC07_A1[ EC07_A1$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") & EC07_A1$SECTOR %in% c("32", "33", "42", 44", 45", 51", 54", 61", "71", "81"), ] (Note that your original code snippet had a shortage of ) and didn't specify the data frame from which to take the columns.) Sarah On Sat, Apr 12, 2014 at 8:36 AM, Andrew Hoerner <ahoerner at rprogress.org> wrote:> Dear Folks-- > I have a file with 3 million-odd rows of data from the 2007 U.S. Economic > Census. I am trying to pare it down to a subset of rows that both (1) has > any one of a vector of NAICS economic sector codes, and (2) also has any > one of a vector of geographic ID codes. > > Here is the code I am trying to use. > > ECwork <- EC07_A1[ any(GEO_ID == c("01000US", "04000US06", "33000US488", > "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") & > any(SECTOR == c("32", "33", "42", 44", 45", 51", 54", 61", "71", > "81"), ] > > I get back the following error: > > Warning message: > In EC07_A1$SECTOR == c("32", "33", "42", "44", "45", "51", "54", : > longer object length is not a multiple of shorter object length > > I see what R is doing. Instead of comparing each element of the column > SECTOR to the row vector of codes, and returning a logical vector of the > length of SECTOR with rows marked as TRUE that match any of the codes, it > is lining my code list up with SECTOR as a column vector and doing > element-by-element testing, and then recycling the code list over three > million rows. But I am not sure how to make it do what I want -- test the > sector code in each row against the vector of code I am looking for. I > would be grateful if anyone could suggest an alternative that would achieve > my ends. > > Oh, and I would add, if there is a way of correctly using doing this with > the extract function [], I would like to know what it is. If not, I guess > I'd like to know that too. > > Sincerely, Andrew Hoerner > > -- > J. Andrew Hoerner > Director, Sustainable Economics Program > Redefining Progress > (510) 507-4820 >-- Sarah Goslee http://www.functionaldiversity.org