Hi! I am using R version 2.7.0 and am working on a panel dataset read into R as a dataframe; I call it "ex". The variables in "ex" are: id year x id: a character string which identifies the unit year: identifies the time period x: the variable of interest (which might contain NAs). Here is an example:> id <- rep(c("A","B","C"),2) > year <- c(rep(1970,3),rep(1980,3)) > x <- c(20,30,40,25,35,45) > ex <- data.frame(id=id,year=year,x=x) > exid year x 1 A 1970 20 2 B 1970 30 3 C 1970 40 4 A 1980 25 5 B 1980 35 6 C 1980 45 I want to draw a subset of "ex" by selecting only the A and B units:> ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),])Now I want to do some computations on x for each selected unit only:> tapply(ex1$x, ex1$id, mean)A B C 22.5 32.5 NA But this gives me an NA value for the unit C, which I thought I had already left out. How do I ensure that the computation (in the last step) is limited to only the units I have selected in the first step? Dipankar [[alternative HTML version deleted]]
Le Fri, May 09, 2008 at 11:23:37PM -0400, Dipankar Basu a ?crit :> > ex <- data.frame(id=id,year=year,x=x) > > ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),]) > > tapply(ex1$x, ex1$id, mean) > A B C > 22.5 32.5 NADear Dipankar, The reason for this behaviour is that the class of ex$id is "factor". You can avoid this by using the I command, like in: ex <- data.frame(id=I(id),year=year,x=x) Have a nice day, -- Charles Plessy http://charles.plessy.org Wak?, Saitama, Japan
Because id is a factor in your data frame, and the levels (including "C") is kept when subsetted. Here is one way to get ride of "C".> ex1$id <- factor(ex1$id) > tapply(ex1$x, ex1$id, mean)A B 22.5 32.5 On Sat, May 10, 2008 at 11:23 AM, Dipankar Basu <basu.15 at gmail.com> wrote:> Hi! > > I am using R version 2.7.0 and am working on a panel dataset read into R as > a dataframe; I call it "ex". The variables in "ex" are: id year x > > id: a character string which identifies the unit > year: identifies the time period > x: the variable of interest (which might contain NAs). > > Here is an example: >> id <- rep(c("A","B","C"),2) >> year <- c(rep(1970,3),rep(1980,3)) >> x <- c(20,30,40,25,35,45) >> ex <- data.frame(id=id,year=year,x=x) >> ex > id year x > 1 A 1970 20 > 2 B 1970 30 > 3 C 1970 40 > 4 A 1980 25 > 5 B 1980 35 > 6 C 1980 45 > > I want to draw a subset of "ex" by selecting only the A and B units: > >> ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),]) > > Now I want to do some computations on x for each selected unit only: > >> tapply(ex1$x, ex1$id, mean) > A B C > 22.5 32.5 NA > > But this gives me an NA value for the unit C, which I thought I had already > left out. How do I ensure that the computation (in the last step) is limited > to only the units I have selected in the first step? > > Dipankar > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- HUANG Ronggui, Wincent Bachelor of Social Work, Fudan University, China Master of sociology, Fudan University, China Ph.D. Candidate, CityU of HK, http://www.cityu.edu.hk/sa/psa_web2006/students/rdegree/huangronggui.html
you need to redefine the factor on ex1$id.. cast it again as.factor to redefine the levels. Dipankar Basu wrote:> > Hi! > > I am using R version 2.7.0 and am working on a panel dataset read into R > as > a dataframe; I call it "ex". The variables in "ex" are: id year x > > id: a character string which identifies the unit > year: identifies the time period > x: the variable of interest (which might contain NAs). > > Here is an example: >> id <- rep(c("A","B","C"),2) >> year <- c(rep(1970,3),rep(1980,3)) >> x <- c(20,30,40,25,35,45) >> ex <- data.frame(id=id,year=year,x=x) >> ex > id year x > 1 A 1970 20 > 2 B 1970 30 > 3 C 1970 40 > 4 A 1980 25 > 5 B 1980 35 > 6 C 1980 45 > > I want to draw a subset of "ex" by selecting only the A and B units: > >> ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),]) > > Now I want to do some computations on x for each selected unit only: > >> tapply(ex1$x, ex1$id, mean) > A B C > 22.5 32.5 NA > > But this gives me an NA value for the unit C, which I thought I had > already > left out. How do I ensure that the computation (in the last step) is > limited > to only the units I have selected in the first step? > > Dipankar > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >----- Yasir H. Kaheil Catchment Research Facility The University of Western Ontario -- View this message in context: http://www.nabble.com/question-about-subseting-a-dataframe-tp17159592p17159679.html Sent from the R help mailing list archive at Nabble.com.