Hi all, I have a data management question. I am using an panel dataset read into R as a dataframe, call it "ex". The variables in "ex" are: id year x id: a character string which identifies the unit year: identifies the time period x: the variable of interest (which might contain NAs). Here is an example: > id <- rep(c("A","B","C"),2) > year <- c(rep(1970,3),rep(1980,3)) > x <- c(20,30,40,25,35,45) > ex <- data.frame(id=id,year=year,x=x) > ex id year x 1 A 1970 20 2 B 1970 30 3 C 1970 40 4 A 1980 25 5 B 1980 35 6 C 1980 45 I want to draw a subset of "ex" by selecting only the A and B units: > ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),]) Now I want to do some computations on x for each unit: > tapply(ex1$x, ex1$id, mean) A B C 22.5 32.5 NA But this gives me an NA value for the unit C, which I thought I had already left out. How do I ensure that the computation (in the last step) is limited to only the units I have selected in the first step? Deepankar
> I want to draw a subset of "ex" by selecting only the A and B units: > > > ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),])or a bit simpler: ex1 <- subset(ex, ex$id %in% c('A','B')) In your expresion you don't need the subset function, as you are already using indexing to extract the desired subset. Furthermore, there is no need to use which() because R will happily use a logical vector for indexing. Finally, I prefer the solution using %in% because it scales nicely for longer lists where using '|' becomes cumbersome. So another way to put it would have been: ex1 <- ex[ex$id %in% c('A','B'), ]> > tapply(ex1$x, ex1$id, mean) > A B C > 22.5 32.5 NA > > But this gives me an NA value for the unit C, which I thought I had > already left out.id is a factor and the subset extraction does not alter the set of levels of the factor even when no actual case of a level is left:> str(ex1)'data.frame': 4 obs. of 3 variables: $ id : Factor w/ 3 levels "A","B","C": 1 2 1 2 $ year: num 1970 1970 1980 1980 $ x : num 20 30 25 35 If you want to get rid of the unused levels you can "re-build" the factor like this:> ex1$id <- factor(ex1$id) > str(ex1)'data.frame': 4 obs. of 3 variables: $ id : Factor w/ 2 levels "A","B": 1 2 1 2 $ year: num 1970 1970 1980 1980 $ x : num 20 30 25 35> tapply(ex1$x, ex1$id, mean)A B 22.5 32.5 cu Philipp -- Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan 85350 Freising, Germany and Institut f?r Bioinformatik und Systembiologie / MIPS Helmholtz Zentrum M?nchen - Deutsches Forschungszentrum f?r Gesundheit und Umwelt Ingolst?dter Landstrasse 1 85764 Neuherberg, Germany http://mips.gsf.de/staff/pagel