thr3ads.net - R help - [R] data management question [May 2008]

If this information is useful, please help other people find it:
Share via:

Deepankar Basu

2008-May-09 04:29 UTC

[R] data management question

Hi all,

I have a data management question. I am using an panel dataset read into 
R as a dataframe, call it "ex". The variables in "ex" are:
id  year  x

id: a character string which identifies the unit
year: identifies the time period
x: the variable of interest (which might contain NAs).

Here is an example:
 > id <- rep(c("A","B","C"),2)
 > year <- c(rep(1970,3),rep(1980,3))
 > x <- c(20,30,40,25,35,45)
 > ex <- data.frame(id=id,year=year,x=x)
 > ex
  id year  x
1  A 1970 20
2  B 1970 30
3  C 1970 40
4  A 1980 25
5  B 1980 35
6  C 1980 45


I want to draw a subset of "ex" by selecting only the A and B units:

 > ex1 <- subset(ex[which(ex$id=="A"|ex$id=="B"),])

Now I want to do some computations on x for each unit:

 > tapply(ex1$x, ex1$id, mean)
   A    B    C
22.5 32.5   NA

But this gives me an NA value for the unit C, which I thought I had 
already left out. How do I ensure that the computation (in the last 
step) is limited to only the units I have selected in the first step?

Deepankar

Philipp Pagel

2008-May-09 07:01 UTC

head link

[R] data management question

> I want to draw a subset of "ex" by selecting only the A and B
units:
>
> > ex1 <-
subset(ex[which(ex$id=="A"|ex$id=="B"),])
or a bit simpler:

ex1 <- subset(ex, ex$id %in% c('A','B'))

In your expresion you don't need the subset function, as you are already
using indexing to extract the desired subset. Furthermore, there is no
need to use which() because R will happily use a logical vector for
indexing. Finally, I prefer the solution using %in% because it scales
nicely for longer lists where using '|' becomes cumbersome. So another
way to put it would have been:

ex1 <- ex[ex$id %in% c('A','B'), ]
> > tapply(ex1$x, ex1$id, mean)
>   A    B    C
> 22.5 32.5   NA
>
> But this gives me an NA value for the unit C, which I thought I had  
> already left out.
id is a factor and the subset extraction does not alter the set of levels
of the factor even when no actual case of a level is left:
> str(ex1)'data.frame':   4 obs. of  3 variables:
$ id  : Factor w/ 3 levels "A","B","C": 1 2 1 2
$ year: num  1970 1970 1980 1980
$ x   : num  20 30 25 35

If you want to get rid of the unused levels you can "re-build" the
factor like this:
> ex1$id <- factor(ex1$id)
> str(ex1)'data.frame':   4 obs. of  3 variables:
 $ id  : Factor w/ 2 levels "A","B": 1 2 1 2
 $ year: num  1970 1970 1980 1980
 $ x   : num  20 30 25 35
> tapply(ex1$x, ex1$id, mean)   A    B
22.5 32.5
 

cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl f?r Genomorientierte Bioinformatik
Technische Universit?t M?nchen
Wissenschaftszentrum Weihenstephan
85350 Freising, Germany
 
 and
 
Institut f?r Bioinformatik und Systembiologie / MIPS
Helmholtz Zentrum M?nchen -
Deutsches Forschungszentrum f?r Gesundheit und Umwelt
Ingolst?dter Landstrasse 1
85764 Neuherberg, Germany
http://mips.gsf.de/staff/pagel

Reasonably Related Threads

Search for more apparently analagous threads

R help - May 2008 - data management question

[R] data management question

[R] data management question

Reasonably Related Threads