thr3ads.net - R help - [R] Subset rows over multiple columns [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2006-Apr-13 18:31 UTC

[R] Subset rows over multiple columns

I have a data frame where I need to subset certain rows before I compute
the mean of another variable. However, the value that I need to subset
by is found in multiple columns. For example, in the data below the
value R0000160 is found in the first and second columns (itd_1 and
itd_45).  These data are student responses to multiple choice test items
from a computer adaptive test. So, the variable itd_1 denotes that item
i was presented to student k in position t and then the variable
righta_a and righta_b denotes a correct (1) or incorrect response to
that item when it was presented.

My goal is to get the p-value (mean of the binary variable) for each
item irrespective of when it was presented to the student.

So, in the sample case below, I would use all elements in righta_a
(except for the second to last) and then only the second to last value
in righta_b.
> tail(tt)         itd_1   itd_45 righta_a righta_b
18407 R0000160 R0208470        1        0
18412 R0000160 R0238140        0        1
18417 R0000160 R0259690        1        1
18422 R0000160 R0000730        1        1
18450 R0113750 R0000160        1        1
18456 R0000160 R0238690        0        1

One thing I can envision doing is using the reshape option such that
itd_1 and itd_45 would be in the "long" format. This would cause for
itd_1 and itd_45 to be stacked in a single column as well as righta_a
and righta_b and then I could then use tapply and get what I need
without any subsetting. That is

testScores <- reshape(tt, idvar='id', varying=list(c('itd_1',
'itd_45'),
c('righta_a', 'righta_b')),
v.names=c('item','answer'),
timevar='item_position', direction='long')

with(testScores, tapply(answer, item, mean))

Or I could get

with(testScores, tapply(answer, list(item, position), mean))

The only problem here is that I have some duplicate IDs in the data and
reshape doesn't like turning data on its head in that situation, so I
would need to tinker with those first. 

So, I have what I think would be a solution, I wonder if there is
another way to preserve the data in this "wide" format and get the
estimates I need? Maybe it is just easier to use reshape. Any
suggestions?

Harold
Windows Xp
R 2.2.1

	[[alternative HTML version deleted]]

Gabor Grothendieck

2006-Apr-13 23:34 UTC

head link

[R] Subset rows over multiple columns

Try this:

tt2 <- tt
tt2[,1] <- as.character(tt2[,1])
tt2[,2] <- as.character(tt2[,2])

f <- function(x) with(tt2, mean(righta_a[x == itd_1 | x == itd_45]))
sapply(unique(unlist(tt2[,1:2])), f)


On 4/13/06, Doran, Harold <HDoran at air.org>
wrote:> I have a data frame where I need to subset certain rows before I compute
> the mean of another variable. However, the value that I need to subset
> by is found in multiple columns. For example, in the data below the
> value R0000160 is found in the first and second columns (itd_1 and
> itd_45).  These data are student responses to multiple choice test items
> from a computer adaptive test. So, the variable itd_1 denotes that item
> i was presented to student k in position t and then the variable
> righta_a and righta_b denotes a correct (1) or incorrect response to
> that item when it was presented.
>
> My goal is to get the p-value (mean of the binary variable) for each
> item irrespective of when it was presented to the student.
>
> So, in the sample case below, I would use all elements in righta_a
> (except for the second to last) and then only the second to last value
> in righta_b.
>
> > tail(tt)
>         itd_1   itd_45 righta_a righta_b
> 18407 R0000160 R0208470        1        0
> 18412 R0000160 R0238140        0        1
> 18417 R0000160 R0259690        1        1
> 18422 R0000160 R0000730        1        1
> 18450 R0113750 R0000160        1        1
> 18456 R0000160 R0238690        0        1
>
> One thing I can envision doing is using the reshape option such that
> itd_1 and itd_45 would be in the "long" format. This would cause
for
> itd_1 and itd_45 to be stacked in a single column as well as righta_a
> and righta_b and then I could then use tapply and get what I need
> without any subsetting. That is
>
> testScores <- reshape(tt, idvar='id',
varying=list(c('itd_1', 'itd_45'),
> c('righta_a', 'righta_b')),
v.names=c('item','answer'),
> timevar='item_position', direction='long')
>
> with(testScores, tapply(answer, item, mean))
>
> Or I could get
>
> with(testScores, tapply(answer, list(item, position), mean))
>
> The only problem here is that I have some duplicate IDs in the data and
> reshape doesn't like turning data on its head in that situation, so I
> would need to tinker with those first.
>
> So, I have what I think would be a solution, I wonder if there is
> another way to preserve the data in this "wide" format and get
the
> estimates I need? Maybe it is just easier to use reshape. Any
> suggestions?
>
> Harold
> Windows Xp
> R 2.2.1
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Reasonably Related Threads

Search for more possibly parallel threads

R help - Apr 2006 - Subset rows over multiple columns

[R] Subset rows over multiple columns

[R] Subset rows over multiple columns

Reasonably Related Threads