thr3ads.net - R help - [R] What to do with this data? [Apr 2008]

If this information is useful, please help other people find it:
Share via:

mika03

2008-Apr-03 19:15 UTC

[R] What to do with this data?

Hello,

This is not necessarily a question about R, but more about how we should
display our data in general. (Will we then use R to do that, once we know
what to do ;-) I received good replies about such things in the past on this
mailing list so I give it a go.

Here's what we did:
We showed a fairly large number of subjects search engine queries and
different possible search engine responses. We assumed that users would like
some our responses better than others and wanted to check this. Subjects
could rate a query/response pair on a scale from 0 (very bad response) to 10
(very good response).

Here are all the judgments we received for one particular class of response
to queries which we thought users would like:

Predicted-Good-0, 4 
Predicted-Good-1, 1 
Predicted-Good-2, 11 
Predicted-Good-3, 8 
Predicted-Good-4, 25 
Predicted-Good-5, 12 
Predicted-Good-6, 21 
Predicted-Good-7, 25 
Predicted-Good-8, 30
Predicted-Good-9, 52 
Predicted-Good-10, 189

And here are all the judgments we received for one particular class of
response to queries which we thought users would NOT like:

Predicted-Bad-0, 34 
Predicted-Bad-1, 23 
Predicted-Bad-2, 45 
Predicted-Bad-3, 60 
Predicted-Bad-4, 42 
Predicted-Bad-5, 50
Predicted-Bad-6, 21
Predicted-Bad-7, 20 
Predicted-Bad-8, 25 
Predicted-Bad-9, 19 
Predicted-Bad-10, 39 

Here's a small table listing number of observations, mean, standard
deviation and standard error:

Type, N, Mean, StDev, StErr
Predicted-Good, 378, 8.21693121693122, 2.47110906286224, 0.12710013550711
Predicted-Bad, 378, 4.5978835978836, 3.02059872953413, 0.155362834286119

The question we have are:

a) It doesn't seem like our data follows a standard distribution. Therefore
is it okay to calculate mean, standard deviation and standard error at all?

b) We initially created a figure plotting the mean and a bar around it
indicating standard deviation. Then somebody who knows more about statistics
told us we should display the mean and error bars around it "to depict a
95%
Confidence Interval, mean +/- 1.96*SE". But if we are doing this,
aren't we
forgetting to mention vital parts of our data, that is that we indeed get
better means for "Good" responses, but that the individual data points
are
all over the place (especially for "Predicted-Bad")? We would capture
this
by showing standard deviation. 

c) And finally: What would be the best way to present this data anyway?


Thanks a lot!



-- 
View this message in context:
http://www.nabble.com/What-to-do-with-this-data--tp16467948p16467948.html
Sent from the R help mailing list archive at Nabble.com.

Lucke, Joseph F

2008-Apr-03 19:36 UTC

head link

[R] What to do with this data?

First compute side-by-side boxplots for the two data sets.  You will see
that the PG group has one (189), maybe 2 (also, 52) extreme values
whereas the PG group has none.  The PG group will have a smaller median
than the PB group.  Means, st devs, and se's are legitimate statistics
but do not have the usual (normal theory) interpretation, at least until
you can account for or eliminate the extreme values. 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of mika03
Sent: Thursday, April 03, 2008 2:16 PM
To: r-help at r-project.org
Subject: [R] What to do with this data?



Hello,

This is not necessarily a question about R, but more about how we should
display our data in general. (Will we then use R to do that, once we
know what to do ;-) I received good replies about such things in the
past on this mailing list so I give it a go.

Here's what we did:
We showed a fairly large number of subjects search engine queries and
different possible search engine responses. We assumed that users would
like some our responses better than others and wanted to check this.
Subjects could rate a query/response pair on a scale from 0 (very bad
response) to 10 (very good response).

Here are all the judgments we received for one particular class of
response to queries which we thought users would like:

Predicted-Good-0, 4
Predicted-Good-1, 1
Predicted-Good-2, 11
Predicted-Good-3, 8
Predicted-Good-4, 25
Predicted-Good-5, 12
Predicted-Good-6, 21
Predicted-Good-7, 25
Predicted-Good-8, 30
Predicted-Good-9, 52
Predicted-Good-10, 189

And here are all the judgments we received for one particular class of
response to queries which we thought users would NOT like:

Predicted-Bad-0, 34
Predicted-Bad-1, 23
Predicted-Bad-2, 45
Predicted-Bad-3, 60
Predicted-Bad-4, 42
Predicted-Bad-5, 50
Predicted-Bad-6, 21
Predicted-Bad-7, 20
Predicted-Bad-8, 25
Predicted-Bad-9, 19
Predicted-Bad-10, 39 

Here's a small table listing number of observations, mean, standard
deviation and standard error:

Type, N, Mean, StDev, StErr
Predicted-Good, 378, 8.21693121693122, 2.47110906286224,
0.12710013550711 Predicted-Bad, 378, 4.5978835978836, 3.02059872953413,
0.155362834286119

The question we have are:

a) It doesn't seem like our data follows a standard distribution.
Therefore is it okay to calculate mean, standard deviation and standard
error at all?

b) We initially created a figure plotting the mean and a bar around it
indicating standard deviation. Then somebody who knows more about
statistics told us we should display the mean and error bars around it
"to depict a 95% Confidence Interval, mean +/- 1.96*SE". But if we are
doing this, aren't we forgetting to mention vital parts of our data,
that is that we indeed get better means for "Good" responses, but that
the individual data points are all over the place (especially for
"Predicted-Bad")? We would capture this by showing standard deviation.

c) And finally: What would be the best way to present this data anyway?


Thanks a lot!



--
View this message in context:
http://www.nabble.com/What-to-do-with-this-data--tp16467948p16467948.htm
l
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2008-Apr-04 13:11 UTC

head link

[R] What to do with this data?

mika03 wrote:> ...
> Here's what we did:
> We showed a fairly large number of subjects search engine queries and
> different possible search engine responses. We assumed that users would
like
> some our responses better than others and wanted to check this. Subjects
> could rate a query/response pair on a scale from 0 (very bad response) to
10
> (very good response).
> 
> Here are all the judgments we received for one particular class of response
> to queries which we thought users would like:
> 
> Predicted-Good-0, 4 
> Predicted-Good-1, 1 
> Predicted-Good-2, 11 
> Predicted-Good-3, 8 
> Predicted-Good-4, 25 
> Predicted-Good-5, 12 
> Predicted-Good-6, 21 
> Predicted-Good-7, 25 
> Predicted-Good-8, 30
> Predicted-Good-9, 52 
> Predicted-Good-10, 189
> 
> And here are all the judgments we received for one particular class of
> response to queries which we thought users would NOT like:
> 
> Predicted-Bad-0, 34 
> Predicted-Bad-1, 23 
> Predicted-Bad-2, 45 
> Predicted-Bad-3, 60 
> Predicted-Bad-4, 42 
> Predicted-Bad-5, 50
> Predicted-Bad-6, 21
> Predicted-Bad-7, 20 
> Predicted-Bad-8, 25 
> Predicted-Bad-9, 19 
> Predicted-Bad-10, 39 
> I interpret these as counts for each option on the scale 0-10.
> Here's a small table listing number of observations, mean, standard
> deviation and standard error:
> 
> Type, N, Mean, StDev, StErr
> Predicted-Good, 378, 8.21693121693122, 2.47110906286224, 0.12710013550711
> Predicted-Bad, 378, 4.5978835978836, 3.02059872953413, 0.155362834286119
> 
> The question we have are:
> 
> a) It doesn't seem like our data follows a standard distribution.
Therefore
> is it okay to calculate mean, standard deviation and standard error at all?
>Yes, the mean is one way of describing the location of the aggregate 
response. The median is another. The calculations give sensible numbers, 
but ...
> b) We initially created a figure plotting the mean and a bar around it
> indicating standard deviation. Then somebody who knows more about
statistics
> told us we should display the mean and error bars around it "to depict
a 95%
> Confidence Interval, mean +/- 1.96*SE". But if we are doing this,
aren't we
> forgetting to mention vital parts of our data, that is that we indeed get
> better means for "Good" responses, but that the individual data
points are
> all over the place (especially for "Predicted-Bad")? We would
capture this
> by showing standard deviation. 
> when you start talking about confidence intervals, you have to assume 
that some distribution for which the distribution functions are known or 
can be calculated underlies your observations. As the responses aren't 
normally distributed, you can't use the normal distribution function to 
calculate confidence intervals. You could estimate them by 
bootstrapping, or see below.
> c) And finally: What would be the best way to present this data anyway?
> Here's a start - cmdf is a data frame with two columns, good (counts of 
"good" responses) and bad (counts of "bad" responses):

plot(0:10,cmdf$good,pch=1,col=3,type="b",
  main="Distribution of response
ratings",xlab="Rating",ylab="Count")
points(0:10,cmdf$bad,pch=2,col=2,type="b")
points(mean(rep(0:10,cmdf$good)),150,pch=1,col=3)
points(mean(rep(0:10,cmdf$bad)),150,pch=2,col=2)
goodmad<-mad(rep(0:10,cmdf$good))
badmad<-mad(rep(0:10,cmdf$bad))
arrows(mean(rep(0:10,cmdf$good))+c(-0.1,0.1),150,
  mean(rep(0:10,cmdf$good))+c(-goodmad,goodmad),150,angle=90,col=3)
arrows(mean(rep(0:10,cmdf$bad))+c(-0.1,0.1),150,
  mean(rep(0:10,cmdf$bad))+c(-badmad,badmad),150,angle=90,col=2)
text(mean(rep(0:10,cmdf$good)),170,"Good mean",col=3)
text(mean(rep(0:10,cmdf$bad)),170,"Bad mean",col=2)

I'm being lazy here, you probably want confidence intervals either 
bootstrapped or on the assumption that "good" responses are 
exponentially distributed and "bad" ones uniformly.

Jim

Maybe Matching Threads

Search for more possibly parallel threads

R help - Apr 2008 - What to do with this data?

[R] What to do with this data?

[R] What to do with this data?

[R] What to do with this data?

Maybe Matching Threads