thr3ads.net - R help - [R] working with summarized data [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Rick Bischoff

2006-Aug-30 14:27 UTC

[R] working with summarized data

The data sets I am working with all have a weight variable--e.g.,  
each row doesn't mean 1 observation.

With that in mind, nearly all of the graphs and summary statistics  
are incorrect for my data, because they don't take into account the  
weight.

****
For example "median" is incorrect, as the quantiles aren't
calculated
with weights:

sum( weights[X < median(X)] ) / sum(weights)

This should be 0.5... of course it's not.
****

Unfortunately, it seems that most(all?) of R's graphics and summary  
statistic functions don't take a weight or frequency argument.    
(Fortunately the models do...)

Am I completely missing how to do this?  One way would be to  
replicate each row proportional to the weight (e.g. if the weight was  
4, we would 3 additional copies) but this will get prohibitive pretty  
quickly as the dataset grows.


Thanks in advance!

Gabor Grothendieck

2006-Aug-30 14:50 UTC

head link

[R] working with summarized data

In each case, look around (help.search,
RSiteSearch) to see if you can find a function
that handles weights.  For the case you mention,
medians, it can be done via quantile regression:

	x <- w <- 1:5
	library(quantreg)
	coef(rq(x ~ 1, weight = w))

On 8/30/06, Rick Bischoff <rdbisch at gmail.com>
wrote:> The data sets I am working with all have a weight variable--e.g.,
> each row doesn't mean 1 observation.
>
> With that in mind, nearly all of the graphs and summary statistics
> are incorrect for my data, because they don't take into account the
> weight.
>
> ****
> For example "median" is incorrect, as the quantiles aren't
calculated
> with weights:
>
> sum( weights[X < median(X)] ) / sum(weights)
>
> This should be 0.5... of course it's not.
> ****
>
> Unfortunately, it seems that most(all?) of R's graphics and summary
> statistic functions don't take a weight or frequency argument.
> (Fortunately the models do...)
>
> Am I completely missing how to do this?  One way would be to
> replicate each row proportional to the weight (e.g. if the weight was
> 4, we would 3 additional copies) but this will get prohibitive pretty
> quickly as the dataset grows.
>
>
> Thanks in advance!
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Rick Bischoff

2006-Aug-30 15:30 UTC

head link

[R] working with summarized data

>> Unfortunately, it seems that most(all?) of R's graphics and summary
>> statistic functions don't take a weight or frequency argument.
>> (Fortunately the models do...)
>
> I have been been meaning to add this functionality to my graphics
> package ggplot (http://had.co.nz/ggplot), but unfortunately haven't
> had time yet.  I'm guessing you want something like:
>
> * scatterplot: scale size of point according to weight (can do)
> * bar chart: bars should have height proportional to weight (can do)
> * histogram: area proportion to weighting variable (have some half
> finished code to do)
> * smoothers: should automatically use weights
> * boxplot: use weighted quantiles/letter statistics (is there a
> function for that?)
>
> What else is there?
densityplot is the only other one I can think of at the moment...  
With the rest of those, I could certainly live without it though!

Thanks!

Greg Snow

2006-Aug-30 16:28 UTC

head link

[R] working with summarized data

There are functions to do weighted summary statistics in the Hmisc
package (wtd.quantile, ...).

For more complicated analyses (but not plots yet) the biglm package has
a bigglm function that expects the data in chunks, you could write a
function that expand parts of the dataset at a time.

Hope this helps, 


-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rick Bischoff
Sent: Wednesday, August 30, 2006 8:28 AM
To: r-help at stat.math.ethz.ch
Subject: [R] working with summarized data

The data sets I am working with all have a weight variable--e.g., each
row doesn't mean 1 observation.

With that in mind, nearly all of the graphs and summary statistics are
incorrect for my data, because they don't take into account the weight.

****
For example "median" is incorrect, as the quantiles aren't
calculated
with weights:

sum( weights[X < median(X)] ) / sum(weights)

This should be 0.5... of course it's not.
****

Unfortunately, it seems that most(all?) of R's graphics and summary  
statistic functions don't take a weight or frequency argument.    
(Fortunately the models do...)

Am I completely missing how to do this?  One way would be to replicate
each row proportional to the weight (e.g. if the weight was 4, we would
3 additional copies) but this will get prohibitive pretty quickly as the
dataset grows.


Thanks in advance!

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Anupam Tyagi

2006-Aug-31 05:12 UTC

head link

[R] working with summarized data

One solution is to simulate the population by repeating each row
"weight" number of times. This is inefficient. It may create a very
large dataset for a large sample survey. But some of graphs and other
things may turn out to your liking, depending upon how the functions are
written.

Anupam.

Rick Bischoff wrote the following on 8/30/2006 7:57 PM:> The data sets I am working with all have a weight variable--e.g.,  
> each row doesn't mean 1 observation.
> 
> With that in mind, nearly all of the graphs and summary statistics  
> are incorrect for my data, because they don't take into account the  
> weight.
> 
> ****
> For example "median" is incorrect, as the quantiles aren't
calculated
> with weights:
> 
> sum( weights[X < median(X)] ) / sum(weights)
> 
> This should be 0.5... of course it's not.
> ****
> 
> Unfortunately, it seems that most(all?) of R's graphics and summary  
> statistic functions don't take a weight or frequency argument.    
> (Fortunately the models do...)
> 
> Am I completely missing how to do this?  One way would be to  
> replicate each row proportional to the weight (e.g. if the weight was  
> 4, we would 3 additional copies) but this will get prohibitive pretty  
> quickly as the dataset grows.
> 
> 
> Thanks in advance!
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Anupam Tyagi

2006-Sep-29 08:06 UTC

head link

[R] working with summarized data

Hi Rick,

I came across your posting that I had replied to. I had assumed from 
your posting that you had positive integer weights, and that you had a 
certain kind of stratified sampling. For a general case, you may want to 
look at "survey" package. Graphical representation of survey data, 
specially large surveys, is a good research issue in statistical 
graphics. R seems to be is suitable for doing this kind of work.

Anupam.

Anupam Tyagi wrote the following on 8/31/2006 10:40 AM:> One solution is to simulate the population by repeating each row 
> "weight" number of times. This is inefficient. It may create a
very
> large dataset for a large sample survey. But some of graphs and other 
> things may turn out to your liking, depending upon how the functions are 
> written.
> 
> Anupam.
> 
> Rick Bischoff wrote the following on 8/30/2006 7:57 PM:
>> The data sets I am working with all have a weight variable--e.g.,  
>> each row doesn't mean 1 observation.
>>
>> With that in mind, nearly all of the graphs and summary statistics  
>> are incorrect for my data, because they don't take into account the
>> weight.
>>
>> ****
>> For example "median" is incorrect, as the quantiles
aren't calculated
>> with weights:
>>
>> sum( weights[X < median(X)] ) / sum(weights)
>>
>> This should be 0.5... of course it's not.
>> ****
>>
>> Unfortunately, it seems that most(all?) of R's graphics and summary
>> statistic functions don't take a weight or frequency argument.    
>> (Fortunately the models do...)
>>
>> Am I completely missing how to do this?  One way would be to  
>> replicate each row proportional to the weight (e.g. if the weight was  
>> 4, we would 3 additional copies) but this will get prohibitive pretty  
>> quickly as the dataset grows.
>>
>>
>> Thanks in advance!
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
>

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Aug 2006 - working with summarized data

[R] working with summarized data

[R] working with summarized data

[R] working with summarized data

[R] working with summarized data

[R] working with summarized data

[R] working with summarized data

Seemingly Similar Threads