The data sets I am working with all have a weight variable--e.g., each row doesn't mean 1 observation. With that in mind, nearly all of the graphs and summary statistics are incorrect for my data, because they don't take into account the weight. **** For example "median" is incorrect, as the quantiles aren't calculated with weights: sum( weights[X < median(X)] ) / sum(weights) This should be 0.5... of course it's not. **** Unfortunately, it seems that most(all?) of R's graphics and summary statistic functions don't take a weight or frequency argument. (Fortunately the models do...) Am I completely missing how to do this? One way would be to replicate each row proportional to the weight (e.g. if the weight was 4, we would 3 additional copies) but this will get prohibitive pretty quickly as the dataset grows. Thanks in advance!
In each case, look around (help.search, RSiteSearch) to see if you can find a function that handles weights. For the case you mention, medians, it can be done via quantile regression: x <- w <- 1:5 library(quantreg) coef(rq(x ~ 1, weight = w)) On 8/30/06, Rick Bischoff <rdbisch at gmail.com> wrote:> The data sets I am working with all have a weight variable--e.g., > each row doesn't mean 1 observation. > > With that in mind, nearly all of the graphs and summary statistics > are incorrect for my data, because they don't take into account the > weight. > > **** > For example "median" is incorrect, as the quantiles aren't calculated > with weights: > > sum( weights[X < median(X)] ) / sum(weights) > > This should be 0.5... of course it's not. > **** > > Unfortunately, it seems that most(all?) of R's graphics and summary > statistic functions don't take a weight or frequency argument. > (Fortunately the models do...) > > Am I completely missing how to do this? One way would be to > replicate each row proportional to the weight (e.g. if the weight was > 4, we would 3 additional copies) but this will get prohibitive pretty > quickly as the dataset grows. > > > Thanks in advance! > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
>> Unfortunately, it seems that most(all?) of R's graphics and summary >> statistic functions don't take a weight or frequency argument. >> (Fortunately the models do...) > > I have been been meaning to add this functionality to my graphics > package ggplot (http://had.co.nz/ggplot), but unfortunately haven't > had time yet. I'm guessing you want something like: > > * scatterplot: scale size of point according to weight (can do) > * bar chart: bars should have height proportional to weight (can do) > * histogram: area proportion to weighting variable (have some half > finished code to do) > * smoothers: should automatically use weights > * boxplot: use weighted quantiles/letter statistics (is there a > function for that?) > > What else is there?densityplot is the only other one I can think of at the moment... With the rest of those, I could certainly live without it though! Thanks!
There are functions to do weighted summary statistics in the Hmisc package (wtd.quantile, ...). For more complicated analyses (but not plots yet) the biglm package has a bigglm function that expects the data in chunks, you could write a function that expand parts of the dataset at a time. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rick Bischoff Sent: Wednesday, August 30, 2006 8:28 AM To: r-help at stat.math.ethz.ch Subject: [R] working with summarized data The data sets I am working with all have a weight variable--e.g., each row doesn't mean 1 observation. With that in mind, nearly all of the graphs and summary statistics are incorrect for my data, because they don't take into account the weight. **** For example "median" is incorrect, as the quantiles aren't calculated with weights: sum( weights[X < median(X)] ) / sum(weights) This should be 0.5... of course it's not. **** Unfortunately, it seems that most(all?) of R's graphics and summary statistic functions don't take a weight or frequency argument. (Fortunately the models do...) Am I completely missing how to do this? One way would be to replicate each row proportional to the weight (e.g. if the weight was 4, we would 3 additional copies) but this will get prohibitive pretty quickly as the dataset grows. Thanks in advance! ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
One solution is to simulate the population by repeating each row "weight" number of times. This is inefficient. It may create a very large dataset for a large sample survey. But some of graphs and other things may turn out to your liking, depending upon how the functions are written. Anupam. Rick Bischoff wrote the following on 8/30/2006 7:57 PM:> The data sets I am working with all have a weight variable--e.g., > each row doesn't mean 1 observation. > > With that in mind, nearly all of the graphs and summary statistics > are incorrect for my data, because they don't take into account the > weight. > > **** > For example "median" is incorrect, as the quantiles aren't calculated > with weights: > > sum( weights[X < median(X)] ) / sum(weights) > > This should be 0.5... of course it's not. > **** > > Unfortunately, it seems that most(all?) of R's graphics and summary > statistic functions don't take a weight or frequency argument. > (Fortunately the models do...) > > Am I completely missing how to do this? One way would be to > replicate each row proportional to the weight (e.g. if the weight was > 4, we would 3 additional copies) but this will get prohibitive pretty > quickly as the dataset grows. > > > Thanks in advance! > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi Rick, I came across your posting that I had replied to. I had assumed from your posting that you had positive integer weights, and that you had a certain kind of stratified sampling. For a general case, you may want to look at "survey" package. Graphical representation of survey data, specially large surveys, is a good research issue in statistical graphics. R seems to be is suitable for doing this kind of work. Anupam. Anupam Tyagi wrote the following on 8/31/2006 10:40 AM:> One solution is to simulate the population by repeating each row > "weight" number of times. This is inefficient. It may create a very > large dataset for a large sample survey. But some of graphs and other > things may turn out to your liking, depending upon how the functions are > written. > > Anupam. > > Rick Bischoff wrote the following on 8/30/2006 7:57 PM: >> The data sets I am working with all have a weight variable--e.g., >> each row doesn't mean 1 observation. >> >> With that in mind, nearly all of the graphs and summary statistics >> are incorrect for my data, because they don't take into account the >> weight. >> >> **** >> For example "median" is incorrect, as the quantiles aren't calculated >> with weights: >> >> sum( weights[X < median(X)] ) / sum(weights) >> >> This should be 0.5... of course it's not. >> **** >> >> Unfortunately, it seems that most(all?) of R's graphics and summary >> statistic functions don't take a weight or frequency argument. >> (Fortunately the models do...) >> >> Am I completely missing how to do this? One way would be to >> replicate each row proportional to the weight (e.g. if the weight was >> 4, we would 3 additional copies) but this will get prohibitive pretty >> quickly as the dataset grows. >> >> >> Thanks in advance! >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >