On 3/15/2006 8:31 AM, Vivek Satsangi wrote:> Folks,
> Normally, in a data frame, one observation counts as one observation
> of the distribution. Thus one can easily produce a CDF and (in Splus
> atleast) use cdf.compare to compare the CDF (BTW: what is the R
> equivalent of the SPlus cdf.compare() function, if any?)
>
> However, if each point should not count equally, how can I weight the
> points before comparing the distributions? I was thinking of somehow
> creating multiple observations for each actual observation based on
> weights and creating a new dataframe etc. -- but that seem excessive.
> Surely there is a simpler way?
>
>> x <- rnorm(100)
>> y <- rnorm(10)
>> xw <- rnorm(100) * 1.73 # The weights. These won't add up to 1
or N or anything because of missing values.
>> yw <- rnorm(10) * 6.23 # The weights. These won't add up to 1 or
to the same number as xw.
>> # The question to answer is, how can I create a qq plot or cdf compare
of x vs. y, weighted by their weights, xw and yw (to eventually figure out if y
comes from the population x, similar to Kolmogorov-Smirnov GOF)?
>> qqplot(x,y) # What now?
qqplot doesn't support weights, but it's a simple enough function that
you could write a version that did. Look at the cases where length(x)
is not equal to length(y): e.g. if length(y) < length(x), qqplot
constructs a linear approximation to a function mapping 1:nx onto the
sorted x values, then takes length(y) evenly spaced values from that
function. You want to do the same sort of thing, except that instead of
even spacing, you want to look at the cumulative sums of the weights.
You might want to use some kind of graphical indicator of whether points
are heavily weighted or not, but I don't know what to recommend for that.
By the way, your example above will give negative weights in xw and yw;
you probably won't like the results if you do that.
Duncan Murdoch