tpw at google.com
2009-Jul-22 20:45 UTC
[Rd] ks.test - The two-sample two-sided Kolmogorov-Smirnov test with ties (PR#13848)
Full_Name: Thomas Waterhouse Version: 2.9.1 OS: OS X 10.5.7 Submission from: (NULL) (216.239.45.4) ks.test uses a biased approximation to the p-value in the case of the two-sample test with ties in the pooled data. This has been justified in R in the past by the argument that the KS test assumes a continuous distribution. But the two-sample test can be extended to arbitrary distributions by a combinatorial argument (below), which I would like to see implemented in R. The p-value of the test is simply the probability that a random assignment of the pooled data to two sets of the sizes of the input data sets has a KS test statistic at least as great as that of the input data. The function psmirnov2x in ks.c calculates this probability by enumerating all such assignments as lattice walks, except that it doesn't know how to handle tied data points. The correct procedure can be deduced by considering that steps in such lattice walks represent steps along the x-axis in computing the empirical CDFs of the two data sets. The trick is to consider all occurrences of a repeated data point together, when determining whether a path exceeds the input statistic or not. A FORTRAN implementation of this algorithm was published by Nikiforov (http://www.ams.sunysb.edu/~nkfr/5PUBL.HTM), but it's not for the faint of heart. It would be straightforward to change psmirnov2x to include this logic, but I found it easier simply to rewrite it (with comments this time!) as follows: void psmirnov2x(double *x, Sint *m, Sint *n, double *pool) { /* Inputs: *x is the value of the KS test statistic (from 0.0 to 1.0) *m and *n are the lengths of each data set *pool is an array of length *m + *n formed by combining the two data sets and sorting in increasing order Output: *x is unity minus the p-value of the test Notes: The p-value of the two-sample KS test is the proportion of all assignments of the pooled data to sets of sizes *m and *n which have test statistic exceeding the test statistic of the actual data. To calculate this, we interpret each assignment as a lattice walk from the origin to (*m, *n), where the i-th step represents the i-th data point in the pooled data set. A step to the right represents assigning a data point to the *m set, and a step up represents assigning a data point to the *n set. The difference between the empirical CDFs of each set can be computed at each step, and the KS test statistic is simply the maximum of this value along a given walk. The p-value is obtained by counting the number of walks that do not exceed the input value (since this is easier than counting the number that do) and dividing by the total number of walks; this is what we return to the caller. If the same value occurs multiple times in the pooled data, we need to count all multiplicites at the same time when computing the empirical CDFs. In terms of the lattice walk, this means we look only the final occurence of each value when determining whether or not a walk exceeds the input statistic. */ double md, nd, q, *u, w; Sint i, j; if (*x == 0.0) { return 1.0; } md = (double) (*m); nd = (double) (*n); q = floor(*x * md * nd - 1e-7) / (md * nd); /* just *x minus an epsilon */ u = (double *) R_alloc(*n + 1, sizeof(double)); for (i = 0; i <= *m; i++) { for (j = 0; j <= *n; j++) { /* At each step, rescale by w := j/(j+*m); this is equivalent to dividing by C(*m+*n, *m) at the end, but without the risk of overflow */ w = (double)(j) / (double)(j + *m); if (i + j == 0) { /* Start with 1 walk from the origin to the origin */ u[j] = 1.0; } else if (i + j < *m + *n && pool[i + j - 1] != pool[i + j] && fabs(i / md - j / nd) > q) { /* This walk meets or exceeds *x, so don't count it */ u[j] = 0.0; } else if (i == 0) { /* The *n-edge: The number of walks to (0, j) is the same as the number of walks to (0, j-1) */ u[j] = u[j - 1] * w; } else if (j == 0) { /* The *m-edge: The number of walks to (i, 0) is the same as the number of walks to (i-1, 0) */ /* no-op */ } else { /* The general case: The number of walks to (i, j) is the number of walks to (i-1, j) plus the number of walks to (i, j-1) */ u[j] = u[j] + u[j - 1] * w; } } } /* Return the proportion of walks that do not exceed the input statistic: */ *x = u[*n]; } ks.test.R must also be changed as per the following diff, which includes a corrected calculation of the test statistic: 51d50 < TIES <- FALSE 54,59c53,57 < z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y)) < if(length(unique(w)) < (n.x + n.y)) { < warning("cannot compute correct p-values with ties") < z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] < TIES <- TRUE < } ---> steps = which(diff(sort(w)) != 0) > if(length(steps) > 0) > z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y))[steps] > else > z <- c(0)68c66 < if(exact && (alternative == "two.sided") && !TIES) ---> if(exact && (alternative == "two.sided"))72a71> sort(w),Finally, the changed signature of psmirnov2x needs to be reflected in ctest.h and init.c. I have built and tested the above implementation against 20 pairs of data sets for which I calculated the test statistics and p-values by brute force or Nikiforov's algorithm. I can provide these test sets if desired.
Possibly Parallel Threads
- Kolmogorov-Smirnov: calculate p value given as input the test statistic
- apparently incorrect p-values from 2-sided Kolmogorov-Smirnov (PR#14158)
- apparently incorrect p-values from 2-sided Kolmogorov-Smirnov (PR#14157)
- apparently incorrect p-values from 2-sided Kolmogorov-Smirnov (PR#14178)
- apparently incorrect p-values from 2-sided Kolmogorov-Smirnov test (PR#14145)