On Tue, 2005-06-28 at 13:03 -0400, ferdinand principia
wrote:> Hi,
>
> I need to calculate Kendall's tau for large data
> vectors (length > 100'000).
> Is somebody aware of a faster algorithm or package
> function than "cor(, method="kendall")"?
> There are ties in the data to be considered (Kendall's
> tau-b).
>
> Any suggestions?
>
> Regards
> Ferdinand
The time intensive part of the process is typically the ranking/ordering
of the vector pairs to calculate the numbers of concordant and
discordant pairs.
If the number of _unique pairs_ in your data is substantially less than
the number of total pairs (in other words, creating a smaller 2d
contingency table from a pair of your vectors makes sense), then the
following may be of help.
# Calculate CONcordant Pairs in a table
# cycle through x[r, c] and multiply by
# sum(x elements below and to the right of x[r, c])
# x = table
concordant <- function(x)
{
x <- matrix(as.numeric(x), dim(x))
# get sum(matrix values > r AND > c)
# for each matrix[r, c]
mat.lr <- function(r, c)
{
lr <- x[(r.x > r) & (c.x > c)]
sum(lr)
}
# get row and column index for each
# matrix element
r.x <- row(x)
c.x <- col(x)
# return the sum of each matrix[r, c] * sums
# using mapply to sequence thru each matrix[r, c]
sum(x * mapply(mat.lr, r = r.x, c = c.x))
}
# Calculate DIScordant Pairs in a table
# cycle through x[r, c] and multiply by
# sum(x elements below and to the left of x[r, c])
# x = table
discordant <- function(x)
{
x <- matrix(as.numeric(x), dim(x))
# get sum(matrix values > r AND < c)
# for each matrix[r, c]
mat.ll <- function(r, c)
{
ll <- x[(r.x > r) & (c.x < c)]
sum(ll)
}
# get row and column index for each
# matrix element
r.x <- row(x)
c.x <- col(x)
# return the sum of each matrix[r, c] * sums
# using mapply to sequence thru each matrix[r, c]
sum(x * mapply(mat.ll, r = r.x, c = c.x))
}
# Calculate Kendall's Tau-b
# x = table
calc.KTb <- function(x)
{
x <- matrix(as.numeric(x), dim(x))
c <- concordant(x)
d <- discordant(x)
n <- sum(x)
SumR <- rowSums(x)
SumC <- colSums(x)
KTb <- (2 * (c - d)) / sqrt(((n ^ 2) -
(sum(SumR ^ 2))) * ((n ^ 2) -
(sum(SumC ^ 2))))
KTb
}
Note that I made some modifications of the above, relative to prior
versions that I have posted to handle large numbers of pairs to avoid
integer overflows in summations. Hence the:
x <- matrix(as.numeric(x), dim(x))
conversion in each function.
Now, create some random test data, with 100,000 elements in each vector,
sampling from 'letters', which would yield a 26 x 26 table:
a <- sample(letters, 100000, replace = TRUE)
b <- sample(letters, 100000, replace = TRUE)
> dim(table(a, b))
[1] 26 26
> system.time(print(calc.KTb(table(a, b))))
[1] 0.0006906088
[1] 0.77 0.02 0.83 0.00 0.00
Note that in the above, the initial table takes most of the time:
> system.time(table(a, b))
[1] 0.55 0.00 0.56 0.00 0.00
Hence:
> tab.ab <- table(a, b)
> system.time(print(calc.KTb(tab.ab)))
[1] 0.0006906088
[1] 0.25 0.01 0.27 0.00 0.00
I should note that I also ran:
> system.time(print(cor(a, b, method = "kendall")))
[1] 0.0006906088
[1] 694.80 7.72 931.89 0.00 0.00
Nice to know the results work out at least... :-)
I have not tested with substantially larger 2d matrices, but would
envision that as the dimensions of the resultant tabulation increases,
my method probably approaches and may even become less efficient than
the approach implemented in cor(). Some testing would validate this and
perhaps point to coding the concordant() and discordant() functions in C
for improvement in timing.
HTH,
Marc Schwartz