There is a paper
(1) A new metric for categorical data, by
S. H. Al-Harbi, G. P. McKeown, and V. J. Rayward-Smith, in
Statistical Data Mining and Knowledge Discovery,
ed. Hamparsum Bozdogan, CRC Press, 2004.
This paper introduced a distance measure Dcv for clustering.
I have _not_ read this paper myself, and do not as yet have access
to a copy of it, but I'm trying to make sense of another paper
which criticises it on the grounds that it does rather badly on both
simulated data and one real data set (the Mushroom data set).
Call this other paper (2).
According to (2) (which is not yet published or I'd cite it),
-1 T
Dcv(x,y) = sqrt( Dh(x,y).C .Dh(x,y) )
where
x and y are rows of the same data frame,
Dh(x,y) = ifelse(x == y, 0, 1) (so that sum(Dh(x,y)) is Hamming distance),
C is the matrix of Cramer's V scores between the columns of the
data frame.
This is an analogue of Mahalanobis distance, with C playing the role of
the variance-covariance matrix.
I wanted to try this out, so I wrote some R code.
In particular, I couldn't persuade myself that Dcv(x,y) was always
defined. What if, for example, some element on the diagonal of C inverse
was negative? Then for x,y differing only in the corresponding property,
the argument of sqrt() would be negative. My mathematical skills were not
up to the task of proving that it couldn't happen, so experiment seemed
like a good idea.
First I read in the Mushroom data (8124 rows x 23 columns)
and deleted column 1 (the poisonous/edible class, to be predicted,
not clustered on) and column 17 (which has the same value in each row).
"mush" is the resulting data frame.
cramer <- function (x, y) {
# x and y are factors of the same length.
# cramer(x, y) is Cramer's V, measuring the association between x and y
n <- length(x) # Number of observations
o <- table(x, y) # Observed table
e <- outer(rowSums(o), colSums(o)/n) # Expected table
X <- sum((o-e)**2/e) # Chi-squared
sqrt(X/(n * (min(dim(o)) - 1))) # Cramer's V
}
C <- matrix(nrow=21, ncol=21)
for (i in 1:21) {
for (j in i:21) {
x <- cramer(mush[,i], mush[,j])
C[i,j] <- x
C[j,i] <- x
}
}
CI <- solve(C) # find inverse of C
I checked my understanding of solve() by trying CI%*%C and C%*%CI and
got the identity matrix to good precision.
Now comes the fatal question: _are_ any elements of the diagonal of
CI negative?
> sum(diag(CI) < 0)
[1] 3> diag(CI)
[1] 1.0614520 1.1689474 2.9640138 0.8785050 2.7083945 -3.0289056
[7] 4.6415407 2.2881969 1.6595514 1.4349139 3.7682528 1.9774941
[13] 2.3342165 -0.9505056 1.5479098 -7.4458871 0.3985586 1.2470409
[19] 1.2829017 2.3821185 1.5060749
There seem to be three possibilities:
(1) The paper cited above has proposed a distance measure for
categorical variables which is sometimes undefined.
(2) The paper I'm studying has misunderstood that paper and
misrepresented what Dcv is really supposed to be.
(3) There is a mistake in my R code.
Ad (1), we're talking about a published paper, so I'd be reluctant to
think so.
Ad (2), I've actually looked at the (Python) code for calculating Dcv in
some
detail. There's an extra scaling step, which multiplies the
diagonal
elements by weights between 1/#values and 1 inclusive, but that only
changes the magnitudes of the diagonal elements, not their sign.
Ad (3), I make mistakes as often as the next man (hello, Eccles!), which is
why I'm asking here if anyone can see the mistake.
The only structucal properties of C that I am sure of are
- all the elements are between 0 and 1
- it is symmetric
- the diagonal elements are all 1.
random.C.like.matrix <- function (n) {
x <- matrix(runif(n*n), nrow=n, ncol=n)
x <- (x + t(x)) * 0.5
diag(x) <- 1
x
}
count.negative.diagonal.elements <- function (x) {
sum(diag(solve(x)) < 0)
}
> summary(sapply(1:100, function (i)
+ count.negative.diagonal.elements(random.C.like.matrix(21)) ))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.00 9.00 8.94 14.00 21.00
So clearly there is nothing unusual about a matrix with those properties
having negative diagonal elements. If Dcv is defined correctly in (2),
then C must have some additional structural property which ensures that
this can't happen, and there must be a mistake in my code that fails to
respect that structural property.