Function predict.lda() is just answering a different question from the one you
are posing. It is answering the question, given the values on this object what
is the probability of membership in each of the groups used to construct the
discriminant functions in the first place. Those probabilities sum to 1 and are
generally called the posterior probabilities. Your question is somewhat
different, if this object was a member of group x, what is the probability that
it would have values like these. These are typicality probabilities (how typical
is this observation in this group).
There are two ways to compute typicality probabilities. One is to use the
reduced space defined by the discriminant functions and measure the distance of
a new observation to the centroid of the group. This is the approach taken by
SPSS which provides the typicality for the group which has the highest posterior
probability. Huberty and Olejink recommend this procedure on the grounds that
the probability distribution is known. The alternate approach which is used
commonly in compositional analysis is to use Mahalanobis distance with the
probability assumed to follow a chi square distribution. I am not aware of a
package that has a function to produce either of these.
Huberty, Carl J. and Stephen Olejink. 2006. Applied Manova and Discriminant
Analysis. Second Edition. Wiley-Interscience.
David L. Carlson
Department of Anthropology
Texas A&M University
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Fraser D. Neiman
Sent: Friday, August 29, 2014 4:14 PM
To: r-help at r-project.org
Subject: [R] posterior probabilities from lda.predict
Dear All,
I have used the lda() function in the MASS library to estimate a set of
discriminant functions to assign samples from a training set to one of six
groups. The cross validation generates nearly perfect predictions for samples
in the training set. Hooray!
Now I want to use lda.predict() to estimate both discriminant function scores
and probabilities of group membership for a second set of samples whose group
membership is unknown. For each unknown sample, lda.predict() produces a six
probabilities. These probabilities sum to one. So lda.predict() seems to assume
that the unknown samples do, in fact, belong to one of the six groups.
The problem is that it is nearly certain that some of the unknown samples in the
second set do not belong to any of the six groups. For those samples,
probabilities of group membership should be close to zero for all six groups.
In fact, identifying which samples are unlikely to belong to any of the six
groups is a major goal of the analysis.
So the question is, what is lda.predict() doing behind the scenes to force the
group membership probabilities to sum to one? How do I get it to not do this and
produce probabilities that accurately reflect the large Mahalanobis distances of
some of the unknown sample from any group centroid?\
I have searched the R-list archive on this and have found several folks asking
similar questions, but no helpful answers.
Thanks very much!
Fraser
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.