Hi,
Apology for this question being off the topic (OT) of
R, though I expect
this list might be the best place on the net to ask
this question.
In brief, the question is: what classification
algorithm
can one use if the features are histograms?
I have a classification problem, and believe that
histograms
of the distribution of some values may be the best
"feature" to use.
To make the mail shorter, here's a simpler example
problem:
Try to classify a person as e.g. drunk or not given
the histogram
of their driving speed.
In the training phase, we have a table whose rows
contain the driver,
whether they are drunk, and a sample of driving speed.
>From this one can build separate histograms of driving
speed
for drunk/non drunk.
(In my actual application, I have several such
histogram features, and they
are visibly different; they are also ranked now by
some analytic
pdf-distance measures such as KL).
Now, how to classify...
given a single speed, its probability can be evaluated
under the two classes,
but a single speed sample is not going to be reliable
in this problem.
Suppose instead that the _distribution_ of speeds is
sufficient
to discriminate.
We have a driver, and a distribution of their speeds
over time. A histogram
can be built. What to do with this histogram?...
Is there a standard classifier that can deal with this
situation?
My thought(s):
- the test histogram could be compared to each
of the training histograms with the Chi^2 measure -
sum of squared Gaussian deviations, then get a
probability from this?
- Alternately, consider training histograms with n
bins as points
in N-dimensional space, use euclidean closeness in
this space.
This may not generalize to more than one such
histogram feature though....
Thanks for any thoughts.
(Also thanks for the replies to my recent question
about hashtable/dictionary.)