thr3ads.net - Xapian devel - KMeans Clusterer

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2017-Jun-14 23:25 UTC

KMeans Clusterer - Going forward

Hello,

I have finished moving the API to PIMPL classes and will fix issues within
the current code over the next week, based on reviews from mentors.

The next step going forward is to start with forming document vectors that
are reduced and more useful. This majorly helps in saving run time (since
time for distance calculation depends on number of terms). Getting the
useful terms within a document in its document vector can improve its
accuracy, due to less noise terms. Two important things to be done in this
direction are :

1) Stemming
This is easier because xapian already provides stemmed terms.

2) Stopword removal
Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to
determine whether a term that is fed to it is a stopword or not. But for
determining which terms are stopwords, I was wondering whether we'd be
using the stopword list within xapian/languages/stopwords or will we have
to create one within the cluster directory?

Over the next half of the month, the plan will be to get feature extraction
and elkans-kmeans (with triangle inequality) to be working well.

As Olly has mentioned in one of his comments on the PR, it wouldn't be
ideal to use hard coded criteria for feature selection. Thus using
something like an ExpandDecider would certainly be great. I will look into
it and make my approach clear as I go ahead.

Thanks,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20170614/fc5f3a7e/attachment.html>

James Aylett

2017-Jun-18 18:00 UTC

head link

KMeans Clusterer - Going forward

On 15 Jun 2017, at 00:25, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> The next step going forward is to start with forming document vectors that
are reduced and more useful. This majorly helps in saving run time (since time
for distance calculation depends on number of terms). Getting the useful terms
within a document in its document vector can improve its accuracy, due to less
noise terms. Two important things to be done in this direction are :
> 
> 1) Stemming
> This is easier because xapian already provides stemmed terms.
Are you planning on dropping all the stemmed terms, or all the unstemmed terms?
> 2) Stopword removal
> Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to
determine whether a term that is fed to it is a stopword or not. But for
determining which terms are stopwords, I was wondering whether we'd be using
the stopword list within xapian/languages/stopwords or will we have to create
one within the cluster directory?
I'd suggest that you allow users to pass in a Stopper subclass, which gives
them maximum control. You don't need to create a new stopword list, or
manage it at all. For documentation and examples, I'd either use a builtin
list or provide an explicit list of terms.
> Over the next half of the month, the plan will be to get feature extraction
and elkans-kmeans (with triangle inequality) to be working well.
In that order, I assume, so focussing on the two straightforward dimensionality
reduction approaches (stemming and stopping) until they're working and
merged, and then looking at things like the triangle inequality optimisation.
> As Olly has mentioned in one of his comments on the PR, it wouldn't be
ideal to use hard coded criteria for feature selection. Thus using something
like an ExpandDecider would certainly be great. I will look into it and make my
approach clear as I go ahead.
This is definitely nice to have, but I suspect getting a solid and performant
system is a better focus. A good thing to do is to keep track of ideas like this
that come up, and reconsider it next time you look afresh at your timeline and
where you are against it. (It's good to do this at the evaluation points,
for instance.)

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org

James Aylett

2017-Jun-18 22:19 UTC

head link

KMeans Clusterer - Going forward

[Please keep emails on the mailing list.]
> On 18 Jun 2017, at 22:43, Richhiey Thomas <richhiey.thomas at
gmail.com> wrote:
> 
>> Are you planning on dropping all the stemmed terms, or all the
unstemmed terms?
> 
> I plan on dropping all the unstemmed terms since it reduces the size of the
termlist to a larger extent and can also take care of noise within text data
such as spelling mistakes.
Hmm, I wonder how much of a negative impact false positive conflation errors
from the stemmer will do here.

It's fairly easy either way; I suspect in future we'll come up with
something more sophisticated and under control of the user, but that
shouldn't hold us up for now.
>> I'd suggest that you allow users to pass in a Stopper subclass,
which gives them maximum control. You don't need to create a new stopword
list, or manage it at all. For documentation and examples, I'd either use a
builtin list or provide an explicit list of terms.
> 
> This sounds good. In a case where the user does not provide a Stopper, I
guess it would be ideal to initialise the Stopper subclass with the common
stopword list that we already have. This can be passed to KMeans in its
constructor and any initializations can be done there.
I'd start with the default being no stopping if there's no explicit
stopper. There may be situations where that is the right approach (particularly
if you have a complex multi-language situation, or you aren't using
word-like terms at all), so it'd be a shame to make it impossible.

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org

Xapian devel - Jun 2017 - KMeans Clusterer - Going forward

KMeans Clusterer - Going forward

KMeans Clusterer - Going forward

KMeans Clusterer - Going forward