Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a document in its document vector can improve its accuracy, due to less noise terms. Two important things to be done in this direction are : 1) Stemming This is easier because xapian already provides stemmed terms. 2) Stopword removal Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to determine whether a term that is fed to it is a stopword or not. But for determining which terms are stopwords, I was wondering whether we'd be using the stopword list within xapian/languages/stopwords or will we have to create one within the cluster directory? Over the next half of the month, the plan will be to get feature extraction and elkans-kmeans (with triangle inequality) to be working well. As Olly has mentioned in one of his comments on the PR, it wouldn't be ideal to use hard coded criteria for feature selection. Thus using something like an ExpandDecider would certainly be great. I will look into it and make my approach clear as I go ahead. Thanks, Richhiey -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170614/fc5f3a7e/attachment.html>
On 15 Jun 2017, at 00:25, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a document in its document vector can improve its accuracy, due to less noise terms. Two important things to be done in this direction are : > > 1) Stemming > This is easier because xapian already provides stemmed terms.Are you planning on dropping all the stemmed terms, or all the unstemmed terms?> 2) Stopword removal > Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to determine whether a term that is fed to it is a stopword or not. But for determining which terms are stopwords, I was wondering whether we'd be using the stopword list within xapian/languages/stopwords or will we have to create one within the cluster directory?I'd suggest that you allow users to pass in a Stopper subclass, which gives them maximum control. You don't need to create a new stopword list, or manage it at all. For documentation and examples, I'd either use a builtin list or provide an explicit list of terms.> Over the next half of the month, the plan will be to get feature extraction and elkans-kmeans (with triangle inequality) to be working well.In that order, I assume, so focussing on the two straightforward dimensionality reduction approaches (stemming and stopping) until they're working and merged, and then looking at things like the triangle inequality optimisation.> As Olly has mentioned in one of his comments on the PR, it wouldn't be ideal to use hard coded criteria for feature selection. Thus using something like an ExpandDecider would certainly be great. I will look into it and make my approach clear as I go ahead.This is definitely nice to have, but I suspect getting a solid and performant system is a better focus. A good thing to do is to keep track of ideas like this that come up, and reconsider it next time you look afresh at your timeline and where you are against it. (It's good to do this at the evaluation points, for instance.) J -- James Aylett, occasional troublemaker & project governance xapian.org
[Please keep emails on the mailing list.]> On 18 Jun 2017, at 22:43, Richhiey Thomas <richhiey.thomas at gmail.com> wrote: > >> Are you planning on dropping all the stemmed terms, or all the unstemmed terms? > > I plan on dropping all the unstemmed terms since it reduces the size of the termlist to a larger extent and can also take care of noise within text data such as spelling mistakes.Hmm, I wonder how much of a negative impact false positive conflation errors from the stemmer will do here. It's fairly easy either way; I suspect in future we'll come up with something more sophisticated and under control of the user, but that shouldn't hold us up for now.>> I'd suggest that you allow users to pass in a Stopper subclass, which gives them maximum control. You don't need to create a new stopword list, or manage it at all. For documentation and examples, I'd either use a builtin list or provide an explicit list of terms. > > This sounds good. In a case where the user does not provide a Stopper, I guess it would be ideal to initialise the Stopper subclass with the common stopword list that we already have. This can be passed to KMeans in its constructor and any initializations can be done there.I'd start with the default being no stopping if there's no explicit stopper. There may be situations where that is the right approach (particularly if you have a complex multi-language situation, or you aren't using word-like terms at all), so it'd be a shame to make it impossible. J -- James Aylett, occasional troublemaker & project governance xapian.org