On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote:> My questions are: > 1) Can you direct me on how to convert this raw idea into a proposal in > context to Xapian with more detail? What areas do I focus on?Our GSoC guide has an application template <https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you should use to structure your proposal. It has some recommendations on how you should lay out and think about a proposal, particularly in terms of planning your timeline, which in the past has proven one of the keys to a successful project.> 2) It would be great if you could elaborate a little on the performance > evaluation part that I haven't been able to follow too well.'Performance' is a tricky word in information retrieval, so I'll break this into two pieces: speed and quality. As the project notes say, the previous attempt at implementing clustering was far too slow to be practically useful. So that's the speed side: we want to be able to cluster a fairly large number of documents quickly (which would need some thought -- do we want to be able to cluster 1000 documents in under a second? 10,000 in a handful of seconds? or might 1000 documents in a handful of seconds be sufficient?). Quality can be judged in a number of ways, but we're generally trying to produce 'good' clusters as a human with knowledge of the subject area would create. There's some discussion of how you might evaluate this in Introduction to Information Retrieval, section 16.3 p356 (or online at http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html). It's perhaps worth pointing out that Hearst (2009, p200) suggests that monothetic clusters 'may be easier for users to understand', although it doesn't cite any specific work to back up this claim. But that may argue that a K-means based approach isn't necessarily going to be the most helpful in all cases; there may be other approaches worth considering instead. (That entire section on clustering is worth reading if you have access to the book.) Hearst, MA (2009) 'Search User Interfaces', CUP, Cambridge. J -- James Aylett, occasional trouble-maker xapian.org
On Mon, Mar 7, 2016 at 8:58 PM, James Aylett <james-xapian at tartarus.org> wrote:> On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote: > > Our GSoC guide has an application template > <https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you > should use to structure your proposal. It has some recommendations on > how you should lay out and think about a proposal, particularly in > terms of planning your timeline, which in the past has proven one of > the keys to a successful project. > > Thanks James!Below I write a raw version of my proposal for Clustering of Search Results based on our previous mails. CLUSTERING OF SEARCH RESULTS: Motivation: My main motivation for applying to this project is that clustering is a topic that I have always liked in machine learning and I have already worked with the different parts of the project, namely PSO and K-means, that I am going to propose. Thus, this project is a way to help bring my knowledge in these algorithms to life in terms of a real world application like Xapian. Apart from this, effective clustering of documents is a very important feature in the effective retrieval and query processing speeds can be increased by using this feature. Project Idea: I would like to propose a hybrid PSO and K-means clustering algorithm for document clustering in Xapian. PSO stands for Particle Swarm Optimisation and is a technique which is used for finding the sub optimal solutions in a solution space.It is based on the way swarms of particles which are initially randomly oriented, will eventually follow the particle which has the best position relative to some fitness measure which is employed to find the better and best positions in the entire solution space. K-means is a partitioning clustering algorithm which is used to classify data into cluster based on various metrics. K-means algorithm starts with initially selecting any k random vectors in the search space and assigning them as cluster centroids. Then, every other vector is assigned to its closest cluster, which is found out by using the euclidian distance metric.This is continued till convergence. The only problem with K-means is selecting the initial clustering centroids because a certain initial configuration will always give the same result. Thus the problem of finding the most optimum clusters using K-means breaks down into finding an optimal initial configuration to start with to get the best clustering possible. The proposed approach will be independent of initial cluster centroids and can help K-means from being trapped in a local optimum, but use its fast convergence along with global search feature of PSO. The algorithm for the given apporach will be as follows: 1) Considering every document to be a particle, initialize each particle to contain K randomly selected cluster centers, at the same time, randomly choose K cluster centers for K-means algorithm. The process of K cluster centers being chosen in PSO and K-means is realized by randomly assigning each data vector to a cluster and computing the cluster center. 2) For each particle compute the fitness value according to the fitness equation 3) Compare the i-th particle?s fitness value with particle?s best solution pid_i, the better candidate solution encountered by particle i, if current value is better than pid_i, set pid_i equal to the current particle. 4) Compare particle?s fitness value with the population?s overall previous best pgd, the best candidate solution encountered by all particles, if current value is better than pgd, set pgd to the current particle?s value and location. 5) At the same time, sse, the sum of distance between each datavector and its cluster center in K-means algorithm is computed 6) Compare pgd and 1/sse; choose the big one as pgd. 7) Update the velocity and location of each particle using the new pgd 8) Assign each data vector to the closest cluster based on the new location of each particle, and then recalculate the cluster center 9) Repeat above steps till maximum iterations is exceeded This approach helps K-means be independent of initial cluster centroids and helps find better quality clusters by combining PSO's ability to provide global search in a solution space and the fast convergence speed of K-means For implementing this, we will find the TermLists for every document and then pass this in a TermListGroup along with the docids to a function for the initialization step of the algorithm. In this step, we can create a vector of K-means centroids and every particle can contain a vector of cluster centers used by PSO. A function will be written to calculate the best value for each particle and best value for the whole swarm using the term vectors for each document and the equations given in the paper on this algorithm. A euclidian distance metric needs to be implemented in a similar way to the DocSimCosine class and with this, the we can find the closest cluster center and add it to the clusters. This can be done using a vector of centroids and vector of all other documents containing the docids. We can find the centroid a document is closest to an then assign it to the cluster using a struct similar to the ClusterAssignment struct used in the current implementation. In the next iteration, the new center can be found out from the current docs in each cluster and we can do this again. The link to the paper I referred to for proposing this approach is: http://ictactjournals.in/paper/IJSCPaper_206-208.pdf Project Timeline: April 22 - May 22 (Before coding period starts) : Finish any more research required by talking to the mentors and any other resources. Have a clear idea of how to go about the project before the coding period May 23 - June 6 (Official coding period starts now) : Define all the classes and functions that will be necessary for the implementation of the project June 7 - June 20 : Implement and test the first phase which is a part dealing with the PSO algorithm MID TERM EVAL: June 21 - June 27 : Making further changes to the code to improve functionality, exception handling and bug removal June 28 - July 10 : Implement and test the second phase which deals with the K-means being actually used for clustering. July 11 - July 16 : Making further changes to the code to improve functionality, exception handling and bug removal July 17 - July 28 : Implement or use a concrete measure to test the performance of the newly implemented clustering algorithm against the previously implemented one in terms of speed and accuracy July 28 - August 1: Find out drawbacks that may come out while testing the proposed solution and tackle it during this time by changing some parts of the code that may increase speed or efficiency. August 2 - August 16: Cleaning the written code, improving the documentation or writing documentation afresh, identifying any bugs that may come out and writing test before the final submissions Im sorry if this is too long but it would be great about how I can improve on this and make it better in anyway or whether there would be any points that you would like me to elaborate in more detail. Thanks :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160312/b0e1dfe5/attachment-0001.html>
On Sat, Mar 12, 2016 at 04:27:55PM +0530, Richhiey Thomas wrote:> Below I write a raw version of my proposal for Clustering of Search Results > based on our previous mails.Hi, Richhiey. Thanks for putting this together ahead of the formal start of applications, and sharing it with us -- and it's really not too long! Project proposals for something that will last the summer are naturally going to be long and get into a lot of detail, so don't shy away from that. Comments inline (and I've chopped out bits where I don't have anything useful to add, but it should be obvious from context what I'm discussing). Do please note that before submitting this you'll need to put some work in to make it match our template (https://trac.xapian.org/wiki/GSoCApplicationTemplate). For instance, you haven't explicitly answered the 'who will benefit' question; you hint it at a little in your motivation section, but being as clear as possible about that question often helps when considering other details (such as 'how fast is fast enough', which is a key issue with clustering).> The algorithm for the given apporach will be as follows:[snip]> The link to the paper I referred to for proposing this approach is: > http://ictactjournals.in/paper/IJSCPaper_206-208.pdfI'd like to ask a few questions about the paper, which hopefully you can answer, because I'm a little confused by it in places. For instance, in equation (3) it looks to me like a measure with two free variables (i and j) is being defined in terms of two free variables (j and m) and one bound one (i). Is the summation actually supposed to be over m? (Then within the summation x_i is a dw dimensional vector, as is c_j, ie they're both feature vectors, one representing a single item and one representing a cluster centroid.) I may be misreading something fundamentally here, because aspects of the combined algorithm just don't make sense to me. It doesn't help my understanding that the paper seems to use overlapping references for the particle positions and clusters (eg in (3), (4) and I think (5)), and the cluster centres that are being used as particle data (at the bottom of p207). I think rewriting it with different variables might make this a lot clearer. (That may be the main source of my confusion, in fact!) As I understand it, though, there are still some parameters to this algorithm: the maximum iterations, the inertia weight and two 'gravity' constants for the velocity calculation in (1), and K. Is that correct? Is there anything we can do to predict a useful value for any of them?> For implementing this, we will find the TermLists for every document and > then pass this in a TermListGroup along with the docids to a function for > the initialization step of the algorithm. In this step, we can create a > vector of K-means centroids and every particle can contain a vector of > cluster centers used by PSO.You mean that you're using the term lists of each document to generate the particle's position vector in a term-based feature space? (The initial centroids for K means and the particle cluster centers are all randomly initialised, correct?) (That's what I'm expecting, I just want to be sure I'm understanding both the paper and what you've written well enough!)> Project Timeline:Please don't be put off by the comments I'm making about the timeline -- writing a plan like this that stretches over several months is incredibly hard, even if you've done lots of them before. You've done a good job of starting to break the problem down and think about how you're going to manage that over the summer, but I think if you break it down even further you'll end up with a more useful timeline that lets you worry only about the work you're doing right now rather than having to think about where you are overall as well.> April 22 - May 22 (Before coding period starts) : Finish any more research > required by talking to the mentors and any other resources. Have a clear > idea of how to go about the project before the coding periodIt would be good to identify if there is specific work required during this period. What further research do you think is required? Is there anything else you need beyond the cited paper in order to implement their algorithm?> May 23 - June 6 (Official coding period starts now) : > Define all the classes and functions that will be necessary for the > implementation of the projectWhy is this going to take two weeks? It sounds from your discussion as if you have a good grasp of how to map this into classes already. Remember that we really recommend having timeline sections of no more than a week. The more you force yourself to break it down, the more planning you have to do up front, which generally results in a smoother project!> June 7 - June 20 : > Implement and test the first phase which is a part dealing with the PSO > algorithmIs there a way you can break this into smaller pieces? For instance, are there datasets for PSO out there which you could use to build test cases, ahead of implementing the algorithm itself? If not, you'll have to generate your own, calculating the outcome for some data sets by some other method. (You should be able to use a different implementation to calculate the test data, because most implementations won't have license conditions which restrict the subsequent distribution of output of the implementation.) With step 6 of the algorithm, it looks to me like the K-means part that is interwoven with PSO can affect the PSO part, by modifying pgd. Does it make sense to tackle things in this way (PSO then K-means), or would it make more sense to break it down as steps of the algorithm, with each calculation tested separately? (It's probably easier to write the tests that way.)> June 21 - June 27 : > Making further changes to the code to improve functionality, exception > handling and bug removalException handling is really something you should do as you go along. I'm also not clear what 'improve functionality' means in this context. Are you concerned that the previous two weeks won't be enough time to complete the PSO algorithm? If so, you definitely want to break it down into smaller pieces; three weeks is generally too long for a single stretch of work (because it's difficult to tell two weeks in if it's going well or not).> June 28 - July 10 : > Implement and test the second phase which deals with the K-means being > actually used for clustering.Again, hopefully you can break this down further.> July 11 - July 16 : > Making further changes to the code to improve functionality, exception > handling and bug removalAnd again, breaking down the previous block should mean you don't need a 'catch up' period. Don't worry if your estimates go up when you break something down further (so something that you thought would take about two weeks, when broken into smaller pieces, might look like three or even four weeks).> July 17 - July 28 : > Implement or use a concrete measure to test the performance of the newly > implemented clustering algorithm against the previously implemented one in > terms of speed and accuracyDo you have an idea of what measure to use for accuracy? The paper you cited uses between and within cluster measures, but there may be other techniques. (Also, I don't know if you have access to any pre-clustered test data, which would allow you to calculate a precision rate.) It's also going to be important to carefully choose the test data you use for this. Is there anything in the TREC collections that would be suitable?> July 28 - August 1: > Find out drawbacks that may come out while testing the proposed solution > and tackle it during this time by changing some parts of the code that may > increase speed or efficiency.It's good to have a period to incorporate learning from the evaluation phase. Is a week enough? (This would be one area where I might consider having a longer period, and to be prepared to plan that while or just after conducting the evaluation, based on what you learn.)> August 2 - August 16: > Cleaning the written code, improving the documentation or writing > documentation afresh, identifying any bugs that may come out and writing > test before the final submissionsGSoC has a slightly unfortunately habit of suggesting in their own timeline that you work on documentation and tests at the end. We strongly recommend that you work on them as you are writing the code (with tests in particular, it's much easier to be confident with where you are if you have tests as you go, but it's also easier to write documentation for something you're working on right now, rather than something you coded a month or two previously). We generally recommend that you include some 'stretch' goals in your project, for if everything goes particularly well and you have more time at the end. So it would be good to have a think about what else could fit within this project. But don't worry if you can't think of any; I think with the evaluation phase and then working based on what you learn from there that you should have enough to do! J -- James Aylett, occasional trouble-maker xapian.org