Ganesh Prabu Ravi
2015-Feb-15 11:54 UTC
[Xapian-devel] GSOC 2015 Participation | Ganesh Prabu
Hi Developers, I am Ganesh Prabu pursuing my final year in computer science from SASTRA University, India. I read through the project ideas page and i found Clustering of Search Results to be the one that aptly fits my profile. Before proceeding further I will introduce myself a little and my programming background, About : I have excellent algorithmic skills and good grasp on Object Oriented Design Patterns. I did my internship at KLA-Tencor where I worked on projects involving multithreading in C# and CPP. So I have about five months of industrial experience. I have experience coding Data mining algorithms as part of my academics. I have worked in CUDA for generating Mandlebrot and Julia Sets. I am good at benchmarking and always like to find ways to improve the method. Besides i have done several projects, some of them include Chain reaction game (JavaScript), AI Snake. I won first place in Microsoft conducted, intra college competition, RaspberryPi kits from KLA-Tencor for developing an OMR reader. Besides I participate in Codechef and Hackerrank to shape my algorithmic skills. Here is my Linkedin and Github account https://www.linkedin.com/in/ganeshpraburavi https://github.com/ganeshpraburavi I started reading through the existing code and they have implemented K-Means algo with TF-IDF as the similarity measure. Problems in Existing Method : 1. They are not doing any dimensionality reduction.(Large features) 2. No effort in feature selection. Even if it ran successfully, it would have resulted in poor clusters Solution 1. Do Dimensionality Reduction(DRT) in such a way that it reduces the features and also select the most relevant features. [1] 2. Implement a parallel clustering algorithm like Buckshot or Suffix tree clustering or Lingo. These clustering algos are more suitable for Web documents [2] *Note: Lingo is an algorithm employed in Carrot2 for clustering of search results from Lucene, Solr I am yet to prepare to exact method for solving this problem. Is the idea of parallel programming paradigm is okay? I would love to have discussion on how it could be proceeded further. I am very excited about this project and would be very glad to work on this with my fullest dedication and accomplish each task specified, before the fixed deadline. [1] https://web.cs.dal.ca/~luo/AI2005.pdf [2] http://project.carrot2.org/publications/wroblewski-2003-ahc.pdf -- Thanks Ganesh Prabu -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150215/4ce6cc97/attachment-0002.html>
On 15 Feb 2015, at 11:54, Ganesh Prabu Ravi <ganeshprabhu1994 at gmail.com> wrote:> I am yet to prepare to exact method for solving this problem. Is the idea of parallel programming paradigm is okay? I would love to have discussion on how it could be proceeded further.Ganesh ? that?s a good start, and while having some more detail as part of your proposal would certainly be a good thing we don?t expect you to have all the details when applying; some will naturally come out as part of the work. If you go through our guide for writing proposals, working from where you are now, that should help you figure out what other details you want to put in. (And we?re happy to feed back on proposals during that period of GSoC ? note that we haven?t been accepted as a participating organisation as yet, so there?s some time before we get to that stage!) On parallel programming, that?s something we?d need to discuss. Is it always going to be available? (Some uses of Xapian might need to work in systems that don?t support multi-process or multi-threaded concurrency, for instance.) Also, things like this ? which are using more than the core of C++ and its library ? would likely have to be very different for Windows, which we?d like to keep support for in all the main features of Xapian. That said, there may be a suitable argument that those concerns don?t apply in this case. But it may be worth thinking about what approaches you can take without parallelism first, and then to investigate optional performance improvements later. J -- James Aylett, occasional trouble-maker xapian.org