Dear All, I'm guruprasad hegde. I would like to contribute to Xapian through GSOC-2018. Thank you for this wonderful opportunity. My Introduction: I study MSc in Computer science at the University of Saarland. I finished my 4th semester. Some of the courses I took include NLP, Information Retrieval & Data mining, statistical learning. These courses helped me develop the interest in information systems. I gained good experience with programming by taking system-related courses. What I have done so far as part of GSOC preparation: * I started my preparation couple of weeks back and have built the code successfully. * I followed through the "Getting Started Guide", trying the examples and look into the related code sometimes. * I had a look at bite-size projects and good first bugs and spent time with few of them by looking at the related code. I will ask the question/clarification in the corresponding thread. * The project 'Diversification of results' piqued my interest. I read the survey paper. Now I have a basic idea about the multiple approaches. Plan: * I will read the survey paper couple of times to grasp the approaches presented, then read the main papers, evaluate which one is suitable for implementing in Xapian. I communicate about this very soon. * I found the recently added topic 'Math aware search' is also quite interesting. I haven't read the literature posted yet, will do that. Question to mentors: * Any suggestion/comment on how can I proceed with choosing between two topics? * Is it possible to make the GSOC work as part of my thesis work? I choose a topic (diversification or math-aware search) and implement it in Xapian. As part of research work, try to come up with any improvement. This idea just crossed the mind, so I thought of checking the possibility. Irrespective of this, I am interested in contributing to Xapian in this GSoC. Looking forward to discussing the project! Thank you for your time. Sincerely, Guruprasad hegde -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180309/45b97134/attachment.html>
On 9 Mar 2018, at 11:34, Guruprasad Hegde <guruhegde1308 at gmail.com> wrote:> Question to mentors: > * Any suggestion/comment on how can I proceed with choosing between two topics? > > * Is it possible to make the GSOC work as part of my thesis work? I choose a topic (diversification or math-aware search) and implement it in Xapian. As part of research work, try to come up with any improvement. > This idea just crossed the mind, so I thought of checking the possibility. Irrespective of this, I am interested in contributing to Xapian in this GSoC.Hi, Guruprasad! I'd advise you to choose whichever interests you more, with perhaps a little bit of which one you think you can achieve more thrown in. (Generally, the more you're able to accomplish the more you'll enjoy it.) GSoC isn't a place for research, however. We ask people to justify the algorithms and approaches they're proposing to take, with reference to published papers, books, and so forth. Of course, you'd be welcome to subsequently build on Xapian and your GSoC work to research improvements to what you've implemented. J -- James Aylett devfort.com — spacelog.org — tartarus.org/james/
Hi, I plan to propose 'Math Aware search' project. After the literature review on the topic, I found Tangent or MIaS system would be a good start. With that, I studied both of the systems well. I plan to pick Tangent because it performs better. Also, it has a good literature(thesis report and few papers available) and reference code available. I keep the summary of both the system, I welcome any opinion on the choice. Tangent: Indexing stage: Each document contains math formula and text. Text indexing is done in a usual way. ======preprocessing================= ===indexing===Math Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol pair tuples => Store in Inverted Index Searching stage: Query(PresentationMathML) => symbol layout tree => Generate symbol pair tuples => Form a query with logical OR operator=> Candidate documents selection using dice coefficient metric => ReRanking the documents using MSS metric. MIaS: Indexing stage: ======preprocessing=============== ============indexing====Math Formula(PresentationMathML) => Tokenization => Formula(token) Modification => Index each token with proper weight(discussed in paper) Formula modification = Ordering + Unification of variables + unification of constant Searching stage: Query(PresentationMathML) => Formula modification => Form a query with logical OR operator => Retrieve using text search engine I plan to send the draft proposal by the end of the day. I also put some thoughts on implementation here. I believe the major work is in preprocessing and searching stage(new weight metric implementation). Existing indexing technique can be used for math part as well. My plan is to implement only formulae retrieval first(document has only math) and add keyword support(document = text + math) later. Later also add support for the query in latex format. Please let me know if you have any comments or any questions on points I mentioned. Sorry for the delay. I would like to mention that I am doing active preparation by reading Xapian codebase and literature. Thank you. Regards, Guruprasad Link for Tangent paper: https://www.cs.rit.edu/~rlaz/files/ntcir2016_tangent.pdf Link for MIaS paper: https://link.springer.com/chapter/10.1007/978-3-642-22673-1_16 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180323/bbb1722a/attachment.html>